All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-06  8:44 ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm,
	LKML, Wu Fengguang

Hi all,

The _core_ bits of the IO-less balance_dirty_pages().
Heavily simplified and re-commented to make it easier to review.

	git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8

Only the bare minimal algorithms are presented, so you will find some rough
edges in the graphs below. But it's usable :)

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/

And an introduction to the (more complete) algorithms:

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf

Questions and reviews are highly appreciated!

shortlog:

	Wu Fengguang (5):
	      writeback: account per-bdi accumulated dirtied pages
	      writeback: dirty position control
	      writeback: dirty rate control
	      writeback: per task dirty rate limit
	      writeback: IO-less balance_dirty_pages()

	The last 4 patches are one single logical change, but splitted here to
	make it easier to review the different parts of the algorithm.

diffstat:

	 include/linux/backing-dev.h      |    8 +
	 include/linux/sched.h            |    7 +
	 include/trace/events/writeback.h |   24 --
	 mm/backing-dev.c                 |    3 +
	 mm/memory_hotplug.c              |    3 -
	 mm/page-writeback.c              |  459 ++++++++++++++++++++++----------------
	 6 files changed, 290 insertions(+), 214 deletions(-)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-06  8:44 ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm,
	LKML, Wu Fengguang

Hi all,

The _core_ bits of the IO-less balance_dirty_pages().
Heavily simplified and re-commented to make it easier to review.

	git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8

Only the bare minimal algorithms are presented, so you will find some rough
edges in the graphs below. But it's usable :)

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/

And an introduction to the (more complete) algorithms:

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf

Questions and reviews are highly appreciated!

shortlog:

	Wu Fengguang (5):
	      writeback: account per-bdi accumulated dirtied pages
	      writeback: dirty position control
	      writeback: dirty rate control
	      writeback: per task dirty rate limit
	      writeback: IO-less balance_dirty_pages()

	The last 4 patches are one single logical change, but splitted here to
	make it easier to review the different parts of the algorithm.

diffstat:

	 include/linux/backing-dev.h      |    8 +
	 include/linux/sched.h            |    7 +
	 include/trace/events/writeback.h |   24 --
	 mm/backing-dev.c                 |    3 +
	 mm/memory_hotplug.c              |    3 -
	 mm/page-writeback.c              |  459 ++++++++++++++++++++++----------------
	 6 files changed, 290 insertions(+), 214 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-06  8:44 ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Dave Chinner,
	Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm,
	LKML, Wu Fengguang

Hi all,

The _core_ bits of the IO-less balance_dirty_pages().
Heavily simplified and re-commented to make it easier to review.

	git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8

Only the bare minimal algorithms are presented, so you will find some rough
edges in the graphs below. But it's usable :)

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/

And an introduction to the (more complete) algorithms:

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf

Questions and reviews are highly appreciated!

shortlog:

	Wu Fengguang (5):
	      writeback: account per-bdi accumulated dirtied pages
	      writeback: dirty position control
	      writeback: dirty rate control
	      writeback: per task dirty rate limit
	      writeback: IO-less balance_dirty_pages()

	The last 4 patches are one single logical change, but splitted here to
	make it easier to review the different parts of the algorithm.

diffstat:

	 include/linux/backing-dev.h      |    8 +
	 include/linux/sched.h            |    7 +
	 include/trace/events/writeback.h |   24 --
	 mm/backing-dev.c                 |    3 +
	 mm/memory_hotplug.c              |    3 -
	 mm/page-writeback.c              |  459 ++++++++++++++++++++++----------------
	 6 files changed, 290 insertions(+), 214 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages
  2011-08-06  8:44 ` Wu Fengguang
  (?)
@ 2011-08-06  8:44   ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Jan Kara, Michael Rubin, Peter Zijlstra,
	Wu Fengguang, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2019 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-12 20:58:40.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-12 20:58:40.000000000 +0800
@@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-12 20:58:55.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,



^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Jan Kara, Michael Rubin, Peter Zijlstra,
	Wu Fengguang, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2322 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-12 20:58:40.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-12 20:58:40.000000000 +0800
@@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-12 20:58:55.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Jan Kara, Michael Rubin, Peter Zijlstra,
	Wu Fengguang, Christoph Hellwig, Dave Chinner, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2322 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-12 20:58:40.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-12 20:58:40.000000000 +0800
@@ -1530,6 +1530,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-06-12 20:58:31.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-12 20:58:55.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:     %10lu kB\n"
 		   "DirtyThresh:        %10lu kB\n"
 		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
 		   "BdiWritten:         %10lu kB\n"
 		   "BdiWriteBandwidth:  %10lu kBps\n"
 		   "b_dirty:            %10lu\n"
@@ -109,6 +110,7 @@ static int bdi_debug_stats_show(struct s
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 2/5] writeback: dirty position control
  2011-08-06  8:44 ` Wu Fengguang
  (?)
@ 2011-08-06  8:44   ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 7230 bytes --]

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

For simplicity, only the global/bdi setpoint control lines are
implemented here, so the [*] curve is more straight than the ideal one
showed in the above figure.

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define BANDWIDTH_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ *  When the number of dirty pages go higher/lower than the setpoint, the dirty
+ *  position ratio (and hence dirty rate limit) will be decreased/increased to
+ *  bring the dirty pages back to the setpoint.
+ *
+ *                              setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------|-----------|
+ * ^                               ^                               ^           ^
+ * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
+ *
+ *                          bdi setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------------------|
+ * ^                               ^                                           ^
+ * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
+ *
+ * (o) pseudo code
+ *
+ *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
+ *
+ *     if (dirty < thresh) scale up   pos_ratio
+ *     if (dirty > thresh) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
+ *
+ * (o) global/bdi control lines
+ *
+ * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
+ * several control lines in turn.
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * If any control line drops below Y=0 before reaching @limit, an auxiliary
+ * line will be setup to connect them. The below figure illustrates the main
+ * bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling bdi_dirty down to normal if it starts high
+ * in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
+ * - the bdi dirty thresh goes down quickly due to change of JBOD workload
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 bdi setpoint                 bdi origin            limit
+ *
+ * The bdi control line: if (origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	goal = thresh - thresh / DIRTY_SCOPE;
+	origin = 4 * thresh;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;			/* auxiliary control line */
+		goal = (goal + origin) / 2;
+		pos_ratio >>= 1;
+	}
+	pos_ratio = origin - dirty;
+	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
+	do_div(pos_ratio, origin - goal + 1);
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
+	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
+	 * Otherwise the bdi write bandwidth is good for limiting the floating
+	 * area, which makes the bdi control line a good backup when the global
+	 * control line is too flat/weak in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	if (unlikely(bdi_dirty > goal + span)) {
+		if (bdi_dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;		/* auxiliary control line */
+			goal += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= origin - bdi_dirty;
+	do_div(pos_ratio, origin - goal + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)



^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 7533 bytes --]

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

For simplicity, only the global/bdi setpoint control lines are
implemented here, so the [*] curve is more straight than the ideal one
showed in the above figure.

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define BANDWIDTH_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ *  When the number of dirty pages go higher/lower than the setpoint, the dirty
+ *  position ratio (and hence dirty rate limit) will be decreased/increased to
+ *  bring the dirty pages back to the setpoint.
+ *
+ *                              setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------|-----------|
+ * ^                               ^                               ^           ^
+ * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
+ *
+ *                          bdi setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------------------|
+ * ^                               ^                                           ^
+ * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
+ *
+ * (o) pseudo code
+ *
+ *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
+ *
+ *     if (dirty < thresh) scale up   pos_ratio
+ *     if (dirty > thresh) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
+ *
+ * (o) global/bdi control lines
+ *
+ * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
+ * several control lines in turn.
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * If any control line drops below Y=0 before reaching @limit, an auxiliary
+ * line will be setup to connect them. The below figure illustrates the main
+ * bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling bdi_dirty down to normal if it starts high
+ * in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
+ * - the bdi dirty thresh goes down quickly due to change of JBOD workload
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 bdi setpoint                 bdi origin            limit
+ *
+ * The bdi control line: if (origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	goal = thresh - thresh / DIRTY_SCOPE;
+	origin = 4 * thresh;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;			/* auxiliary control line */
+		goal = (goal + origin) / 2;
+		pos_ratio >>= 1;
+	}
+	pos_ratio = origin - dirty;
+	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
+	do_div(pos_ratio, origin - goal + 1);
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
+	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
+	 * Otherwise the bdi write bandwidth is good for limiting the floating
+	 * area, which makes the bdi control line a good backup when the global
+	 * control line is too flat/weak in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	if (unlikely(bdi_dirty > goal + span)) {
+		if (bdi_dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;		/* auxiliary control line */
+			goal += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= origin - bdi_dirty;
+	do_div(pos_ratio, origin - goal + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 7533 bytes --]

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

For simplicity, only the global/bdi setpoint control lines are
implemented here, so the [*] curve is more straight than the ideal one
showed in the above figure.

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define BANDWIDTH_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ *  When the number of dirty pages go higher/lower than the setpoint, the dirty
+ *  position ratio (and hence dirty rate limit) will be decreased/increased to
+ *  bring the dirty pages back to the setpoint.
+ *
+ *                              setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------|-----------|
+ * ^                               ^                               ^           ^
+ * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
+ *
+ *                          bdi setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------------------|
+ * ^                               ^                                           ^
+ * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
+ *
+ * (o) pseudo code
+ *
+ *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
+ *
+ *     if (dirty < thresh) scale up   pos_ratio
+ *     if (dirty > thresh) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
+ *
+ * (o) global/bdi control lines
+ *
+ * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
+ * several control lines in turn.
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * If any control line drops below Y=0 before reaching @limit, an auxiliary
+ * line will be setup to connect them. The below figure illustrates the main
+ * bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling bdi_dirty down to normal if it starts high
+ * in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
+ * - the bdi dirty thresh goes down quickly due to change of JBOD workload
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 bdi setpoint                 bdi origin            limit
+ *
+ * The bdi control line: if (origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	goal = thresh - thresh / DIRTY_SCOPE;
+	origin = 4 * thresh;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;			/* auxiliary control line */
+		goal = (goal + origin) / 2;
+		pos_ratio >>= 1;
+	}
+	pos_ratio = origin - dirty;
+	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
+	do_div(pos_ratio, origin - goal + 1);
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
+	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
+	 * Otherwise the bdi write bandwidth is good for limiting the floating
+	 * area, which makes the bdi control line a good backup when the global
+	 * control line is too flat/weak in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	if (unlikely(bdi_dirty > goal + span)) {
+		if (bdi_dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;		/* auxiliary control line */
+			goal += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= origin - bdi_dirty;
+	do_div(pos_ratio, origin - goal + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44 ` Wu Fengguang
  (?)
@ 2011-08-06  8:44   ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 6415 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / pos_bw;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        bw = bdi->dirty_ratelimit;
        ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
        if (dirty pages unbalanced)
             bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

When started N dd, throttle each dd at

         task_ratelimit = pos_bw (any non-zero initial value is OK)

After 200ms, we got

         dirty_bw = # of pages dirtied by app / 200ms
         write_bw = # of pages written to disk / 200ms

For aggressive dirtiers, the equality holds

         dirty_bw == N * task_ratelimit
                  == N * pos_bw                      	(1)

The balanced throttle bandwidth can be estimated by

         ref_bw = pos_bw * write_bw / dirty_bw       	(2)

>From (1) and (2), we get equality

         ref_bw == write_bw / N                      	(3)

If the N dd's are all throttled at ref_bw, the dirty/writeback rates
will match. So ref_bw is the balanced dirty rate.

In practice, the ref_bw calculated by (2) may fluctuate and have
estimation errors. So the bdi->dirty_ratelimit update policy is to
follow it only when both pos_bw and ref_bw point to the same direction
(indicating not only the dirty position has deviated from the global/bdi
setpoints, but also it's still departing away).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    7 +++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   69 +++++++++++++++++++++++++++++++++-
 3 files changed, 75 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base throttle bandwidth, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 09:08:35.000000000 +0800
@@ -736,6 +736,66 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long bw = bdi->dirty_ratelimit;
+	unsigned long dirty_bw;
+	unsigned long pos_bw;
+	unsigned long ref_bw;
+	unsigned long long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
+	 */
+	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
+
+	/*
+	 * ref_bw = pos_bw * write_bw / dirty_bw
+	 *
+	 * It's a linear estimation of the "balanced" throttle bandwidth.
+	 */
+	pos_ratio *= bdi->avg_write_bandwidth;
+	do_div(pos_ratio, dirty_bw | 1);
+	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+
+	/*
+	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
+	 * are on the same side of dirty_ratelimit. Which not only makes it
+	 * more stable, but also is essential for preventing it being driven
+	 * away by possible systematic errors in ref_bw.
+	 */
+	if (pos_bw < bw) {
+		if (ref_bw < bw)
+			bw = max(ref_bw, pos_bw);
+	} else {
+		if (ref_bw > bw)
+			bw = min(ref_bw, pos_bw);
+	}
+
+	bdi->dirty_ratelimit = bw;
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long dirty,
@@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
+					   bdi_dirty, dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }



^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 6718 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / pos_bw;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        bw = bdi->dirty_ratelimit;
        ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
        if (dirty pages unbalanced)
             bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

When started N dd, throttle each dd at

         task_ratelimit = pos_bw (any non-zero initial value is OK)

After 200ms, we got

         dirty_bw = # of pages dirtied by app / 200ms
         write_bw = # of pages written to disk / 200ms

For aggressive dirtiers, the equality holds

         dirty_bw == N * task_ratelimit
                  == N * pos_bw                      	(1)

The balanced throttle bandwidth can be estimated by

         ref_bw = pos_bw * write_bw / dirty_bw       	(2)

>From (1) and (2), we get equality

         ref_bw == write_bw / N                      	(3)

If the N dd's are all throttled at ref_bw, the dirty/writeback rates
will match. So ref_bw is the balanced dirty rate.

In practice, the ref_bw calculated by (2) may fluctuate and have
estimation errors. So the bdi->dirty_ratelimit update policy is to
follow it only when both pos_bw and ref_bw point to the same direction
(indicating not only the dirty position has deviated from the global/bdi
setpoints, but also it's still departing away).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    7 +++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   69 +++++++++++++++++++++++++++++++++-
 3 files changed, 75 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base throttle bandwidth, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 09:08:35.000000000 +0800
@@ -736,6 +736,66 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long bw = bdi->dirty_ratelimit;
+	unsigned long dirty_bw;
+	unsigned long pos_bw;
+	unsigned long ref_bw;
+	unsigned long long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
+	 */
+	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
+
+	/*
+	 * ref_bw = pos_bw * write_bw / dirty_bw
+	 *
+	 * It's a linear estimation of the "balanced" throttle bandwidth.
+	 */
+	pos_ratio *= bdi->avg_write_bandwidth;
+	do_div(pos_ratio, dirty_bw | 1);
+	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+
+	/*
+	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
+	 * are on the same side of dirty_ratelimit. Which not only makes it
+	 * more stable, but also is essential for preventing it being driven
+	 * away by possible systematic errors in ref_bw.
+	 */
+	if (pos_bw < bw) {
+		if (ref_bw < bw)
+			bw = max(ref_bw, pos_bw);
+	} else {
+		if (ref_bw > bw)
+			bw = min(ref_bw, pos_bw);
+	}
+
+	bdi->dirty_ratelimit = bw;
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long dirty,
@@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
+					   bdi_dirty, dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 3/5] writeback: dirty rate control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 6718 bytes --]

It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.

On write() syscall, use bdi->dirty_ratelimit
============================================

    balance_dirty_pages(pages_dirtied)
    {
        pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
        pause = pages_dirtied / pos_bw;
        sleep(pause);
    }

On every 200ms, update bdi->dirty_ratelimit
===========================================

    bdi_update_dirty_ratelimit()
    {
        bw = bdi->dirty_ratelimit;
        ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
        if (dirty pages unbalanced)
             bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
    }

Estimation of balanced bdi->dirty_ratelimit
===========================================

When started N dd, throttle each dd at

         task_ratelimit = pos_bw (any non-zero initial value is OK)

After 200ms, we got

         dirty_bw = # of pages dirtied by app / 200ms
         write_bw = # of pages written to disk / 200ms

For aggressive dirtiers, the equality holds

         dirty_bw == N * task_ratelimit
                  == N * pos_bw                      	(1)

The balanced throttle bandwidth can be estimated by

         ref_bw = pos_bw * write_bw / dirty_bw       	(2)

>From (1) and (2), we get equality

         ref_bw == write_bw / N                      	(3)

If the N dd's are all throttled at ref_bw, the dirty/writeback rates
will match. So ref_bw is the balanced dirty rate.

In practice, the ref_bw calculated by (2) may fluctuate and have
estimation errors. So the bdi->dirty_ratelimit update policy is to
follow it only when both pos_bw and ref_bw point to the same direction
(indicating not only the dirty position has deviated from the global/bdi
setpoints, but also it's still departing away).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    7 +++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   69 +++++++++++++++++++++++++++++++++-
 3 files changed, 75 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-05 18:05:36.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;	/* pages written at bw_time_stamp */
 	unsigned long write_bandwidth;	/* the estimated write bandwidth */
 	unsigned long avg_write_bandwidth; /* further smoothed write bw */
 
+	/*
+	 * The base throttle bandwidth, re-calculated on every 200ms.
+	 * All the bdi tasks' dirty rate will be curbed under it.
+	 */
+	unsigned long dirty_ratelimit;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-05 18:05:36.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 
+	bdi->dirty_ratelimit = INIT_BW;
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_write_bandwidth = INIT_BW;
 
--- linux-next.orig/mm/page-writeback.c	2011-08-05 18:05:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 09:08:35.000000000 +0800
@@ -736,6 +736,66 @@ static void global_update_bandwidth(unsi
 	spin_unlock(&dirty_lock);
 }
 
+/*
+ * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long dirty,
+				       unsigned long bdi_thresh,
+				       unsigned long bdi_dirty,
+				       unsigned long dirtied,
+				       unsigned long elapsed)
+{
+	unsigned long bw = bdi->dirty_ratelimit;
+	unsigned long dirty_bw;
+	unsigned long pos_bw;
+	unsigned long ref_bw;
+	unsigned long long pos_ratio;
+
+	/*
+	 * The dirty rate will match the writeback rate in long term, except
+	 * when dirty pages are truncated by userspace or re-dirtied by FS.
+	 */
+	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
+				       bdi_thresh, bdi_dirty);
+	/*
+	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
+	 */
+	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
+
+	/*
+	 * ref_bw = pos_bw * write_bw / dirty_bw
+	 *
+	 * It's a linear estimation of the "balanced" throttle bandwidth.
+	 */
+	pos_ratio *= bdi->avg_write_bandwidth;
+	do_div(pos_ratio, dirty_bw | 1);
+	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
+
+	/*
+	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
+	 * are on the same side of dirty_ratelimit. Which not only makes it
+	 * more stable, but also is essential for preventing it being driven
+	 * away by possible systematic errors in ref_bw.
+	 */
+	if (pos_bw < bw) {
+		if (ref_bw < bw)
+			bw = max(ref_bw, pos_bw);
+	} else {
+		if (ref_bw > bw)
+			bw = min(ref_bw, pos_bw);
+	}
+
+	bdi->dirty_ratelimit = bw;
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long dirty,
@@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
 {
 	unsigned long now = jiffies;
 	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long dirtied;
 	unsigned long written;
 
 	/*
@@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed < BANDWIDTH_INTERVAL)
 		return;
 
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/*
@@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
-	if (thresh)
+	if (thresh) {
 		global_update_bandwidth(thresh, dirty, now);
-
+		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
+					   bdi_dirty, dirtied, elapsed);
+	}
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-06  8:44 ` Wu Fengguang
  (?)
@ 2011-08-06  8:44   ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7105 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
dirtying pages at exactly the same time, each task will be assigned a
large initial nr_dirtied_pause, so that the dirty threshold will be
exceeded long before each task reached its nr_dirtied_pause and hence
call balance_dirty_pages().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 ++
 mm/memory_hotplug.c   |    3 -
 mm/page-writeback.c   |  106 +++++++++-------------------------------
 3 files changed, 32 insertions(+), 84 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-05 15:36:23.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-05 15:39:52.000000000 +0800
@@ -1525,6 +1525,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-05 15:39:48.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-05 15:39:52.000000000 +0800
@@ -48,26 +48,6 @@
 
 #define BANDWIDTH_CALC_SHIFT	10
 
-/*
- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
- * will look to see if it needs to force writeback or throttling.
- */
-static long ratelimit_pages = 32;
-
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If ratelimit_pages is too low then big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long ratelimit_pages(unsigned long dirty,
+				     unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long ratelimit;
-	unsigned long *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
+	ratelimit = current->nr_dirtied_pause;
+	if (bdi->dirty_exceeded)
 		ratelimit = 8;
 
-	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
-	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
-	}
-	preempt_enable();
+	current->nr_dirtied += nr_pages_dirtied;
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1166,44 +1151,6 @@ void laptop_sync_completion(void)
 #endif
 
 /*
- * If ratelimit_pages is too high then we can get into dirty-data overload
- * if a large number of processes all perform writes at the same time.
- * If it is too low then SMP machines will call the (expensive)
- * get_writeback_state too often.
- *
- * Here we set ratelimit_pages to a level which ensures that when all CPUs are
- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
-void writeback_set_ratelimit(void)
-{
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
-	if (ratelimit_pages < 16)
-		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
-}
-
-static int __cpuinit
-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
-{
-	writeback_set_ratelimit();
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block __cpuinitdata ratelimit_nb = {
-	.notifier_call	= ratelimit_handler,
-	.next		= NULL,
-};
-
-/*
  * Called early on to tune the page writeback dirty limits.
  *
  * We used to scale dirty pages according to how total memory
@@ -1225,9 +1172,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	writeback_set_ratelimit();
-	register_cpu_notifier(&ratelimit_nb);
-
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
 	prop_descriptor_init(&vm_dirties, shift);
--- linux-next.orig/mm/memory_hotplug.c	2011-08-05 15:36:23.000000000 +0800
+++ linux-next/mm/memory_hotplug.c	2011-08-05 15:39:52.000000000 +0800
@@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn
 
 	vm_total_pages = nr_free_pagecache_pages();
 
-	writeback_set_ratelimit();
-
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 	unlock_memory_hotplug();
@@ -970,7 +968,6 @@ repeat:
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
-	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
 	unlock_memory_hotplug();



^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7408 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
dirtying pages at exactly the same time, each task will be assigned a
large initial nr_dirtied_pause, so that the dirty threshold will be
exceeded long before each task reached its nr_dirtied_pause and hence
call balance_dirty_pages().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 ++
 mm/memory_hotplug.c   |    3 -
 mm/page-writeback.c   |  106 +++++++++-------------------------------
 3 files changed, 32 insertions(+), 84 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-05 15:36:23.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-05 15:39:52.000000000 +0800
@@ -1525,6 +1525,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-05 15:39:48.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-05 15:39:52.000000000 +0800
@@ -48,26 +48,6 @@
 
 #define BANDWIDTH_CALC_SHIFT	10
 
-/*
- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
- * will look to see if it needs to force writeback or throttling.
- */
-static long ratelimit_pages = 32;
-
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If ratelimit_pages is too low then big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long ratelimit_pages(unsigned long dirty,
+				     unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long ratelimit;
-	unsigned long *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
+	ratelimit = current->nr_dirtied_pause;
+	if (bdi->dirty_exceeded)
 		ratelimit = 8;
 
-	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
-	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
-	}
-	preempt_enable();
+	current->nr_dirtied += nr_pages_dirtied;
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1166,44 +1151,6 @@ void laptop_sync_completion(void)
 #endif
 
 /*
- * If ratelimit_pages is too high then we can get into dirty-data overload
- * if a large number of processes all perform writes at the same time.
- * If it is too low then SMP machines will call the (expensive)
- * get_writeback_state too often.
- *
- * Here we set ratelimit_pages to a level which ensures that when all CPUs are
- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
-void writeback_set_ratelimit(void)
-{
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
-	if (ratelimit_pages < 16)
-		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
-}
-
-static int __cpuinit
-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
-{
-	writeback_set_ratelimit();
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block __cpuinitdata ratelimit_nb = {
-	.notifier_call	= ratelimit_handler,
-	.next		= NULL,
-};
-
-/*
  * Called early on to tune the page writeback dirty limits.
  *
  * We used to scale dirty pages according to how total memory
@@ -1225,9 +1172,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	writeback_set_ratelimit();
-	register_cpu_notifier(&ratelimit_nb);
-
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
 	prop_descriptor_init(&vm_dirties, shift);
--- linux-next.orig/mm/memory_hotplug.c	2011-08-05 15:36:23.000000000 +0800
+++ linux-next/mm/memory_hotplug.c	2011-08-05 15:39:52.000000000 +0800
@@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn
 
 	vm_total_pages = nr_free_pagecache_pages();
 
-	writeback_set_ratelimit();
-
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 	unlock_memory_hotplug();
@@ -970,7 +968,6 @@ repeat:
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
-	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
 	unlock_memory_hotplug();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: per-task-ratelimit --]
[-- Type: text/plain, Size: 7408 bytes --]

Add two fields to task_struct.

1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility

The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.

XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
dirtying pages at exactly the same time, each task will be assigned a
large initial nr_dirtied_pause, so that the dirty threshold will be
exceeded long before each task reached its nr_dirtied_pause and hence
call balance_dirty_pages().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 ++
 mm/memory_hotplug.c   |    3 -
 mm/page-writeback.c   |  106 +++++++++-------------------------------
 3 files changed, 32 insertions(+), 84 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-05 15:36:23.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-05 15:39:52.000000000 +0800
@@ -1525,6 +1525,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-08-05 15:39:48.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-05 15:39:52.000000000 +0800
@@ -48,26 +48,6 @@
 
 #define BANDWIDTH_CALC_SHIFT	10
 
-/*
- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
- * will look to see if it needs to force writeback or throttling.
- */
-static long ratelimit_pages = 32;
-
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct 
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If ratelimit_pages is too low then big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long ratelimit_pages(unsigned long dirty,
+				     unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a
 	if (clear_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	current->nr_dirtied = 0;
+	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long ratelimit;
-	unsigned long *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
+	ratelimit = current->nr_dirtied_pause;
+	if (bdi->dirty_exceeded)
 		ratelimit = 8;
 
-	/*
-	 * Check the rate limiting. Also, we do not want to throttle real-time
-	 * tasks in balance_dirty_pages(). Period.
-	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
-	}
-	preempt_enable();
+	current->nr_dirtied += nr_pages_dirtied;
+	if (unlikely(current->nr_dirtied >= ratelimit))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1166,44 +1151,6 @@ void laptop_sync_completion(void)
 #endif
 
 /*
- * If ratelimit_pages is too high then we can get into dirty-data overload
- * if a large number of processes all perform writes at the same time.
- * If it is too low then SMP machines will call the (expensive)
- * get_writeback_state too often.
- *
- * Here we set ratelimit_pages to a level which ensures that when all CPUs are
- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
-void writeback_set_ratelimit(void)
-{
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
-	if (ratelimit_pages < 16)
-		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
-}
-
-static int __cpuinit
-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
-{
-	writeback_set_ratelimit();
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block __cpuinitdata ratelimit_nb = {
-	.notifier_call	= ratelimit_handler,
-	.next		= NULL,
-};
-
-/*
  * Called early on to tune the page writeback dirty limits.
  *
  * We used to scale dirty pages according to how total memory
@@ -1225,9 +1172,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	writeback_set_ratelimit();
-	register_cpu_notifier(&ratelimit_nb);
-
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
 	prop_descriptor_init(&vm_dirties, shift);
--- linux-next.orig/mm/memory_hotplug.c	2011-08-05 15:36:23.000000000 +0800
+++ linux-next/mm/memory_hotplug.c	2011-08-05 15:39:52.000000000 +0800
@@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn
 
 	vm_total_pages = nr_free_pagecache_pages();
 
-	writeback_set_ratelimit();
-
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 	unlock_memory_hotplug();
@@ -970,7 +968,6 @@ repeat:
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
-	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
 	unlock_memory_hotplug();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-06  8:44 ` Wu Fengguang
  (?)
@ 2011-08-06  8:44   ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 14513 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy will be to do
~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  142 +++++++----------------------
 2 files changed, 37 insertions(+), 129 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 11:17:26.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 16:16:30.000000000 +0800
@@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long bw;
+	unsigned long base_bw;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi))
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		else
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
 				     bdi_thresh, bdi_dirty, start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		base_bw = bdi->dirty_ratelimit;
+		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
+					bdi_thresh, bdi_dirty);
+		if (unlikely(bw == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
+		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
+		pause = min(pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
 		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
 			break;
 		/*
 		 * pass-good area. When some bdi gets blocked (eg. NFS server
@@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a
 			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
 		    bdi_dirty < bdi_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-08-06 11:08:34.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-06 11:17:29.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),



^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 14816 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy will be to do
~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  142 +++++++----------------------
 2 files changed, 37 insertions(+), 129 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 11:17:26.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 16:16:30.000000000 +0800
@@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long bw;
+	unsigned long base_bw;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi))
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		else
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
 				     bdi_thresh, bdi_dirty, start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		base_bw = bdi->dirty_ratelimit;
+		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
+					bdi_thresh, bdi_dirty);
+		if (unlikely(bw == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
+		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
+		pause = min(pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
 		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
 			break;
 		/*
 		 * pass-good area. When some bdi gets blocked (eg. NFS server
@@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a
 			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
 		    bdi_dirty < bdi_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-08-06 11:08:34.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-06 11:17:29.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 14816 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)

  With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
  from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".

  * "CPU usage has dropped by ~55%", "it certainly appears that most of
    the CPU time saving comes from the removal of contention on the
    inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
    cacheline bouncing, because the new code is able to call much less
    frequently into balance_dirty_pages() and hence access the global
    page states)

  * the user space "App overhead" is reduced by 20%, by avoiding the
    cacheline pollution by the complex writeback code path

  * "for a ~5% throughput reduction", "the number of write IOs have
    dropped by ~25%", and the elapsed time reduced from 41:42.17 to
    40:53.23.

  * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
    and improves IO throughput from 38MB/s to 42MB/s.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

  Now it's possible to increase writeback chunk size proportional to the
  disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
  the larger writeback size dramatically reduces the seek count to 1/10
  (far beyond my expectation) and improves the write throughput by 24%.

- long block time in balance_dirty_pages() hurts desktop responsiveness

  Many of us may have the experience: it often takes a couple of seconds
  or even long time to stop a heavy writing dd/cp/tar command with
  Ctrl-C or "kill -9".

- IO pipeline broken by bumpy write() progress

  There are a broad class of "loop {read(buf); write(buf);}" applications
  whose read() pipeline will be under-utilized or even come to a stop if
  the write()s have long latencies _or_ don't progress in a constant rate.
  The current threshold based throttling inherently transfers the large
  low level IO completion fluctuations to bumpy application write()s,
  and further deteriorates with increasing number of dirtiers and/or bdi's.

  For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
  the rsync progresses very bumpy in legacy kernel, and throughput is
  improved by 67% by this patchset. (plus the larger write chunk size,
  it will be 93% speedup).

  The new rate based throttling can support 1000+ dd's with excellent
  smoothness, low latency and low overheads.

For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than   4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

It can control pause times at will. The default policy will be to do
~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.

BEHAVIOR CHANGE
===============

(1) dirty threshold

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

(2) smoothness/responsiveness

Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   24 ----
 mm/page-writeback.c              |  142 +++++++----------------------
 2 files changed, 37 insertions(+), 129 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 11:17:26.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 16:16:30.000000000 +0800
@@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
-		long *numerator, long *denominator)
-{
-	prop_fraction_single(&vm_dirties, &tsk->dirties,
-				numerator, denominator);
-}
-
-/*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-#define TASK_LIMIT_FRACTION 8
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty / TASK_LIMIT_FRACTION;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/* Minimum limit for any task */
-static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
-{
-	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
-}
-
 /*
  *
  */
@@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_reclaimable;
 	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long task_bdi_thresh;
-	unsigned long min_task_bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long pause = 0;
 	bool dirty_exceeded = false;
-	bool clear_dirty_exceeded = true;
+	unsigned long bw;
+	unsigned long base_bw;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		if (bdi_thresh < 2 * bdi_stat_error(bdi))
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_dirty = bdi_nr_reclaimable +
+		else
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
 				    bdi_stat(bdi, BDI_WRITEBACK);
-		}
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
 				  (nr_dirty > dirty_thresh);
-		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
-					(nr_dirty <= dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
-
-		if (!bdi->dirty_exceeded)
+		if (dirty_exceeded && !bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
 				     bdi_thresh, bdi_dirty, start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_balance_dirty_start(bdi);
-		if (bdi_nr_reclaimable > task_bdi_thresh) {
-			pages_written += writeback_inodes_wb(&bdi->wb,
-							     write_chunk);
-			trace_balance_dirty_written(bdi, pages_written);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		base_bw = bdi->dirty_ratelimit;
+		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
+					bdi_thresh, bdi_dirty);
+		if (unlikely(bw == 0)) {
+			pause = MAX_PAUSE;
+			goto pause;
 		}
+		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
+		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
+		pause = min(pause, MAX_PAUSE);
+
+pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
@@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a
 		 * (b) the pause time limit makes the dirtiers more responsive.
 		 */
 		if (nr_dirty < dirty_thresh +
-			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
-		    time_after(jiffies, start_time + MAX_PAUSE))
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
 			break;
 		/*
 		 * pass-good area. When some bdi gets blocked (eg. NFS server
@@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a
 			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
 		    bdi_dirty < bdi_thresh)
 			break;
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (clear_dirty_exceeded && bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
@@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-08-06 11:08:34.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-08-06 11:17:29.000000000 +0800
@@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
-DEFINE_WRITEBACK_EVENT(balance_dirty_start);
-DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
-
-TRACE_EVENT(balance_dirty_written,
-
-	TP_PROTO(struct backing_dev_info *bdi, int written),
-
-	TP_ARGS(bdi, written),
-
-	TP_STRUCT__entry(
-		__array(char,	name, 32)
-		__field(int,	written)
-	),
-
-	TP_fast_assign(
-		strncpy(__entry->name, dev_name(bdi->dev), 32);
-		__entry->written = written;
-	),
-
-	TP_printk("bdi %s written %d",
-		  __entry->name,
-		  __entry->written
-	)
-);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-06 14:35     ` Andrea Righi
  -1 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-06 14:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote:
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> dirtying pages at exactly the same time, each task will be assigned a
> large initial nr_dirtied_pause, so that the dirty threshold will be
> exceeded long before each task reached its nr_dirtied_pause and hence
> call balance_dirty_pages().
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

A minor nitpick below.

Reviewed-by: Andrea Righi <andrea@betterlinux.com>

> ---
>  include/linux/sched.h |    7 ++
>  mm/memory_hotplug.c   |    3 -
>  mm/page-writeback.c   |  106 +++++++++-------------------------------
>  3 files changed, 32 insertions(+), 84 deletions(-)
> 
> --- linux-next.orig/include/linux/sched.h	2011-08-05 15:36:23.000000000 +0800
> +++ linux-next/include/linux/sched.h	2011-08-05 15:39:52.000000000 +0800
> @@ -1525,6 +1525,13 @@ struct task_struct {
>  	int make_it_fail;
>  #endif
>  	struct prop_local_single dirties;
> +	/*
> +	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
> +	 * balance_dirty_pages() for some dirty throttling pause
> +	 */
> +	int nr_dirtied;
> +	int nr_dirtied_pause;
> +
>  #ifdef CONFIG_LATENCYTOP
>  	int latency_record_count;
>  	struct latency_record latency_record[LT_SAVECOUNT];
> --- linux-next.orig/mm/page-writeback.c	2011-08-05 15:39:48.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-05 15:39:52.000000000 +0800
> @@ -48,26 +48,6 @@
>  
>  #define BANDWIDTH_CALC_SHIFT	10
>  
> -/*
> - * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
> - * will look to see if it needs to force writeback or throttling.
> - */
> -static long ratelimit_pages = 32;
> -
> -/*
> - * When balance_dirty_pages decides that the caller needs to perform some
> - * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than dirtied pages to ensure that reasonably
> - * large amounts of I/O are submitted.
> - */
> -static inline long sync_writeback_pages(unsigned long dirtied)
> -{
> -	if (dirtied < ratelimit_pages)
> -		dirtied = ratelimit_pages;
> -
> -	return dirtied + dirtied / 2;
> -}
> -
>  /* The following parameters are exported via /proc/sys/vm */
>  
>  /*
> @@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct 
>  }
>  
>  /*
> + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> + * will look to see if it needs to start dirty throttling.
> + *
> + * If ratelimit_pages is too low then big NUMA machines will call the expensive
> + * global_page_state() too often. So scale it near-sqrt to the safety margin
> + * (the number of pages we may dirty without exceeding the dirty limits).
> + */
> +static unsigned long ratelimit_pages(unsigned long dirty,
> +				     unsigned long thresh)
> +{
> +	if (thresh > dirty)
> +		return 1UL << (ilog2(thresh - dirty) >> 1);
> +
> +	return 1;
> +}
> +
> +/*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
>   * the caller to perform writeback if the system is over `vm_dirty_ratio'.

I think we should also fix the comment of balance_dirty_pages(), now
that it's IO-less for the caller. Maybe something like:

/*
 * balance_dirty_pages() must be called by processes which are generating dirty
 * data.  It looks at the number of dirty pages in the machine and will force
 * the caller to wait once crossing the dirty threshold. If we're over
 * `background_thresh' then the writeback threads are woken to perform some
 * writeout.
 */

> @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a
>  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> +	current->nr_dirtied = 0;
> +	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
> +
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page 
>  	}
>  }
>  
> -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> -
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
>   * @mapping: address_space which was dirtied
> @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long ratelimit;
> -	unsigned long *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
>  
> -	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> +	ratelimit = current->nr_dirtied_pause;
> +	if (bdi->dirty_exceeded)
>  		ratelimit = 8;
>  
> -	/*
> -	 * Check the rate limiting. Also, we do not want to throttle real-time
> -	 * tasks in balance_dirty_pages(). Period.
> -	 */
> -	preempt_disable();
> -	p =  &__get_cpu_var(bdp_ratelimits);
> -	*p += nr_pages_dirtied;
> -	if (unlikely(*p >= ratelimit)) {
> -		ratelimit = sync_writeback_pages(*p);
> -		*p = 0;
> -		preempt_enable();
> -		balance_dirty_pages(mapping, ratelimit);
> -		return;
> -	}
> -	preempt_enable();
> +	current->nr_dirtied += nr_pages_dirtied;
> +	if (unlikely(current->nr_dirtied >= ratelimit))
> +		balance_dirty_pages(mapping, current->nr_dirtied);
>  }
>  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>  
> @@ -1166,44 +1151,6 @@ void laptop_sync_completion(void)
>  #endif
>  
>  /*
> - * If ratelimit_pages is too high then we can get into dirty-data overload
> - * if a large number of processes all perform writes at the same time.
> - * If it is too low then SMP machines will call the (expensive)
> - * get_writeback_state too often.
> - *
> - * Here we set ratelimit_pages to a level which ensures that when all CPUs are
> - * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> - * thresholds before writeback cuts in.
> - *
> - * But the limit should not be set too high.  Because it also controls the
> - * amount of memory which the balance_dirty_pages() caller has to write back.
> - * If this is too large then the caller will block on the IO queue all the
> - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> - * will write six megabyte chunks, max.
> - */
> -
> -void writeback_set_ratelimit(void)
> -{
> -	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
> -	if (ratelimit_pages < 16)
> -		ratelimit_pages = 16;
> -	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> -		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
> -}
> -
> -static int __cpuinit
> -ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
> -{
> -	writeback_set_ratelimit();
> -	return NOTIFY_DONE;
> -}
> -
> -static struct notifier_block __cpuinitdata ratelimit_nb = {
> -	.notifier_call	= ratelimit_handler,
> -	.next		= NULL,
> -};
> -
> -/*
>   * Called early on to tune the page writeback dirty limits.
>   *
>   * We used to scale dirty pages according to how total memory
> @@ -1225,9 +1172,6 @@ void __init page_writeback_init(void)
>  {
>  	int shift;
>  
> -	writeback_set_ratelimit();
> -	register_cpu_notifier(&ratelimit_nb);
> -
>  	shift = calc_period_shift();
>  	prop_descriptor_init(&vm_completions, shift);
>  	prop_descriptor_init(&vm_dirties, shift);
> --- linux-next.orig/mm/memory_hotplug.c	2011-08-05 15:36:23.000000000 +0800
> +++ linux-next/mm/memory_hotplug.c	2011-08-05 15:39:52.000000000 +0800
> @@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn
>  
>  	vm_total_pages = nr_free_pagecache_pages();
>  
> -	writeback_set_ratelimit();
> -
>  	if (onlined_pages)
>  		memory_notify(MEM_ONLINE, &arg);
>  	unlock_memory_hotplug();
> @@ -970,7 +968,6 @@ repeat:
>  	}
>  
>  	vm_total_pages = nr_free_pagecache_pages();
> -	writeback_set_ratelimit();
>  
>  	memory_notify(MEM_OFFLINE, &arg);
>  	unlock_memory_hotplug();
> 

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-06 14:35     ` Andrea Righi
  0 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-06 14:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote:
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> dirtying pages at exactly the same time, each task will be assigned a
> large initial nr_dirtied_pause, so that the dirty threshold will be
> exceeded long before each task reached its nr_dirtied_pause and hence
> call balance_dirty_pages().
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

A minor nitpick below.

Reviewed-by: Andrea Righi <andrea@betterlinux.com>

> ---
>  include/linux/sched.h |    7 ++
>  mm/memory_hotplug.c   |    3 -
>  mm/page-writeback.c   |  106 +++++++++-------------------------------
>  3 files changed, 32 insertions(+), 84 deletions(-)
> 
> --- linux-next.orig/include/linux/sched.h	2011-08-05 15:36:23.000000000 +0800
> +++ linux-next/include/linux/sched.h	2011-08-05 15:39:52.000000000 +0800
> @@ -1525,6 +1525,13 @@ struct task_struct {
>  	int make_it_fail;
>  #endif
>  	struct prop_local_single dirties;
> +	/*
> +	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
> +	 * balance_dirty_pages() for some dirty throttling pause
> +	 */
> +	int nr_dirtied;
> +	int nr_dirtied_pause;
> +
>  #ifdef CONFIG_LATENCYTOP
>  	int latency_record_count;
>  	struct latency_record latency_record[LT_SAVECOUNT];
> --- linux-next.orig/mm/page-writeback.c	2011-08-05 15:39:48.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-05 15:39:52.000000000 +0800
> @@ -48,26 +48,6 @@
>  
>  #define BANDWIDTH_CALC_SHIFT	10
>  
> -/*
> - * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
> - * will look to see if it needs to force writeback or throttling.
> - */
> -static long ratelimit_pages = 32;
> -
> -/*
> - * When balance_dirty_pages decides that the caller needs to perform some
> - * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than dirtied pages to ensure that reasonably
> - * large amounts of I/O are submitted.
> - */
> -static inline long sync_writeback_pages(unsigned long dirtied)
> -{
> -	if (dirtied < ratelimit_pages)
> -		dirtied = ratelimit_pages;
> -
> -	return dirtied + dirtied / 2;
> -}
> -
>  /* The following parameters are exported via /proc/sys/vm */
>  
>  /*
> @@ -868,6 +848,23 @@ static void bdi_update_bandwidth(struct 
>  }
>  
>  /*
> + * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
> + * will look to see if it needs to start dirty throttling.
> + *
> + * If ratelimit_pages is too low then big NUMA machines will call the expensive
> + * global_page_state() too often. So scale it near-sqrt to the safety margin
> + * (the number of pages we may dirty without exceeding the dirty limits).
> + */
> +static unsigned long ratelimit_pages(unsigned long dirty,
> +				     unsigned long thresh)
> +{
> +	if (thresh > dirty)
> +		return 1UL << (ilog2(thresh - dirty) >> 1);
> +
> +	return 1;
> +}
> +
> +/*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
>   * the caller to perform writeback if the system is over `vm_dirty_ratio'.

I think we should also fix the comment of balance_dirty_pages(), now
that it's IO-less for the caller. Maybe something like:

/*
 * balance_dirty_pages() must be called by processes which are generating dirty
 * data.  It looks at the number of dirty pages in the machine and will force
 * the caller to wait once crossing the dirty threshold. If we're over
 * `background_thresh' then the writeback threads are woken to perform some
 * writeout.
 */

> @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a
>  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> +	current->nr_dirtied = 0;
> +	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
> +
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page 
>  	}
>  }
>  
> -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> -
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
>   * @mapping: address_space which was dirtied
> @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long ratelimit;
> -	unsigned long *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
>  
> -	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> +	ratelimit = current->nr_dirtied_pause;
> +	if (bdi->dirty_exceeded)
>  		ratelimit = 8;
>  
> -	/*
> -	 * Check the rate limiting. Also, we do not want to throttle real-time
> -	 * tasks in balance_dirty_pages(). Period.
> -	 */
> -	preempt_disable();
> -	p =  &__get_cpu_var(bdp_ratelimits);
> -	*p += nr_pages_dirtied;
> -	if (unlikely(*p >= ratelimit)) {
> -		ratelimit = sync_writeback_pages(*p);
> -		*p = 0;
> -		preempt_enable();
> -		balance_dirty_pages(mapping, ratelimit);
> -		return;
> -	}
> -	preempt_enable();
> +	current->nr_dirtied += nr_pages_dirtied;
> +	if (unlikely(current->nr_dirtied >= ratelimit))
> +		balance_dirty_pages(mapping, current->nr_dirtied);
>  }
>  EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>  
> @@ -1166,44 +1151,6 @@ void laptop_sync_completion(void)
>  #endif
>  
>  /*
> - * If ratelimit_pages is too high then we can get into dirty-data overload
> - * if a large number of processes all perform writes at the same time.
> - * If it is too low then SMP machines will call the (expensive)
> - * get_writeback_state too often.
> - *
> - * Here we set ratelimit_pages to a level which ensures that when all CPUs are
> - * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> - * thresholds before writeback cuts in.
> - *
> - * But the limit should not be set too high.  Because it also controls the
> - * amount of memory which the balance_dirty_pages() caller has to write back.
> - * If this is too large then the caller will block on the IO queue all the
> - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> - * will write six megabyte chunks, max.
> - */
> -
> -void writeback_set_ratelimit(void)
> -{
> -	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
> -	if (ratelimit_pages < 16)
> -		ratelimit_pages = 16;
> -	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> -		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
> -}
> -
> -static int __cpuinit
> -ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
> -{
> -	writeback_set_ratelimit();
> -	return NOTIFY_DONE;
> -}
> -
> -static struct notifier_block __cpuinitdata ratelimit_nb = {
> -	.notifier_call	= ratelimit_handler,
> -	.next		= NULL,
> -};
> -
> -/*
>   * Called early on to tune the page writeback dirty limits.
>   *
>   * We used to scale dirty pages according to how total memory
> @@ -1225,9 +1172,6 @@ void __init page_writeback_init(void)
>  {
>  	int shift;
>  
> -	writeback_set_ratelimit();
> -	register_cpu_notifier(&ratelimit_nb);
> -
>  	shift = calc_period_shift();
>  	prop_descriptor_init(&vm_completions, shift);
>  	prop_descriptor_init(&vm_dirties, shift);
> --- linux-next.orig/mm/memory_hotplug.c	2011-08-05 15:36:23.000000000 +0800
> +++ linux-next/mm/memory_hotplug.c	2011-08-05 15:39:52.000000000 +0800
> @@ -527,8 +527,6 @@ int __ref online_pages(unsigned long pfn
>  
>  	vm_total_pages = nr_free_pagecache_pages();
>  
> -	writeback_set_ratelimit();
> -
>  	if (onlined_pages)
>  		memory_notify(MEM_ONLINE, &arg);
>  	unlock_memory_hotplug();
> @@ -970,7 +968,6 @@ repeat:
>  	}
>  
>  	vm_total_pages = nr_free_pagecache_pages();
> -	writeback_set_ratelimit();
>  
>  	memory_notify(MEM_OFFLINE, &arg);
>  	unlock_memory_hotplug();
> 

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-06 14:48     ` Andrea Righi
  -1 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-06 14:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
> 
> RATIONALS
> =========
> 
> - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
> 
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> 
> - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
> 
>   With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
>   from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
> 
>   * "CPU usage has dropped by ~55%", "it certainly appears that most of
>     the CPU time saving comes from the removal of contention on the
>     inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
>     cacheline bouncing, because the new code is able to call much less
>     frequently into balance_dirty_pages() and hence access the global
>     page states)
> 
>   * the user space "App overhead" is reduced by 20%, by avoiding the
>     cacheline pollution by the complex writeback code path
> 
>   * "for a ~5% throughput reduction", "the number of write IOs have
>     dropped by ~25%", and the elapsed time reduced from 41:42.17 to
>     40:53.23.
> 
>   * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
>     and improves IO throughput from 38MB/s to 42MB/s.
> 
> - IO size too small for fast arrays and too large for slow USB sticks
> 
>   The write_chunk used by current balance_dirty_pages() cannot be
>   directly set to some large value (eg. 128MB) for better IO efficiency.
>   Because it could lead to more than 1 second user perceivable stalls.
>   Even the current 4MB write size may be too large for slow USB sticks.
>   The fact that balance_dirty_pages() starts IO on itself couples the
>   IO size to wait time, which makes it hard to do suitable IO size while
>   keeping the wait time under control.
> 
>   Now it's possible to increase writeback chunk size proportional to the
>   disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
>   the larger writeback size dramatically reduces the seek count to 1/10
>   (far beyond my expectation) and improves the write throughput by 24%.
> 
> - long block time in balance_dirty_pages() hurts desktop responsiveness
> 
>   Many of us may have the experience: it often takes a couple of seconds
>   or even long time to stop a heavy writing dd/cp/tar command with
>   Ctrl-C or "kill -9".
> 
> - IO pipeline broken by bumpy write() progress
> 
>   There are a broad class of "loop {read(buf); write(buf);}" applications
>   whose read() pipeline will be under-utilized or even come to a stop if
>   the write()s have long latencies _or_ don't progress in a constant rate.
>   The current threshold based throttling inherently transfers the large
>   low level IO completion fluctuations to bumpy application write()s,
>   and further deteriorates with increasing number of dirtiers and/or bdi's.
> 
>   For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
>   the rsync progresses very bumpy in legacy kernel, and throughput is
>   improved by 67% by this patchset. (plus the larger write chunk size,
>   it will be 93% speedup).
> 
>   The new rate based throttling can support 1000+ dd's with excellent
>   smoothness, low latency and low overheads.
> 
> For the above reasons, it's much better to do IO-less and low latency
> pauses in balance_dirty_pages().
> 
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. However it's found to have two problems:
> 
> - in large NUMA systems, the per-cpu counters may have big accounting
>   errors, leading to big throttle wait time and jitters.
> 
> - NFS may kill large amount of unstable pages with one single COMMIT.
>   Because NFS server serves COMMIT with expensive fsync() IOs, it is
>   desirable to delay and reduce the number of COMMITs. So it's not
>   likely to optimize away such kind of bursty IO completions, and the
>   resulted large (and tiny) stall times in IO completion based throttling.
> 
> So here is a pause time oriented approach, which tries to control the
> pause time in each balance_dirty_pages() invocations, by controlling
> the number of pages dirtied before calling balance_dirty_pages(), for
> smooth and efficient dirty throttling:
> 
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than   4ms, which burns CPU power)
> - avoid too large pause time (more than 200ms, which hurts responsiveness)
> - avoid big fluctuations of pause times
> 
> It can control pause times at will. The default policy will be to do
> ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.
> 
> BEHAVIOR CHANGE
> ===============
> 
> (1) dirty threshold
> 
> Users will notice that the applications will get throttled once crossing
> the global (background + dirty)/2=15% threshold, and then balanced around
> 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
> memory in 1-dd case.
> 
> Since the task will be soft throttled earlier than before, it may be
> perceived by end users as performance "slow down" if his application
> happens to dirty more than 15% dirtyable memory.
> 
> (2) smoothness/responsiveness
> 
> Users will notice a more responsive system during heavy writeback.
> "killall dd" will take effect instantly.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---

Another minor nit below.

>  include/trace/events/writeback.h |   24 ----
>  mm/page-writeback.c              |  142 +++++++----------------------
>  2 files changed, 37 insertions(+), 129 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 11:17:26.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 16:16:30.000000000 +0800
> @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct
>  				numerator, denominator);
>  }
>  
> -static inline void task_dirties_fraction(struct task_struct *tsk,
> -		long *numerator, long *denominator)
> -{
> -	prop_fraction_single(&vm_dirties, &tsk->dirties,
> -				numerator, denominator);
> -}
> -
> -/*
> - * task_dirty_limit - scale down dirty throttling threshold for one task
> - *
> - * task specific dirty limit:
> - *
> - *   dirty -= (dirty/8) * p_{t}
> - *
> - * To protect light/slow dirtying tasks from heavier/fast ones, we start
> - * throttling individual tasks before reaching the bdi dirty limit.
> - * Relatively low thresholds will be allocated to heavy dirtiers. So when
> - * dirty pages grow large, heavy dirtiers will be throttled first, which will
> - * effectively curb the growth of dirty pages. Light dirtiers with high enough
> - * dirty threshold may never get throttled.
> - */
> -#define TASK_LIMIT_FRACTION 8
> -static unsigned long task_dirty_limit(struct task_struct *tsk,
> -				       unsigned long bdi_dirty)
> -{
> -	long numerator, denominator;
> -	unsigned long dirty = bdi_dirty;
> -	u64 inv = dirty / TASK_LIMIT_FRACTION;
> -
> -	task_dirties_fraction(tsk, &numerator, &denominator);
> -	inv *= numerator;
> -	do_div(inv, denominator);
> -
> -	dirty -= inv;
> -
> -	return max(dirty, bdi_dirty/2);
> -}
> -
> -/* Minimum limit for any task */
> -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
> -{
> -	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
> -}
> -
>  /*
>   *
>   */
> @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns
>   * perform some writeout.
>   */
>  static void balance_dirty_pages(struct address_space *mapping,
> -				unsigned long write_chunk)
> +				unsigned long pages_dirtied)
>  {
> -	unsigned long nr_reclaimable, bdi_nr_reclaimable;
> +	unsigned long nr_reclaimable;
>  	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
>  	unsigned long bdi_dirty;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long task_bdi_thresh;
> -	unsigned long min_task_bdi_thresh;
> -	unsigned long pages_written = 0;
> -	unsigned long pause = 1;
> +	unsigned long pause = 0;
>  	bool dirty_exceeded = false;
> -	bool clear_dirty_exceeded = true;
> +	unsigned long bw;
> +	unsigned long base_bw;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long start_time = jiffies;
>  
>  	for (;;) {
> +		/*
> +		 * Unstable writes are a feature of certain networked
> +		 * filesystems (i.e. NFS) in which data may have been
> +		 * written to the server's write cache, but has not yet
> +		 * been flushed to permanent storage.
> +		 */
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
>  		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
> @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> -		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
> -		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
>  
>  		/*
>  		 * In order to avoid the stacked BDI deadlock we need
> @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a
>  		 * actually dirty; with m+n sitting in the percpu
>  		 * deltas.
>  		 */
> -		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
> -			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		if (bdi_thresh < 2 * bdi_stat_error(bdi))
> +			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat_sum(bdi, BDI_WRITEBACK);
> -		} else {
> -			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		else
> +			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat(bdi, BDI_WRITEBACK);
> -		}
>  
> -		/*
> -		 * The bdi thresh is somehow "soft" limit derived from the
> -		 * global "hard" limit. The former helps to prevent heavy IO
> -		 * bdi or process from holding back light ones; The latter is
> -		 * the last resort safeguard.
> -		 */
> -		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
> +		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
>  				  (nr_dirty > dirty_thresh);
> -		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
> -					(nr_dirty <= dirty_thresh);
> -
> -		if (!dirty_exceeded)
> -			break;
> -
> -		if (!bdi->dirty_exceeded)
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
>  				     bdi_thresh, bdi_dirty, start_time);
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_bw = bdi->dirty_ratelimit;
> +		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> +					bdi_thresh, bdi_dirty);
> +		if (unlikely(bw == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> +		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
> +		pause = min(pause, MAX_PAUSE);

Fix this build warning:

 mm/page-writeback.c: In function ‘balance_dirty_pages’:
 mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
---
 mm/page-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a36f83d..a998931 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -886,7 +886,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
 		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
-		pause = min(pause, MAX_PAUSE);
+		pause = min_t(unsigned long, pause, MAX_PAUSE);
 
 pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);

^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-06 14:48     ` Andrea Righi
  0 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-06 14:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
> 
> RATIONALS
> =========
> 
> - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
> 
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> 
> - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
> 
>   With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
>   from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
> 
>   * "CPU usage has dropped by ~55%", "it certainly appears that most of
>     the CPU time saving comes from the removal of contention on the
>     inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
>     cacheline bouncing, because the new code is able to call much less
>     frequently into balance_dirty_pages() and hence access the global
>     page states)
> 
>   * the user space "App overhead" is reduced by 20%, by avoiding the
>     cacheline pollution by the complex writeback code path
> 
>   * "for a ~5% throughput reduction", "the number of write IOs have
>     dropped by ~25%", and the elapsed time reduced from 41:42.17 to
>     40:53.23.
> 
>   * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
>     and improves IO throughput from 38MB/s to 42MB/s.
> 
> - IO size too small for fast arrays and too large for slow USB sticks
> 
>   The write_chunk used by current balance_dirty_pages() cannot be
>   directly set to some large value (eg. 128MB) for better IO efficiency.
>   Because it could lead to more than 1 second user perceivable stalls.
>   Even the current 4MB write size may be too large for slow USB sticks.
>   The fact that balance_dirty_pages() starts IO on itself couples the
>   IO size to wait time, which makes it hard to do suitable IO size while
>   keeping the wait time under control.
> 
>   Now it's possible to increase writeback chunk size proportional to the
>   disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
>   the larger writeback size dramatically reduces the seek count to 1/10
>   (far beyond my expectation) and improves the write throughput by 24%.
> 
> - long block time in balance_dirty_pages() hurts desktop responsiveness
> 
>   Many of us may have the experience: it often takes a couple of seconds
>   or even long time to stop a heavy writing dd/cp/tar command with
>   Ctrl-C or "kill -9".
> 
> - IO pipeline broken by bumpy write() progress
> 
>   There are a broad class of "loop {read(buf); write(buf);}" applications
>   whose read() pipeline will be under-utilized or even come to a stop if
>   the write()s have long latencies _or_ don't progress in a constant rate.
>   The current threshold based throttling inherently transfers the large
>   low level IO completion fluctuations to bumpy application write()s,
>   and further deteriorates with increasing number of dirtiers and/or bdi's.
> 
>   For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
>   the rsync progresses very bumpy in legacy kernel, and throughput is
>   improved by 67% by this patchset. (plus the larger write chunk size,
>   it will be 93% speedup).
> 
>   The new rate based throttling can support 1000+ dd's with excellent
>   smoothness, low latency and low overheads.
> 
> For the above reasons, it's much better to do IO-less and low latency
> pauses in balance_dirty_pages().
> 
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. However it's found to have two problems:
> 
> - in large NUMA systems, the per-cpu counters may have big accounting
>   errors, leading to big throttle wait time and jitters.
> 
> - NFS may kill large amount of unstable pages with one single COMMIT.
>   Because NFS server serves COMMIT with expensive fsync() IOs, it is
>   desirable to delay and reduce the number of COMMITs. So it's not
>   likely to optimize away such kind of bursty IO completions, and the
>   resulted large (and tiny) stall times in IO completion based throttling.
> 
> So here is a pause time oriented approach, which tries to control the
> pause time in each balance_dirty_pages() invocations, by controlling
> the number of pages dirtied before calling balance_dirty_pages(), for
> smooth and efficient dirty throttling:
> 
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than   4ms, which burns CPU power)
> - avoid too large pause time (more than 200ms, which hurts responsiveness)
> - avoid big fluctuations of pause times
> 
> It can control pause times at will. The default policy will be to do
> ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.
> 
> BEHAVIOR CHANGE
> ===============
> 
> (1) dirty threshold
> 
> Users will notice that the applications will get throttled once crossing
> the global (background + dirty)/2=15% threshold, and then balanced around
> 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
> memory in 1-dd case.
> 
> Since the task will be soft throttled earlier than before, it may be
> perceived by end users as performance "slow down" if his application
> happens to dirty more than 15% dirtyable memory.
> 
> (2) smoothness/responsiveness
> 
> Users will notice a more responsive system during heavy writeback.
> "killall dd" will take effect instantly.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---

Another minor nit below.

>  include/trace/events/writeback.h |   24 ----
>  mm/page-writeback.c              |  142 +++++++----------------------
>  2 files changed, 37 insertions(+), 129 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 11:17:26.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 16:16:30.000000000 +0800
> @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct
>  				numerator, denominator);
>  }
>  
> -static inline void task_dirties_fraction(struct task_struct *tsk,
> -		long *numerator, long *denominator)
> -{
> -	prop_fraction_single(&vm_dirties, &tsk->dirties,
> -				numerator, denominator);
> -}
> -
> -/*
> - * task_dirty_limit - scale down dirty throttling threshold for one task
> - *
> - * task specific dirty limit:
> - *
> - *   dirty -= (dirty/8) * p_{t}
> - *
> - * To protect light/slow dirtying tasks from heavier/fast ones, we start
> - * throttling individual tasks before reaching the bdi dirty limit.
> - * Relatively low thresholds will be allocated to heavy dirtiers. So when
> - * dirty pages grow large, heavy dirtiers will be throttled first, which will
> - * effectively curb the growth of dirty pages. Light dirtiers with high enough
> - * dirty threshold may never get throttled.
> - */
> -#define TASK_LIMIT_FRACTION 8
> -static unsigned long task_dirty_limit(struct task_struct *tsk,
> -				       unsigned long bdi_dirty)
> -{
> -	long numerator, denominator;
> -	unsigned long dirty = bdi_dirty;
> -	u64 inv = dirty / TASK_LIMIT_FRACTION;
> -
> -	task_dirties_fraction(tsk, &numerator, &denominator);
> -	inv *= numerator;
> -	do_div(inv, denominator);
> -
> -	dirty -= inv;
> -
> -	return max(dirty, bdi_dirty/2);
> -}
> -
> -/* Minimum limit for any task */
> -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
> -{
> -	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
> -}
> -
>  /*
>   *
>   */
> @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns
>   * perform some writeout.
>   */
>  static void balance_dirty_pages(struct address_space *mapping,
> -				unsigned long write_chunk)
> +				unsigned long pages_dirtied)
>  {
> -	unsigned long nr_reclaimable, bdi_nr_reclaimable;
> +	unsigned long nr_reclaimable;
>  	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
>  	unsigned long bdi_dirty;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long task_bdi_thresh;
> -	unsigned long min_task_bdi_thresh;
> -	unsigned long pages_written = 0;
> -	unsigned long pause = 1;
> +	unsigned long pause = 0;
>  	bool dirty_exceeded = false;
> -	bool clear_dirty_exceeded = true;
> +	unsigned long bw;
> +	unsigned long base_bw;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long start_time = jiffies;
>  
>  	for (;;) {
> +		/*
> +		 * Unstable writes are a feature of certain networked
> +		 * filesystems (i.e. NFS) in which data may have been
> +		 * written to the server's write cache, but has not yet
> +		 * been flushed to permanent storage.
> +		 */
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
>  		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
> @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> -		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
> -		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
>  
>  		/*
>  		 * In order to avoid the stacked BDI deadlock we need
> @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a
>  		 * actually dirty; with m+n sitting in the percpu
>  		 * deltas.
>  		 */
> -		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
> -			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		if (bdi_thresh < 2 * bdi_stat_error(bdi))
> +			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat_sum(bdi, BDI_WRITEBACK);
> -		} else {
> -			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		else
> +			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat(bdi, BDI_WRITEBACK);
> -		}
>  
> -		/*
> -		 * The bdi thresh is somehow "soft" limit derived from the
> -		 * global "hard" limit. The former helps to prevent heavy IO
> -		 * bdi or process from holding back light ones; The latter is
> -		 * the last resort safeguard.
> -		 */
> -		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
> +		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
>  				  (nr_dirty > dirty_thresh);
> -		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
> -					(nr_dirty <= dirty_thresh);
> -
> -		if (!dirty_exceeded)
> -			break;
> -
> -		if (!bdi->dirty_exceeded)
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
>  				     bdi_thresh, bdi_dirty, start_time);
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_bw = bdi->dirty_ratelimit;
> +		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> +					bdi_thresh, bdi_dirty);
> +		if (unlikely(bw == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> +		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
> +		pause = min(pause, MAX_PAUSE);

Fix this build warning:

 mm/page-writeback.c: In function ‘balance_dirty_pages’:
 mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
---
 mm/page-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a36f83d..a998931 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -886,7 +886,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
 		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
-		pause = min(pause, MAX_PAUSE);
+		pause = min_t(unsigned long, pause, MAX_PAUSE);
 
 pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-06 14:48     ` Andrea Righi
  0 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-06 14:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
> 
> RATIONALS
> =========
> 
> - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
> 
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> 
> - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
> 
>   With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
>   from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
> 
>   * "CPU usage has dropped by ~55%", "it certainly appears that most of
>     the CPU time saving comes from the removal of contention on the
>     inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
>     cacheline bouncing, because the new code is able to call much less
>     frequently into balance_dirty_pages() and hence access the global
>     page states)
> 
>   * the user space "App overhead" is reduced by 20%, by avoiding the
>     cacheline pollution by the complex writeback code path
> 
>   * "for a ~5% throughput reduction", "the number of write IOs have
>     dropped by ~25%", and the elapsed time reduced from 41:42.17 to
>     40:53.23.
> 
>   * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
>     and improves IO throughput from 38MB/s to 42MB/s.
> 
> - IO size too small for fast arrays and too large for slow USB sticks
> 
>   The write_chunk used by current balance_dirty_pages() cannot be
>   directly set to some large value (eg. 128MB) for better IO efficiency.
>   Because it could lead to more than 1 second user perceivable stalls.
>   Even the current 4MB write size may be too large for slow USB sticks.
>   The fact that balance_dirty_pages() starts IO on itself couples the
>   IO size to wait time, which makes it hard to do suitable IO size while
>   keeping the wait time under control.
> 
>   Now it's possible to increase writeback chunk size proportional to the
>   disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
>   the larger writeback size dramatically reduces the seek count to 1/10
>   (far beyond my expectation) and improves the write throughput by 24%.
> 
> - long block time in balance_dirty_pages() hurts desktop responsiveness
> 
>   Many of us may have the experience: it often takes a couple of seconds
>   or even long time to stop a heavy writing dd/cp/tar command with
>   Ctrl-C or "kill -9".
> 
> - IO pipeline broken by bumpy write() progress
> 
>   There are a broad class of "loop {read(buf); write(buf);}" applications
>   whose read() pipeline will be under-utilized or even come to a stop if
>   the write()s have long latencies _or_ don't progress in a constant rate.
>   The current threshold based throttling inherently transfers the large
>   low level IO completion fluctuations to bumpy application write()s,
>   and further deteriorates with increasing number of dirtiers and/or bdi's.
> 
>   For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
>   the rsync progresses very bumpy in legacy kernel, and throughput is
>   improved by 67% by this patchset. (plus the larger write chunk size,
>   it will be 93% speedup).
> 
>   The new rate based throttling can support 1000+ dd's with excellent
>   smoothness, low latency and low overheads.
> 
> For the above reasons, it's much better to do IO-less and low latency
> pauses in balance_dirty_pages().
> 
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. However it's found to have two problems:
> 
> - in large NUMA systems, the per-cpu counters may have big accounting
>   errors, leading to big throttle wait time and jitters.
> 
> - NFS may kill large amount of unstable pages with one single COMMIT.
>   Because NFS server serves COMMIT with expensive fsync() IOs, it is
>   desirable to delay and reduce the number of COMMITs. So it's not
>   likely to optimize away such kind of bursty IO completions, and the
>   resulted large (and tiny) stall times in IO completion based throttling.
> 
> So here is a pause time oriented approach, which tries to control the
> pause time in each balance_dirty_pages() invocations, by controlling
> the number of pages dirtied before calling balance_dirty_pages(), for
> smooth and efficient dirty throttling:
> 
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than   4ms, which burns CPU power)
> - avoid too large pause time (more than 200ms, which hurts responsiveness)
> - avoid big fluctuations of pause times
> 
> It can control pause times at will. The default policy will be to do
> ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.
> 
> BEHAVIOR CHANGE
> ===============
> 
> (1) dirty threshold
> 
> Users will notice that the applications will get throttled once crossing
> the global (background + dirty)/2=15% threshold, and then balanced around
> 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
> memory in 1-dd case.
> 
> Since the task will be soft throttled earlier than before, it may be
> perceived by end users as performance "slow down" if his application
> happens to dirty more than 15% dirtyable memory.
> 
> (2) smoothness/responsiveness
> 
> Users will notice a more responsive system during heavy writeback.
> "killall dd" will take effect instantly.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---

Another minor nit below.

>  include/trace/events/writeback.h |   24 ----
>  mm/page-writeback.c              |  142 +++++++----------------------
>  2 files changed, 37 insertions(+), 129 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 11:17:26.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 16:16:30.000000000 +0800
> @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct
>  				numerator, denominator);
>  }
>  
> -static inline void task_dirties_fraction(struct task_struct *tsk,
> -		long *numerator, long *denominator)
> -{
> -	prop_fraction_single(&vm_dirties, &tsk->dirties,
> -				numerator, denominator);
> -}
> -
> -/*
> - * task_dirty_limit - scale down dirty throttling threshold for one task
> - *
> - * task specific dirty limit:
> - *
> - *   dirty -= (dirty/8) * p_{t}
> - *
> - * To protect light/slow dirtying tasks from heavier/fast ones, we start
> - * throttling individual tasks before reaching the bdi dirty limit.
> - * Relatively low thresholds will be allocated to heavy dirtiers. So when
> - * dirty pages grow large, heavy dirtiers will be throttled first, which will
> - * effectively curb the growth of dirty pages. Light dirtiers with high enough
> - * dirty threshold may never get throttled.
> - */
> -#define TASK_LIMIT_FRACTION 8
> -static unsigned long task_dirty_limit(struct task_struct *tsk,
> -				       unsigned long bdi_dirty)
> -{
> -	long numerator, denominator;
> -	unsigned long dirty = bdi_dirty;
> -	u64 inv = dirty / TASK_LIMIT_FRACTION;
> -
> -	task_dirties_fraction(tsk, &numerator, &denominator);
> -	inv *= numerator;
> -	do_div(inv, denominator);
> -
> -	dirty -= inv;
> -
> -	return max(dirty, bdi_dirty/2);
> -}
> -
> -/* Minimum limit for any task */
> -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
> -{
> -	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
> -}
> -
>  /*
>   *
>   */
> @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns
>   * perform some writeout.
>   */
>  static void balance_dirty_pages(struct address_space *mapping,
> -				unsigned long write_chunk)
> +				unsigned long pages_dirtied)
>  {
> -	unsigned long nr_reclaimable, bdi_nr_reclaimable;
> +	unsigned long nr_reclaimable;
>  	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
>  	unsigned long bdi_dirty;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long task_bdi_thresh;
> -	unsigned long min_task_bdi_thresh;
> -	unsigned long pages_written = 0;
> -	unsigned long pause = 1;
> +	unsigned long pause = 0;
>  	bool dirty_exceeded = false;
> -	bool clear_dirty_exceeded = true;
> +	unsigned long bw;
> +	unsigned long base_bw;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long start_time = jiffies;
>  
>  	for (;;) {
> +		/*
> +		 * Unstable writes are a feature of certain networked
> +		 * filesystems (i.e. NFS) in which data may have been
> +		 * written to the server's write cache, but has not yet
> +		 * been flushed to permanent storage.
> +		 */
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
>  		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
> @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> -		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
> -		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
>  
>  		/*
>  		 * In order to avoid the stacked BDI deadlock we need
> @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a
>  		 * actually dirty; with m+n sitting in the percpu
>  		 * deltas.
>  		 */
> -		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
> -			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		if (bdi_thresh < 2 * bdi_stat_error(bdi))
> +			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat_sum(bdi, BDI_WRITEBACK);
> -		} else {
> -			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		else
> +			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat(bdi, BDI_WRITEBACK);
> -		}
>  
> -		/*
> -		 * The bdi thresh is somehow "soft" limit derived from the
> -		 * global "hard" limit. The former helps to prevent heavy IO
> -		 * bdi or process from holding back light ones; The latter is
> -		 * the last resort safeguard.
> -		 */
> -		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
> +		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
>  				  (nr_dirty > dirty_thresh);
> -		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
> -					(nr_dirty <= dirty_thresh);
> -
> -		if (!dirty_exceeded)
> -			break;
> -
> -		if (!bdi->dirty_exceeded)
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
>  				     bdi_thresh, bdi_dirty, start_time);
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_bw = bdi->dirty_ratelimit;
> +		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> +					bdi_thresh, bdi_dirty);
> +		if (unlikely(bw == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> +		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
> +		pause = min(pause, MAX_PAUSE);

Fix this build warning:

 mm/page-writeback.c: In function a??balance_dirty_pagesa??:
 mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
---
 mm/page-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a36f83d..a998931 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -886,7 +886,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		}
 		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
 		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
-		pause = min(pause, MAX_PAUSE);
+		pause = min_t(unsigned long, pause, MAX_PAUSE);
 
 pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-06 16:46     ` Andrea Righi
  -1 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-06 16:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
> 
> RATIONALS
> =========
> 
> - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
> 
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> 
> - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
> 
>   With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
>   from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
> 
>   * "CPU usage has dropped by ~55%", "it certainly appears that most of
>     the CPU time saving comes from the removal of contention on the
>     inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
>     cacheline bouncing, because the new code is able to call much less
>     frequently into balance_dirty_pages() and hence access the global
>     page states)
> 
>   * the user space "App overhead" is reduced by 20%, by avoiding the
>     cacheline pollution by the complex writeback code path
> 
>   * "for a ~5% throughput reduction", "the number of write IOs have
>     dropped by ~25%", and the elapsed time reduced from 41:42.17 to
>     40:53.23.
> 
>   * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
>     and improves IO throughput from 38MB/s to 42MB/s.
> 
> - IO size too small for fast arrays and too large for slow USB sticks
> 
>   The write_chunk used by current balance_dirty_pages() cannot be
>   directly set to some large value (eg. 128MB) for better IO efficiency.
>   Because it could lead to more than 1 second user perceivable stalls.
>   Even the current 4MB write size may be too large for slow USB sticks.
>   The fact that balance_dirty_pages() starts IO on itself couples the
>   IO size to wait time, which makes it hard to do suitable IO size while
>   keeping the wait time under control.
> 
>   Now it's possible to increase writeback chunk size proportional to the
>   disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
>   the larger writeback size dramatically reduces the seek count to 1/10
>   (far beyond my expectation) and improves the write throughput by 24%.
> 
> - long block time in balance_dirty_pages() hurts desktop responsiveness
> 
>   Many of us may have the experience: it often takes a couple of seconds
>   or even long time to stop a heavy writing dd/cp/tar command with
>   Ctrl-C or "kill -9".
> 
> - IO pipeline broken by bumpy write() progress
> 
>   There are a broad class of "loop {read(buf); write(buf);}" applications
>   whose read() pipeline will be under-utilized or even come to a stop if
>   the write()s have long latencies _or_ don't progress in a constant rate.
>   The current threshold based throttling inherently transfers the large
>   low level IO completion fluctuations to bumpy application write()s,
>   and further deteriorates with increasing number of dirtiers and/or bdi's.
> 
>   For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
>   the rsync progresses very bumpy in legacy kernel, and throughput is
>   improved by 67% by this patchset. (plus the larger write chunk size,
>   it will be 93% speedup).
> 
>   The new rate based throttling can support 1000+ dd's with excellent
>   smoothness, low latency and low overheads.
> 
> For the above reasons, it's much better to do IO-less and low latency
> pauses in balance_dirty_pages().
> 
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. However it's found to have two problems:
> 
> - in large NUMA systems, the per-cpu counters may have big accounting
>   errors, leading to big throttle wait time and jitters.
> 
> - NFS may kill large amount of unstable pages with one single COMMIT.
>   Because NFS server serves COMMIT with expensive fsync() IOs, it is
>   desirable to delay and reduce the number of COMMITs. So it's not
>   likely to optimize away such kind of bursty IO completions, and the
>   resulted large (and tiny) stall times in IO completion based throttling.
> 
> So here is a pause time oriented approach, which tries to control the
> pause time in each balance_dirty_pages() invocations, by controlling
> the number of pages dirtied before calling balance_dirty_pages(), for
> smooth and efficient dirty throttling:
> 
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than   4ms, which burns CPU power)
> - avoid too large pause time (more than 200ms, which hurts responsiveness)
> - avoid big fluctuations of pause times

I definitely agree that too small pauses must be avoided. However, I
don't understand very well from the code how the minimum sleep time is
regulated.

I've added a simple tracepoint (see below) to monitor the pause times in
balance_dirty_pages().

Sometimes I see very small pause time if I set a low dirty threshold
(<=32MB).

Example:

 # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes
 # iozone -A >/dev/null &
 # cat /sys/kernel/debug/tracing/trace_pipe
 ...
          iozone-2075  [001]   380.604961: writeback_dirty_throttle: 1
          iozone-2075  [001]   380.605966: writeback_dirty_throttle: 2
          iozone-2075  [001]   380.608405: writeback_dirty_throttle: 0
          iozone-2075  [001]   380.608980: writeback_dirty_throttle: 1
          iozone-2075  [001]   380.609952: writeback_dirty_throttle: 1
          iozone-2075  [001]   380.610952: writeback_dirty_throttle: 2
          iozone-2075  [001]   380.612662: writeback_dirty_throttle: 0
          iozone-2075  [000]   380.613799: writeback_dirty_throttle: 1
          iozone-2075  [000]   380.614771: writeback_dirty_throttle: 1
          iozone-2075  [000]   380.615767: writeback_dirty_throttle: 2
 ...

BTW, I can see this behavior only in the first minute while iozone is
running. Ater ~1min things seem to get stable (sleeps are usually
between 50ms and 200ms).

I wonder if we shouldn't add an explicit check also for the minimum
sleep time.

Thanks,
-Andrea

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
---
 include/trace/events/writeback.h |   12 ++++++++++++
 mm/page-writeback.c              |    1 +
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 9c2cc8a..22b04b9 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -78,6 +78,18 @@ TRACE_EVENT(writeback_pages_written,
 	TP_printk("%ld", __entry->pages)
 );
 
+TRACE_EVENT(writeback_dirty_throttle,
+	TP_PROTO(unsigned long sleep),
+	TP_ARGS(sleep),
+	TP_STRUCT__entry(
+		__field(unsigned long, sleep)
+	),
+	TP_fast_assign(
+		__entry->sleep = sleep;
+	),
+	TP_printk("%u", jiffies_to_msecs(__entry->sleep))
+);
+
 DECLARE_EVENT_CLASS(writeback_class,
 	TP_PROTO(struct backing_dev_info *bdi),
 	TP_ARGS(bdi),
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a998931..e5a2664 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -889,6 +889,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		pause = min_t(unsigned long, pause, MAX_PAUSE);
 
 pause:
+		trace_writeback_dirty_throttle(pause);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 

> 
> It can control pause times at will. The default policy will be to do
> ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.
> 
> BEHAVIOR CHANGE
> ===============
> 
> (1) dirty threshold
> 
> Users will notice that the applications will get throttled once crossing
> the global (background + dirty)/2=15% threshold, and then balanced around
> 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
> memory in 1-dd case.
> 
> Since the task will be soft throttled earlier than before, it may be
> perceived by end users as performance "slow down" if his application
> happens to dirty more than 15% dirtyable memory.
> 
> (2) smoothness/responsiveness
> 
> Users will notice a more responsive system during heavy writeback.
> "killall dd" will take effect instantly.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/trace/events/writeback.h |   24 ----
>  mm/page-writeback.c              |  142 +++++++----------------------
>  2 files changed, 37 insertions(+), 129 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 11:17:26.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 16:16:30.000000000 +0800
> @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct
>  				numerator, denominator);
>  }
>  
> -static inline void task_dirties_fraction(struct task_struct *tsk,
> -		long *numerator, long *denominator)
> -{
> -	prop_fraction_single(&vm_dirties, &tsk->dirties,
> -				numerator, denominator);
> -}
> -
> -/*
> - * task_dirty_limit - scale down dirty throttling threshold for one task
> - *
> - * task specific dirty limit:
> - *
> - *   dirty -= (dirty/8) * p_{t}
> - *
> - * To protect light/slow dirtying tasks from heavier/fast ones, we start
> - * throttling individual tasks before reaching the bdi dirty limit.
> - * Relatively low thresholds will be allocated to heavy dirtiers. So when
> - * dirty pages grow large, heavy dirtiers will be throttled first, which will
> - * effectively curb the growth of dirty pages. Light dirtiers with high enough
> - * dirty threshold may never get throttled.
> - */
> -#define TASK_LIMIT_FRACTION 8
> -static unsigned long task_dirty_limit(struct task_struct *tsk,
> -				       unsigned long bdi_dirty)
> -{
> -	long numerator, denominator;
> -	unsigned long dirty = bdi_dirty;
> -	u64 inv = dirty / TASK_LIMIT_FRACTION;
> -
> -	task_dirties_fraction(tsk, &numerator, &denominator);
> -	inv *= numerator;
> -	do_div(inv, denominator);
> -
> -	dirty -= inv;
> -
> -	return max(dirty, bdi_dirty/2);
> -}
> -
> -/* Minimum limit for any task */
> -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
> -{
> -	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
> -}
> -
>  /*
>   *
>   */
> @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns
>   * perform some writeout.
>   */
>  static void balance_dirty_pages(struct address_space *mapping,
> -				unsigned long write_chunk)
> +				unsigned long pages_dirtied)
>  {
> -	unsigned long nr_reclaimable, bdi_nr_reclaimable;
> +	unsigned long nr_reclaimable;
>  	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
>  	unsigned long bdi_dirty;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long task_bdi_thresh;
> -	unsigned long min_task_bdi_thresh;
> -	unsigned long pages_written = 0;
> -	unsigned long pause = 1;
> +	unsigned long pause = 0;
>  	bool dirty_exceeded = false;
> -	bool clear_dirty_exceeded = true;
> +	unsigned long bw;
> +	unsigned long base_bw;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long start_time = jiffies;
>  
>  	for (;;) {
> +		/*
> +		 * Unstable writes are a feature of certain networked
> +		 * filesystems (i.e. NFS) in which data may have been
> +		 * written to the server's write cache, but has not yet
> +		 * been flushed to permanent storage.
> +		 */
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
>  		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
> @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> -		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
> -		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
>  
>  		/*
>  		 * In order to avoid the stacked BDI deadlock we need
> @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a
>  		 * actually dirty; with m+n sitting in the percpu
>  		 * deltas.
>  		 */
> -		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
> -			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		if (bdi_thresh < 2 * bdi_stat_error(bdi))
> +			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat_sum(bdi, BDI_WRITEBACK);
> -		} else {
> -			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		else
> +			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat(bdi, BDI_WRITEBACK);
> -		}
>  
> -		/*
> -		 * The bdi thresh is somehow "soft" limit derived from the
> -		 * global "hard" limit. The former helps to prevent heavy IO
> -		 * bdi or process from holding back light ones; The latter is
> -		 * the last resort safeguard.
> -		 */
> -		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
> +		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
>  				  (nr_dirty > dirty_thresh);
> -		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
> -					(nr_dirty <= dirty_thresh);
> -
> -		if (!dirty_exceeded)
> -			break;
> -
> -		if (!bdi->dirty_exceeded)
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
>  				     bdi_thresh, bdi_dirty, start_time);
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_bw = bdi->dirty_ratelimit;
> +		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> +					bdi_thresh, bdi_dirty);
> +		if (unlikely(bw == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> +		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
> +		pause = min(pause, MAX_PAUSE);
> +
> +pause:
>  		__set_current_state(TASK_UNINTERRUPTIBLE);
>  		io_schedule_timeout(pause);
> -		trace_balance_dirty_wait(bdi);
>  
>  		dirty_thresh = hard_dirty_limit(dirty_thresh);
>  		/*
> @@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a
>  		 * (b) the pause time limit makes the dirtiers more responsive.
>  		 */
>  		if (nr_dirty < dirty_thresh +
> -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> -		    time_after(jiffies, start_time + MAX_PAUSE))
> +			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
>  			break;
>  		/*
>  		 * pass-good area. When some bdi gets blocked (eg. NFS server
> @@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a
>  			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
>  		    bdi_dirty < bdi_thresh)
>  			break;
> -
> -		/*
> -		 * Increase the delay for each loop, up to our previous
> -		 * default of taking a 100ms nap.
> -		 */
> -		pause <<= 1;
> -		if (pause > HZ / 10)
> -			pause = HZ / 10;
>  	}
>  
> -	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> -	if (clear_dirty_exceeded && bdi->dirty_exceeded)
> +	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
>  	current->nr_dirtied = 0;
> @@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> -	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> +	if (laptop_mode)
> +		return;
> +
> +	if (nr_reclaimable > background_thresh)
>  		bdi_start_background_writeback(bdi);
>  }
>  
> --- linux-next.orig/include/trace/events/writeback.h	2011-08-06 11:08:34.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2011-08-06 11:17:29.000000000 +0800
> @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
>  DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
>  DEFINE_WRITEBACK_EVENT(writeback_thread_start);
>  DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
> -DEFINE_WRITEBACK_EVENT(balance_dirty_start);
> -DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
> -
> -TRACE_EVENT(balance_dirty_written,
> -
> -	TP_PROTO(struct backing_dev_info *bdi, int written),
> -
> -	TP_ARGS(bdi, written),
> -
> -	TP_STRUCT__entry(
> -		__array(char,	name, 32)
> -		__field(int,	written)
> -	),
> -
> -	TP_fast_assign(
> -		strncpy(__entry->name, dev_name(bdi->dev), 32);
> -		__entry->written = written;
> -	),
> -
> -	TP_printk("bdi %s written %d",
> -		  __entry->name,
> -		  __entry->written
> -	)
> -);
>  
>  DECLARE_EVENT_CLASS(wbc_class,
>  	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
> 

^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-06 16:46     ` Andrea Righi
  0 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-06 16:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> As proposed by Chris, Dave and Jan, don't start foreground writeback IO
> inside balance_dirty_pages(). Instead, simply let it idle sleep for some
> time to throttle the dirtying task. In the mean while, kick off the
> per-bdi flusher thread to do background writeback IO.
> 
> RATIONALS
> =========
> 
> - disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
> 
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> 
> - lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
> 
>   With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
>   from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
> 
>   * "CPU usage has dropped by ~55%", "it certainly appears that most of
>     the CPU time saving comes from the removal of contention on the
>     inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
>     cacheline bouncing, because the new code is able to call much less
>     frequently into balance_dirty_pages() and hence access the global
>     page states)
> 
>   * the user space "App overhead" is reduced by 20%, by avoiding the
>     cacheline pollution by the complex writeback code path
> 
>   * "for a ~5% throughput reduction", "the number of write IOs have
>     dropped by ~25%", and the elapsed time reduced from 41:42.17 to
>     40:53.23.
> 
>   * On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
>     and improves IO throughput from 38MB/s to 42MB/s.
> 
> - IO size too small for fast arrays and too large for slow USB sticks
> 
>   The write_chunk used by current balance_dirty_pages() cannot be
>   directly set to some large value (eg. 128MB) for better IO efficiency.
>   Because it could lead to more than 1 second user perceivable stalls.
>   Even the current 4MB write size may be too large for slow USB sticks.
>   The fact that balance_dirty_pages() starts IO on itself couples the
>   IO size to wait time, which makes it hard to do suitable IO size while
>   keeping the wait time under control.
> 
>   Now it's possible to increase writeback chunk size proportional to the
>   disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
>   the larger writeback size dramatically reduces the seek count to 1/10
>   (far beyond my expectation) and improves the write throughput by 24%.
> 
> - long block time in balance_dirty_pages() hurts desktop responsiveness
> 
>   Many of us may have the experience: it often takes a couple of seconds
>   or even long time to stop a heavy writing dd/cp/tar command with
>   Ctrl-C or "kill -9".
> 
> - IO pipeline broken by bumpy write() progress
> 
>   There are a broad class of "loop {read(buf); write(buf);}" applications
>   whose read() pipeline will be under-utilized or even come to a stop if
>   the write()s have long latencies _or_ don't progress in a constant rate.
>   The current threshold based throttling inherently transfers the large
>   low level IO completion fluctuations to bumpy application write()s,
>   and further deteriorates with increasing number of dirtiers and/or bdi's.
> 
>   For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
>   the rsync progresses very bumpy in legacy kernel, and throughput is
>   improved by 67% by this patchset. (plus the larger write chunk size,
>   it will be 93% speedup).
> 
>   The new rate based throttling can support 1000+ dd's with excellent
>   smoothness, low latency and low overheads.
> 
> For the above reasons, it's much better to do IO-less and low latency
> pauses in balance_dirty_pages().
> 
> Jan Kara, Dave Chinner and me explored the scheme to let
> balance_dirty_pages() wait for enough writeback IO completions to
> safeguard the dirty limit. However it's found to have two problems:
> 
> - in large NUMA systems, the per-cpu counters may have big accounting
>   errors, leading to big throttle wait time and jitters.
> 
> - NFS may kill large amount of unstable pages with one single COMMIT.
>   Because NFS server serves COMMIT with expensive fsync() IOs, it is
>   desirable to delay and reduce the number of COMMITs. So it's not
>   likely to optimize away such kind of bursty IO completions, and the
>   resulted large (and tiny) stall times in IO completion based throttling.
> 
> So here is a pause time oriented approach, which tries to control the
> pause time in each balance_dirty_pages() invocations, by controlling
> the number of pages dirtied before calling balance_dirty_pages(), for
> smooth and efficient dirty throttling:
> 
> - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> - avoid too small pause time (less than   4ms, which burns CPU power)
> - avoid too large pause time (more than 200ms, which hurts responsiveness)
> - avoid big fluctuations of pause times

I definitely agree that too small pauses must be avoided. However, I
don't understand very well from the code how the minimum sleep time is
regulated.

I've added a simple tracepoint (see below) to monitor the pause times in
balance_dirty_pages().

Sometimes I see very small pause time if I set a low dirty threshold
(<=32MB).

Example:

 # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes
 # iozone -A >/dev/null &
 # cat /sys/kernel/debug/tracing/trace_pipe
 ...
          iozone-2075  [001]   380.604961: writeback_dirty_throttle: 1
          iozone-2075  [001]   380.605966: writeback_dirty_throttle: 2
          iozone-2075  [001]   380.608405: writeback_dirty_throttle: 0
          iozone-2075  [001]   380.608980: writeback_dirty_throttle: 1
          iozone-2075  [001]   380.609952: writeback_dirty_throttle: 1
          iozone-2075  [001]   380.610952: writeback_dirty_throttle: 2
          iozone-2075  [001]   380.612662: writeback_dirty_throttle: 0
          iozone-2075  [000]   380.613799: writeback_dirty_throttle: 1
          iozone-2075  [000]   380.614771: writeback_dirty_throttle: 1
          iozone-2075  [000]   380.615767: writeback_dirty_throttle: 2
 ...

BTW, I can see this behavior only in the first minute while iozone is
running. Ater ~1min things seem to get stable (sleeps are usually
between 50ms and 200ms).

I wonder if we shouldn't add an explicit check also for the minimum
sleep time.

Thanks,
-Andrea

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
---
 include/trace/events/writeback.h |   12 ++++++++++++
 mm/page-writeback.c              |    1 +
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 9c2cc8a..22b04b9 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -78,6 +78,18 @@ TRACE_EVENT(writeback_pages_written,
 	TP_printk("%ld", __entry->pages)
 );
 
+TRACE_EVENT(writeback_dirty_throttle,
+	TP_PROTO(unsigned long sleep),
+	TP_ARGS(sleep),
+	TP_STRUCT__entry(
+		__field(unsigned long, sleep)
+	),
+	TP_fast_assign(
+		__entry->sleep = sleep;
+	),
+	TP_printk("%u", jiffies_to_msecs(__entry->sleep))
+);
+
 DECLARE_EVENT_CLASS(writeback_class,
 	TP_PROTO(struct backing_dev_info *bdi),
 	TP_ARGS(bdi),
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a998931..e5a2664 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -889,6 +889,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 		pause = min_t(unsigned long, pause, MAX_PAUSE);
 
 pause:
+		trace_writeback_dirty_throttle(pause);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 

> 
> It can control pause times at will. The default policy will be to do
> ~10ms pauses in 1-dd case, and increase to ~100ms in 1000-dd case.
> 
> BEHAVIOR CHANGE
> ===============
> 
> (1) dirty threshold
> 
> Users will notice that the applications will get throttled once crossing
> the global (background + dirty)/2=15% threshold, and then balanced around
> 17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
> memory in 1-dd case.
> 
> Since the task will be soft throttled earlier than before, it may be
> perceived by end users as performance "slow down" if his application
> happens to dirty more than 15% dirtyable memory.
> 
> (2) smoothness/responsiveness
> 
> Users will notice a more responsive system during heavy writeback.
> "killall dd" will take effect instantly.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/trace/events/writeback.h |   24 ----
>  mm/page-writeback.c              |  142 +++++++----------------------
>  2 files changed, 37 insertions(+), 129 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 11:17:26.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 16:16:30.000000000 +0800
> @@ -242,50 +242,6 @@ static void bdi_writeout_fraction(struct
>  				numerator, denominator);
>  }
>  
> -static inline void task_dirties_fraction(struct task_struct *tsk,
> -		long *numerator, long *denominator)
> -{
> -	prop_fraction_single(&vm_dirties, &tsk->dirties,
> -				numerator, denominator);
> -}
> -
> -/*
> - * task_dirty_limit - scale down dirty throttling threshold for one task
> - *
> - * task specific dirty limit:
> - *
> - *   dirty -= (dirty/8) * p_{t}
> - *
> - * To protect light/slow dirtying tasks from heavier/fast ones, we start
> - * throttling individual tasks before reaching the bdi dirty limit.
> - * Relatively low thresholds will be allocated to heavy dirtiers. So when
> - * dirty pages grow large, heavy dirtiers will be throttled first, which will
> - * effectively curb the growth of dirty pages. Light dirtiers with high enough
> - * dirty threshold may never get throttled.
> - */
> -#define TASK_LIMIT_FRACTION 8
> -static unsigned long task_dirty_limit(struct task_struct *tsk,
> -				       unsigned long bdi_dirty)
> -{
> -	long numerator, denominator;
> -	unsigned long dirty = bdi_dirty;
> -	u64 inv = dirty / TASK_LIMIT_FRACTION;
> -
> -	task_dirties_fraction(tsk, &numerator, &denominator);
> -	inv *= numerator;
> -	do_div(inv, denominator);
> -
> -	dirty -= inv;
> -
> -	return max(dirty, bdi_dirty/2);
> -}
> -
> -/* Minimum limit for any task */
> -static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
> -{
> -	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
> -}
> -
>  /*
>   *
>   */
> @@ -855,24 +811,28 @@ static unsigned long ratelimit_pages(uns
>   * perform some writeout.
>   */
>  static void balance_dirty_pages(struct address_space *mapping,
> -				unsigned long write_chunk)
> +				unsigned long pages_dirtied)
>  {
> -	unsigned long nr_reclaimable, bdi_nr_reclaimable;
> +	unsigned long nr_reclaimable;
>  	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
>  	unsigned long bdi_dirty;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long task_bdi_thresh;
> -	unsigned long min_task_bdi_thresh;
> -	unsigned long pages_written = 0;
> -	unsigned long pause = 1;
> +	unsigned long pause = 0;
>  	bool dirty_exceeded = false;
> -	bool clear_dirty_exceeded = true;
> +	unsigned long bw;
> +	unsigned long base_bw;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long start_time = jiffies;
>  
>  	for (;;) {
> +		/*
> +		 * Unstable writes are a feature of certain networked
> +		 * filesystems (i.e. NFS) in which data may have been
> +		 * written to the server's write cache, but has not yet
> +		 * been flushed to permanent storage.
> +		 */
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
>  		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
> @@ -888,8 +848,6 @@ static void balance_dirty_pages(struct a
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> -		min_task_bdi_thresh = task_min_dirty_limit(bdi_thresh);
> -		task_bdi_thresh = task_dirty_limit(current, bdi_thresh);
>  
>  		/*
>  		 * In order to avoid the stacked BDI deadlock we need
> @@ -901,56 +859,38 @@ static void balance_dirty_pages(struct a
>  		 * actually dirty; with m+n sitting in the percpu
>  		 * deltas.
>  		 */
> -		if (task_bdi_thresh < 2 * bdi_stat_error(bdi)) {
> -			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		if (bdi_thresh < 2 * bdi_stat_error(bdi))
> +			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat_sum(bdi, BDI_WRITEBACK);
> -		} else {
> -			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -			bdi_dirty = bdi_nr_reclaimable +
> +		else
> +			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
>  				    bdi_stat(bdi, BDI_WRITEBACK);
> -		}
>  
> -		/*
> -		 * The bdi thresh is somehow "soft" limit derived from the
> -		 * global "hard" limit. The former helps to prevent heavy IO
> -		 * bdi or process from holding back light ones; The latter is
> -		 * the last resort safeguard.
> -		 */
> -		dirty_exceeded = (bdi_dirty > task_bdi_thresh) ||
> +		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
>  				  (nr_dirty > dirty_thresh);
> -		clear_dirty_exceeded = (bdi_dirty <= min_task_bdi_thresh) &&
> -					(nr_dirty <= dirty_thresh);
> -
> -		if (!dirty_exceeded)
> -			break;
> -
> -		if (!bdi->dirty_exceeded)
> +		if (dirty_exceeded && !bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
>  				     bdi_thresh, bdi_dirty, start_time);
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_bw = bdi->dirty_ratelimit;
> +		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> +					bdi_thresh, bdi_dirty);
> +		if (unlikely(bw == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> +		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
> +		pause = min(pause, MAX_PAUSE);
> +
> +pause:
>  		__set_current_state(TASK_UNINTERRUPTIBLE);
>  		io_schedule_timeout(pause);
> -		trace_balance_dirty_wait(bdi);
>  
>  		dirty_thresh = hard_dirty_limit(dirty_thresh);
>  		/*
> @@ -960,8 +900,7 @@ static void balance_dirty_pages(struct a
>  		 * (b) the pause time limit makes the dirtiers more responsive.
>  		 */
>  		if (nr_dirty < dirty_thresh +
> -			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
> -		    time_after(jiffies, start_time + MAX_PAUSE))
> +			       dirty_thresh / DIRTY_MAXPAUSE_AREA)
>  			break;
>  		/*
>  		 * pass-good area. When some bdi gets blocked (eg. NFS server
> @@ -974,18 +913,9 @@ static void balance_dirty_pages(struct a
>  			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
>  		    bdi_dirty < bdi_thresh)
>  			break;
> -
> -		/*
> -		 * Increase the delay for each loop, up to our previous
> -		 * default of taking a 100ms nap.
> -		 */
> -		pause <<= 1;
> -		if (pause > HZ / 10)
> -			pause = HZ / 10;
>  	}
>  
> -	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> -	if (clear_dirty_exceeded && bdi->dirty_exceeded)
> +	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
>  	current->nr_dirtied = 0;
> @@ -1002,8 +932,10 @@ static void balance_dirty_pages(struct a
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> -	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> +	if (laptop_mode)
> +		return;
> +
> +	if (nr_reclaimable > background_thresh)
>  		bdi_start_background_writeback(bdi);
>  }
>  
> --- linux-next.orig/include/trace/events/writeback.h	2011-08-06 11:08:34.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2011-08-06 11:17:29.000000000 +0800
> @@ -104,30 +104,6 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
>  DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
>  DEFINE_WRITEBACK_EVENT(writeback_thread_start);
>  DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
> -DEFINE_WRITEBACK_EVENT(balance_dirty_start);
> -DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
> -
> -TRACE_EVENT(balance_dirty_written,
> -
> -	TP_PROTO(struct backing_dev_info *bdi, int written),
> -
> -	TP_ARGS(bdi, written),
> -
> -	TP_STRUCT__entry(
> -		__array(char,	name, 32)
> -		__field(int,	written)
> -	),
> -
> -	TP_fast_assign(
> -		strncpy(__entry->name, dev_name(bdi->dev), 32);
> -		__entry->written = written;
> -	),
> -
> -	TP_printk("bdi %s written %d",
> -		  __entry->name,
> -		  __entry->written
> -	)
> -);
>  
>  DECLARE_EVENT_CLASS(wbc_class,
>  	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-06 14:35     ` Andrea Righi
@ 2011-08-07  6:19       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-07  6:19 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 10:35:31PM +0800, Andrea Righi wrote:
> On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote:
> > Add two fields to task_struct.
> > 
> > 1) account dirtied pages in the individual tasks, for accuracy
> > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > 
> > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > scale near-sqrt to the safety gap between dirty pages and threshold.
> > 
> > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > dirtying pages at exactly the same time, each task will be assigned a
> > large initial nr_dirtied_pause, so that the dirty threshold will be
> > exceeded long before each task reached its nr_dirtied_pause and hence
> > call balance_dirty_pages().
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> A minor nitpick below.
> 
> Reviewed-by: Andrea Righi <andrea@betterlinux.com>

Thank you.

> > +/*
> >   * balance_dirty_pages() must be called by processes which are generating dirty
> >   * data.  It looks at the number of dirty pages in the machine and will force
> >   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> 
> I think we should also fix the comment of balance_dirty_pages(), now
> that it's IO-less for the caller. Maybe something like:
> 
> /*
>  * balance_dirty_pages() must be called by processes which are generating dirty
>  * data.  It looks at the number of dirty pages in the machine and will force
>  * the caller to wait once crossing the dirty threshold. If we're over
>  * `background_thresh' then the writeback threads are woken to perform some
>  * writeout.
>  */

Good catch! I'll add this change to the next patch:

 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-07  6:19       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-07  6:19 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 10:35:31PM +0800, Andrea Righi wrote:
> On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote:
> > Add two fields to task_struct.
> > 
> > 1) account dirtied pages in the individual tasks, for accuracy
> > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > 
> > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > scale near-sqrt to the safety gap between dirty pages and threshold.
> > 
> > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > dirtying pages at exactly the same time, each task will be assigned a
> > large initial nr_dirtied_pause, so that the dirty threshold will be
> > exceeded long before each task reached its nr_dirtied_pause and hence
> > call balance_dirty_pages().
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> A minor nitpick below.
> 
> Reviewed-by: Andrea Righi <andrea@betterlinux.com>

Thank you.

> > +/*
> >   * balance_dirty_pages() must be called by processes which are generating dirty
> >   * data.  It looks at the number of dirty pages in the machine and will force
> >   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> 
> I think we should also fix the comment of balance_dirty_pages(), now
> that it's IO-less for the caller. Maybe something like:
> 
> /*
>  * balance_dirty_pages() must be called by processes which are generating dirty
>  * data.  It looks at the number of dirty pages in the machine and will force
>  * the caller to wait once crossing the dirty threshold. If we're over
>  * `background_thresh' then the writeback threads are woken to perform some
>  * writeout.
>  */

Good catch! I'll add this change to the next patch:

 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
  */

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-06 14:48     ` Andrea Righi
  (?)
@ 2011-08-07  6:44       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-07  6:44 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

> > +             bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> > +             pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
> > +             pause = min(pause, MAX_PAUSE);
> 
> Fix this build warning:
> 
>  mm/page-writeback.c: In function ‘balance_dirty_pages’:
>  mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast

Thanks! I'll fix it by changing `pause' to "long", since we'll have
negative pause time anyway when considering think time compensation.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-07  6:44       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-07  6:44 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

> > +             bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> > +             pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
> > +             pause = min(pause, MAX_PAUSE);
> 
> Fix this build warning:
> 
>  mm/page-writeback.c: In function ‘balance_dirty_pages’:
>  mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast

Thanks! I'll fix it by changing `pause' to "long", since we'll have
negative pause time anyway when considering think time compensation.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-07  6:44       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-07  6:44 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

> > +             bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> > +             pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
> > +             pause = min(pause, MAX_PAUSE);
> 
> Fix this build warning:
> 
>  mm/page-writeback.c: In function a??balance_dirty_pagesa??:
>  mm/page-writeback.c:889:11: warning: comparison of distinct pointer types lacks a cast

Thanks! I'll fix it by changing `pause' to "long", since we'll have
negative pause time anyway when considering think time compensation.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-06 16:46     ` Andrea Righi
  (?)
@ 2011-08-07  7:18     ` Wu Fengguang
  2011-08-07  9:50         ` Andrea Righi
  -1 siblings, 1 reply; 305+ messages in thread
From: Wu Fengguang @ 2011-08-07  7:18 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

[-- Attachment #1: Type: text/plain, Size: 3482 bytes --]

Andrea,

On Sun, Aug 07, 2011 at 12:46:56AM +0800, Andrea Righi wrote:
> On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:

> > So here is a pause time oriented approach, which tries to control the
> > pause time in each balance_dirty_pages() invocations, by controlling
> > the number of pages dirtied before calling balance_dirty_pages(), for
> > smooth and efficient dirty throttling:
> >
> > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > - avoid too small pause time (less than   4ms, which burns CPU power)
> > - avoid too large pause time (more than 200ms, which hurts responsiveness)
> > - avoid big fluctuations of pause times
> 
> I definitely agree that too small pauses must be avoided. However, I
> don't understand very well from the code how the minimum sleep time is
> regulated.

Thanks for pointing this out. Yes, the sleep time regulation is not
here and I should have mentioned that above. Since this is only the
core bits, there will be some followup patches to fix the rough edges.
(attached the two relevant patches)

> I've added a simple tracepoint (see below) to monitor the pause times in
> balance_dirty_pages().
> 
> Sometimes I see very small pause time if I set a low dirty threshold
> (<=32MB).

Yeah, it's definitely possible.

> Example:
> 
>  # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes
>  # iozone -A >/dev/null &
>  # cat /sys/kernel/debug/tracing/trace_pipe
>  ...
>           iozone-2075  [001]   380.604961: writeback_dirty_throttle: 1
>           iozone-2075  [001]   380.605966: writeback_dirty_throttle: 2
>           iozone-2075  [001]   380.608405: writeback_dirty_throttle: 0
>           iozone-2075  [001]   380.608980: writeback_dirty_throttle: 1
>           iozone-2075  [001]   380.609952: writeback_dirty_throttle: 1
>           iozone-2075  [001]   380.610952: writeback_dirty_throttle: 2
>           iozone-2075  [001]   380.612662: writeback_dirty_throttle: 0
>           iozone-2075  [000]   380.613799: writeback_dirty_throttle: 1
>           iozone-2075  [000]   380.614771: writeback_dirty_throttle: 1
>           iozone-2075  [000]   380.615767: writeback_dirty_throttle: 2
>  ...
> 
> BTW, I can see this behavior only in the first minute while iozone is
> running. Ater ~1min things seem to get stable (sleeps are usually
> between 50ms and 200ms).
> 

Yeah, it's roughly in line with this graph, where the red dots are the
pause time:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/512M/xfs-1dd-4k-8p-438M-20:10-3.0.0-next-20110802+-2011-08-06.11:03/balance_dirty_pages-pause.png

Note that the big change of pattern in the middle is due to a
deliberate disturb: a dd will be started at 100s _reading_ 1GB data,
which effectively livelocked the other dd dirtier task with the CFQ io
scheduler. 

> I wonder if we shouldn't add an explicit check also for the minimum
> sleep time.
 
With the more complete patchset including the pause time regulation,
the pause time distribution should look much better, falling nicely
into the range (5ms, 20ms):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/xfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:51/balance_dirty_pages-pause.png

> +TRACE_EVENT(writeback_dirty_throttle,
> +       TP_PROTO(unsigned long sleep),
> +       TP_ARGS(sleep),

btw, I've just pushed two more tracing patches to the git tree.
Hope it helps :)

Thanks,
Fengguang

[-- Attachment #2: max-pause --]
[-- Type: text/plain, Size: 3065 bytes --]

Subject: writeback: limit max dirty pause time
Date: Sat Jun 11 19:21:43 CST 2011

Apply two policies to scale down the max pause time for

1) small number of concurrent dirtiers
2) small memory system (comparing to storage bandwidth)

MAX_PAUSE=200ms may only be suitable for high end servers with lots of
concurrent dirtiers, where the large pause time can reduce much overheads.

Otherwise, smaller pause time is desirable whenever possible, so as to
get good responsiveness and smooth user experiences. It's actually
required for good disk utilization in the case when all the dirty pages
can be synced to disk within MAX_PAUSE=200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   43 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-07 14:23:45.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-07 14:25:29.000000000 +0800
@@ -856,6 +856,42 @@ static unsigned long ratelimit_pages(uns
 	return 1;
 }
 
+static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
+				   unsigned long bdi_dirty)
+{
+	unsigned long hi = ilog2(bdi->write_bandwidth);
+	unsigned long lo = ilog2(bdi->dirty_ratelimit);
+	unsigned long t;
+
+	/* target for ~10ms pause on 1-dd case */
+	t = HZ / 50;
+
+	/*
+	 * Scale up pause time for concurrent dirtiers in order to reduce CPU
+	 * overheads.
+	 *
+	 * (N * 20ms) on 2^N concurrent tasks.
+	 */
+	if (hi > lo)
+		t += (hi - lo) * (20 * HZ) / 1024;
+
+	/*
+	 * Limit pause time for small memory systems. If sleeping for too long
+	 * time, a small pool of dirty/writeback pages may go empty and disk go
+	 * idle.
+	 *
+	 * 1ms for every 1MB; may further consider bdi bandwidth.
+	 */
+	if (bdi_dirty)
+		t = min(t, bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ)));
+
+	/*
+	 * The pause time will be settled within range (max_pause/4, max_pause).
+	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
+	 */
+	return clamp_val(t, 4, MAX_PAUSE);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -873,6 +909,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	long pause = 0;
+	long max_pause;
 	bool dirty_exceeded = false;
 	unsigned long bw;
 	unsigned long base_bw;
@@ -930,16 +967,18 @@ static void balance_dirty_pages(struct a
 		if (unlikely(!writeback_in_progress(bdi)))
 			bdi_start_background_writeback(bdi);
 
+		max_pause = bdi_max_pause(bdi, bdi_dirty);
+
 		base_bw = bdi->dirty_ratelimit;
 		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
 					bdi_thresh, bdi_dirty);
 		if (unlikely(bw == 0)) {
-			pause = MAX_PAUSE;
+			pause = max_pause;
 			goto pause;
 		}
 		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
 		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
-		pause = min(pause, MAX_PAUSE);
+		pause = min(pause, max_pause);
 
 pause:
 		trace_balance_dirty_pages(bdi,

[-- Attachment #3: max-pause-adaption --]
[-- Type: text/plain, Size: 1829 bytes --]

Subject: writeback: control dirty pause time
Date: Sat Jun 11 19:32:32 CST 2011

The dirty pause time shall ultimately be controlled by adjusting
nr_dirtied_pause, since there is relationship

	pause = pages_dirtied / pos_bw

Assuming

	pages_dirtied ~= nr_dirtied_pause
	pos_bw ~= base_bw

We get

	nr_dirtied_pause ~= base_bw * desired_pause

Here base_bw is preferred over pos_bw because it's more stable.

It's also important to limit possible large transitional errors:

- bw is changing quickly
- pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
- pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
  separate fix, but still expect non-trivial errors)

So we end up using the above formula inside clamp_val().

The best test case for this code is to run 100 "dd bs=4M" tasks on
btrfs and check its pause time distribution.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-07 14:51:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-07 15:02:08.000000000 +0800
@@ -1021,7 +1021,19 @@ pause:
 		bdi->dirty_exceeded = 0;
 
 	current->nr_dirtied = 0;
-	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
+	if (pause == 0)
+		current->nr_dirtied_pause =
+				ratelimit_pages(nr_dirty, dirty_thresh);
+	else if (pause < max_pause / 4)
+		current->nr_dirtied_pause = clamp_val(
+						base_bw * (max_pause/2) / HZ,
+						pages_dirtied + pages_dirtied/8,
+						pages_dirtied * 4);
+	else if (pause > max_pause)
+		current->nr_dirtied_pause = 1 | clamp_val(
+						base_bw * (max_pause*3/8) / HZ,
+						current->nr_dirtied_pause / 4,
+						current->nr_dirtied_pause*7/8);
 
 	if (writeback_in_progress(bdi))
 		return;

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-07  7:18     ` Wu Fengguang
@ 2011-08-07  9:50         ` Andrea Righi
  0 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-07  9:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sun, Aug 07, 2011 at 03:18:57PM +0800, Wu Fengguang wrote:
> Andrea,
> 
> On Sun, Aug 07, 2011 at 12:46:56AM +0800, Andrea Righi wrote:
> > On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> 
> > > So here is a pause time oriented approach, which tries to control the
> > > pause time in each balance_dirty_pages() invocations, by controlling
> > > the number of pages dirtied before calling balance_dirty_pages(), for
> > > smooth and efficient dirty throttling:
> > >
> > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > > - avoid too small pause time (less than   4ms, which burns CPU power)
> > > - avoid too large pause time (more than 200ms, which hurts responsiveness)
> > > - avoid big fluctuations of pause times
> > 
> > I definitely agree that too small pauses must be avoided. However, I
> > don't understand very well from the code how the minimum sleep time is
> > regulated.
> 
> Thanks for pointing this out. Yes, the sleep time regulation is not
> here and I should have mentioned that above. Since this is only the
> core bits, there will be some followup patches to fix the rough edges.
> (attached the two relevant patches)
> 
> > I've added a simple tracepoint (see below) to monitor the pause times in
> > balance_dirty_pages().
> > 
> > Sometimes I see very small pause time if I set a low dirty threshold
> > (<=32MB).
> 
> Yeah, it's definitely possible.
> 
> > Example:
> > 
> >  # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes
> >  # iozone -A >/dev/null &
> >  # cat /sys/kernel/debug/tracing/trace_pipe
> >  ...
> >           iozone-2075  [001]   380.604961: writeback_dirty_throttle: 1
> >           iozone-2075  [001]   380.605966: writeback_dirty_throttle: 2
> >           iozone-2075  [001]   380.608405: writeback_dirty_throttle: 0
> >           iozone-2075  [001]   380.608980: writeback_dirty_throttle: 1
> >           iozone-2075  [001]   380.609952: writeback_dirty_throttle: 1
> >           iozone-2075  [001]   380.610952: writeback_dirty_throttle: 2
> >           iozone-2075  [001]   380.612662: writeback_dirty_throttle: 0
> >           iozone-2075  [000]   380.613799: writeback_dirty_throttle: 1
> >           iozone-2075  [000]   380.614771: writeback_dirty_throttle: 1
> >           iozone-2075  [000]   380.615767: writeback_dirty_throttle: 2
> >  ...
> > 
> > BTW, I can see this behavior only in the first minute while iozone is
> > running. Ater ~1min things seem to get stable (sleeps are usually
> > between 50ms and 200ms).
> > 
> 
> Yeah, it's roughly in line with this graph, where the red dots are the
> pause time:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/512M/xfs-1dd-4k-8p-438M-20:10-3.0.0-next-20110802+-2011-08-06.11:03/balance_dirty_pages-pause.png
> 
> Note that the big change of pattern in the middle is due to a
> deliberate disturb: a dd will be started at 100s _reading_ 1GB data,
> which effectively livelocked the other dd dirtier task with the CFQ io
> scheduler. 
> 
> > I wonder if we shouldn't add an explicit check also for the minimum
> > sleep time.
>  
> With the more complete patchset including the pause time regulation,
> the pause time distribution should look much better, falling nicely
> into the range (5ms, 20ms):
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/xfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:51/balance_dirty_pages-pause.png
> 
> > +TRACE_EVENT(writeback_dirty_throttle,
> > +       TP_PROTO(unsigned long sleep),
> > +       TP_ARGS(sleep),
> 
> btw, I've just pushed two more tracing patches to the git tree.
> Hope it helps :)

Perfect. Thanks for the clarification and the additional patches, I'm
going to test them right now.

-Andrea

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-07  9:50         ` Andrea Righi
  0 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-07  9:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal, linux-mm,
	LKML

On Sun, Aug 07, 2011 at 03:18:57PM +0800, Wu Fengguang wrote:
> Andrea,
> 
> On Sun, Aug 07, 2011 at 12:46:56AM +0800, Andrea Righi wrote:
> > On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> 
> > > So here is a pause time oriented approach, which tries to control the
> > > pause time in each balance_dirty_pages() invocations, by controlling
> > > the number of pages dirtied before calling balance_dirty_pages(), for
> > > smooth and efficient dirty throttling:
> > >
> > > - avoid useless (eg. zero pause time) balance_dirty_pages() calls
> > > - avoid too small pause time (less than   4ms, which burns CPU power)
> > > - avoid too large pause time (more than 200ms, which hurts responsiveness)
> > > - avoid big fluctuations of pause times
> > 
> > I definitely agree that too small pauses must be avoided. However, I
> > don't understand very well from the code how the minimum sleep time is
> > regulated.
> 
> Thanks for pointing this out. Yes, the sleep time regulation is not
> here and I should have mentioned that above. Since this is only the
> core bits, there will be some followup patches to fix the rough edges.
> (attached the two relevant patches)
> 
> > I've added a simple tracepoint (see below) to monitor the pause times in
> > balance_dirty_pages().
> > 
> > Sometimes I see very small pause time if I set a low dirty threshold
> > (<=32MB).
> 
> Yeah, it's definitely possible.
> 
> > Example:
> > 
> >  # echo $((16 * 1024 * 1024)) > /proc/sys/vm/dirty_bytes
> >  # iozone -A >/dev/null &
> >  # cat /sys/kernel/debug/tracing/trace_pipe
> >  ...
> >           iozone-2075  [001]   380.604961: writeback_dirty_throttle: 1
> >           iozone-2075  [001]   380.605966: writeback_dirty_throttle: 2
> >           iozone-2075  [001]   380.608405: writeback_dirty_throttle: 0
> >           iozone-2075  [001]   380.608980: writeback_dirty_throttle: 1
> >           iozone-2075  [001]   380.609952: writeback_dirty_throttle: 1
> >           iozone-2075  [001]   380.610952: writeback_dirty_throttle: 2
> >           iozone-2075  [001]   380.612662: writeback_dirty_throttle: 0
> >           iozone-2075  [000]   380.613799: writeback_dirty_throttle: 1
> >           iozone-2075  [000]   380.614771: writeback_dirty_throttle: 1
> >           iozone-2075  [000]   380.615767: writeback_dirty_throttle: 2
> >  ...
> > 
> > BTW, I can see this behavior only in the first minute while iozone is
> > running. Ater ~1min things seem to get stable (sleeps are usually
> > between 50ms and 200ms).
> > 
> 
> Yeah, it's roughly in line with this graph, where the red dots are the
> pause time:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/512M/xfs-1dd-4k-8p-438M-20:10-3.0.0-next-20110802+-2011-08-06.11:03/balance_dirty_pages-pause.png
> 
> Note that the big change of pattern in the middle is due to a
> deliberate disturb: a dd will be started at 100s _reading_ 1GB data,
> which effectively livelocked the other dd dirtier task with the CFQ io
> scheduler. 
> 
> > I wonder if we shouldn't add an explicit check also for the minimum
> > sleep time.
>  
> With the more complete patchset including the pause time regulation,
> the pause time distribution should look much better, falling nicely
> into the range (5ms, 20ms):
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/xfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:51/balance_dirty_pages-pause.png
> 
> > +TRACE_EVENT(writeback_dirty_throttle,
> > +       TP_PROTO(unsigned long sleep),
> > +       TP_ARGS(sleep),
> 
> btw, I've just pushed two more tracing patches to the git tree.
> Hope it helps :)

Perfect. Thanks for the clarification and the additional patches, I'm
going to test them right now.

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-08 13:46     ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +       origin = 4 * thresh;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;                 /* auxiliary control line */
> +               goal = (goal + origin) / 2;
> +               pos_ratio >>= 1; 

use before init?

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 13:46     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +       origin = 4 * thresh;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;                 /* auxiliary control line */
> +               goal = (goal + origin) / 2;
> +               pos_ratio >>= 1; 

use before init?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 13:46     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +       origin = 4 * thresh;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;                 /* auxiliary control line */
> +               goal = (goal + origin) / 2;
> +               pos_ratio >>= 1; 

use before init?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-08 13:47     ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> dirtying pages at exactly the same time, each task will be assigned a
> large initial nr_dirtied_pause, so that the dirty threshold will be
> exceeded long before each task reached its nr_dirtied_pause and hence
> call balance_dirty_pages().
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/sched.h |    7 ++
>  mm/memory_hotplug.c   |    3 -
>  mm/page-writeback.c   |  106 +++++++++-------------------------------
>  3 files changed, 32 insertions(+), 84 deletions(-) 

No fork() hooks? This way tasks inherit their parent's dirty count on
clone().

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-08 13:47     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> dirtying pages at exactly the same time, each task will be assigned a
> large initial nr_dirtied_pause, so that the dirty threshold will be
> exceeded long before each task reached its nr_dirtied_pause and hence
> call balance_dirty_pages().
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/sched.h |    7 ++
>  mm/memory_hotplug.c   |    3 -
>  mm/page-writeback.c   |  106 +++++++++-------------------------------
>  3 files changed, 32 insertions(+), 84 deletions(-) 

No fork() hooks? This way tasks inherit their parent's dirty count on
clone().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-08 13:47     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> dirtying pages at exactly the same time, each task will be assigned a
> large initial nr_dirtied_pause, so that the dirty threshold will be
> exceeded long before each task reached its nr_dirtied_pause and hence
> call balance_dirty_pages().
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/sched.h |    7 ++
>  mm/memory_hotplug.c   |    3 -
>  mm/page-writeback.c   |  106 +++++++++-------------------------------
>  3 files changed, 32 insertions(+), 84 deletions(-) 

No fork() hooks? This way tasks inherit their parent's dirty count on
clone().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 13:46     ` Peter Zijlstra
@ 2011-08-08 14:11       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:46:33PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +                                       unsigned long thresh,
> > +                                       unsigned long dirty,
> > +                                       unsigned long bdi_thresh,
> > +                                       unsigned long bdi_dirty)
> > +{
> > +       unsigned long limit = hard_dirty_limit(thresh);
> > +       unsigned long origin;
> > +       unsigned long goal;
> > +       unsigned long long span;
> > +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> > +
> > +       if (unlikely(dirty >= limit))
> > +               return 0;
> > +
> > +       /*
> > +        * global setpoint
> > +        */
> > +       goal = thresh - thresh / DIRTY_SCOPE;
> > +       origin = 4 * thresh;
> > +
> > +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > +               origin = limit;                 /* auxiliary control line */
> > +               goal = (goal + origin) / 2;
> > +               pos_ratio >>= 1; 
> 
> use before init?

Yeah it's embarrassing, I find this bug all the way back to the initial version...

It's actually dead code because (origin < limit) should never happen.
I feel so good being able to drop 5 more lines of code :)

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-08 22:04:48.000000000 +0800
@@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
 	goal = thresh - thresh / DIRTY_SCOPE;
 	origin = 4 * thresh;
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:11       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:46:33PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +                                       unsigned long thresh,
> > +                                       unsigned long dirty,
> > +                                       unsigned long bdi_thresh,
> > +                                       unsigned long bdi_dirty)
> > +{
> > +       unsigned long limit = hard_dirty_limit(thresh);
> > +       unsigned long origin;
> > +       unsigned long goal;
> > +       unsigned long long span;
> > +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> > +
> > +       if (unlikely(dirty >= limit))
> > +               return 0;
> > +
> > +       /*
> > +        * global setpoint
> > +        */
> > +       goal = thresh - thresh / DIRTY_SCOPE;
> > +       origin = 4 * thresh;
> > +
> > +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > +               origin = limit;                 /* auxiliary control line */
> > +               goal = (goal + origin) / 2;
> > +               pos_ratio >>= 1; 
> 
> use before init?

Yeah it's embarrassing, I find this bug all the way back to the initial version...

It's actually dead code because (origin < limit) should never happen.
I feel so good being able to drop 5 more lines of code :)

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-08 22:04:48.000000000 +0800
@@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
 	goal = thresh - thresh / DIRTY_SCOPE;
 	origin = 4 * thresh;
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-08 13:47     ` Peter Zijlstra
@ 2011-08-08 14:21       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > Add two fields to task_struct.
> > 
> > 1) account dirtied pages in the individual tasks, for accuracy
> > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > 
> > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > scale near-sqrt to the safety gap between dirty pages and threshold.
> > 
> > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > dirtying pages at exactly the same time, each task will be assigned a
> > large initial nr_dirtied_pause, so that the dirty threshold will be
> > exceeded long before each task reached its nr_dirtied_pause and hence
> > call balance_dirty_pages().
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/sched.h |    7 ++
> >  mm/memory_hotplug.c   |    3 -
> >  mm/page-writeback.c   |  106 +++++++++-------------------------------
> >  3 files changed, 32 insertions(+), 84 deletions(-) 
> 
> No fork() hooks? This way tasks inherit their parent's dirty count on
> clone().

Ah good point. Here is the quick fix.

Thanks,
Fengguang
---

--- linux-next.orig/kernel/fork.c	2011-08-08 22:11:59.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-08 22:18:05.000000000 +0800
@@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 8;
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-08 14:21       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > Add two fields to task_struct.
> > 
> > 1) account dirtied pages in the individual tasks, for accuracy
> > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > 
> > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > scale near-sqrt to the safety gap between dirty pages and threshold.
> > 
> > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > dirtying pages at exactly the same time, each task will be assigned a
> > large initial nr_dirtied_pause, so that the dirty threshold will be
> > exceeded long before each task reached its nr_dirtied_pause and hence
> > call balance_dirty_pages().
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/sched.h |    7 ++
> >  mm/memory_hotplug.c   |    3 -
> >  mm/page-writeback.c   |  106 +++++++++-------------------------------
> >  3 files changed, 32 insertions(+), 84 deletions(-) 
> 
> No fork() hooks? This way tasks inherit their parent's dirty count on
> clone().

Ah good point. Here is the quick fix.

Thanks,
Fengguang
---

--- linux-next.orig/kernel/fork.c	2011-08-08 22:11:59.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-08 22:18:05.000000000 +0800
@@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
 	p->pdeath_signal = 0;
 	p->exit_state = 0;
 
+	p->nr_dirtied = 0;
+	p->nr_dirtied_pause = 8;
+
 	/*
 	 * Ok, make it visible to the rest of the system.
 	 * We dont wake it up yet.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-08 13:47     ` Peter Zijlstra
@ 2011-08-08 14:23       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > Add two fields to task_struct.
> > 
> > 1) account dirtied pages in the individual tasks, for accuracy
> > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > 
> > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > scale near-sqrt to the safety gap between dirty pages and threshold.
> > 
> > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > dirtying pages at exactly the same time, each task will be assigned a
> > large initial nr_dirtied_pause, so that the dirty threshold will be
> > exceeded long before each task reached its nr_dirtied_pause and hence
> > call balance_dirty_pages().
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/sched.h |    7 ++
> >  mm/memory_hotplug.c   |    3 -
> >  mm/page-writeback.c   |  106 +++++++++-------------------------------
> >  3 files changed, 32 insertions(+), 84 deletions(-) 
> 
> No fork() hooks? This way tasks inherit their parent's dirty count on
> clone().

btw, I do have another patch queued for improving the "leaked dirties
on exit" case :)

Thanks,
Fengguang
---
Subject: writeback: charge leaked page dirties to active tasks
Date: Tue Apr 05 13:21:19 CST 2011

It's a years long problem that a large number of short-lived dirtiers
(eg. gcc instances in a fast kernel build) may starve long-run dirtiers
(eg. dd) as well as pushing the dirty pages to the global hard limit.

The solution is to charge the pages dirtied by the exited gcc to the
other random gcc/dd instances. It sounds not perfect, however should
behave good enough in practice.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 kernel/exit.c             |    2 ++
 mm/page-writeback.c       |   11 +++++++++++
 3 files changed, 15 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-08-08 21:45:58.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-08 21:45:58.000000000 +0800
@@ -7,6 +7,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+DECLARE_PER_CPU(int, dirty_leaks);
+
 /*
  * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
  *
--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:45:58.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-08 22:21:50.000000000 +0800
@@ -190,6 +190,7 @@ int dirty_ratio_handler(struct ctl_table
 	return ret;
 }
 
+DEFINE_PER_CPU(int, dirty_leaks) = 0;
 
 int dirty_bytes_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
@@ -1150,6 +1151,7 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
@@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr(
 	if (bdi->dirty_exceeded)
 		ratelimit = 8;
 
+	preempt_disable();
+	p = &__get_cpu_var(dirty_leaks);
+	if (*p > 0 && current->nr_dirtied < ratelimit) {
+		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
+		*p -= nr_pages_dirtied;
+		current->nr_dirtied += nr_pages_dirtied;
+	}
+	preempt_enable();
+
 	if (unlikely(current->nr_dirtied >= ratelimit))
 		balance_dirty_pages(mapping, current->nr_dirtied);
 }
--- linux-next.orig/kernel/exit.c	2011-08-08 21:43:37.000000000 +0800
+++ linux-next/kernel/exit.c	2011-08-08 21:45:58.000000000 +0800
@@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code)
 	validate_creds_for_do_exit(tsk);
 
 	preempt_disable();
+	if (tsk->nr_dirtied)
+		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
 	exit_rcu();
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-08 14:23       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > Add two fields to task_struct.
> > 
> > 1) account dirtied pages in the individual tasks, for accuracy
> > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > 
> > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > scale near-sqrt to the safety gap between dirty pages and threshold.
> > 
> > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > dirtying pages at exactly the same time, each task will be assigned a
> > large initial nr_dirtied_pause, so that the dirty threshold will be
> > exceeded long before each task reached its nr_dirtied_pause and hence
> > call balance_dirty_pages().
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/sched.h |    7 ++
> >  mm/memory_hotplug.c   |    3 -
> >  mm/page-writeback.c   |  106 +++++++++-------------------------------
> >  3 files changed, 32 insertions(+), 84 deletions(-) 
> 
> No fork() hooks? This way tasks inherit their parent's dirty count on
> clone().

btw, I do have another patch queued for improving the "leaked dirties
on exit" case :)

Thanks,
Fengguang
---
Subject: writeback: charge leaked page dirties to active tasks
Date: Tue Apr 05 13:21:19 CST 2011

It's a years long problem that a large number of short-lived dirtiers
(eg. gcc instances in a fast kernel build) may starve long-run dirtiers
(eg. dd) as well as pushing the dirty pages to the global hard limit.

The solution is to charge the pages dirtied by the exited gcc to the
other random gcc/dd instances. It sounds not perfect, however should
behave good enough in practice.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 kernel/exit.c             |    2 ++
 mm/page-writeback.c       |   11 +++++++++++
 3 files changed, 15 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-08-08 21:45:58.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-08 21:45:58.000000000 +0800
@@ -7,6 +7,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+DECLARE_PER_CPU(int, dirty_leaks);
+
 /*
  * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
  *
--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:45:58.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-08 22:21:50.000000000 +0800
@@ -190,6 +190,7 @@ int dirty_ratio_handler(struct ctl_table
 	return ret;
 }
 
+DEFINE_PER_CPU(int, dirty_leaks) = 0;
 
 int dirty_bytes_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
@@ -1150,6 +1151,7 @@ void balance_dirty_pages_ratelimited_nr(
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	int ratelimit;
+	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
@@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr(
 	if (bdi->dirty_exceeded)
 		ratelimit = 8;
 
+	preempt_disable();
+	p = &__get_cpu_var(dirty_leaks);
+	if (*p > 0 && current->nr_dirtied < ratelimit) {
+		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
+		*p -= nr_pages_dirtied;
+		current->nr_dirtied += nr_pages_dirtied;
+	}
+	preempt_enable();
+
 	if (unlikely(current->nr_dirtied >= ratelimit))
 		balance_dirty_pages(mapping, current->nr_dirtied);
 }
--- linux-next.orig/kernel/exit.c	2011-08-08 21:43:37.000000000 +0800
+++ linux-next/kernel/exit.c	2011-08-08 21:45:58.000000000 +0800
@@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code)
 	validate_creds_for_do_exit(tsk);
 
 	preempt_disable();
+	if (tsk->nr_dirtied)
+		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
 	exit_rcu();
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-08 14:23       ` Wu Fengguang
  (?)
@ 2011-08-08 14:26         ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote:
> +       preempt_disable();
> +       p = &__get_cpu_var(dirty_leaks);

 p = &get_cpu_var(dirty_leaks);

> +       if (*p > 0 && current->nr_dirtied < ratelimit) {
> +               nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> +               *p -= nr_pages_dirtied;
> +               current->nr_dirtied += nr_pages_dirtied;
> +       }
> +       preempt_enable(); 

put_cpu_var(dirty_leads);

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-08 14:26         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote:
> +       preempt_disable();
> +       p = &__get_cpu_var(dirty_leaks);

 p = &get_cpu_var(dirty_leaks);

> +       if (*p > 0 && current->nr_dirtied < ratelimit) {
> +               nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> +               *p -= nr_pages_dirtied;
> +               current->nr_dirtied += nr_pages_dirtied;
> +       }
> +       preempt_enable(); 

put_cpu_var(dirty_leads);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-08 14:26         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote:
> +       preempt_disable();
> +       p = &__get_cpu_var(dirty_leaks);

 p = &get_cpu_var(dirty_leaks);

> +       if (*p > 0 && current->nr_dirtied < ratelimit) {
> +               nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> +               *p -= nr_pages_dirtied;
> +               current->nr_dirtied += nr_pages_dirtied;
> +       }
> +       preempt_enable(); 

put_cpu_var(dirty_leads);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:11       ` Wu Fengguang
  (?)
@ 2011-08-08 14:31         ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> It's actually dead code because (origin < limit) should never happen.
> I feel so good being able to drop 5 more lines of code :) 

OK, but that leaves me trying to figure out what origin is, and why its
4 * thresh.

I'm having a horrible time understanding this stuff.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:31         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> It's actually dead code because (origin < limit) should never happen.
> I feel so good being able to drop 5 more lines of code :) 

OK, but that leaves me trying to figure out what origin is, and why its
4 * thresh.

I'm having a horrible time understanding this stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:31         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> It's actually dead code because (origin < limit) should never happen.
> I feel so good being able to drop 5 more lines of code :) 

OK, but that leaves me trying to figure out what origin is, and why its
4 * thresh.

I'm having a horrible time understanding this stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:11       ` Wu Fengguang
  (?)
@ 2011-08-08 14:41         ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
>         goal = thresh - thresh / DIRTY_SCOPE;
>         origin = 4 * thresh;
>  
> -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> -               origin = limit;                 /* auxiliary control line */
> -               goal = (goal + origin) / 2;
> -               pos_ratio >>= 1;
> -       }
>         pos_ratio = origin - dirty;
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, origin - goal + 1); 

So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:41         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
>         goal = thresh - thresh / DIRTY_SCOPE;
>         origin = 4 * thresh;
>  
> -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> -               origin = limit;                 /* auxiliary control line */
> -               goal = (goal + origin) / 2;
> -               pos_ratio >>= 1;
> -       }
>         pos_ratio = origin - dirty;
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, origin - goal + 1); 

So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:41         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
>         goal = thresh - thresh / DIRTY_SCOPE;
>         origin = 4 * thresh;
>  
> -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> -               origin = limit;                 /* auxiliary control line */
> -               goal = (goal + origin) / 2;
> -               pos_ratio >>= 1;
> -       }
>         pos_ratio = origin - dirty;
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, origin - goal + 1); 

So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-08 14:26         ` Peter Zijlstra
@ 2011-08-08 22:38           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 22:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:26:52PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote:
> > +       preempt_disable();
> > +       p = &__get_cpu_var(dirty_leaks);
> 
>  p = &get_cpu_var(dirty_leaks);
> 
> > +       if (*p > 0 && current->nr_dirtied < ratelimit) {
> > +               nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> > +               *p -= nr_pages_dirtied;
> > +               current->nr_dirtied += nr_pages_dirtied;
> > +       }
> > +       preempt_enable(); 
> 
> put_cpu_var(dirty_leads);

Good to know these, thanks!

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-08 22:38           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 22:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:26:52PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:23 +0800, Wu Fengguang wrote:
> > +       preempt_disable();
> > +       p = &__get_cpu_var(dirty_leaks);
> 
>  p = &get_cpu_var(dirty_leaks);
> 
> > +       if (*p > 0 && current->nr_dirtied < ratelimit) {
> > +               nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> > +               *p -= nr_pages_dirtied;
> > +               current->nr_dirtied += nr_pages_dirtied;
> > +       }
> > +       preempt_enable(); 
> 
> put_cpu_var(dirty_leads);

Good to know these, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:31         ` Peter Zijlstra
@ 2011-08-08 22:47           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 22:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:31:49PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > It's actually dead code because (origin < limit) should never happen.
> > I feel so good being able to drop 5 more lines of code :) 
> 
> OK, but that leaves me trying to figure out what origin is, and why its
> 4 * thresh.

origin is where the control line crosses the X axis (in both the
global/bdi setpoint cases).

"4 * thresh" is merely something larger than max(dirty, thresh)
that yields reasonably gentle slope. The more slope, the larger
"gravity" to bring the dirty pages back to the setpoint.

> I'm having a horrible time understanding this stuff.

Sorry for that. Do you have more questions?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 22:47           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 22:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:31:49PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > It's actually dead code because (origin < limit) should never happen.
> > I feel so good being able to drop 5 more lines of code :) 
> 
> OK, but that leaves me trying to figure out what origin is, and why its
> 4 * thresh.

origin is where the control line crosses the X axis (in both the
global/bdi setpoint cases).

"4 * thresh" is merely something larger than max(dirty, thresh)
that yields reasonably gentle slope. The more slope, the larger
"gravity" to bring the dirty pages back to the setpoint.

> I'm having a horrible time understanding this stuff.

Sorry for that. Do you have more questions?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:41         ` Peter Zijlstra
@ 2011-08-08 23:05           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 23:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> >         goal = thresh - thresh / DIRTY_SCOPE;
> >         origin = 4 * thresh;
> >  
> > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > -               origin = limit;                 /* auxiliary control line */
> > -               goal = (goal + origin) / 2;
> > -               pos_ratio >>= 1;
> > -       }
> >         pos_ratio = origin - dirty;
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, origin - goal + 1); 

FYI I've updated the fix to the below one, so that @limit will be used
as the origin in the rare case of (4*thresh < dirty).

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
@@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
 	 * global setpoint
 	 */
 	goal = thresh - thresh / DIRTY_SCOPE;
-	origin = 4 * thresh;
+	origin = max(4 * thresh, limit);
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

> So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
> comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 

This is the more meaningful view :)

                    origin - dirty
        pos_ratio = --------------
                    origin - goal

which comes from the below [*] control line, so that when (dirty == goal),
pos_ratio == 1.0:

 ^ pos_ratio
 |
 |
 |   *
 |      *
 |         *
 |            *
 |               *
 |                  *
 |                     *
 |                        *
 |                           *
 |                              *
 |                                 *
 .. pos_ratio = 1.0 ..................*
 |                                    .  *
 |                                    .     *
 |                                    .        *
 |                                    .           *
 |                                    .              *
 |                                    .                 *
 |                                    .                    *
 |                                    .                       *
 |                                    .                          *
 |                                    .                             *
 |                                    .                                *
 |                                    .                                   *
 |                                    .                                      *
 |                                    .                                         *
 |                                    .                                            *
 |                                    .                                               *
 +------------------------------------.--------------------------------------------------*---------------------->
 0                                   goal                                              origin         dirty pages

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 23:05           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 23:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> >         goal = thresh - thresh / DIRTY_SCOPE;
> >         origin = 4 * thresh;
> >  
> > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > -               origin = limit;                 /* auxiliary control line */
> > -               goal = (goal + origin) / 2;
> > -               pos_ratio >>= 1;
> > -       }
> >         pos_ratio = origin - dirty;
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, origin - goal + 1); 

FYI I've updated the fix to the below one, so that @limit will be used
as the origin in the rare case of (4*thresh < dirty).

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
@@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
 	 * global setpoint
 	 */
 	goal = thresh - thresh / DIRTY_SCOPE;
-	origin = 4 * thresh;
+	origin = max(4 * thresh, limit);
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

> So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
> comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 

This is the more meaningful view :)

                    origin - dirty
        pos_ratio = --------------
                    origin - goal

which comes from the below [*] control line, so that when (dirty == goal),
pos_ratio == 1.0:

 ^ pos_ratio
 |
 |
 |   *
 |      *
 |         *
 |            *
 |               *
 |                  *
 |                     *
 |                        *
 |                           *
 |                              *
 |                                 *
 .. pos_ratio = 1.0 ..................*
 |                                    .  *
 |                                    .     *
 |                                    .        *
 |                                    .           *
 |                                    .              *
 |                                    .                 *
 |                                    .                    *
 |                                    .                       *
 |                                    .                          *
 |                                    .                             *
 |                                    .                                *
 |                                    .                                   *
 |                                    .                                      *
 |                                    .                                         *
 |                                    .                                            *
 |                                    .                                               *
 +------------------------------------.--------------------------------------------------*---------------------->
 0                                   goal                                              origin         dirty pages

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-08 14:21       ` Wu Fengguang
@ 2011-08-08 23:32         ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 23:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> --- linux-next.orig/kernel/fork.c	2011-08-08 22:11:59.000000000 +0800
> +++ linux-next/kernel/fork.c	2011-08-08 22:18:05.000000000 +0800
> @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
>  	p->pdeath_signal = 0;
>  	p->exit_state = 0;
>  
> +	p->nr_dirtied = 0;
> +	p->nr_dirtied_pause = 8;

Hmm, it looks better to allow a new task to dirty 128KB without being
throttled, if the system is not in dirty exceeded state. So changed
the last line to this:

+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-08 23:32         ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-08 23:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> --- linux-next.orig/kernel/fork.c	2011-08-08 22:11:59.000000000 +0800
> +++ linux-next/kernel/fork.c	2011-08-08 22:18:05.000000000 +0800
> @@ -1301,6 +1301,9 @@ static struct task_struct *copy_process(
>  	p->pdeath_signal = 0;
>  	p->exit_state = 0;
>  
> +	p->nr_dirtied = 0;
> +	p->nr_dirtied_pause = 8;

Hmm, it looks better to allow a new task to dirty 128KB without being
throttled, if the system is not in dirty exceeded state. So changed
the last line to this:

+	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
  2011-08-06  8:44 ` Wu Fengguang
@ 2011-08-09  2:01   ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09  2:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> Hi all,
> 
> The _core_ bits of the IO-less balance_dirty_pages().
> Heavily simplified and re-commented to make it easier to review.
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8
> 
> Only the bare minimal algorithms are presented, so you will find some rough
> edges in the graphs below. But it's usable :)
> 
> 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/
> 
> And an introduction to the (more complete) algorithms:
> 
> 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf
> 
> Questions and reviews are highly appreciated!

Hi Wu,

I am going through the slide number 39 where you talk about it being
future proof and it can be used for IO control purposes. You have listed
following merits of this approach.

* per-bdi nature, works on NFS and Software RAID
* no delayed response (working at the right layer)
* no page tracking, hence decoupled from memcg
* no interactions with FS and CFQ
* get proportional IO controller for free
* reuse/inherit all the base facilities/functions

I would say that it will also be a good idea to list the demerits of
this approach in current form and that is that it only deals with
controlling buffered write IO and nothing else. So on the same
block device, other direct writes might be going on from same group
and in this scheme a user will not have any control. Another disadvantage
is that throttling at page cache level does not take care of IO
spikes at device level.

Now I think one could probably come up with more sophisticated scheme
where throttling is done at bdi level but is also accounted at device
level at IO controller. (Something similar I had done in the past but
Dave Chinner did not like it).

Anyway, keeping track of per cgroup rate and throttling accordingly
can definitely help implement an algorithm for per cgroup IO control.
We probably just need to find a reasonable way to account all this
IO to end device so that we have control of all kind of IO of a cgroup.

How do you implement proportional control here? From overall bdi bandwidth
vary per cgroup bandwidth regularly based on cgroup weight? Again the
issue here is that it controls only buffered WRITES and nothing else and
in this case co-ordinating with CFQ will probably be hard. So I guess
usage of proportional IO just for buffered WRITES will have limited
usage.

Thanks
Vivek




> 
> shortlog:
> 
> 	Wu Fengguang (5):
> 	      writeback: account per-bdi accumulated dirtied pages
> 	      writeback: dirty position control
> 	      writeback: dirty rate control
> 	      writeback: per task dirty rate limit
> 	      writeback: IO-less balance_dirty_pages()
> 
> 	The last 4 patches are one single logical change, but splitted here to
> 	make it easier to review the different parts of the algorithm.
> 
> diffstat:
> 
> 	 include/linux/backing-dev.h      |    8 +
> 	 include/linux/sched.h            |    7 +
> 	 include/trace/events/writeback.h |   24 --
> 	 mm/backing-dev.c                 |    3 +
> 	 mm/memory_hotplug.c              |    3 -
> 	 mm/page-writeback.c              |  459 ++++++++++++++++++++++----------------
> 	 6 files changed, 290 insertions(+), 214 deletions(-)
> 
> Thanks,
> Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-09  2:01   ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09  2:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> Hi all,
> 
> The _core_ bits of the IO-less balance_dirty_pages().
> Heavily simplified and re-commented to make it easier to review.
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8
> 
> Only the bare minimal algorithms are presented, so you will find some rough
> edges in the graphs below. But it's usable :)
> 
> 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/
> 
> And an introduction to the (more complete) algorithms:
> 
> 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf
> 
> Questions and reviews are highly appreciated!

Hi Wu,

I am going through the slide number 39 where you talk about it being
future proof and it can be used for IO control purposes. You have listed
following merits of this approach.

* per-bdi nature, works on NFS and Software RAID
* no delayed response (working at the right layer)
* no page tracking, hence decoupled from memcg
* no interactions with FS and CFQ
* get proportional IO controller for free
* reuse/inherit all the base facilities/functions

I would say that it will also be a good idea to list the demerits of
this approach in current form and that is that it only deals with
controlling buffered write IO and nothing else. So on the same
block device, other direct writes might be going on from same group
and in this scheme a user will not have any control. Another disadvantage
is that throttling at page cache level does not take care of IO
spikes at device level.

Now I think one could probably come up with more sophisticated scheme
where throttling is done at bdi level but is also accounted at device
level at IO controller. (Something similar I had done in the past but
Dave Chinner did not like it).

Anyway, keeping track of per cgroup rate and throttling accordingly
can definitely help implement an algorithm for per cgroup IO control.
We probably just need to find a reasonable way to account all this
IO to end device so that we have control of all kind of IO of a cgroup.

How do you implement proportional control here? From overall bdi bandwidth
vary per cgroup bandwidth regularly based on cgroup weight? Again the
issue here is that it controls only buffered WRITES and nothing else and
in this case co-ordinating with CFQ will probably be hard. So I guess
usage of proportional IO just for buffered WRITES will have limited
usage.

Thanks
Vivek




> 
> shortlog:
> 
> 	Wu Fengguang (5):
> 	      writeback: account per-bdi accumulated dirtied pages
> 	      writeback: dirty position control
> 	      writeback: dirty rate control
> 	      writeback: per task dirty rate limit
> 	      writeback: IO-less balance_dirty_pages()
> 
> 	The last 4 patches are one single logical change, but splitted here to
> 	make it easier to review the different parts of the algorithm.
> 
> diffstat:
> 
> 	 include/linux/backing-dev.h      |    8 +
> 	 include/linux/sched.h            |    7 +
> 	 include/trace/events/writeback.h |   24 --
> 	 mm/backing-dev.c                 |    3 +
> 	 mm/memory_hotplug.c              |    3 -
> 	 mm/page-writeback.c              |  459 ++++++++++++++++++++++----------------
> 	 6 files changed, 290 insertions(+), 214 deletions(-)
> 
> Thanks,
> Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09  2:08     ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09  2:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:49PM +0800, Wu Fengguang wrote:
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> For simplicity, only the global/bdi setpoint control lines are
> implemented here, so the [*] curve is more straight than the ideal one
> showed in the above figure.
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 

IMHO, "position_ratio" is not necessarily very intutive. Can there be
a better name? Based on your slides, it is scaling factor applied to
task rate limit depending on how well we are doing in terms of meeting
our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
that make sense and be little more intutive? 

Thanks
Vivek
 

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 143 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define BANDWIDTH_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.
> + *
> + *                              setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------|-----------|
> + * ^                               ^                               ^           ^
> + * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
> + *
> + *                          bdi setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------------------|
> + * ^                               ^                                           ^
> + * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
> + *
> + * (o) pseudo code
> + *
> + *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
> + *
> + *     if (dirty < thresh) scale up   pos_ratio
> + *     if (dirty > thresh) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
> + *
> + * (o) global/bdi control lines
> + *
> + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
> + * several control lines in turn.
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * If any control line drops below Y=0 before reaching @limit, an auxiliary
> + * line will be setup to connect them. The below figure illustrates the main
> + * bdi control line with an auxiliary line extending it to @limit.
> + *
> + * This allows smoothly throttling bdi_dirty down to normal if it starts high
> + * in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
> + * - the bdi dirty thresh goes down quickly due to change of JBOD workload
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, bw scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, bw scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 bdi setpoint                 bdi origin            limit
> + *
> + * The bdi control line: if (origin < limit), an auxiliary control line (*)
> + * will be setup to extend the main control line (o) to @limit.
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long origin;
> +	unsigned long goal;
> +	unsigned long long span;
> +	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 */
> +	goal = thresh - thresh / DIRTY_SCOPE;
> +	origin = 4 * thresh;
> +
> +	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +		origin = limit;			/* auxiliary control line */
> +		goal = (goal + origin) / 2;
> +		pos_ratio >>= 1;
> +	}
> +	pos_ratio = origin - dirty;
> +	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	/*
> +	 * bdi setpoint
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> +	/*
> +	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
> +	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
> +	 * Otherwise the bdi write bandwidth is good for limiting the floating
> +	 * area, which makes the bdi control line a good backup when the global
> +	 * control line is too flat/weak in large memory systems.
> +	 */
> +	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
> +		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
> +	do_div(span, thresh + 1);
> +	origin = goal + 2 * span;
> +
> +	if (unlikely(bdi_dirty > goal + span)) {
> +		if (bdi_dirty > limit)
> +			return 0;
> +		if (origin < limit) {
> +			origin = limit;		/* auxiliary control line */
> +			goal += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= origin - bdi_dirty;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> 

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09  2:08     ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09  2:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:49PM +0800, Wu Fengguang wrote:
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> For simplicity, only the global/bdi setpoint control lines are
> implemented here, so the [*] curve is more straight than the ideal one
> showed in the above figure.
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 

IMHO, "position_ratio" is not necessarily very intutive. Can there be
a better name? Based on your slides, it is scaling factor applied to
task rate limit depending on how well we are doing in terms of meeting
our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
that make sense and be little more intutive? 

Thanks
Vivek
 

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 143 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define BANDWIDTH_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.
> + *
> + *                              setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------|-----------|
> + * ^                               ^                               ^           ^
> + * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
> + *
> + *                          bdi setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------------------|
> + * ^                               ^                                           ^
> + * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
> + *
> + * (o) pseudo code
> + *
> + *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
> + *
> + *     if (dirty < thresh) scale up   pos_ratio
> + *     if (dirty > thresh) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
> + *
> + * (o) global/bdi control lines
> + *
> + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
> + * several control lines in turn.
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * If any control line drops below Y=0 before reaching @limit, an auxiliary
> + * line will be setup to connect them. The below figure illustrates the main
> + * bdi control line with an auxiliary line extending it to @limit.
> + *
> + * This allows smoothly throttling bdi_dirty down to normal if it starts high
> + * in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
> + * - the bdi dirty thresh goes down quickly due to change of JBOD workload
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, bw scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, bw scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 bdi setpoint                 bdi origin            limit
> + *
> + * The bdi control line: if (origin < limit), an auxiliary control line (*)
> + * will be setup to extend the main control line (o) to @limit.
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long origin;
> +	unsigned long goal;
> +	unsigned long long span;
> +	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 */
> +	goal = thresh - thresh / DIRTY_SCOPE;
> +	origin = 4 * thresh;
> +
> +	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +		origin = limit;			/* auxiliary control line */
> +		goal = (goal + origin) / 2;
> +		pos_ratio >>= 1;
> +	}
> +	pos_ratio = origin - dirty;
> +	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	/*
> +	 * bdi setpoint
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> +	/*
> +	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
> +	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
> +	 * Otherwise the bdi write bandwidth is good for limiting the floating
> +	 * area, which makes the bdi control line a good backup when the global
> +	 * control line is too flat/weak in large memory systems.
> +	 */
> +	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
> +		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
> +	do_div(span, thresh + 1);
> +	origin = goal + 2 * span;
> +
> +	if (unlikely(bdi_dirty > goal + span)) {
> +		if (bdi_dirty > limit)
> +			return 0;
> +		if (origin < limit) {
> +			origin = limit;		/* auxiliary control line */
> +			goal += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= origin - bdi_dirty;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
  2011-08-09  2:01   ` Vivek Goyal
@ 2011-08-09  5:55     ` Dave Chinner
  -1 siblings, 0 replies; 305+ messages in thread
From: Dave Chinner @ 2011-08-09  5:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Greg Thelen, Minchan Kim, Andrea Righi,
	linux-mm, LKML

On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> > Hi all,
> > 
> > The _core_ bits of the IO-less balance_dirty_pages().
> > Heavily simplified and re-commented to make it easier to review.
> > 
> > 	git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8
> > 
> > Only the bare minimal algorithms are presented, so you will find some rough
> > edges in the graphs below. But it's usable :)
> > 
> > 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/
> > 
> > And an introduction to the (more complete) algorithms:
> > 
> > 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf
> > 
> > Questions and reviews are highly appreciated!
> 
> Hi Wu,
> 
> I am going through the slide number 39 where you talk about it being
> future proof and it can be used for IO control purposes. You have listed
> following merits of this approach.
> 
> * per-bdi nature, works on NFS and Software RAID
> * no delayed response (working at the right layer)
> * no page tracking, hence decoupled from memcg
> * no interactions with FS and CFQ
> * get proportional IO controller for free
> * reuse/inherit all the base facilities/functions
> 
> I would say that it will also be a good idea to list the demerits of
> this approach in current form and that is that it only deals with
> controlling buffered write IO and nothing else.

That's not a demerit - that is all it is designed to do.

> So on the same block device, other direct writes might be going on
> from same group and in this scheme a user will not have any
> control.

But it is taken into account by the IO write throttling.

> Another disadvantage is that throttling at page cache
> level does not take care of IO spikes at device level.

And that is handled as well.

How? By the indirect effect other IO and IO spikes have on the
writeback rate. That is, other IO reduces the writeback bandwidth,
which then changes the throttling parameters via feedback loops.

The buffered write throttle is designed to reduce the page cache
dirtying rate to the current cleaning rate of the backing device
is. Increase the cleaning rate (i.e. device is otherwise idle) and
it will throttle less. Decrease the cleaning rate (i.e. other IO
spikes or block IO throttle activates) and it will throttle more.

We have to do vary buffered write throttling like this to adapt to
changing IO workloads (e.g.  someone starting a read-heavy workload
will slow down writeback rate, so we need to throttle buffered
writes more aggressively), so it has to be independent of any sort
of block layer IO controller.

Simply put: the block IO controller still has direct control over
the rate at which buffered writes drain out of the system. The
IO-less write throttle simply limits the rate at which buffered
writes come into the system to match whatever the IO path allows to
drain out....

> Now I think one could probably come up with more sophisticated scheme
> where throttling is done at bdi level but is also accounted at device
> level at IO controller. (Something similar I had done in the past but
> Dave Chinner did not like it).

I don't like it because it is solution to a specific problem and
requires complex coupling across multiple layers of the system. We
are trying to move away from that throttling model. More
fundamentally, though, is that it is not a general solution to the
entire class of "IO writeback rate changed" problems that buffered
write throttling needs to solve.

> Anyway, keeping track of per cgroup rate and throttling accordingly
> can definitely help implement an algorithm for per cgroup IO control.
> We probably just need to find a reasonable way to account all this
> IO to end device so that we have control of all kind of IO of a cgroup.
> How do you implement proportional control here? From overall bdi bandwidth
> vary per cgroup bandwidth regularly based on cgroup weight? Again the
> issue here is that it controls only buffered WRITES and nothing else and
> in this case co-ordinating with CFQ will probably be hard. So I guess
> usage of proportional IO just for buffered WRITES will have limited
> usage.

The whole point of doing the throttling this way is that we don't
need any sort of special connection between block IO throttling and
page cache (buffered write) throttling. We significantly reduce the
coupling between the layers by relying on feedback-driven control
loops to determine the buffered write throttling thresholds
adaptively. IOWs, the IO-less write throttling at the page cache
will adjust automatically to whatever throughput the block IO
throttling allows async writes to achieve.

However, before we have a "finished product", there is still another
piece of the puzzle to be put in place - memcg-aware buffered
writeback. That is, having a flusher thread do work on behalf of
memcg in the IO context of the memcg. Then the IO controller just
sees a stream of async writes in the context of the memcg the
buffered writes came from in the first place. The block layer
throttles them just like any other IO in the IO context of the
memcg...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-09  5:55     ` Dave Chinner
  0 siblings, 0 replies; 305+ messages in thread
From: Dave Chinner @ 2011-08-09  5:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Greg Thelen, Minchan Kim, Andrea Righi,
	linux-mm, LKML

On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> > Hi all,
> > 
> > The _core_ bits of the IO-less balance_dirty_pages().
> > Heavily simplified and re-commented to make it easier to review.
> > 
> > 	git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8
> > 
> > Only the bare minimal algorithms are presented, so you will find some rough
> > edges in the graphs below. But it's usable :)
> > 
> > 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/
> > 
> > And an introduction to the (more complete) algorithms:
> > 
> > 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf
> > 
> > Questions and reviews are highly appreciated!
> 
> Hi Wu,
> 
> I am going through the slide number 39 where you talk about it being
> future proof and it can be used for IO control purposes. You have listed
> following merits of this approach.
> 
> * per-bdi nature, works on NFS and Software RAID
> * no delayed response (working at the right layer)
> * no page tracking, hence decoupled from memcg
> * no interactions with FS and CFQ
> * get proportional IO controller for free
> * reuse/inherit all the base facilities/functions
> 
> I would say that it will also be a good idea to list the demerits of
> this approach in current form and that is that it only deals with
> controlling buffered write IO and nothing else.

That's not a demerit - that is all it is designed to do.

> So on the same block device, other direct writes might be going on
> from same group and in this scheme a user will not have any
> control.

But it is taken into account by the IO write throttling.

> Another disadvantage is that throttling at page cache
> level does not take care of IO spikes at device level.

And that is handled as well.

How? By the indirect effect other IO and IO spikes have on the
writeback rate. That is, other IO reduces the writeback bandwidth,
which then changes the throttling parameters via feedback loops.

The buffered write throttle is designed to reduce the page cache
dirtying rate to the current cleaning rate of the backing device
is. Increase the cleaning rate (i.e. device is otherwise idle) and
it will throttle less. Decrease the cleaning rate (i.e. other IO
spikes or block IO throttle activates) and it will throttle more.

We have to do vary buffered write throttling like this to adapt to
changing IO workloads (e.g.  someone starting a read-heavy workload
will slow down writeback rate, so we need to throttle buffered
writes more aggressively), so it has to be independent of any sort
of block layer IO controller.

Simply put: the block IO controller still has direct control over
the rate at which buffered writes drain out of the system. The
IO-less write throttle simply limits the rate at which buffered
writes come into the system to match whatever the IO path allows to
drain out....

> Now I think one could probably come up with more sophisticated scheme
> where throttling is done at bdi level but is also accounted at device
> level at IO controller. (Something similar I had done in the past but
> Dave Chinner did not like it).

I don't like it because it is solution to a specific problem and
requires complex coupling across multiple layers of the system. We
are trying to move away from that throttling model. More
fundamentally, though, is that it is not a general solution to the
entire class of "IO writeback rate changed" problems that buffered
write throttling needs to solve.

> Anyway, keeping track of per cgroup rate and throttling accordingly
> can definitely help implement an algorithm for per cgroup IO control.
> We probably just need to find a reasonable way to account all this
> IO to end device so that we have control of all kind of IO of a cgroup.
> How do you implement proportional control here? From overall bdi bandwidth
> vary per cgroup bandwidth regularly based on cgroup weight? Again the
> issue here is that it controls only buffered WRITES and nothing else and
> in this case co-ordinating with CFQ will probably be hard. So I guess
> usage of proportional IO just for buffered WRITES will have limited
> usage.

The whole point of doing the throttling this way is that we don't
need any sort of special connection between block IO throttling and
page cache (buffered write) throttling. We significantly reduce the
coupling between the layers by relying on feedback-driven control
loops to determine the buffered write throttling thresholds
adaptively. IOWs, the IO-less write throttling at the page cache
will adjust automatically to whatever throughput the block IO
throttling allows async writes to achieve.

However, before we have a "finished product", there is still another
piece of the puzzle to be put in place - memcg-aware buffered
writeback. That is, having a flusher thread do work on behalf of
memcg in the IO context of the memcg. Then the IO controller just
sees a stream of async writes in the context of the memcg the
buffered writes came from in the first place. The block layer
throttles them just like any other IO in the IO context of the
memcg...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 22:47           ` Wu Fengguang
  (?)
@ 2011-08-09  9:31             ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> origin is where the control line crosses the X axis (in both the
> global/bdi setpoint cases). 

Ah, that's normally called zero, root or or x-intercept:

http://en.wikipedia.org/wiki/X-intercept

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09  9:31             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> origin is where the control line crosses the X axis (in both the
> global/bdi setpoint cases). 

Ah, that's normally called zero, root or or x-intercept:

http://en.wikipedia.org/wiki/X-intercept

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09  9:31             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> origin is where the control line crosses the X axis (in both the
> global/bdi setpoint cases). 

Ah, that's normally called zero, root or or x-intercept:

http://en.wikipedia.org/wiki/X-intercept

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-09 10:32             ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote:
> This is the more meaningful view :)
> 
>                     origin - dirty
>         pos_ratio = --------------
>                     origin - goal 

> which comes from the below [*] control line, so that when (dirty == goal),
> pos_ratio == 1.0:

OK, so basically you want a linear function for which:

f(goal) = 1 and has a root somewhere > goal.

(that one line is much more informative than all your graphs put
together, one can start from there and derive your function)

That does indeed get you the above function, now what does it mean?

> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.

(you seem inconsistent with your terminology, I think goal and setpoint
are interchanged? I looked up set point and its a term from control
system theory, so I'll chalk that up to my own ignorance..)

Ok, so higher dirty -> lower position ration -> lower dirty rate (and
the inverse), now what does that do...

/me goes read other patches in search of more clues.. I'm starting to
dislike graphs.. why not simply state where those things come from,
that's much easier.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 10:32             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote:
> This is the more meaningful view :)
> 
>                     origin - dirty
>         pos_ratio = --------------
>                     origin - goal 

> which comes from the below [*] control line, so that when (dirty == goal),
> pos_ratio == 1.0:

OK, so basically you want a linear function for which:

f(goal) = 1 and has a root somewhere > goal.

(that one line is much more informative than all your graphs put
together, one can start from there and derive your function)

That does indeed get you the above function, now what does it mean?

> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.

(you seem inconsistent with your terminology, I think goal and setpoint
are interchanged? I looked up set point and its a term from control
system theory, so I'll chalk that up to my own ignorance..)

Ok, so higher dirty -> lower position ration -> lower dirty rate (and
the inverse), now what does that do...

/me goes read other patches in search of more clues.. I'm starting to
dislike graphs.. why not simply state where those things come from,
that's much easier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 10:32             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote:
> This is the more meaningful view :)
> 
>                     origin - dirty
>         pos_ratio = --------------
>                     origin - goal 

> which comes from the below [*] control line, so that when (dirty == goal),
> pos_ratio == 1.0:

OK, so basically you want a linear function for which:

f(goal) = 1 and has a root somewhere > goal.

(that one line is much more informative than all your graphs put
together, one can start from there and derive your function)

That does indeed get you the above function, now what does it mean?

> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.

(you seem inconsistent with your terminology, I think goal and setpoint
are interchanged? I looked up set point and its a term from control
system theory, so I'll chalk that up to my own ignorance..)

Ok, so higher dirty -> lower position ration -> lower dirty rate (and
the inverse), now what does that do...

/me goes read other patches in search of more clues.. I'm starting to
dislike graphs.. why not simply state where those things come from,
that's much easier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
  2011-08-09  5:55     ` Dave Chinner
@ 2011-08-09 14:04       ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 14:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Greg Thelen, Minchan Kim, Andrea Righi,
	linux-mm, LKML

On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote:
> On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote:
> > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> > > Hi all,
> > > 
> > > The _core_ bits of the IO-less balance_dirty_pages().
> > > Heavily simplified and re-commented to make it easier to review.
> > > 
> > > 	git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8
> > > 
> > > Only the bare minimal algorithms are presented, so you will find some rough
> > > edges in the graphs below. But it's usable :)
> > > 
> > > 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/
> > > 
> > > And an introduction to the (more complete) algorithms:
> > > 
> > > 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf
> > > 
> > > Questions and reviews are highly appreciated!
> > 
> > Hi Wu,
> > 
> > I am going through the slide number 39 where you talk about it being
> > future proof and it can be used for IO control purposes. You have listed
> > following merits of this approach.
> > 
> > * per-bdi nature, works on NFS and Software RAID
> > * no delayed response (working at the right layer)
> > * no page tracking, hence decoupled from memcg
> > * no interactions with FS and CFQ
> > * get proportional IO controller for free
> > * reuse/inherit all the base facilities/functions
> > 
> > I would say that it will also be a good idea to list the demerits of
> > this approach in current form and that is that it only deals with
> > controlling buffered write IO and nothing else.
> 
> That's not a demerit - that is all it is designed to do.

It is designed to improve the existing task throttling functionality and
we are trying to extend the same to cgroups too. So if by design something
does not gel well with existing pieces, it is demerit to me. Atleast
there should be a good explanation of design intention and how it is
going to be useful.

For example, how this thing is going to gel with existing IO controller?
Are you going to create two separate mechianisms. One for control of
writes while entering the cache and other for controlling the writes
at device level?

The fact that this mechanism does not know about any other IO in the
system/cgroup is a limiting factor. From usability point of view, a
user expects any kind of IO happening from a group.

So are we planning to create a new controller? Or add additional files
in existing controller to control the per cgroup write throttling 
behavior? Even if we create additional files, again then a user is
forced to put separate write policies for buffered writes and direct
writes. I was hoping a better interface would be that user puts a
policy on writes and that takes affect and a user does not have to
worry whether the applications inside the cgroup are doing buffered
writes or direct writes.

> 
> > So on the same block device, other direct writes might be going on
> > from same group and in this scheme a user will not have any
> > control.
> 
> But it is taken into account by the IO write throttling.

You mean blkio controller?

It does. But my complain is that we are trying to control two separate
knobs for two kind of IOs and I am trying to come up with a single
knob.

Current interface for write control in blkio controller looks like.

blkio.throtl.write_bps_device

Once can write to this file specifying the write limit of a cgroup 
on a particular device. I was hoping that buffered write limits
will come out of same limit but with these pathes looks like we
shall have to create a new interface altogether which just controls
buffered writes and nothing else and user is supposed to know what
his application is doing and try to configure the limits accordingly.

So my concern is that how the overall interface would look like and
how well it will work with existing controller and how a user is
supposed to use it.

In fact current IO controller does throttling at device level so 
interface is device specific. One is supposed to know the major
and minor number of device to specify. I am not sure in this
case what one is supposed to do as it is bdi specific and for
NFS case there is no device. So one is supposed to speciy bdi or
limits are going to be global (system wide, independent of bdi
or block device)?

> 
> > Another disadvantage is that throttling at page cache
> > level does not take care of IO spikes at device level.
> 
> And that is handled as well.
> 
> How? By the indirect effect other IO and IO spikes have on the
> writeback rate. That is, other IO reduces the writeback bandwidth,
> which then changes the throttling parameters via feedback loops.

Actually I was referring to effect of buffered writes on other IO
going on the device. With control being on device level, one can
tightly control the WRITEs flowing out of a cgroup to Lun and that
can help a bit knowing how bad it will be for other reads going on
the lun.

With this scheme, flusher threads can suddenly throw tons of writes
on lun and then no IO for another few seconds. So basically IO is
bursty at device level and doing control at device level can make
it more smooth.

So we have two ways to control buffered writes.

- Throttle them while entering the page cache
- Throttle them at device and feedback loop in turn throttles them at
  page cache level based on dirty ratio.

Myself and Andrea had implemented first appraoch (same what Wu is
suggesting now with a different mechanism) and following was your
response.

https://lkml.org/lkml/2011/6/28/494

To me it looked like that at that point of time you preferred precise
throttling at device level and now you seem to prefer precise throttling
at page cache level?

Again, I am not against cgroup parameter based throttling at page
cache level. It simplifies the implementation and probably is good
enough for lots of people. I am only worried about that the interface
and how does it work with existing interfaces.

In absolute throttling one does not have to care about feedback or
what is the underlying bdi bandwidth. So to me these patches are
good for work conserving IO control where we want to determine how
fast we can write to device and then throttle tasks accordingly. But
in absolute throttling one specifies the upper limit and there we
don't need the mechanism to determine what the bdi badnwidth or
how many dirty pages are there and throttle tasks accordingly. 

> 
> The buffered write throttle is designed to reduce the page cache
> dirtying rate to the current cleaning rate of the backing device
> is. Increase the cleaning rate (i.e. device is otherwise idle) and
> it will throttle less. Decrease the cleaning rate (i.e. other IO
> spikes or block IO throttle activates) and it will throttle more.
> 
> We have to do vary buffered write throttling like this to adapt to
> changing IO workloads (e.g.  someone starting a read-heavy workload
> will slow down writeback rate, so we need to throttle buffered
> writes more aggressively), so it has to be independent of any sort
> of block layer IO controller.
> 
> Simply put: the block IO controller still has direct control over
> the rate at which buffered writes drain out of the system. The
> IO-less write throttle simply limits the rate at which buffered
> writes come into the system to match whatever the IO path allows to
> drain out....

Ok, this makes sense. So it goes back to the previous design where
absolute cgroup based control happens at device level and IO less
throttle implements the feedback loop to slow down the writes into
page cache. That makes sense. But Wu's slides suggest that one can
directly implement cgroup based IO control in IO less throttling
and that's where I have concerns.

Anyway this stuff shall have to be made cgroup aware so that tasks
of different groups can see different throttling depending on how
much IO that group is able to do at device level.

> 
> > Now I think one could probably come up with more sophisticated scheme
> > where throttling is done at bdi level but is also accounted at device
> > level at IO controller. (Something similar I had done in the past but
> > Dave Chinner did not like it).
> 
> I don't like it because it is solution to a specific problem and
> requires complex coupling across multiple layers of the system. We
> are trying to move away from that throttling model. More
> fundamentally, though, is that it is not a general solution to the
> entire class of "IO writeback rate changed" problems that buffered
> write throttling needs to solve.
> 
> > Anyway, keeping track of per cgroup rate and throttling accordingly
> > can definitely help implement an algorithm for per cgroup IO control.
> > We probably just need to find a reasonable way to account all this
> > IO to end device so that we have control of all kind of IO of a cgroup.
> > How do you implement proportional control here? From overall bdi bandwidth
> > vary per cgroup bandwidth regularly based on cgroup weight? Again the
> > issue here is that it controls only buffered WRITES and nothing else and
> > in this case co-ordinating with CFQ will probably be hard. So I guess
> > usage of proportional IO just for buffered WRITES will have limited
> > usage.
> 
> The whole point of doing the throttling this way is that we don't
> need any sort of special connection between block IO throttling and
> page cache (buffered write) throttling. We significantly reduce the
> coupling between the layers by relying on feedback-driven control
> loops to determine the buffered write throttling thresholds
> adaptively. IOWs, the IO-less write throttling at the page cache
> will adjust automatically to whatever throughput the block IO
> throttling allows async writes to achieve.

This is good. But that's not the impression one gets from Wu's slides.

> 
> However, before we have a "finished product", there is still another
> piece of the puzzle to be put in place - memcg-aware buffered
> writeback. That is, having a flusher thread do work on behalf of
> memcg in the IO context of the memcg. Then the IO controller just
> sees a stream of async writes in the context of the memcg the
> buffered writes came from in the first place. The block layer
> throttles them just like any other IO in the IO context of the
> memcg...

Yes that is still a piece remaining. I was hoping that Greg Thelen will
be able to extend his patches to submit writes in the context of
per cgroup flusher/worker threads and solve this problem.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-09 14:04       ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 14:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Greg Thelen, Minchan Kim, Andrea Righi,
	linux-mm, LKML

On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote:
> On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote:
> > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> > > Hi all,
> > > 
> > > The _core_ bits of the IO-less balance_dirty_pages().
> > > Heavily simplified and re-commented to make it easier to review.
> > > 
> > > 	git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8
> > > 
> > > Only the bare minimal algorithms are presented, so you will find some rough
> > > edges in the graphs below. But it's usable :)
> > > 
> > > 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/
> > > 
> > > And an introduction to the (more complete) algorithms:
> > > 
> > > 	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf
> > > 
> > > Questions and reviews are highly appreciated!
> > 
> > Hi Wu,
> > 
> > I am going through the slide number 39 where you talk about it being
> > future proof and it can be used for IO control purposes. You have listed
> > following merits of this approach.
> > 
> > * per-bdi nature, works on NFS and Software RAID
> > * no delayed response (working at the right layer)
> > * no page tracking, hence decoupled from memcg
> > * no interactions with FS and CFQ
> > * get proportional IO controller for free
> > * reuse/inherit all the base facilities/functions
> > 
> > I would say that it will also be a good idea to list the demerits of
> > this approach in current form and that is that it only deals with
> > controlling buffered write IO and nothing else.
> 
> That's not a demerit - that is all it is designed to do.

It is designed to improve the existing task throttling functionality and
we are trying to extend the same to cgroups too. So if by design something
does not gel well with existing pieces, it is demerit to me. Atleast
there should be a good explanation of design intention and how it is
going to be useful.

For example, how this thing is going to gel with existing IO controller?
Are you going to create two separate mechianisms. One for control of
writes while entering the cache and other for controlling the writes
at device level?

The fact that this mechanism does not know about any other IO in the
system/cgroup is a limiting factor. From usability point of view, a
user expects any kind of IO happening from a group.

So are we planning to create a new controller? Or add additional files
in existing controller to control the per cgroup write throttling 
behavior? Even if we create additional files, again then a user is
forced to put separate write policies for buffered writes and direct
writes. I was hoping a better interface would be that user puts a
policy on writes and that takes affect and a user does not have to
worry whether the applications inside the cgroup are doing buffered
writes or direct writes.

> 
> > So on the same block device, other direct writes might be going on
> > from same group and in this scheme a user will not have any
> > control.
> 
> But it is taken into account by the IO write throttling.

You mean blkio controller?

It does. But my complain is that we are trying to control two separate
knobs for two kind of IOs and I am trying to come up with a single
knob.

Current interface for write control in blkio controller looks like.

blkio.throtl.write_bps_device

Once can write to this file specifying the write limit of a cgroup 
on a particular device. I was hoping that buffered write limits
will come out of same limit but with these pathes looks like we
shall have to create a new interface altogether which just controls
buffered writes and nothing else and user is supposed to know what
his application is doing and try to configure the limits accordingly.

So my concern is that how the overall interface would look like and
how well it will work with existing controller and how a user is
supposed to use it.

In fact current IO controller does throttling at device level so 
interface is device specific. One is supposed to know the major
and minor number of device to specify. I am not sure in this
case what one is supposed to do as it is bdi specific and for
NFS case there is no device. So one is supposed to speciy bdi or
limits are going to be global (system wide, independent of bdi
or block device)?

> 
> > Another disadvantage is that throttling at page cache
> > level does not take care of IO spikes at device level.
> 
> And that is handled as well.
> 
> How? By the indirect effect other IO and IO spikes have on the
> writeback rate. That is, other IO reduces the writeback bandwidth,
> which then changes the throttling parameters via feedback loops.

Actually I was referring to effect of buffered writes on other IO
going on the device. With control being on device level, one can
tightly control the WRITEs flowing out of a cgroup to Lun and that
can help a bit knowing how bad it will be for other reads going on
the lun.

With this scheme, flusher threads can suddenly throw tons of writes
on lun and then no IO for another few seconds. So basically IO is
bursty at device level and doing control at device level can make
it more smooth.

So we have two ways to control buffered writes.

- Throttle them while entering the page cache
- Throttle them at device and feedback loop in turn throttles them at
  page cache level based on dirty ratio.

Myself and Andrea had implemented first appraoch (same what Wu is
suggesting now with a different mechanism) and following was your
response.

https://lkml.org/lkml/2011/6/28/494

To me it looked like that at that point of time you preferred precise
throttling at device level and now you seem to prefer precise throttling
at page cache level?

Again, I am not against cgroup parameter based throttling at page
cache level. It simplifies the implementation and probably is good
enough for lots of people. I am only worried about that the interface
and how does it work with existing interfaces.

In absolute throttling one does not have to care about feedback or
what is the underlying bdi bandwidth. So to me these patches are
good for work conserving IO control where we want to determine how
fast we can write to device and then throttle tasks accordingly. But
in absolute throttling one specifies the upper limit and there we
don't need the mechanism to determine what the bdi badnwidth or
how many dirty pages are there and throttle tasks accordingly. 

> 
> The buffered write throttle is designed to reduce the page cache
> dirtying rate to the current cleaning rate of the backing device
> is. Increase the cleaning rate (i.e. device is otherwise idle) and
> it will throttle less. Decrease the cleaning rate (i.e. other IO
> spikes or block IO throttle activates) and it will throttle more.
> 
> We have to do vary buffered write throttling like this to adapt to
> changing IO workloads (e.g.  someone starting a read-heavy workload
> will slow down writeback rate, so we need to throttle buffered
> writes more aggressively), so it has to be independent of any sort
> of block layer IO controller.
> 
> Simply put: the block IO controller still has direct control over
> the rate at which buffered writes drain out of the system. The
> IO-less write throttle simply limits the rate at which buffered
> writes come into the system to match whatever the IO path allows to
> drain out....

Ok, this makes sense. So it goes back to the previous design where
absolute cgroup based control happens at device level and IO less
throttle implements the feedback loop to slow down the writes into
page cache. That makes sense. But Wu's slides suggest that one can
directly implement cgroup based IO control in IO less throttling
and that's where I have concerns.

Anyway this stuff shall have to be made cgroup aware so that tasks
of different groups can see different throttling depending on how
much IO that group is able to do at device level.

> 
> > Now I think one could probably come up with more sophisticated scheme
> > where throttling is done at bdi level but is also accounted at device
> > level at IO controller. (Something similar I had done in the past but
> > Dave Chinner did not like it).
> 
> I don't like it because it is solution to a specific problem and
> requires complex coupling across multiple layers of the system. We
> are trying to move away from that throttling model. More
> fundamentally, though, is that it is not a general solution to the
> entire class of "IO writeback rate changed" problems that buffered
> write throttling needs to solve.
> 
> > Anyway, keeping track of per cgroup rate and throttling accordingly
> > can definitely help implement an algorithm for per cgroup IO control.
> > We probably just need to find a reasonable way to account all this
> > IO to end device so that we have control of all kind of IO of a cgroup.
> > How do you implement proportional control here? From overall bdi bandwidth
> > vary per cgroup bandwidth regularly based on cgroup weight? Again the
> > issue here is that it controls only buffered WRITES and nothing else and
> > in this case co-ordinating with CFQ will probably be hard. So I guess
> > usage of proportional IO just for buffered WRITES will have limited
> > usage.
> 
> The whole point of doing the throttling this way is that we don't
> need any sort of special connection between block IO throttling and
> page cache (buffered write) throttling. We significantly reduce the
> coupling between the layers by relying on feedback-driven control
> loops to determine the buffered write throttling thresholds
> adaptively. IOWs, the IO-less write throttling at the page cache
> will adjust automatically to whatever throughput the block IO
> throttling allows async writes to achieve.

This is good. But that's not the impression one gets from Wu's slides.

> 
> However, before we have a "finished product", there is still another
> piece of the puzzle to be put in place - memcg-aware buffered
> writeback. That is, having a flusher thread do work on behalf of
> memcg in the IO context of the memcg. Then the IO controller just
> sees a stream of async writes in the context of the memcg the
> buffered writes came from in the first place. The block layer
> throttles them just like any other IO in the IO context of the
> memcg...

Yes that is still a piece remaining. I was hoping that Greg Thelen will
be able to extend his patches to submit writes in the context of
per cgroup flusher/worker threads and solve this problem.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09 14:54     ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 14:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:
> It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
> when there are N dd tasks.
> 
> On write() syscall, use bdi->dirty_ratelimit
> ============================================
> 
>     balance_dirty_pages(pages_dirtied)
>     {
>         pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
>         pause = pages_dirtied / pos_bw;
>         sleep(pause);
>     }
> 
> On every 200ms, update bdi->dirty_ratelimit
> ===========================================
> 
>     bdi_update_dirty_ratelimit()
>     {
>         bw = bdi->dirty_ratelimit;
>         ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
>         if (dirty pages unbalanced)
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
>     }
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)
> 
> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms
> 
> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                      	(1)
> 
> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw       	(2)
> 
> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                      	(3)
> 
> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Hi Fengguang,

So how much work it is to extend all this to handle the case of cgroups?
IOW, I would imagine that you shall have to keep track of per cgroup/per
bdi state of many of the variables. For example, write_bw will become
per cgroup/per bdi entity instead of per bdi entity only. Same should
be true for position ratio, dirty_bw etc?

I am assuming that if some cgroup is low weight on end device, then
WRITE bandwidth of that cgroup should go down and that should be
accounted for at per bdi state and task throttling should happen
accordingly so that a lower weight cgroup tasks get throttled more
as compared to higher weight cgroup tasks?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 14:54     ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 14:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:
> It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
> when there are N dd tasks.
> 
> On write() syscall, use bdi->dirty_ratelimit
> ============================================
> 
>     balance_dirty_pages(pages_dirtied)
>     {
>         pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
>         pause = pages_dirtied / pos_bw;
>         sleep(pause);
>     }
> 
> On every 200ms, update bdi->dirty_ratelimit
> ===========================================
> 
>     bdi_update_dirty_ratelimit()
>     {
>         bw = bdi->dirty_ratelimit;
>         ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
>         if (dirty pages unbalanced)
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
>     }
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)
> 
> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms
> 
> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                      	(1)
> 
> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw       	(2)
> 
> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                      	(3)
> 
> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Hi Fengguang,

So how much work it is to extend all this to handle the case of cgroups?
IOW, I would imagine that you shall have to keep track of per cgroup/per
bdi state of many of the variables. For example, write_bw will become
per cgroup/per bdi entity instead of per bdi entity only. Same should
be true for position ratio, dirty_bw etc?

I am assuming that if some cgroup is low weight on end device, then
WRITE bandwidth of that cgroup should go down and that should be
accounted for at per bdi state and task throttling should happen
accordingly so that a lower weight cgroup tasks get throttled more
as compared to higher weight cgroup tasks?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-09 14:57     ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 14:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)

This is (0), since it makes (1). But it fails to explain what the
difference is between task_ratelimit and pos_bw (and why positional
bandwidth is a good name).

> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms

Right, so that I get. And our premise for the whole work is to delay
applications so that we match the dirty_bw to the write_bw, right?

> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                         (1)

So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
since N is a unit-less number.

What does task_ratelimit in pages/s mean? Since we make the tasks sleep
the only thing we can make from this is a measure of pages. So I expect
(in a later patch) we compute the sleep time on the amount of pages we
want written out, using this ratelimit measure, right?

> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw          (2)

Here you introduce reference bandwidth, what does it mean and what is
its relation to positional bandwidth. Going by the equation, we got
(pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

write_bw/dirty_bw is the ration between output and input of dirty pages,
but what is pos_bw and what does that make ref_bw?

> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                         (3)

Somehow this seems like the primary postulate, yet you present it like a
derivation. The whole purpose of your control system is to provide this
fairness between processes, therefore I would expect you start out with
this postulate and reason therefrom.

> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Which does lead to the question why its not called that instead ;-)

> In practice, the ref_bw calculated by (2) may fluctuate and have
> estimation errors. So the bdi->dirty_ratelimit update policy is to
> follow it only when both pos_bw and ref_bw point to the same direction
> (indicating not only the dirty position has deviated from the global/bdi
> setpoints, but also it's still departing away).

Which is where you introduce the need for pos_bw, yet you have not yet
explained its meaning. In this explanation you allude to it being the
speed (first time derivative) of the deviation from the setpoint.

The set point's measure is in pages, so the measure of its first time
derivative would indeed be pages/s, just like bandwidth, but calling it
a bandwidth seems highly confusing indeed.

I would also like a few more words on your update condition, why did you
pick those, and what are the full ramifications of them.

Also missing in this story is your pos_ratio thing, it is used in the
code, but there is no explanation on how it ties in with the above
things.


You seem very skilled in control systems (your earlier read-ahead work
was also a very complex system), but the explanations of your systems
are highly confusing. Can you go back to the roots and explain how you
constructed your model and why you did so? (without using graphs please)


PS. I'm not criticizing your work, the results are impressive (as
always), but I find it very hard to understand. 

PPS. If it would help, feel free to refer me to educational material on
control system theory, either online or in books.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 14:57     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 14:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)

This is (0), since it makes (1). But it fails to explain what the
difference is between task_ratelimit and pos_bw (and why positional
bandwidth is a good name).

> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms

Right, so that I get. And our premise for the whole work is to delay
applications so that we match the dirty_bw to the write_bw, right?

> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                         (1)

So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
since N is a unit-less number.

What does task_ratelimit in pages/s mean? Since we make the tasks sleep
the only thing we can make from this is a measure of pages. So I expect
(in a later patch) we compute the sleep time on the amount of pages we
want written out, using this ratelimit measure, right?

> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw          (2)

Here you introduce reference bandwidth, what does it mean and what is
its relation to positional bandwidth. Going by the equation, we got
(pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

write_bw/dirty_bw is the ration between output and input of dirty pages,
but what is pos_bw and what does that make ref_bw?

> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                         (3)

Somehow this seems like the primary postulate, yet you present it like a
derivation. The whole purpose of your control system is to provide this
fairness between processes, therefore I would expect you start out with
this postulate and reason therefrom.

> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Which does lead to the question why its not called that instead ;-)

> In practice, the ref_bw calculated by (2) may fluctuate and have
> estimation errors. So the bdi->dirty_ratelimit update policy is to
> follow it only when both pos_bw and ref_bw point to the same direction
> (indicating not only the dirty position has deviated from the global/bdi
> setpoints, but also it's still departing away).

Which is where you introduce the need for pos_bw, yet you have not yet
explained its meaning. In this explanation you allude to it being the
speed (first time derivative) of the deviation from the setpoint.

The set point's measure is in pages, so the measure of its first time
derivative would indeed be pages/s, just like bandwidth, but calling it
a bandwidth seems highly confusing indeed.

I would also like a few more words on your update condition, why did you
pick those, and what are the full ramifications of them.

Also missing in this story is your pos_ratio thing, it is used in the
code, but there is no explanation on how it ties in with the above
things.


You seem very skilled in control systems (your earlier read-ahead work
was also a very complex system), but the explanations of your systems
are highly confusing. Can you go back to the roots and explain how you
constructed your model and why you did so? (without using graphs please)


PS. I'm not criticizing your work, the results are impressive (as
always), but I find it very hard to understand. 

PPS. If it would help, feel free to refer me to educational material on
control system theory, either online or in books.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 14:57     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 14:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> Estimation of balanced bdi->dirty_ratelimit
> ===========================================
> 
> When started N dd, throttle each dd at
> 
>          task_ratelimit = pos_bw (any non-zero initial value is OK)

This is (0), since it makes (1). But it fails to explain what the
difference is between task_ratelimit and pos_bw (and why positional
bandwidth is a good name).

> After 200ms, we got
> 
>          dirty_bw = # of pages dirtied by app / 200ms
>          write_bw = # of pages written to disk / 200ms

Right, so that I get. And our premise for the whole work is to delay
applications so that we match the dirty_bw to the write_bw, right?

> For aggressive dirtiers, the equality holds
> 
>          dirty_bw == N * task_ratelimit
>                   == N * pos_bw                         (1)

So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
since N is a unit-less number.

What does task_ratelimit in pages/s mean? Since we make the tasks sleep
the only thing we can make from this is a measure of pages. So I expect
(in a later patch) we compute the sleep time on the amount of pages we
want written out, using this ratelimit measure, right?

> The balanced throttle bandwidth can be estimated by
> 
>          ref_bw = pos_bw * write_bw / dirty_bw          (2)

Here you introduce reference bandwidth, what does it mean and what is
its relation to positional bandwidth. Going by the equation, we got
(pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

write_bw/dirty_bw is the ration between output and input of dirty pages,
but what is pos_bw and what does that make ref_bw?

> >From (1) and (2), we get equality
> 
>          ref_bw == write_bw / N                         (3)

Somehow this seems like the primary postulate, yet you present it like a
derivation. The whole purpose of your control system is to provide this
fairness between processes, therefore I would expect you start out with
this postulate and reason therefrom.

> If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> will match. So ref_bw is the balanced dirty rate.

Which does lead to the question why its not called that instead ;-)

> In practice, the ref_bw calculated by (2) may fluctuate and have
> estimation errors. So the bdi->dirty_ratelimit update policy is to
> follow it only when both pos_bw and ref_bw point to the same direction
> (indicating not only the dirty position has deviated from the global/bdi
> setpoints, but also it's still departing away).

Which is where you introduce the need for pos_bw, yet you have not yet
explained its meaning. In this explanation you allude to it being the
speed (first time derivative) of the deviation from the setpoint.

The set point's measure is in pages, so the measure of its first time
derivative would indeed be pages/s, just like bandwidth, but calling it
a bandwidth seems highly confusing indeed.

I would also like a few more words on your update condition, why did you
pick those, and what are the full ramifications of them.

Also missing in this story is your pos_ratio thing, it is used in the
code, but there is no explanation on how it ties in with the above
things.


You seem very skilled in control systems (your earlier read-ahead work
was also a very complex system), but the explanations of your systems
are highly confusing. Can you go back to the roots and explain how you
constructed your model and why you did so? (without using graphs please)


PS. I'm not criticizing your work, the results are impressive (as
always), but I find it very hard to understand. 

PPS. If it would help, feel free to refer me to educational material on
control system theory, either online or in books.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09 15:50     ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 15:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
> + *
> + * Normal bdi tasks will be curbed at or below it in long term.
> + * Obviously it should be around (write_bw / N) when there are N dd tasks.
> + */

Hi Fengguang,

So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
limit (based on postion ratio, dirty_bw and write_bw). But this seems
to be overall bdi limit and does not seem to take into account the
number of tasks doing IO to that bdi (as your comment suggests). So
it probably will track write_bw as opposed to write_bw/N. What am
I missing?

Thanks
Vivek


> +static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> +				       unsigned long thresh,
> +				       unsigned long dirty,
> +				       unsigned long bdi_thresh,
> +				       unsigned long bdi_dirty,
> +				       unsigned long dirtied,
> +				       unsigned long elapsed)
> +{
> +	unsigned long bw = bdi->dirty_ratelimit;
> +	unsigned long dirty_bw;
> +	unsigned long pos_bw;
> +	unsigned long ref_bw;
> +	unsigned long long pos_ratio;
> +
> +	/*
> +	 * The dirty rate will match the writeback rate in long term, except
> +	 * when dirty pages are truncated by userspace or re-dirtied by FS.
> +	 */
> +	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
> +
> +	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
> +				       bdi_thresh, bdi_dirty);
> +	/*
> +	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
> +	 */
> +	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +
> +	/*
> +	 * ref_bw = pos_bw * write_bw / dirty_bw
> +	 *
> +	 * It's a linear estimation of the "balanced" throttle bandwidth.
> +	 */
> +	pos_ratio *= bdi->avg_write_bandwidth;
> +	do_div(pos_ratio, dirty_bw | 1);
> +	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +
> +	/*
> +	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
> +	 * are on the same side of dirty_ratelimit. Which not only makes it
> +	 * more stable, but also is essential for preventing it being driven
> +	 * away by possible systematic errors in ref_bw.
> +	 */
> +	if (pos_bw < bw) {
> +		if (ref_bw < bw)
> +			bw = max(ref_bw, pos_bw);
> +	} else {
> +		if (ref_bw > bw)
> +			bw = min(ref_bw, pos_bw);
> +	}
> +
> +	bdi->dirty_ratelimit = bw;
> +}
> +
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
>  			    unsigned long dirty,
> @@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
>  {
>  	unsigned long now = jiffies;
>  	unsigned long elapsed = now - bdi->bw_time_stamp;
> +	unsigned long dirtied;
>  	unsigned long written;
>  
>  	/*
> @@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
>  	if (elapsed < BANDWIDTH_INTERVAL)
>  		return;
>  
> +	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
>  	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
>  
>  	/*
> @@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
>  	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>  		goto snapshot;
>  
> -	if (thresh)
> +	if (thresh) {
>  		global_update_bandwidth(thresh, dirty, now);
> -
> +		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
> +					   bdi_dirty, dirtied, elapsed);
> +	}
>  	bdi_update_write_bandwidth(bdi, elapsed, written);
>  
>  snapshot:
> +	bdi->dirtied_stamp = dirtied;
>  	bdi->written_stamp = written;
>  	bdi->bw_time_stamp = now;
>  }
> 

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 15:50     ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 15:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Maintain bdi->dirty_ratelimit, the base throttle bandwidth.
> + *
> + * Normal bdi tasks will be curbed at or below it in long term.
> + * Obviously it should be around (write_bw / N) when there are N dd tasks.
> + */

Hi Fengguang,

So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
limit (based on postion ratio, dirty_bw and write_bw). But this seems
to be overall bdi limit and does not seem to take into account the
number of tasks doing IO to that bdi (as your comment suggests). So
it probably will track write_bw as opposed to write_bw/N. What am
I missing?

Thanks
Vivek


> +static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> +				       unsigned long thresh,
> +				       unsigned long dirty,
> +				       unsigned long bdi_thresh,
> +				       unsigned long bdi_dirty,
> +				       unsigned long dirtied,
> +				       unsigned long elapsed)
> +{
> +	unsigned long bw = bdi->dirty_ratelimit;
> +	unsigned long dirty_bw;
> +	unsigned long pos_bw;
> +	unsigned long ref_bw;
> +	unsigned long long pos_ratio;
> +
> +	/*
> +	 * The dirty rate will match the writeback rate in long term, except
> +	 * when dirty pages are truncated by userspace or re-dirtied by FS.
> +	 */
> +	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
> +
> +	pos_ratio = bdi_position_ratio(bdi, thresh, dirty,
> +				       bdi_thresh, bdi_dirty);
> +	/*
> +	 * pos_bw reflects each dd's dirty rate enforced for the past 200ms.
> +	 */
> +	pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +	pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +
> +	/*
> +	 * ref_bw = pos_bw * write_bw / dirty_bw
> +	 *
> +	 * It's a linear estimation of the "balanced" throttle bandwidth.
> +	 */
> +	pos_ratio *= bdi->avg_write_bandwidth;
> +	do_div(pos_ratio, dirty_bw | 1);
> +	ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +
> +	/*
> +	 * dirty_ratelimit will follow ref_bw/pos_bw conservatively iff they
> +	 * are on the same side of dirty_ratelimit. Which not only makes it
> +	 * more stable, but also is essential for preventing it being driven
> +	 * away by possible systematic errors in ref_bw.
> +	 */
> +	if (pos_bw < bw) {
> +		if (ref_bw < bw)
> +			bw = max(ref_bw, pos_bw);
> +	} else {
> +		if (ref_bw > bw)
> +			bw = min(ref_bw, pos_bw);
> +	}
> +
> +	bdi->dirty_ratelimit = bw;
> +}
> +
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
>  			    unsigned long dirty,
> @@ -745,6 +805,7 @@ void __bdi_update_bandwidth(struct backi
>  {
>  	unsigned long now = jiffies;
>  	unsigned long elapsed = now - bdi->bw_time_stamp;
> +	unsigned long dirtied;
>  	unsigned long written;
>  
>  	/*
> @@ -753,6 +814,7 @@ void __bdi_update_bandwidth(struct backi
>  	if (elapsed < BANDWIDTH_INTERVAL)
>  		return;
>  
> +	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
>  	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
>  
>  	/*
> @@ -762,12 +824,15 @@ void __bdi_update_bandwidth(struct backi
>  	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>  		goto snapshot;
>  
> -	if (thresh)
> +	if (thresh) {
>  		global_update_bandwidth(thresh, dirty, now);
> -
> +		bdi_update_dirty_ratelimit(bdi, thresh, dirty, bdi_thresh,
> +					   bdi_dirty, dirtied, elapsed);
> +	}
>  	bdi_update_write_bandwidth(bdi, elapsed, written);
>  
>  snapshot:
> +	bdi->dirtied_stamp = dirtied;
>  	bdi->written_stamp = written;
>  	bdi->bw_time_stamp = now;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 15:50     ` Vivek Goyal
  (?)
@ 2011-08-09 16:16       ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> 
> So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> limit (based on postion ratio, dirty_bw and write_bw). But this seems
> to be overall bdi limit and does not seem to take into account the
> number of tasks doing IO to that bdi (as your comment suggests). So
> it probably will track write_bw as opposed to write_bw/N. What am
> I missing? 

I think the per task thing comes from him using the pages_dirtied
argument to balance_dirty_pages() to compute the sleep time. Although
I'm not quite sure how he keeps fairness in light of the sleep time
bounding to MAX_PAUSE.



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:16       ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> 
> So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> limit (based on postion ratio, dirty_bw and write_bw). But this seems
> to be overall bdi limit and does not seem to take into account the
> number of tasks doing IO to that bdi (as your comment suggests). So
> it probably will track write_bw as opposed to write_bw/N. What am
> I missing? 

I think the per task thing comes from him using the pages_dirtied
argument to balance_dirty_pages() to compute the sleep time. Although
I'm not quite sure how he keeps fairness in light of the sleep time
bounding to MAX_PAUSE.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:16       ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> 
> So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> limit (based on postion ratio, dirty_bw and write_bw). But this seems
> to be overall bdi limit and does not seem to take into account the
> number of tasks doing IO to that bdi (as your comment suggests). So
> it probably will track write_bw as opposed to write_bw/N. What am
> I missing? 

I think the per task thing comes from him using the pages_dirtied
argument to balance_dirty_pages() to compute the sleep time. Although
I'm not quite sure how he keeps fairness in light of the sleep time
bounding to MAX_PAUSE.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 16:16       ` Peter Zijlstra
  (?)
@ 2011-08-09 16:19         ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests). So
> > it probably will track write_bw as opposed to write_bw/N. What am
> > I missing? 
> 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time. Although
> I'm not quite sure how he keeps fairness in light of the sleep time
> bounding to MAX_PAUSE.

Furthermore, there's of course the issue that current->nr_dirtied is
computed over all BDIs it dirtied pages from, and the sleep time is
computed for the BDI it happened to do the overflowing write on.

Assuming an task (mostly) writes to a single bdi, or equally to all, it
should all work out.



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:19         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests). So
> > it probably will track write_bw as opposed to write_bw/N. What am
> > I missing? 
> 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time. Although
> I'm not quite sure how he keeps fairness in light of the sleep time
> bounding to MAX_PAUSE.

Furthermore, there's of course the issue that current->nr_dirtied is
computed over all BDIs it dirtied pages from, and the sleep time is
computed for the BDI it happened to do the overflowing write on.

Assuming an task (mostly) writes to a single bdi, or equally to all, it
should all work out.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:19         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests). So
> > it probably will track write_bw as opposed to write_bw/N. What am
> > I missing? 
> 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time. Although
> I'm not quite sure how he keeps fairness in light of the sleep time
> bounding to MAX_PAUSE.

Furthermore, there's of course the issue that current->nr_dirtied is
computed over all BDIs it dirtied pages from, and the sleep time is
computed for the BDI it happened to do the overflowing write on.

Assuming an task (mostly) writes to a single bdi, or equally to all, it
should all work out.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-09 16:56     ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;

I can't actually find this low-pass filter in the code.. could be I'm
blind from staring at it too long though..

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:56     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;

I can't actually find this low-pass filter in the code.. could be I'm
blind from staring at it too long though..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 16:56     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 16:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
>              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;

I can't actually find this low-pass filter in the code.. could be I'm
blind from staring at it too long though..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-09 17:02     ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:

> +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +

> +       pos_ratio *= bdi->avg_write_bandwidth;
> +       do_div(pos_ratio, dirty_bw | 1);
> +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 

when written out that results in:

           bw * pos_ratio * bdi->avg_write_bandwidth
  ref_bw = -----------------------------------------
                         dirty_bw

which would suggest you write it like:

  ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);

since pos_bw is already bw * pos_ratio per the above.

Or am I missing something?

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 17:02     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:

> +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +

> +       pos_ratio *= bdi->avg_write_bandwidth;
> +       do_div(pos_ratio, dirty_bw | 1);
> +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 

when written out that results in:

           bw * pos_ratio * bdi->avg_write_bandwidth
  ref_bw = -----------------------------------------
                         dirty_bw

which would suggest you write it like:

  ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);

since pos_bw is already bw * pos_ratio per the above.

Or am I missing something?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-09 17:02     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:02 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:

> +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> +

> +       pos_ratio *= bdi->avg_write_bandwidth;
> +       do_div(pos_ratio, dirty_bw | 1);
> +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 

when written out that results in:

           bw * pos_ratio * bdi->avg_write_bandwidth
  ref_bw = -----------------------------------------
                         dirty_bw

which would suggest you write it like:

  ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);

since pos_bw is already bw * pos_ratio per the above.

Or am I missing something?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-09 17:20             ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> >                     origin - dirty
> >         pos_ratio = --------------
> >                     origin - goal 
> 
> > which comes from the below [*] control line, so that when (dirty == goal),
> > pos_ratio == 1.0:
> 
> OK, so basically you want a linear function for which:
> 
> f(goal) = 1 and has a root somewhere > goal.
> 
> (that one line is much more informative than all your graphs put
> together, one can start from there and derive your function)
> 
> That does indeed get you the above function, now what does it mean? 

So going by:

                                         write_bw
  ref_bw = dirty_ratelimit * pos_ratio * --------
                                         dirty_bw

pos_ratio seems to be the feedback on the deviation of the dirty pages
around its setpoint. So we adjust the reference bw (or rather ratelimit)
to take account of the shift in output vs input capacity as well as the
shift in dirty pages around its setpoint.

>From that we derive the condition that: 

  pos_ratio(setpoint) := 1

Now in order to create a linear function we need one more condition. We
get one from the fact that once we hit the limit we should hard throttle
our writers. We get that by setting the ratelimit to 0, because, after
all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:

  pos_ratio(limit) := 0

Using these two conditions we can solve the equations and get your:

                        limit - dirty
  pos_ratio(dirty) =  ----------------
                      limit - setpoint

Now, for some reason you chose not to use limit, but something like
min(limit, 4*thresh) something to do with the slope affecting the rate
of adjustment. This wants a comment someplace.


Now all of the above would seem to suggest:

  dirty_ratelimit := ref_bw

However for that you use:

  if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
	dirty_ratelimit = max(ref_bw, pos_bw);

  if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
	dirty_ratelimit = min(ref_bw, pos_bw);

You have:

  pos_bw = dirty_ratelimit * pos_ratio

Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
why are you ignoring the shift in output vs input rate there?




^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 17:20             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> >                     origin - dirty
> >         pos_ratio = --------------
> >                     origin - goal 
> 
> > which comes from the below [*] control line, so that when (dirty == goal),
> > pos_ratio == 1.0:
> 
> OK, so basically you want a linear function for which:
> 
> f(goal) = 1 and has a root somewhere > goal.
> 
> (that one line is much more informative than all your graphs put
> together, one can start from there and derive your function)
> 
> That does indeed get you the above function, now what does it mean? 

So going by:

                                         write_bw
  ref_bw = dirty_ratelimit * pos_ratio * --------
                                         dirty_bw

pos_ratio seems to be the feedback on the deviation of the dirty pages
around its setpoint. So we adjust the reference bw (or rather ratelimit)
to take account of the shift in output vs input capacity as well as the
shift in dirty pages around its setpoint.

From that we derive the condition that: 

  pos_ratio(setpoint) := 1

Now in order to create a linear function we need one more condition. We
get one from the fact that once we hit the limit we should hard throttle
our writers. We get that by setting the ratelimit to 0, because, after
all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:

  pos_ratio(limit) := 0

Using these two conditions we can solve the equations and get your:

                        limit - dirty
  pos_ratio(dirty) =  ----------------
                      limit - setpoint

Now, for some reason you chose not to use limit, but something like
min(limit, 4*thresh) something to do with the slope affecting the rate
of adjustment. This wants a comment someplace.


Now all of the above would seem to suggest:

  dirty_ratelimit := ref_bw

However for that you use:

  if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
	dirty_ratelimit = max(ref_bw, pos_bw);

  if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
	dirty_ratelimit = min(ref_bw, pos_bw);

You have:

  pos_bw = dirty_ratelimit * pos_ratio

Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
why are you ignoring the shift in output vs input rate there?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 17:20             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> >                     origin - dirty
> >         pos_ratio = --------------
> >                     origin - goal 
> 
> > which comes from the below [*] control line, so that when (dirty == goal),
> > pos_ratio == 1.0:
> 
> OK, so basically you want a linear function for which:
> 
> f(goal) = 1 and has a root somewhere > goal.
> 
> (that one line is much more informative than all your graphs put
> together, one can start from there and derive your function)
> 
> That does indeed get you the above function, now what does it mean? 

So going by:

                                         write_bw
  ref_bw = dirty_ratelimit * pos_ratio * --------
                                         dirty_bw

pos_ratio seems to be the feedback on the deviation of the dirty pages
around its setpoint. So we adjust the reference bw (or rather ratelimit)
to take account of the shift in output vs input capacity as well as the
shift in dirty pages around its setpoint.

From that we derive the condition that: 

  pos_ratio(setpoint) := 1

Now in order to create a linear function we need one more condition. We
get one from the fact that once we hit the limit we should hard throttle
our writers. We get that by setting the ratelimit to 0, because, after
all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:

  pos_ratio(limit) := 0

Using these two conditions we can solve the equations and get your:

                        limit - dirty
  pos_ratio(dirty) =  ----------------
                      limit - setpoint

Now, for some reason you chose not to use limit, but something like
min(limit, 4*thresh) something to do with the slope affecting the rate
of adjustment. This wants a comment someplace.


Now all of the above would seem to suggest:

  dirty_ratelimit := ref_bw

However for that you use:

  if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
	dirty_ratelimit = max(ref_bw, pos_bw);

  if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
	dirty_ratelimit = min(ref_bw, pos_bw);

You have:

  pos_bw = dirty_ratelimit * pos_ratio

Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
why are you ignoring the shift in output vs input rate there?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09 17:46     ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 17:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote:

[..]
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
>   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a
>  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> +	current->nr_dirtied = 0;
> +	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
> +
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page 
>  	}
>  }
>  
> -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> -
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
>   * @mapping: address_space which was dirtied
> @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long ratelimit;
> -	unsigned long *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
>  
> -	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> +	ratelimit = current->nr_dirtied_pause;
> +	if (bdi->dirty_exceeded)
>  		ratelimit = 8;

Should we make sure that ratelimit is more than 8? It could be that
ratelimit is 1 and we set it higher (just reverse of what we wanted?)

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-09 17:46     ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 17:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote:

[..]
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
>   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a
>  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> +	current->nr_dirtied = 0;
> +	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
> +
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page 
>  	}
>  }
>  
> -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> -
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
>   * @mapping: address_space which was dirtied
> @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long ratelimit;
> -	unsigned long *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
>  
> -	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> +	ratelimit = current->nr_dirtied_pause;
> +	if (bdi->dirty_exceeded)
>  		ratelimit = 8;

Should we make sure that ratelimit is more than 8? It could be that
ratelimit is 1 and we set it higher (just reverse of what we wanted?)

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09 18:15     ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 18:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:

[..]
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_bw = bdi->dirty_ratelimit;
> +		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> +					bdi_thresh, bdi_dirty);

For the sake of consistency of usage of varibale naming how about using

pos_ratio = bdi_position_ratio()?

> +		if (unlikely(bw == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;

So far bw had pos_ratio as value now it will be replaced with actual
bandwidth as value. It makes code confusing. So using pos_ratio will
help.

		bw = (u64)base_bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-09 18:15     ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 18:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:

[..]
> -		trace_balance_dirty_start(bdi);
> -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> -			pages_written += writeback_inodes_wb(&bdi->wb,
> -							     write_chunk);
> -			trace_balance_dirty_written(bdi, pages_written);
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
> +		if (unlikely(!writeback_in_progress(bdi)))
> +			bdi_start_background_writeback(bdi);
> +
> +		base_bw = bdi->dirty_ratelimit;
> +		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> +					bdi_thresh, bdi_dirty);

For the sake of consistency of usage of varibale naming how about using

pos_ratio = bdi_position_ratio()?

> +		if (unlikely(bw == 0)) {
> +			pause = MAX_PAUSE;
> +			goto pause;
>  		}
> +		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;

So far bw had pos_ratio as value now it will be replaced with actual
bandwidth as value. It makes code confusing. So using pos_ratio will
help.

		bw = (u64)base_bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-09 18:35     ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 18:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> dirtying pages at exactly the same time, each task will be assigned a
> large initial nr_dirtied_pause, so that the dirty threshold will be
> exceeded long before each task reached its nr_dirtied_pause and hence
> call balance_dirty_pages(). 

Right, so why remove the per-cpu threshold? you can keep that as a bound
on the number of out-standing dirty pages.

Loosing that bound is actually a bad thing (TM), since you could have
configured a tight dirty limit and lock up your machine this way.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-09 18:35     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 18:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> dirtying pages at exactly the same time, each task will be assigned a
> large initial nr_dirtied_pause, so that the dirty threshold will be
> exceeded long before each task reached its nr_dirtied_pause and hence
> call balance_dirty_pages(). 

Right, so why remove the per-cpu threshold? you can keep that as a bound
on the number of out-standing dirty pages.

Loosing that bound is actually a bad thing (TM), since you could have
configured a tight dirty limit and lock up your machine this way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-09 18:35     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 18:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> Add two fields to task_struct.
> 
> 1) account dirtied pages in the individual tasks, for accuracy
> 2) per-task balance_dirty_pages() call intervals, for flexibility
> 
> The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> scale near-sqrt to the safety gap between dirty pages and threshold.
> 
> XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> dirtying pages at exactly the same time, each task will be assigned a
> large initial nr_dirtied_pause, so that the dirty threshold will be
> exceeded long before each task reached its nr_dirtied_pause and hence
> call balance_dirty_pages(). 

Right, so why remove the per-cpu threshold? you can keep that as a bound
on the number of out-standing dirty pages.

Loosing that bound is actually a bad thing (TM), since you could have
configured a tight dirty limit and lock up your machine this way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-09 18:15     ` Vivek Goyal
  (?)
@ 2011-08-09 18:41       ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 18:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote:
> 
> So far bw had pos_ratio as value now it will be replaced with actual
> bandwidth as value. It makes code confusing. So using pos_ratio will
> help. 

Agreed on consistency, also I'm not sure bandwidth is the right term
here to begin with, its a pages/s unit and I think rate would be better
here. But whatever ;-)

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-09 18:41       ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 18:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote:
> 
> So far bw had pos_ratio as value now it will be replaced with actual
> bandwidth as value. It makes code confusing. So using pos_ratio will
> help. 

Agreed on consistency, also I'm not sure bandwidth is the right term
here to begin with, its a pages/s unit and I think rate would be better
here. But whatever ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-09 18:41       ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-09 18:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote:
> 
> So far bw had pos_ratio as value now it will be replaced with actual
> bandwidth as value. It makes code confusing. So using pos_ratio will
> help. 

Agreed on consistency, also I'm not sure bandwidth is the right term
here to begin with, its a pages/s unit and I think rate would be better
here. But whatever ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09 19:16     ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:

[..]
> -/*
> - * task_dirty_limit - scale down dirty throttling threshold for one task
> - *
> - * task specific dirty limit:
> - *
> - *   dirty -= (dirty/8) * p_{t}
> - *
> - * To protect light/slow dirtying tasks from heavier/fast ones, we start
> - * throttling individual tasks before reaching the bdi dirty limit.
> - * Relatively low thresholds will be allocated to heavy dirtiers. So when
> - * dirty pages grow large, heavy dirtiers will be throttled first, which will
> - * effectively curb the growth of dirty pages. Light dirtiers with high enough
> - * dirty threshold may never get throttled.
> - */

Hi Fengguang,

So we have got rid of the notion of per task dirty limit based on their
fraction? What replaces it.

I can't see any code which is replacing it. If yes, I am wondering how
do you get fairness among tasks which share this bdi.

Also wondering what did this patch series to do make sure that tasks
share bdi more fairly and get write_bw/N bandwidth.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-09 19:16     ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-09 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:

[..]
> -/*
> - * task_dirty_limit - scale down dirty throttling threshold for one task
> - *
> - * task specific dirty limit:
> - *
> - *   dirty -= (dirty/8) * p_{t}
> - *
> - * To protect light/slow dirtying tasks from heavier/fast ones, we start
> - * throttling individual tasks before reaching the bdi dirty limit.
> - * Relatively low thresholds will be allocated to heavy dirtiers. So when
> - * dirty pages grow large, heavy dirtiers will be throttled first, which will
> - * effectively curb the growth of dirty pages. Light dirtiers with high enough
> - * dirty threshold may never get throttled.
> - */

Hi Fengguang,

So we have got rid of the notion of per task dirty limit based on their
fraction? What replaces it.

I can't see any code which is replacing it. If yes, I am wondering how
do you get fairness among tasks which share this bdi.

Also wondering what did this patch series to do make sure that tasks
share bdi more fairly and get write_bw/N bandwidth.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-09 18:41       ` Peter Zijlstra
@ 2011-08-10  3:22         ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10  3:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 02:41:05AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote:
> > 
> > So far bw had pos_ratio as value now it will be replaced with actual
> > bandwidth as value. It makes code confusing. So using pos_ratio will
> > help. 
> 
> Agreed on consistency, also I'm not sure bandwidth is the right term
> here to begin with, its a pages/s unit and I think rate would be better
> here. But whatever ;-)

Good idea, I'll switch to the name "rate".

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-10  3:22         ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10  3:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 02:41:05AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 14:15 -0400, Vivek Goyal wrote:
> > 
> > So far bw had pos_ratio as value now it will be replaced with actual
> > bandwidth as value. It makes code confusing. So using pos_ratio will
> > help. 
> 
> Agreed on consistency, also I'm not sure bandwidth is the right term
> here to begin with, its a pages/s unit and I think rate would be better
> here. But whatever ;-)

Good idea, I'll switch to the name "rate".

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-09 18:15     ` Vivek Goyal
@ 2011-08-10  3:26       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10  3:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Wed, Aug 10, 2011 at 02:15:43AM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> 
> [..]
> > -		trace_balance_dirty_start(bdi);
> > -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> > -			pages_written += writeback_inodes_wb(&bdi->wb,
> > -							     write_chunk);
> > -			trace_balance_dirty_written(bdi, pages_written);
> > -			if (pages_written >= write_chunk)
> > -				break;		/* We've done our duty */
> > +		if (unlikely(!writeback_in_progress(bdi)))
> > +			bdi_start_background_writeback(bdi);
> > +
> > +		base_bw = bdi->dirty_ratelimit;
> > +		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> > +					bdi_thresh, bdi_dirty);
> 
> For the sake of consistency of usage of varibale naming how about using
> 
> pos_ratio = bdi_position_ratio()?

OK!

> > +		if (unlikely(bw == 0)) {
> > +			pause = MAX_PAUSE;
> > +			goto pause;
> >  		}
> > +		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> 
> So far bw had pos_ratio as value now it will be replaced with actual
> bandwidth as value. It makes code confusing. So using pos_ratio will
> help.
> 
> 		bw = (u64)base_bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;

Yeah it makes good sense. I'll change to.

 		rate = (u64)base_rate * pos_ratio >> BANDWIDTH_CALC_SHIFT;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
@ 2011-08-10  3:26       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10  3:26 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Wed, Aug 10, 2011 at 02:15:43AM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> 
> [..]
> > -		trace_balance_dirty_start(bdi);
> > -		if (bdi_nr_reclaimable > task_bdi_thresh) {
> > -			pages_written += writeback_inodes_wb(&bdi->wb,
> > -							     write_chunk);
> > -			trace_balance_dirty_written(bdi, pages_written);
> > -			if (pages_written >= write_chunk)
> > -				break;		/* We've done our duty */
> > +		if (unlikely(!writeback_in_progress(bdi)))
> > +			bdi_start_background_writeback(bdi);
> > +
> > +		base_bw = bdi->dirty_ratelimit;
> > +		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
> > +					bdi_thresh, bdi_dirty);
> 
> For the sake of consistency of usage of varibale naming how about using
> 
> pos_ratio = bdi_position_ratio()?

OK!

> > +		if (unlikely(bw == 0)) {
> > +			pause = MAX_PAUSE;
> > +			goto pause;
> >  		}
> > +		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
> 
> So far bw had pos_ratio as value now it will be replaced with actual
> bandwidth as value. It makes code confusing. So using pos_ratio will
> help.
> 
> 		bw = (u64)base_bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;

Yeah it makes good sense. I'll change to.

 		rate = (u64)base_rate * pos_ratio >> BANDWIDTH_CALC_SHIFT;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-09 17:46     ` Vivek Goyal
@ 2011-08-10  3:29       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10  3:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Wed, Aug 10, 2011 at 01:46:21AM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote:
> 
> [..]
> >   * balance_dirty_pages() must be called by processes which are generating dirty
> >   * data.  It looks at the number of dirty pages in the machine and will force
> >   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> > @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a
> >  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
> >  		bdi->dirty_exceeded = 0;
> >  
> > +	current->nr_dirtied = 0;
> > +	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
> > +
> >  	if (writeback_in_progress(bdi))
> >  		return;
> >  
> > @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page 
> >  	}
> >  }
> >  
> > -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> > -
> >  /**
> >   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> >   * @mapping: address_space which was dirtied
> > @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr(
> >  {
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  	unsigned long ratelimit;
> > -	unsigned long *p;
> >  
> >  	if (!bdi_cap_account_dirty(bdi))
> >  		return;
> >  
> > -	ratelimit = ratelimit_pages;
> > -	if (mapping->backing_dev_info->dirty_exceeded)
> > +	ratelimit = current->nr_dirtied_pause;
> > +	if (bdi->dirty_exceeded)
> >  		ratelimit = 8;
> 
> Should we make sure that ratelimit is more than 8? It could be that
> ratelimit is 1 and we set it higher (just reverse of what we wanted?)

Good catch! I actually just fixed it in that direction :)

        if (bdi->dirty_exceeded)
-               ratelimit = 8;
+               ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-10  3:29       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10  3:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Wed, Aug 10, 2011 at 01:46:21AM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:51PM +0800, Wu Fengguang wrote:
> 
> [..]
> >   * balance_dirty_pages() must be called by processes which are generating dirty
> >   * data.  It looks at the number of dirty pages in the machine and will force
> >   * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> > @@ -1008,6 +1005,9 @@ static void balance_dirty_pages(struct a
> >  	if (clear_dirty_exceeded && bdi->dirty_exceeded)
> >  		bdi->dirty_exceeded = 0;
> >  
> > +	current->nr_dirtied = 0;
> > +	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
> > +
> >  	if (writeback_in_progress(bdi))
> >  		return;
> >  
> > @@ -1034,8 +1034,6 @@ void set_page_dirty_balance(struct page 
> >  	}
> >  }
> >  
> > -static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
> > -
> >  /**
> >   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> >   * @mapping: address_space which was dirtied
> > @@ -1055,30 +1053,17 @@ void balance_dirty_pages_ratelimited_nr(
> >  {
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  	unsigned long ratelimit;
> > -	unsigned long *p;
> >  
> >  	if (!bdi_cap_account_dirty(bdi))
> >  		return;
> >  
> > -	ratelimit = ratelimit_pages;
> > -	if (mapping->backing_dev_info->dirty_exceeded)
> > +	ratelimit = current->nr_dirtied_pause;
> > +	if (bdi->dirty_exceeded)
> >  		ratelimit = 8;
> 
> Should we make sure that ratelimit is more than 8? It could be that
> ratelimit is 1 and we set it higher (just reverse of what we wanted?)

Good catch! I actually just fixed it in that direction :)

        if (bdi->dirty_exceeded)
-               ratelimit = 8;
+               ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-09 18:35     ` Peter Zijlstra
@ 2011-08-10  3:40       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > 
> > Add two fields to task_struct.
> > 
> > 1) account dirtied pages in the individual tasks, for accuracy
> > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > 
> > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > scale near-sqrt to the safety gap between dirty pages and threshold.
> > 
> > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > dirtying pages at exactly the same time, each task will be assigned a
> > large initial nr_dirtied_pause, so that the dirty threshold will be
> > exceeded long before each task reached its nr_dirtied_pause and hence
> > call balance_dirty_pages(). 
> 
> Right, so why remove the per-cpu threshold? you can keep that as a bound
> on the number of out-standing dirty pages.

Right, I also have the vague feeling that the per-cpu threshold can
somehow backup the per-task threshold in case there are too many tasks.

> Loosing that bound is actually a bad thing (TM), since you could have
> configured a tight dirty limit and lock up your machine this way.

It seems good enough to only remove the 4MB upper limit for
ratelimit_pages, so that the per-cpu limit won't kick in too
frequently in typical machines.

  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
  * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
 void writeback_set_ratelimit(void)
 {
        ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
        if (ratelimit_pages < 16)
                ratelimit_pages = 16;
-       if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-               ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-10  3:40       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > 
> > Add two fields to task_struct.
> > 
> > 1) account dirtied pages in the individual tasks, for accuracy
> > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > 
> > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > scale near-sqrt to the safety gap between dirty pages and threshold.
> > 
> > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > dirtying pages at exactly the same time, each task will be assigned a
> > large initial nr_dirtied_pause, so that the dirty threshold will be
> > exceeded long before each task reached its nr_dirtied_pause and hence
> > call balance_dirty_pages(). 
> 
> Right, so why remove the per-cpu threshold? you can keep that as a bound
> on the number of out-standing dirty pages.

Right, I also have the vague feeling that the per-cpu threshold can
somehow backup the per-task threshold in case there are too many tasks.

> Loosing that bound is actually a bad thing (TM), since you could have
> configured a tight dirty limit and lock up your machine this way.

It seems good enough to only remove the 4MB upper limit for
ratelimit_pages, so that the per-cpu limit won't kick in too
frequently in typical machines.

  * Here we set ratelimit_pages to a level which ensures that when all CPUs are
  * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
  * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
 void writeback_set_ratelimit(void)
 {
        ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
        if (ratelimit_pages < 16)
                ratelimit_pages = 16;
-       if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-               ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
 }

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 5/5] writeback: IO-less balance_dirty_pages()
  2011-08-09 19:16     ` Vivek Goyal
  (?)
@ 2011-08-10  4:33     ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10  4:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

[-- Attachment #1: Type: text/plain, Size: 3749 bytes --]

On Wed, Aug 10, 2011 at 03:16:22AM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:52PM +0800, Wu Fengguang wrote:
> 
> [..]
> > -/*
> > - * task_dirty_limit - scale down dirty throttling threshold for one task
> > - *
> > - * task specific dirty limit:
> > - *
> > - *   dirty -= (dirty/8) * p_{t}
> > - *
> > - * To protect light/slow dirtying tasks from heavier/fast ones, we start
> > - * throttling individual tasks before reaching the bdi dirty limit.
> > - * Relatively low thresholds will be allocated to heavy dirtiers. So when
> > - * dirty pages grow large, heavy dirtiers will be throttled first, which will
> > - * effectively curb the growth of dirty pages. Light dirtiers with high enough
> > - * dirty threshold may never get throttled.
> > - */
> 
> Hi Fengguang,
> 
> So we have got rid of the notion of per task dirty limit based on their
> fraction? What replaces it.

It's simply removed :)

> I can't see any code which is replacing it.

The think time compensation feature (patch attached) will be providing
the same protection for light/slow dirtiers. With it, the slower
dirtiers won't be throttled at all, because the pause time calculated
by

        period = pages_dirtied / rate
        pause = period - think

will be <= 0.

For example, given write_bw = 100MB/s and

- 2 dd tasks that dirty pages as fast as possible
- 1 scp whose dirty rate is limited by network bandwidth 10MB/s

Then with think time compensation, the real dirty rates will be

- 2 dd tasks: (100-10)/2 = 45MB/s (each)
- 1 scp task: 10MB/s

The scp task won't be throttled by balance_dirty_pages() any more.
This is a tested feature. In the below graph, the dirty rate (the
slope of the lines) of the last 3 tasks are 2, 4, 8 MB/s

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/btrfs-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-45/balance_dirty_pages-task-bw.png

given this fio workload, which started one full speed dirtier and
four 1, 2, 4, 8 MB/s rate limited dirtiers

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/RATES-2-4-8/btrfs-fio-rates-128k-8p-2975M-2.6.38-rc6-dt6+-2011-03-01-20-45/fio-rates

> If yes, I am wondering how
> do you get fairness among tasks which share this bdi.
> 
> Also wondering what did this patch series to do make sure that tasks
> share bdi more fairly and get write_bw/N bandwidth.

Each of the N dd tasks will be rate limited by

        rate = base_rate * pos_ratio

At any time snapshot, each bdi task will see almost the same base_rate
and pos_ratio, so will be throttled almost at the same rate. This is a
strong guarantee of fairness under all situations.

Since pos_ratio is fluctuating (evenly) around 1.0, and
base_rate=bdi->dirty_ratelimit is fluctuating around (write_bw/N),
on average we get

        avg_rate = (write_bw/N) * 1.0

(I'll explain the "dirty_ratelimit = write_bw/N" magic other emails.)

The below graphs demonstrate the dirty progress of the last 3 dd tasks.
The slope of each curve is the dirty rate.

They vividly show three curves progressing at the same pace in all of
the 3 stages

- rampup stage (20-100s) 

- disturbed stage (120s-160s)
  (disturbed by starting a 1GB read dd in the middle of the tests)

- stable stage (after 160s)

And dirtied almost the same amount of pages during the test.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/xfs-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:26/balance_dirty_pages-task-bw.png

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/2G/xfs-10dd-4k-8p-1947M-20:10-3.0.0-next-20110802+-2011-08-06.15:49/balance_dirty_pages-task-bw.png

Thanks,
Fengguang

[-- Attachment #2: think-time-compensation --]
[-- Type: text/plain, Size: 5083 bytes --]

Subject: writeback: dirty ratelimit - think time compensation
Date: Sat Jun 11 19:25:42 CST 2011

Compensate the task's think time when computing the final pause time,
so that ->dirty_ratelimit can be executed accurately.

In the rare case that the task slept longer than the period time (result
in negative pause time), the extra sleep time will be compensated in
next period if it's not too big (<500ms).

Accumulated errors are carefully avoided as long as the max pause area
is not hitted.

Pseudo code:

	period = pages_dirtied / bw;
	think = jiffies - dirty_paused_when;
	pause = period - think;

case 1: period > think

                pause = period - think
                dirty_paused_when += pause

                             period time
              |======================================>|
                  think time
              |===============>|
        ------|----------------|----------------------|-----------
        dirty_paused_when   jiffies


case 2: period <= think

                don't pause; reduce future pause time by:
                dirty_paused_when += period

                       period time
              |=========================>|
                             think time
              |======================================>|
        ------|--------------------------+------------|-----------
        dirty_paused_when                          jiffies

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    1 +
 kernel/fork.c         |    1 +
 mm/page-writeback.c   |   34 +++++++++++++++++++++++++++++++---
 3 files changed, 33 insertions(+), 3 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-08-09 07:53:31.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-08-09 07:54:12.000000000 +0800
@@ -1531,6 +1531,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long dirty_paused_when; /* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2011-08-09 07:53:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-09 08:08:11.000000000 +0800
@@ -817,6 +817,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	long period;
 	long pause = 0;
 	bool dirty_exceeded = false;
 	unsigned long bw;
@@ -825,6 +826,8 @@ static void balance_dirty_pages(struct a
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		unsigned long now = jiffies;
+
 		/*
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been
@@ -842,8 +845,11 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= (background_thresh + dirty_thresh) / 2) {
+			current->dirty_paused_when = now;
+			current->nr_dirtied = 0;
 			break;
+		}
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
 
@@ -879,17 +885,40 @@ static void balance_dirty_pages(struct a
 		bw = bdi_position_ratio(bdi, dirty_thresh, nr_dirty,
 					bdi_thresh, bdi_dirty);
 		if (unlikely(bw == 0)) {
+			period = MAX_PAUSE;
 			pause = MAX_PAUSE;
 			goto pause;
 		}
 		bw = (u64)base_bw * bw >> BANDWIDTH_CALC_SHIFT;
-		pause = (HZ * pages_dirtied + bw / 2) / (bw | 1);
+		period = (HZ * pages_dirtied + bw / 2) / (bw | 1);
+		pause = current->dirty_paused_when + period - now;
+		/*
+		 * For less than 1s think time (ext3/4 may block the dirtier
+		 * for up to 800ms from time to time on 1-HDD; so does xfs,
+		 * however at much less frequency), try to compensate it in
+		 * future periods by updating the virtual time; otherwise just
+		 * do a reset, as it may be a light dirtier.
+		 */
+		if (unlikely(pause <= 0)) {
+			if (pause < -HZ) {
+				current->dirty_paused_when = now;
+				current->nr_dirtied = 0;
+			} else if (period) {
+				current->dirty_paused_when += period;
+				current->nr_dirtied = 0;
+			}
+			pause = 1; /* avoid resetting nr_dirtied_pause below */
+			break;
+		}
 		pause = min(pause, MAX_PAUSE);
 
 pause:
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
+		current->dirty_paused_when = now + pause;
+		current->nr_dirtied = 0;
+
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
 		 * max-pause area. If dirty exceeded but still within this
@@ -916,7 +945,6 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	current->nr_dirtied = 0;
 	current->nr_dirtied_pause = ratelimit_pages(nr_dirty, dirty_thresh);
 
 	if (writeback_in_progress(bdi))
--- linux-next.orig/kernel/fork.c	2011-08-09 07:53:31.000000000 +0800
+++ linux-next/kernel/fork.c	2011-08-09 07:54:12.000000000 +0800
@@ -1303,6 +1303,7 @@ static struct task_struct *copy_process(
 
 	p->nr_dirtied = 0;
 	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+	p->dirty_paused_when = 0;
 
 	/*
 	 * Ok, make it visible to the rest of the system.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
  2011-08-09 14:04       ` Vivek Goyal
  (?)
@ 2011-08-10  7:41         ` Greg Thelen
  -1 siblings, 0 replies; 305+ messages in thread
From: Greg Thelen @ 2011-08-10  7:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML,
	Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara,
	KAMEZAWA Hiroyuki

On Aug 9, 2011 7:04 AM, "Vivek Goyal" <vgoyal@redhat.com> wrote:
>
> On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote:
> > On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote:
> > > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> > > > Hi all,
> > > >
> > > > The _core_ bits of the IO-less balance_dirty_pages().
> > > > Heavily simplified and re-commented to make it easier to review.
> > > >
> > > >   git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8
> > > >
> > > > Only the bare minimal algorithms are presented, so you will find some rough
> > > > edges in the graphs below. But it's usable :)
> > > >
> > > >   http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/
> > > >
> > > > And an introduction to the (more complete) algorithms:
> > > >
> > > >   http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf
> > > >
> > > > Questions and reviews are highly appreciated!
> > >
> > > Hi Wu,
> > >
> > > I am going through the slide number 39 where you talk about it being
> > > future proof and it can be used for IO control purposes. You have listed
> > > following merits of this approach.
> > >
> > > * per-bdi nature, works on NFS and Software RAID
> > > * no delayed response (working at the right layer)
> > > * no page tracking, hence decoupled from memcg
> > > * no interactions with FS and CFQ
> > > * get proportional IO controller for free
> > > * reuse/inherit all the base facilities/functions
> > >
> > > I would say that it will also be a good idea to list the demerits of
> > > this approach in current form and that is that it only deals with
> > > controlling buffered write IO and nothing else.
> >
> > That's not a demerit - that is all it is designed to do.
>
> It is designed to improve the existing task throttling functionality and
> we are trying to extend the same to cgroups too. So if by design something
> does not gel well with existing pieces, it is demerit to me. Atleast
> there should be a good explanation of design intention and how it is
> going to be useful.
>
> For example, how this thing is going to gel with existing IO controller?
> Are you going to create two separate mechianisms. One for control of
> writes while entering the cache and other for controlling the writes
> at device level?
>
> The fact that this mechanism does not know about any other IO in the
> system/cgroup is a limiting factor. From usability point of view, a
> user expects any kind of IO happening from a group.
>
> So are we planning to create a new controller? Or add additional files
> in existing controller to control the per cgroup write throttling
> behavior? Even if we create additional files, again then a user is
> forced to put separate write policies for buffered writes and direct
> writes. I was hoping a better interface would be that user puts a
> policy on writes and that takes affect and a user does not have to
> worry whether the applications inside the cgroup are doing buffered
> writes or direct writes.
>
> >
> > > So on the same block device, other direct writes might be going on
> > > from same group and in this scheme a user will not have any
> > > control.
> >
> > But it is taken into account by the IO write throttling.
>
> You mean blkio controller?
>
> It does. But my complain is that we are trying to control two separate
> knobs for two kind of IOs and I am trying to come up with a single
> knob.
>
> Current interface for write control in blkio controller looks like.
>
> blkio.throtl.write_bps_device
>
> Once can write to this file specifying the write limit of a cgroup
> on a particular device. I was hoping that buffered write limits
> will come out of same limit but with these pathes looks like we
> shall have to create a new interface altogether which just controls
> buffered writes and nothing else and user is supposed to know what
> his application is doing and try to configure the limits accordingly.
>
> So my concern is that how the overall interface would look like and
> how well it will work with existing controller and how a user is
> supposed to use it.
>
> In fact current IO controller does throttling at device level so
> interface is device specific. One is supposed to know the major
> and minor number of device to specify. I am not sure in this
> case what one is supposed to do as it is bdi specific and for
> NFS case there is no device. So one is supposed to speciy bdi or
> limits are going to be global (system wide, independent of bdi
> or block device)?
>
> >
> > > Another disadvantage is that throttling at page cache
> > > level does not take care of IO spikes at device level.
> >
> > And that is handled as well.
> >
> > How? By the indirect effect other IO and IO spikes have on the
> > writeback rate. That is, other IO reduces the writeback bandwidth,
> > which then changes the throttling parameters via feedback loops.
>
> Actually I was referring to effect of buffered writes on other IO
> going on the device. With control being on device level, one can
> tightly control the WRITEs flowing out of a cgroup to Lun and that
> can help a bit knowing how bad it will be for other reads going on
> the lun.
>
> With this scheme, flusher threads can suddenly throw tons of writes
> on lun and then no IO for another few seconds. So basically IO is
> bursty at device level and doing control at device level can make
> it more smooth.
>
> So we have two ways to control buffered writes.
>
> - Throttle them while entering the page cache
> - Throttle them at device and feedback loop in turn throttles them at
>  page cache level based on dirty ratio.
>
> Myself and Andrea had implemented first appraoch (same what Wu is
> suggesting now with a different mechanism) and following was your
> response.
>
> https://lkml.org/lkml/2011/6/28/494
>
> To me it looked like that at that point of time you preferred precise
> throttling at device level and now you seem to prefer precise throttling
> at page cache level?
>
> Again, I am not against cgroup parameter based throttling at page
> cache level. It simplifies the implementation and probably is good
> enough for lots of people. I am only worried about that the interface
> and how does it work with existing interfaces.
>
> In absolute throttling one does not have to care about feedback or
> what is the underlying bdi bandwidth. So to me these patches are
> good for work conserving IO control where we want to determine how
> fast we can write to device and then throttle tasks accordingly. But
> in absolute throttling one specifies the upper limit and there we
> don't need the mechanism to determine what the bdi badnwidth or
> how many dirty pages are there and throttle tasks accordingly.
>
> >
> > The buffered write throttle is designed to reduce the page cache
> > dirtying rate to the current cleaning rate of the backing device
> > is. Increase the cleaning rate (i.e. device is otherwise idle) and
> > it will throttle less. Decrease the cleaning rate (i.e. other IO
> > spikes or block IO throttle activates) and it will throttle more.
> >
> > We have to do vary buffered write throttling like this to adapt to
> > changing IO workloads (e.g.  someone starting a read-heavy workload
> > will slow down writeback rate, so we need to throttle buffered
> > writes more aggressively), so it has to be independent of any sort
> > of block layer IO controller.
> >
> > Simply put: the block IO controller still has direct control over
> > the rate at which buffered writes drain out of the system. The
> > IO-less write throttle simply limits the rate at which buffered
> > writes come into the system to match whatever the IO path allows to
> > drain out....
>
> Ok, this makes sense. So it goes back to the previous design where
> absolute cgroup based control happens at device level and IO less
> throttle implements the feedback loop to slow down the writes into
> page cache. That makes sense. But Wu's slides suggest that one can
> directly implement cgroup based IO control in IO less throttling
> and that's where I have concerns.
>
> Anyway this stuff shall have to be made cgroup aware so that tasks
> of different groups can see different throttling depending on how
> much IO that group is able to do at device level.
>
> >
> > > Now I think one could probably come up with more sophisticated scheme
> > > where throttling is done at bdi level but is also accounted at device
> > > level at IO controller. (Something similar I had done in the past but
> > > Dave Chinner did not like it).
> >
> > I don't like it because it is solution to a specific problem and
> > requires complex coupling across multiple layers of the system. We
> > are trying to move away from that throttling model. More
> > fundamentally, though, is that it is not a general solution to the
> > entire class of "IO writeback rate changed" problems that buffered
> > write throttling needs to solve.
> >
> > > Anyway, keeping track of per cgroup rate and throttling accordingly
> > > can definitely help implement an algorithm for per cgroup IO control.
> > > We probably just need to find a reasonable way to account all this
> > > IO to end device so that we have control of all kind of IO of a cgroup.
> > > How do you implement proportional control here? From overall bdi bandwidth
> > > vary per cgroup bandwidth regularly based on cgroup weight? Again the
> > > issue here is that it controls only buffered WRITES and nothing else and
> > > in this case co-ordinating with CFQ will probably be hard. So I guess
> > > usage of proportional IO just for buffered WRITES will have limited
> > > usage.
> >
> > The whole point of doing the throttling this way is that we don't
> > need any sort of special connection between block IO throttling and
> > page cache (buffered write) throttling. We significantly reduce the
> > coupling between the layers by relying on feedback-driven control
> > loops to determine the buffered write throttling thresholds
> > adaptively. IOWs, the IO-less write throttling at the page cache
> > will adjust automatically to whatever throughput the block IO
> > throttling allows async writes to achieve.
>
> This is good. But that's not the impression one gets from Wu's slides.
>
> >
> > However, before we have a "finished product", there is still another
> > piece of the puzzle to be put in place - memcg-aware buffered
> > writeback. That is, having a flusher thread do work on behalf of
> > memcg in the IO context of the memcg. Then the IO controller just
> > sees a stream of async writes in the context of the memcg the
> > buffered writes came from in the first place. The block layer
> > throttles them just like any other IO in the IO context of the
> > memcg...
>
> Yes that is still a piece remaining. I was hoping that Greg Thelen will
> be able to extend his patches to submit writes in the context of
> per cgroup flusher/worker threads and solve this problem.
>
> Thanks
> Vivek

Are you suggesting multiple flushers per bdi (one per cgroup)?  I
thought the point of IO less was to one issue buffered writes from a
single thread.

Note: I have rebased the memcg writeback code to latest mmotm and am
testing it now.  These patches do not introduce additional threads;
the existing bdi flusher threads are used with an optional memcg
filter.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-10  7:41         ` Greg Thelen
  0 siblings, 0 replies; 305+ messages in thread
From: Greg Thelen @ 2011-08-10  7:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML,
	Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara,
	KAMEZAWA Hiroyuki

On Aug 9, 2011 7:04 AM, "Vivek Goyal" <vgoyal@redhat.com> wrote:
>
> On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote:
> > On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote:
> > > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> > > > Hi all,
> > > >
> > > > The _core_ bits of the IO-less balance_dirty_pages().
> > > > Heavily simplified and re-commented to make it easier to review.
> > > >
> > > >   git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8
> > > >
> > > > Only the bare minimal algorithms are presented, so you will find some rough
> > > > edges in the graphs below. But it's usable :)
> > > >
> > > >   http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/
> > > >
> > > > And an introduction to the (more complete) algorithms:
> > > >
> > > >   http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf
> > > >
> > > > Questions and reviews are highly appreciated!
> > >
> > > Hi Wu,
> > >
> > > I am going through the slide number 39 where you talk about it being
> > > future proof and it can be used for IO control purposes. You have listed
> > > following merits of this approach.
> > >
> > > * per-bdi nature, works on NFS and Software RAID
> > > * no delayed response (working at the right layer)
> > > * no page tracking, hence decoupled from memcg
> > > * no interactions with FS and CFQ
> > > * get proportional IO controller for free
> > > * reuse/inherit all the base facilities/functions
> > >
> > > I would say that it will also be a good idea to list the demerits of
> > > this approach in current form and that is that it only deals with
> > > controlling buffered write IO and nothing else.
> >
> > That's not a demerit - that is all it is designed to do.
>
> It is designed to improve the existing task throttling functionality and
> we are trying to extend the same to cgroups too. So if by design something
> does not gel well with existing pieces, it is demerit to me. Atleast
> there should be a good explanation of design intention and how it is
> going to be useful.
>
> For example, how this thing is going to gel with existing IO controller?
> Are you going to create two separate mechianisms. One for control of
> writes while entering the cache and other for controlling the writes
> at device level?
>
> The fact that this mechanism does not know about any other IO in the
> system/cgroup is a limiting factor. From usability point of view, a
> user expects any kind of IO happening from a group.
>
> So are we planning to create a new controller? Or add additional files
> in existing controller to control the per cgroup write throttling
> behavior? Even if we create additional files, again then a user is
> forced to put separate write policies for buffered writes and direct
> writes. I was hoping a better interface would be that user puts a
> policy on writes and that takes affect and a user does not have to
> worry whether the applications inside the cgroup are doing buffered
> writes or direct writes.
>
> >
> > > So on the same block device, other direct writes might be going on
> > > from same group and in this scheme a user will not have any
> > > control.
> >
> > But it is taken into account by the IO write throttling.
>
> You mean blkio controller?
>
> It does. But my complain is that we are trying to control two separate
> knobs for two kind of IOs and I am trying to come up with a single
> knob.
>
> Current interface for write control in blkio controller looks like.
>
> blkio.throtl.write_bps_device
>
> Once can write to this file specifying the write limit of a cgroup
> on a particular device. I was hoping that buffered write limits
> will come out of same limit but with these pathes looks like we
> shall have to create a new interface altogether which just controls
> buffered writes and nothing else and user is supposed to know what
> his application is doing and try to configure the limits accordingly.
>
> So my concern is that how the overall interface would look like and
> how well it will work with existing controller and how a user is
> supposed to use it.
>
> In fact current IO controller does throttling at device level so
> interface is device specific. One is supposed to know the major
> and minor number of device to specify. I am not sure in this
> case what one is supposed to do as it is bdi specific and for
> NFS case there is no device. So one is supposed to speciy bdi or
> limits are going to be global (system wide, independent of bdi
> or block device)?
>
> >
> > > Another disadvantage is that throttling at page cache
> > > level does not take care of IO spikes at device level.
> >
> > And that is handled as well.
> >
> > How? By the indirect effect other IO and IO spikes have on the
> > writeback rate. That is, other IO reduces the writeback bandwidth,
> > which then changes the throttling parameters via feedback loops.
>
> Actually I was referring to effect of buffered writes on other IO
> going on the device. With control being on device level, one can
> tightly control the WRITEs flowing out of a cgroup to Lun and that
> can help a bit knowing how bad it will be for other reads going on
> the lun.
>
> With this scheme, flusher threads can suddenly throw tons of writes
> on lun and then no IO for another few seconds. So basically IO is
> bursty at device level and doing control at device level can make
> it more smooth.
>
> So we have two ways to control buffered writes.
>
> - Throttle them while entering the page cache
> - Throttle them at device and feedback loop in turn throttles them at
>  page cache level based on dirty ratio.
>
> Myself and Andrea had implemented first appraoch (same what Wu is
> suggesting now with a different mechanism) and following was your
> response.
>
> https://lkml.org/lkml/2011/6/28/494
>
> To me it looked like that at that point of time you preferred precise
> throttling at device level and now you seem to prefer precise throttling
> at page cache level?
>
> Again, I am not against cgroup parameter based throttling at page
> cache level. It simplifies the implementation and probably is good
> enough for lots of people. I am only worried about that the interface
> and how does it work with existing interfaces.
>
> In absolute throttling one does not have to care about feedback or
> what is the underlying bdi bandwidth. So to me these patches are
> good for work conserving IO control where we want to determine how
> fast we can write to device and then throttle tasks accordingly. But
> in absolute throttling one specifies the upper limit and there we
> don't need the mechanism to determine what the bdi badnwidth or
> how many dirty pages are there and throttle tasks accordingly.
>
> >
> > The buffered write throttle is designed to reduce the page cache
> > dirtying rate to the current cleaning rate of the backing device
> > is. Increase the cleaning rate (i.e. device is otherwise idle) and
> > it will throttle less. Decrease the cleaning rate (i.e. other IO
> > spikes or block IO throttle activates) and it will throttle more.
> >
> > We have to do vary buffered write throttling like this to adapt to
> > changing IO workloads (e.g.  someone starting a read-heavy workload
> > will slow down writeback rate, so we need to throttle buffered
> > writes more aggressively), so it has to be independent of any sort
> > of block layer IO controller.
> >
> > Simply put: the block IO controller still has direct control over
> > the rate at which buffered writes drain out of the system. The
> > IO-less write throttle simply limits the rate at which buffered
> > writes come into the system to match whatever the IO path allows to
> > drain out....
>
> Ok, this makes sense. So it goes back to the previous design where
> absolute cgroup based control happens at device level and IO less
> throttle implements the feedback loop to slow down the writes into
> page cache. That makes sense. But Wu's slides suggest that one can
> directly implement cgroup based IO control in IO less throttling
> and that's where I have concerns.
>
> Anyway this stuff shall have to be made cgroup aware so that tasks
> of different groups can see different throttling depending on how
> much IO that group is able to do at device level.
>
> >
> > > Now I think one could probably come up with more sophisticated scheme
> > > where throttling is done at bdi level but is also accounted at device
> > > level at IO controller. (Something similar I had done in the past but
> > > Dave Chinner did not like it).
> >
> > I don't like it because it is solution to a specific problem and
> > requires complex coupling across multiple layers of the system. We
> > are trying to move away from that throttling model. More
> > fundamentally, though, is that it is not a general solution to the
> > entire class of "IO writeback rate changed" problems that buffered
> > write throttling needs to solve.
> >
> > > Anyway, keeping track of per cgroup rate and throttling accordingly
> > > can definitely help implement an algorithm for per cgroup IO control.
> > > We probably just need to find a reasonable way to account all this
> > > IO to end device so that we have control of all kind of IO of a cgroup.
> > > How do you implement proportional control here? From overall bdi bandwidth
> > > vary per cgroup bandwidth regularly based on cgroup weight? Again the
> > > issue here is that it controls only buffered WRITES and nothing else and
> > > in this case co-ordinating with CFQ will probably be hard. So I guess
> > > usage of proportional IO just for buffered WRITES will have limited
> > > usage.
> >
> > The whole point of doing the throttling this way is that we don't
> > need any sort of special connection between block IO throttling and
> > page cache (buffered write) throttling. We significantly reduce the
> > coupling between the layers by relying on feedback-driven control
> > loops to determine the buffered write throttling thresholds
> > adaptively. IOWs, the IO-less write throttling at the page cache
> > will adjust automatically to whatever throughput the block IO
> > throttling allows async writes to achieve.
>
> This is good. But that's not the impression one gets from Wu's slides.
>
> >
> > However, before we have a "finished product", there is still another
> > piece of the puzzle to be put in place - memcg-aware buffered
> > writeback. That is, having a flusher thread do work on behalf of
> > memcg in the IO context of the memcg. Then the IO controller just
> > sees a stream of async writes in the context of the memcg the
> > buffered writes came from in the first place. The block layer
> > throttles them just like any other IO in the IO context of the
> > memcg...
>
> Yes that is still a piece remaining. I was hoping that Greg Thelen will
> be able to extend his patches to submit writes in the context of
> per cgroup flusher/worker threads and solve this problem.
>
> Thanks
> Vivek

Are you suggesting multiple flushers per bdi (one per cgroup)?  I
thought the point of IO less was to one issue buffered writes from a
single thread.

Note: I have rebased the memcg writeback code to latest mmotm and am
testing it now.  These patches do not introduce additional threads;
the existing bdi flusher threads are used with an optional memcg
filter.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-10  7:41         ` Greg Thelen
  0 siblings, 0 replies; 305+ messages in thread
From: Greg Thelen @ 2011-08-10  7:41 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML,
	Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara,
	KAMEZAWA Hiroyuki

On Aug 9, 2011 7:04 AM, "Vivek Goyal" <vgoyal@redhat.com> wrote:
>
> On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote:
> > On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote:
> > > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> > > > Hi all,
> > > >
> > > > The _core_ bits of the IO-less balance_dirty_pages().
> > > > Heavily simplified and re-commented to make it easier to review.
> > > >
> > > >   git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v8
> > > >
> > > > Only the bare minimal algorithms are presented, so you will find some rough
> > > > edges in the graphs below. But it's usable :)
> > > >
> > > >   http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/
> > > >
> > > > And an introduction to the (more complete) algorithms:
> > > >
> > > >   http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/slides/smooth-dirty-throttling.pdf
> > > >
> > > > Questions and reviews are highly appreciated!
> > >
> > > Hi Wu,
> > >
> > > I am going through the slide number 39 where you talk about it being
> > > future proof and it can be used for IO control purposes. You have listed
> > > following merits of this approach.
> > >
> > > * per-bdi nature, works on NFS and Software RAID
> > > * no delayed response (working at the right layer)
> > > * no page tracking, hence decoupled from memcg
> > > * no interactions with FS and CFQ
> > > * get proportional IO controller for free
> > > * reuse/inherit all the base facilities/functions
> > >
> > > I would say that it will also be a good idea to list the demerits of
> > > this approach in current form and that is that it only deals with
> > > controlling buffered write IO and nothing else.
> >
> > That's not a demerit - that is all it is designed to do.
>
> It is designed to improve the existing task throttling functionality and
> we are trying to extend the same to cgroups too. So if by design something
> does not gel well with existing pieces, it is demerit to me. Atleast
> there should be a good explanation of design intention and how it is
> going to be useful.
>
> For example, how this thing is going to gel with existing IO controller?
> Are you going to create two separate mechianisms. One for control of
> writes while entering the cache and other for controlling the writes
> at device level?
>
> The fact that this mechanism does not know about any other IO in the
> system/cgroup is a limiting factor. From usability point of view, a
> user expects any kind of IO happening from a group.
>
> So are we planning to create a new controller? Or add additional files
> in existing controller to control the per cgroup write throttling
> behavior? Even if we create additional files, again then a user is
> forced to put separate write policies for buffered writes and direct
> writes. I was hoping a better interface would be that user puts a
> policy on writes and that takes affect and a user does not have to
> worry whether the applications inside the cgroup are doing buffered
> writes or direct writes.
>
> >
> > > So on the same block device, other direct writes might be going on
> > > from same group and in this scheme a user will not have any
> > > control.
> >
> > But it is taken into account by the IO write throttling.
>
> You mean blkio controller?
>
> It does. But my complain is that we are trying to control two separate
> knobs for two kind of IOs and I am trying to come up with a single
> knob.
>
> Current interface for write control in blkio controller looks like.
>
> blkio.throtl.write_bps_device
>
> Once can write to this file specifying the write limit of a cgroup
> on a particular device. I was hoping that buffered write limits
> will come out of same limit but with these pathes looks like we
> shall have to create a new interface altogether which just controls
> buffered writes and nothing else and user is supposed to know what
> his application is doing and try to configure the limits accordingly.
>
> So my concern is that how the overall interface would look like and
> how well it will work with existing controller and how a user is
> supposed to use it.
>
> In fact current IO controller does throttling at device level so
> interface is device specific. One is supposed to know the major
> and minor number of device to specify. I am not sure in this
> case what one is supposed to do as it is bdi specific and for
> NFS case there is no device. So one is supposed to speciy bdi or
> limits are going to be global (system wide, independent of bdi
> or block device)?
>
> >
> > > Another disadvantage is that throttling at page cache
> > > level does not take care of IO spikes at device level.
> >
> > And that is handled as well.
> >
> > How? By the indirect effect other IO and IO spikes have on the
> > writeback rate. That is, other IO reduces the writeback bandwidth,
> > which then changes the throttling parameters via feedback loops.
>
> Actually I was referring to effect of buffered writes on other IO
> going on the device. With control being on device level, one can
> tightly control the WRITEs flowing out of a cgroup to Lun and that
> can help a bit knowing how bad it will be for other reads going on
> the lun.
>
> With this scheme, flusher threads can suddenly throw tons of writes
> on lun and then no IO for another few seconds. So basically IO is
> bursty at device level and doing control at device level can make
> it more smooth.
>
> So we have two ways to control buffered writes.
>
> - Throttle them while entering the page cache
> - Throttle them at device and feedback loop in turn throttles them at
>  page cache level based on dirty ratio.
>
> Myself and Andrea had implemented first appraoch (same what Wu is
> suggesting now with a different mechanism) and following was your
> response.
>
> https://lkml.org/lkml/2011/6/28/494
>
> To me it looked like that at that point of time you preferred precise
> throttling at device level and now you seem to prefer precise throttling
> at page cache level?
>
> Again, I am not against cgroup parameter based throttling at page
> cache level. It simplifies the implementation and probably is good
> enough for lots of people. I am only worried about that the interface
> and how does it work with existing interfaces.
>
> In absolute throttling one does not have to care about feedback or
> what is the underlying bdi bandwidth. So to me these patches are
> good for work conserving IO control where we want to determine how
> fast we can write to device and then throttle tasks accordingly. But
> in absolute throttling one specifies the upper limit and there we
> don't need the mechanism to determine what the bdi badnwidth or
> how many dirty pages are there and throttle tasks accordingly.
>
> >
> > The buffered write throttle is designed to reduce the page cache
> > dirtying rate to the current cleaning rate of the backing device
> > is. Increase the cleaning rate (i.e. device is otherwise idle) and
> > it will throttle less. Decrease the cleaning rate (i.e. other IO
> > spikes or block IO throttle activates) and it will throttle more.
> >
> > We have to do vary buffered write throttling like this to adapt to
> > changing IO workloads (e.g.  someone starting a read-heavy workload
> > will slow down writeback rate, so we need to throttle buffered
> > writes more aggressively), so it has to be independent of any sort
> > of block layer IO controller.
> >
> > Simply put: the block IO controller still has direct control over
> > the rate at which buffered writes drain out of the system. The
> > IO-less write throttle simply limits the rate at which buffered
> > writes come into the system to match whatever the IO path allows to
> > drain out....
>
> Ok, this makes sense. So it goes back to the previous design where
> absolute cgroup based control happens at device level and IO less
> throttle implements the feedback loop to slow down the writes into
> page cache. That makes sense. But Wu's slides suggest that one can
> directly implement cgroup based IO control in IO less throttling
> and that's where I have concerns.
>
> Anyway this stuff shall have to be made cgroup aware so that tasks
> of different groups can see different throttling depending on how
> much IO that group is able to do at device level.
>
> >
> > > Now I think one could probably come up with more sophisticated scheme
> > > where throttling is done at bdi level but is also accounted at device
> > > level at IO controller. (Something similar I had done in the past but
> > > Dave Chinner did not like it).
> >
> > I don't like it because it is solution to a specific problem and
> > requires complex coupling across multiple layers of the system. We
> > are trying to move away from that throttling model. More
> > fundamentally, though, is that it is not a general solution to the
> > entire class of "IO writeback rate changed" problems that buffered
> > write throttling needs to solve.
> >
> > > Anyway, keeping track of per cgroup rate and throttling accordingly
> > > can definitely help implement an algorithm for per cgroup IO control.
> > > We probably just need to find a reasonable way to account all this
> > > IO to end device so that we have control of all kind of IO of a cgroup.
> > > How do you implement proportional control here? From overall bdi bandwidth
> > > vary per cgroup bandwidth regularly based on cgroup weight? Again the
> > > issue here is that it controls only buffered WRITES and nothing else and
> > > in this case co-ordinating with CFQ will probably be hard. So I guess
> > > usage of proportional IO just for buffered WRITES will have limited
> > > usage.
> >
> > The whole point of doing the throttling this way is that we don't
> > need any sort of special connection between block IO throttling and
> > page cache (buffered write) throttling. We significantly reduce the
> > coupling between the layers by relying on feedback-driven control
> > loops to determine the buffered write throttling thresholds
> > adaptively. IOWs, the IO-less write throttling at the page cache
> > will adjust automatically to whatever throughput the block IO
> > throttling allows async writes to achieve.
>
> This is good. But that's not the impression one gets from Wu's slides.
>
> >
> > However, before we have a "finished product", there is still another
> > piece of the puzzle to be put in place - memcg-aware buffered
> > writeback. That is, having a flusher thread do work on behalf of
> > memcg in the IO context of the memcg. Then the IO controller just
> > sees a stream of async writes in the context of the memcg the
> > buffered writes came from in the first place. The block layer
> > throttles them just like any other IO in the IO context of the
> > memcg...
>
> Yes that is still a piece remaining. I was hoping that Greg Thelen will
> be able to extend his patches to submit writes in the context of
> per cgroup flusher/worker threads and solve this problem.
>
> Thanks
> Vivek

Are you suggesting multiple flushers per bdi (one per cgroup)?  I
thought the point of IO less was to one issue buffered writes from a
single thread.

Note: I have rebased the memcg writeback code to latest mmotm and am
testing it now.  These patches do not introduce additional threads;
the existing bdi flusher threads are used with an optional memcg
filter.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-10  3:40       ` Wu Fengguang
  (?)
@ 2011-08-10 10:25         ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-10 10:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote:
> On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote:
> > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > > 
> > > Add two fields to task_struct.
> > > 
> > > 1) account dirtied pages in the individual tasks, for accuracy
> > > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > > 
> > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > > scale near-sqrt to the safety gap between dirty pages and threshold.
> > > 
> > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > > dirtying pages at exactly the same time, each task will be assigned a
> > > large initial nr_dirtied_pause, so that the dirty threshold will be
> > > exceeded long before each task reached its nr_dirtied_pause and hence
> > > call balance_dirty_pages(). 
> > 
> > Right, so why remove the per-cpu threshold? you can keep that as a bound
> > on the number of out-standing dirty pages.
> 
> Right, I also have the vague feeling that the per-cpu threshold can
> somehow backup the per-task threshold in case there are too many tasks.
> 
> > Loosing that bound is actually a bad thing (TM), since you could have
> > configured a tight dirty limit and lock up your machine this way.
> 
> It seems good enough to only remove the 4MB upper limit for
> ratelimit_pages, so that the per-cpu limit won't kick in too
> frequently in typical machines.
> 
>   * Here we set ratelimit_pages to a level which ensures that when all CPUs are
>   * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
>   * thresholds before writeback cuts in.
> - *
> - * But the limit should not be set too high.  Because it also controls the
> - * amount of memory which the balance_dirty_pages() caller has to write back.
> - * If this is too large then the caller will block on the IO queue all the
> - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> - * will write six megabyte chunks, max.
> - */
> -
>  void writeback_set_ratelimit(void)
>  {
>         ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
>         if (ratelimit_pages < 16)
>                 ratelimit_pages = 16;
> -       if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> -               ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
>  }

Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather
a lot.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-10 10:25         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-10 10:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote:
> On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote:
> > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > > 
> > > Add two fields to task_struct.
> > > 
> > > 1) account dirtied pages in the individual tasks, for accuracy
> > > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > > 
> > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > > scale near-sqrt to the safety gap between dirty pages and threshold.
> > > 
> > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > > dirtying pages at exactly the same time, each task will be assigned a
> > > large initial nr_dirtied_pause, so that the dirty threshold will be
> > > exceeded long before each task reached its nr_dirtied_pause and hence
> > > call balance_dirty_pages(). 
> > 
> > Right, so why remove the per-cpu threshold? you can keep that as a bound
> > on the number of out-standing dirty pages.
> 
> Right, I also have the vague feeling that the per-cpu threshold can
> somehow backup the per-task threshold in case there are too many tasks.
> 
> > Loosing that bound is actually a bad thing (TM), since you could have
> > configured a tight dirty limit and lock up your machine this way.
> 
> It seems good enough to only remove the 4MB upper limit for
> ratelimit_pages, so that the per-cpu limit won't kick in too
> frequently in typical machines.
> 
>   * Here we set ratelimit_pages to a level which ensures that when all CPUs are
>   * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
>   * thresholds before writeback cuts in.
> - *
> - * But the limit should not be set too high.  Because it also controls the
> - * amount of memory which the balance_dirty_pages() caller has to write back.
> - * If this is too large then the caller will block on the IO queue all the
> - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> - * will write six megabyte chunks, max.
> - */
> -
>  void writeback_set_ratelimit(void)
>  {
>         ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
>         if (ratelimit_pages < 16)
>                 ratelimit_pages = 16;
> -       if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> -               ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
>  }

Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather
a lot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-10 10:25         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-10 10:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote:
> On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote:
> > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > > 
> > > Add two fields to task_struct.
> > > 
> > > 1) account dirtied pages in the individual tasks, for accuracy
> > > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > > 
> > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > > scale near-sqrt to the safety gap between dirty pages and threshold.
> > > 
> > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > > dirtying pages at exactly the same time, each task will be assigned a
> > > large initial nr_dirtied_pause, so that the dirty threshold will be
> > > exceeded long before each task reached its nr_dirtied_pause and hence
> > > call balance_dirty_pages(). 
> > 
> > Right, so why remove the per-cpu threshold? you can keep that as a bound
> > on the number of out-standing dirty pages.
> 
> Right, I also have the vague feeling that the per-cpu threshold can
> somehow backup the per-task threshold in case there are too many tasks.
> 
> > Loosing that bound is actually a bad thing (TM), since you could have
> > configured a tight dirty limit and lock up your machine this way.
> 
> It seems good enough to only remove the 4MB upper limit for
> ratelimit_pages, so that the per-cpu limit won't kick in too
> frequently in typical machines.
> 
>   * Here we set ratelimit_pages to a level which ensures that when all CPUs are
>   * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
>   * thresholds before writeback cuts in.
> - *
> - * But the limit should not be set too high.  Because it also controls the
> - * amount of memory which the balance_dirty_pages() caller has to write back.
> - * If this is too large then the caller will block on the IO queue all the
> - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> - * will write six megabyte chunks, max.
> - */
> -
>  void writeback_set_ratelimit(void)
>  {
>         ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
>         if (ratelimit_pages < 16)
>                 ratelimit_pages = 16;
> -       if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> -               ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
>  }

Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather
a lot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 14:57     ` Peter Zijlstra
@ 2011-08-10 11:07       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 10:57:32PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > 
> > Estimation of balanced bdi->dirty_ratelimit
> > ===========================================
> > 
> > When started N dd, throttle each dd at
> > 
> >          task_ratelimit = pos_bw (any non-zero initial value is OK)
> 
> This is (0), since it makes (1). But it fails to explain what the
> difference is between task_ratelimit and pos_bw (and why positional
> bandwidth is a good name).

Yeah it's (0) and is another form of the formula used in
balance_dirty_pages():

        rate = bdi->dirty_ratelimit * pos_ratio

In fact the estimation of ref_bw can take a more general form, by
writing (0) as

        task_ratelimit = task_ratelimit_0

where task_ratelimit_0 is any non-zero value balance_dirty_pages()
uses to throttle the tasks during that 200ms.

> > After 200ms, we got
> > 
> >          dirty_bw = # of pages dirtied by app / 200ms
> >          write_bw = # of pages written to disk / 200ms
> 
> Right, so that I get. And our premise for the whole work is to delay
> applications so that we match the dirty_bw to the write_bw, right?

Right, the balance target is (dirty_bw == write_bw),
but let's rename dirty_bw to dirty_rate as you suggested.

> > For aggressive dirtiers, the equality holds
> > 
> >          dirty_bw == N * task_ratelimit
> >                   == N * pos_bw                         (1)
> 
> So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
> since N is a unit-less number.

Right.

> What does task_ratelimit in pages/s mean? Since we make the tasks sleep
> the only thing we can make from this is a measure of pages. So I expect
> (in a later patch) we compute the sleep time on the amount of pages we
> want written out, using this ratelimit measure, right?

Right. balance_dirty_pages() will use it this way (the variable name
used in code is 'bw', will change to 'rate'):

        pause = (HZ * pages_dirtied) / task_ratelimit

> > The balanced throttle bandwidth can be estimated by
> > 
> >          ref_bw = pos_bw * write_bw / dirty_bw          (2)
> 
> Here you introduce reference bandwidth, what does it mean and what is
> its relation to positional bandwidth. Going by the equation, we got
> (pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

Yeah. Or better do some renames:

          balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)    (2)

> write_bw/dirty_bw is the ration between output and input of dirty pages,
> but what is pos_bw and what does that make ref_bw?

It's (bdi->dirty_ratelimit * pos_ratio), the effective dirty rate
balance_dirty_pages() used to limit each bdi task for the past 200ms.

For example, if (task_ratelimit_0 = write_bw). Then the N dd tasks
will make bdi dirty rate (dirty_rate = N * task_ratelimit_0), and the
balanced ratelimit will be

        balanced_rate
        = task_ratelimit_0 * (write_bw / (N * task_ratelimit_0))
        = write_bw / N

Thus within 200ms, we get the estimation of balanced_rate without
knowing N beforehand.

> > >From (1) and (2), we get equality
> > 
> >          ref_bw == write_bw / N                         (3)
> 
> Somehow this seems like the primary postulate, yet you present it like a
> derivation. The whole purpose of your control system is to provide this
> fairness between processes, therefore I would expect you start out with
> this postulate and reason therefrom.

Good idea.

> > If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> > will match. So ref_bw is the balanced dirty rate.
> 
> Which does lead to the question why its not called that instead ;-)

Sure, changed to balanced_rate :-)

> > In practice, the ref_bw calculated by (2) may fluctuate and have
> > estimation errors. So the bdi->dirty_ratelimit update policy is to
> > follow it only when both pos_bw and ref_bw point to the same direction
> > (indicating not only the dirty position has deviated from the global/bdi
> > setpoints, but also it's still departing away).
> 
> Which is where you introduce the need for pos_bw, yet you have not yet
> explained its meaning. In this explanation you allude to it being the
> speed (first time derivative) of the deviation from the setpoint.

That's right.

> The set point's measure is in pages, so the measure of its first time
> derivative would indeed be pages/s, just like bandwidth, but calling it
> a bandwidth seems highly confusing indeed.

Yeah, I'll rename the relevant vars *bw to *rate.

> I would also like a few more words on your update condition, why did you
> pick those, and what are the full ramifications of them.

OK.

> Also missing in this story is your pos_ratio thing, it is used in the
> code, but there is no explanation on how it ties in with the above
> things.

There are two control targets

(1) dirty setpoint
(2) dirty rate

pos_ratio does the position based control for (1). It's not inherently
relevant to the computation of balanced_rate. I hope the below rephrased
text will make it easier to understand.

: When started N dd, we would like to throttle each dd at
: 
:          balanced_rate == write_bw / N                                  (1)
: 
: We don't know N beforehand, but still can estimate balanced_rate
: within 200ms.
: 
: Start by throttling each dd task at rate
: 
:         task_ratelimit = task_ratelimit_0                               (2)
:                          (any non-zero initial value is OK)
: 
: After 200ms, we got
: 
:         dirty_rate = # of pages dirtied by all dd's / 200ms
:         write_bw   = # of pages written to the disk / 200ms
: 
: For the aggressive dd dirtiers, the equality holds
: 
:         dirty_rate == N * task_rate
:                    == N * task_ratelimit
:                    == N * task_ratelimit_0                              (3)
: Or
:         task_ratelimit_0 = dirty_rate / N                               (4)
:                           
: So the balanced throttle bandwidth can be estimated by
:                           
:         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
:                           
: Because with (4) and (5) we can get the desired equality (1):
:                           
:         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
:                       == write_bw / N
:
: Since balance_dirty_pages() will be using
:        
:         task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio()    (6)
: 
:        
: Taking (5) and (6), we get the real formula used in the code
:                                                                  
:         balanced_rate = bdi->dirty_ratelimit * bdi_position_ratio() * 
:                                 (write_bw / dirty_rate)                 (7)
: 

> You seem very skilled in control systems (your earlier read-ahead work
> was also a very complex system),

Thank you! I majored in the college "Pattern Recognition and Intelligent
Systems" and "Control theory and Control Engineering", which happen to be
the perfect preparations for read-ahead and dirty balancing :)

> but the explanations of your systems are highly confusing.

Sorry for that!

> Can you go back to the roots and explain how you constructed your
> model and why you did so? (without using graphs please)

As mentioned above, the root requirements are

(1) position target: to keep dirty pages around the bdi/global setpoints
(2) rate target:     to keep bdi dirty rate around bdi write bandwidth

In order to meet (2), we try to estimate (balanced_rate = write_bw / N)
and use it to throttle the N dd tasks.

However that's not enough. When the dirty rate perfectly matches the
write bandwidth, the dirty pages can stay stationary at any point.  We
want the dirty pages to stay around the setpoints as required by (1).

So if the dirty pages are ABOVE the setpoints, we throttle each task
a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse). With that positional adjustment, the formula is
transformed from

        task_ratelimit = balanced_rate              => meets (2)

to

        task_ratelimit = balanced_rate * pos_ratio  => meets both (1),(2)

At last, due to the possible large fluctuations in the raw
balanced_rate value, the more stable bdi->dirty_ratelimit which tracks
balanced_rate in a conservative way is used, resulting in the final form

        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio()

> PS. I'm not criticizing your work, the results are impressive (as
> always), but I find it very hard to understand. 
> 
> PPS. If it would help, feel free to refer me to educational material on
> control system theory, either online or in books.

Fortunately no fancy control theory is used here ;) Only the simple
theory of negative feedback control is used, which states that there
will be overshoots and ringing if trying to correct the errors way too
fast.

The overshooting concept can be explained in the graph of the below page,
where the step response can be a sudden start of dd reader that took
away all the disk write bandwidth.

http://en.wikipedia.org/wiki/Step_response

In terms of the negative feedback control theory, the
bdi_position_ratio() function (control lines) can be expressed as

1) f(setpoint) = 1.0
2) df/dt < 0

3) optionally, abs(df/dt) should be large on large errors (= dirty -
   setpoint) in order to cancel the errors fast, and be smaller when
   dirty pages get closer to the setpoints in order to avoid overshooting.

The principle of (3) will be implemented in some follow up patches :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 11:07       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 10:57:32PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > 
> > Estimation of balanced bdi->dirty_ratelimit
> > ===========================================
> > 
> > When started N dd, throttle each dd at
> > 
> >          task_ratelimit = pos_bw (any non-zero initial value is OK)
> 
> This is (0), since it makes (1). But it fails to explain what the
> difference is between task_ratelimit and pos_bw (and why positional
> bandwidth is a good name).

Yeah it's (0) and is another form of the formula used in
balance_dirty_pages():

        rate = bdi->dirty_ratelimit * pos_ratio

In fact the estimation of ref_bw can take a more general form, by
writing (0) as

        task_ratelimit = task_ratelimit_0

where task_ratelimit_0 is any non-zero value balance_dirty_pages()
uses to throttle the tasks during that 200ms.

> > After 200ms, we got
> > 
> >          dirty_bw = # of pages dirtied by app / 200ms
> >          write_bw = # of pages written to disk / 200ms
> 
> Right, so that I get. And our premise for the whole work is to delay
> applications so that we match the dirty_bw to the write_bw, right?

Right, the balance target is (dirty_bw == write_bw),
but let's rename dirty_bw to dirty_rate as you suggested.

> > For aggressive dirtiers, the equality holds
> > 
> >          dirty_bw == N * task_ratelimit
> >                   == N * pos_bw                         (1)
> 
> So dirty_bw is in pages/s, so task_ratelimit should also be in pages/s,
> since N is a unit-less number.

Right.

> What does task_ratelimit in pages/s mean? Since we make the tasks sleep
> the only thing we can make from this is a measure of pages. So I expect
> (in a later patch) we compute the sleep time on the amount of pages we
> want written out, using this ratelimit measure, right?

Right. balance_dirty_pages() will use it this way (the variable name
used in code is 'bw', will change to 'rate'):

        pause = (HZ * pages_dirtied) / task_ratelimit

> > The balanced throttle bandwidth can be estimated by
> > 
> >          ref_bw = pos_bw * write_bw / dirty_bw          (2)
> 
> Here you introduce reference bandwidth, what does it mean and what is
> its relation to positional bandwidth. Going by the equation, we got
> (pages/s * pages/s) / (pages/s) so we indeed have a bandwidth unit.

Yeah. Or better do some renames:

          balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)    (2)

> write_bw/dirty_bw is the ration between output and input of dirty pages,
> but what is pos_bw and what does that make ref_bw?

It's (bdi->dirty_ratelimit * pos_ratio), the effective dirty rate
balance_dirty_pages() used to limit each bdi task for the past 200ms.

For example, if (task_ratelimit_0 = write_bw). Then the N dd tasks
will make bdi dirty rate (dirty_rate = N * task_ratelimit_0), and the
balanced ratelimit will be

        balanced_rate
        = task_ratelimit_0 * (write_bw / (N * task_ratelimit_0))
        = write_bw / N

Thus within 200ms, we get the estimation of balanced_rate without
knowing N beforehand.

> > >From (1) and (2), we get equality
> > 
> >          ref_bw == write_bw / N                         (3)
> 
> Somehow this seems like the primary postulate, yet you present it like a
> derivation. The whole purpose of your control system is to provide this
> fairness between processes, therefore I would expect you start out with
> this postulate and reason therefrom.

Good idea.

> > If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> > will match. So ref_bw is the balanced dirty rate.
> 
> Which does lead to the question why its not called that instead ;-)

Sure, changed to balanced_rate :-)

> > In practice, the ref_bw calculated by (2) may fluctuate and have
> > estimation errors. So the bdi->dirty_ratelimit update policy is to
> > follow it only when both pos_bw and ref_bw point to the same direction
> > (indicating not only the dirty position has deviated from the global/bdi
> > setpoints, but also it's still departing away).
> 
> Which is where you introduce the need for pos_bw, yet you have not yet
> explained its meaning. In this explanation you allude to it being the
> speed (first time derivative) of the deviation from the setpoint.

That's right.

> The set point's measure is in pages, so the measure of its first time
> derivative would indeed be pages/s, just like bandwidth, but calling it
> a bandwidth seems highly confusing indeed.

Yeah, I'll rename the relevant vars *bw to *rate.

> I would also like a few more words on your update condition, why did you
> pick those, and what are the full ramifications of them.

OK.

> Also missing in this story is your pos_ratio thing, it is used in the
> code, but there is no explanation on how it ties in with the above
> things.

There are two control targets

(1) dirty setpoint
(2) dirty rate

pos_ratio does the position based control for (1). It's not inherently
relevant to the computation of balanced_rate. I hope the below rephrased
text will make it easier to understand.

: When started N dd, we would like to throttle each dd at
: 
:          balanced_rate == write_bw / N                                  (1)
: 
: We don't know N beforehand, but still can estimate balanced_rate
: within 200ms.
: 
: Start by throttling each dd task at rate
: 
:         task_ratelimit = task_ratelimit_0                               (2)
:                          (any non-zero initial value is OK)
: 
: After 200ms, we got
: 
:         dirty_rate = # of pages dirtied by all dd's / 200ms
:         write_bw   = # of pages written to the disk / 200ms
: 
: For the aggressive dd dirtiers, the equality holds
: 
:         dirty_rate == N * task_rate
:                    == N * task_ratelimit
:                    == N * task_ratelimit_0                              (3)
: Or
:         task_ratelimit_0 = dirty_rate / N                               (4)
:                           
: So the balanced throttle bandwidth can be estimated by
:                           
:         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
:                           
: Because with (4) and (5) we can get the desired equality (1):
:                           
:         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
:                       == write_bw / N
:
: Since balance_dirty_pages() will be using
:        
:         task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio()    (6)
: 
:        
: Taking (5) and (6), we get the real formula used in the code
:                                                                  
:         balanced_rate = bdi->dirty_ratelimit * bdi_position_ratio() * 
:                                 (write_bw / dirty_rate)                 (7)
: 

> You seem very skilled in control systems (your earlier read-ahead work
> was also a very complex system),

Thank you! I majored in the college "Pattern Recognition and Intelligent
Systems" and "Control theory and Control Engineering", which happen to be
the perfect preparations for read-ahead and dirty balancing :)

> but the explanations of your systems are highly confusing.

Sorry for that!

> Can you go back to the roots and explain how you constructed your
> model and why you did so? (without using graphs please)

As mentioned above, the root requirements are

(1) position target: to keep dirty pages around the bdi/global setpoints
(2) rate target:     to keep bdi dirty rate around bdi write bandwidth

In order to meet (2), we try to estimate (balanced_rate = write_bw / N)
and use it to throttle the N dd tasks.

However that's not enough. When the dirty rate perfectly matches the
write bandwidth, the dirty pages can stay stationary at any point.  We
want the dirty pages to stay around the setpoints as required by (1).

So if the dirty pages are ABOVE the setpoints, we throttle each task
a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse). With that positional adjustment, the formula is
transformed from

        task_ratelimit = balanced_rate              => meets (2)

to

        task_ratelimit = balanced_rate * pos_ratio  => meets both (1),(2)

At last, due to the possible large fluctuations in the raw
balanced_rate value, the more stable bdi->dirty_ratelimit which tracks
balanced_rate in a conservative way is used, resulting in the final form

        task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio()

> PS. I'm not criticizing your work, the results are impressive (as
> always), but I find it very hard to understand. 
> 
> PPS. If it would help, feel free to refer me to educational material on
> control system theory, either online or in books.

Fortunately no fancy control theory is used here ;) Only the simple
theory of negative feedback control is used, which states that there
will be overshoots and ringing if trying to correct the errors way too
fast.

The overshooting concept can be explained in the graph of the below page,
where the step response can be a sudden start of dd reader that took
away all the disk write bandwidth.

http://en.wikipedia.org/wiki/Step_response

In terms of the negative feedback control theory, the
bdi_position_ratio() function (control lines) can be expressed as

1) f(setpoint) = 1.0
2) df/dt < 0

3) optionally, abs(df/dt) should be large on large errors (= dirty -
   setpoint) in order to cancel the errors fast, and be smaller when
   dirty pages get closer to the setpoints in order to avoid overshooting.

The principle of (3) will be implemented in some follow up patches :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-10 10:25         ` Peter Zijlstra
@ 2011-08-10 11:13           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 11:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 06:25:48PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote:
> > On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote:
> > > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > > > 
> > > > Add two fields to task_struct.
> > > > 
> > > > 1) account dirtied pages in the individual tasks, for accuracy
> > > > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > > > 
> > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > > > scale near-sqrt to the safety gap between dirty pages and threshold.
> > > > 
> > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > > > dirtying pages at exactly the same time, each task will be assigned a
> > > > large initial nr_dirtied_pause, so that the dirty threshold will be
> > > > exceeded long before each task reached its nr_dirtied_pause and hence
> > > > call balance_dirty_pages(). 
> > > 
> > > Right, so why remove the per-cpu threshold? you can keep that as a bound
> > > on the number of out-standing dirty pages.
> > 
> > Right, I also have the vague feeling that the per-cpu threshold can
> > somehow backup the per-task threshold in case there are too many tasks.
> > 
> > > Loosing that bound is actually a bad thing (TM), since you could have
> > > configured a tight dirty limit and lock up your machine this way.
> > 
> > It seems good enough to only remove the 4MB upper limit for
> > ratelimit_pages, so that the per-cpu limit won't kick in too
> > frequently in typical machines.
> > 
> >   * Here we set ratelimit_pages to a level which ensures that when all CPUs are
> >   * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> >   * thresholds before writeback cuts in.
> > - *
> > - * But the limit should not be set too high.  Because it also controls the
> > - * amount of memory which the balance_dirty_pages() caller has to write back.
> > - * If this is too large then the caller will block on the IO queue all the
> > - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> > - * will write six megabyte chunks, max.
> > - */
> > -
> >  void writeback_set_ratelimit(void)
> >  {
> >         ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
> >         if (ratelimit_pages < 16)
> >                 ratelimit_pages = 16;
> > -       if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> > -               ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
> >  }
> 
> Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather
> a lot.

Ah yes, vm_total_pages is not longer suitable here, may use

        ratelimit_pages = dirty_threshold / (num_online_cpus() * 32);

We just need to ensure the dirty_threshold won't be exceeded too much
in the rare case tsk->nr_dirtied_pause cannot keep dirty pages under
control when there are >10k dirtier tasks.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-10 11:13           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 11:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 06:25:48PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-10 at 11:40 +0800, Wu Fengguang wrote:
> > On Wed, Aug 10, 2011 at 02:35:06AM +0800, Peter Zijlstra wrote:
> > > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > > > 
> > > > Add two fields to task_struct.
> > > > 
> > > > 1) account dirtied pages in the individual tasks, for accuracy
> > > > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > > > 
> > > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > > > scale near-sqrt to the safety gap between dirty pages and threshold.
> > > > 
> > > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > > > dirtying pages at exactly the same time, each task will be assigned a
> > > > large initial nr_dirtied_pause, so that the dirty threshold will be
> > > > exceeded long before each task reached its nr_dirtied_pause and hence
> > > > call balance_dirty_pages(). 
> > > 
> > > Right, so why remove the per-cpu threshold? you can keep that as a bound
> > > on the number of out-standing dirty pages.
> > 
> > Right, I also have the vague feeling that the per-cpu threshold can
> > somehow backup the per-task threshold in case there are too many tasks.
> > 
> > > Loosing that bound is actually a bad thing (TM), since you could have
> > > configured a tight dirty limit and lock up your machine this way.
> > 
> > It seems good enough to only remove the 4MB upper limit for
> > ratelimit_pages, so that the per-cpu limit won't kick in too
> > frequently in typical machines.
> > 
> >   * Here we set ratelimit_pages to a level which ensures that when all CPUs are
> >   * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
> >   * thresholds before writeback cuts in.
> > - *
> > - * But the limit should not be set too high.  Because it also controls the
> > - * amount of memory which the balance_dirty_pages() caller has to write back.
> > - * If this is too large then the caller will block on the IO queue all the
> > - * time.  So limit it to four megabytes - the balance_dirty_pages() caller
> > - * will write six megabyte chunks, max.
> > - */
> > -
> >  void writeback_set_ratelimit(void)
> >  {
> >         ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
> >         if (ratelimit_pages < 16)
> >                 ratelimit_pages = 16;
> > -       if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
> > -               ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
> >  }
> 
> Uhm, so what's your bound then? 1/32 of the per-cpu memory seems rather
> a lot.

Ah yes, vm_total_pages is not longer suitable here, may use

        ratelimit_pages = dirty_threshold / (num_online_cpus() * 32);

We just need to ensure the dirty_threshold won't be exceeded too much
in the rare case tsk->nr_dirtied_pause cannot keep dirty pages under
control when there are >10k dirtier tasks.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09  9:31             ` Peter Zijlstra
@ 2011-08-10 12:28               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 12:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 05:31:44PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> > origin is where the control line crosses the X axis (in both the
> > global/bdi setpoint cases). 
> 
> Ah, that's normally called zero, root or or x-intercept:
> 
> http://en.wikipedia.org/wiki/X-intercept

Yes indeed! I'll change the name to x_intercept.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-10 12:28               ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 12:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 05:31:44PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> > origin is where the control line crosses the X axis (in both the
> > global/bdi setpoint cases). 
> 
> Ah, that's normally called zero, root or or x-intercept:
> 
> http://en.wikipedia.org/wiki/X-intercept

Yes indeed! I'll change the name to x_intercept.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 16:16       ` Peter Zijlstra
@ 2011-08-10 14:00         ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 12:16:30AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests).
> > So it probably will track write_bw as opposed to write_bw/N. What
> > am I missing? 

In normal situation (near the setpoints),

   task_ratelimit ~= bdi->dirty_ratelimit ~= write_bw / N

Yes, dirty_ratelimit is a per-bdi variable, because all tasks share
roughly the same dirty ratelimit for the obvious reason of fairness.
 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time.

Yeah. Ultimately it will allow different tasks to be throttled at
different (user specified) rates.

> Although I'm not quite sure how he keeps fairness in light of the
> sleep time bounding to MAX_PAUSE.

Firstly, MAX_PAUSE will only be applied when the dirty pages rush
high (dirty exceeded).  Secondly, the dirty exceeded state is global
to all tasks, in which case each task will sleep for MAX_PAUSE equally.
So the fairness is still maintained in dirty exceeded state.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 14:00         ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 12:16:30AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > 
> > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > to be overall bdi limit and does not seem to take into account the
> > number of tasks doing IO to that bdi (as your comment suggests).
> > So it probably will track write_bw as opposed to write_bw/N. What
> > am I missing? 

In normal situation (near the setpoints),

   task_ratelimit ~= bdi->dirty_ratelimit ~= write_bw / N

Yes, dirty_ratelimit is a per-bdi variable, because all tasks share
roughly the same dirty ratelimit for the obvious reason of fairness.
 
> I think the per task thing comes from him using the pages_dirtied
> argument to balance_dirty_pages() to compute the sleep time.

Yeah. Ultimately it will allow different tasks to be throttled at
different (user specified) rates.

> Although I'm not quite sure how he keeps fairness in light of the
> sleep time bounding to MAX_PAUSE.

Firstly, MAX_PAUSE will only be applied when the dirty pages rush
high (dirty exceeded).  Secondly, the dirty exceeded state is global
to all tasks, in which case each task will sleep for MAX_PAUSE equally.
So the fairness is still maintained in dirty exceeded state.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 16:19         ` Peter Zijlstra
@ 2011-08-10 14:07           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 12:19:32AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > > 
> > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > > to be overall bdi limit and does not seem to take into account the
> > > number of tasks doing IO to that bdi (as your comment suggests). So
> > > it probably will track write_bw as opposed to write_bw/N. What am
> > > I missing? 
> > 
> > I think the per task thing comes from him using the pages_dirtied
> > argument to balance_dirty_pages() to compute the sleep time. Although
> > I'm not quite sure how he keeps fairness in light of the sleep time
> > bounding to MAX_PAUSE.
> 
> Furthermore, there's of course the issue that current->nr_dirtied is
> computed over all BDIs it dirtied pages from, and the sleep time is
> computed for the BDI it happened to do the overflowing write on.
> 
> Assuming an task (mostly) writes to a single bdi, or equally to all, it
> should all work out.

Right. That's one pitfall I forgot to mention, sorry.

If _really_ necessary, the above imperfection can be avoided by adding
tsk->last_dirty_bdi and tsk->to_pause, and to do so when switching to
another bdi:

        to_pause += nr_dirtied / task_ratelimit
        if (to_pause > reasonable_large_pause_time) {
                sleep(to_pause)
                to_pause = 0
        }
        nr_dirtied  = 0

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 14:07           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 12:19:32AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 18:16 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 11:50 -0400, Vivek Goyal wrote:
> > > 
> > > So IIUC, bdi->dirty_ratelimit is the dynmically adjusted desired rate
> > > limit (based on postion ratio, dirty_bw and write_bw). But this seems
> > > to be overall bdi limit and does not seem to take into account the
> > > number of tasks doing IO to that bdi (as your comment suggests). So
> > > it probably will track write_bw as opposed to write_bw/N. What am
> > > I missing? 
> > 
> > I think the per task thing comes from him using the pages_dirtied
> > argument to balance_dirty_pages() to compute the sleep time. Although
> > I'm not quite sure how he keeps fairness in light of the sleep time
> > bounding to MAX_PAUSE.
> 
> Furthermore, there's of course the issue that current->nr_dirtied is
> computed over all BDIs it dirtied pages from, and the sleep time is
> computed for the BDI it happened to do the overflowing write on.
> 
> Assuming an task (mostly) writes to a single bdi, or equally to all, it
> should all work out.

Right. That's one pitfall I forgot to mention, sorry.

If _really_ necessary, the above imperfection can be avoided by adding
tsk->last_dirty_bdi and tsk->to_pause, and to do so when switching to
another bdi:

        to_pause += nr_dirtied / task_ratelimit
        if (to_pause > reasonable_large_pause_time) {
                sleep(to_pause)
                to_pause = 0
        }
        nr_dirtied  = 0

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 16:56     ` Peter Zijlstra
  (?)
  (?)
@ 2011-08-10 14:10     ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 600 bytes --]

On Wed, Aug 10, 2011 at 12:56:56AM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> >              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
> 
> I can't actually find this low-pass filter in the code.. could be I'm
> blind from staring at it too long though..

Sorry, it's implemented in another patch (attached). I've also removed
it from _this_ changelog.

Here you can find all the other patches in addition to the core bits.

http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=shortlog;h=refs/heads/dirty-throttling-v8%2B

Thanks,
Fengguang

[-- Attachment #2: smooth-base-bw --]
[-- Type: text/plain, Size: 2488 bytes --]

Subject: writeback: make dirty_ratelimit stable/smooth
Date: Thu Aug 04 22:05:05 CST 2011

Half the dirty_ratelimit update step size to avoid overshooting, and
further slow down the updates when the tracking error is smaller than
(base_rate / 8).

It's desirable to have a _constant_ dirty_ratelimit given a stable
workload. Because each jolt of dirty_ratelimit will directly show up
in all the bdi tasks' dirty rate.

The cost will be slightly increased dirty position error, which is
pretty acceptable.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-10 21:35:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-10 21:35:31.000000000 +0800
@@ -741,6 +741,7 @@ static void bdi_update_dirty_ratelimit(s
 	unsigned long dirty_rate;
 	unsigned long pos_rate;
 	unsigned long balanced_rate;
+	unsigned long delta;
 	unsigned long long pos_ratio;
 
 	/*
@@ -755,7 +756,6 @@ static void bdi_update_dirty_ratelimit(s
 	 * pos_rate reflects each dd's dirty rate enforced for the past 200ms.
 	 */
 	pos_rate = base_rate * pos_ratio >> BANDWIDTH_CALC_SHIFT;
-	pos_rate++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
 
 	/*
 	 * balanced_rate = pos_rate * write_bw / dirty_rate
@@ -777,14 +777,32 @@ static void bdi_update_dirty_ratelimit(s
 	 * makes it more stable, but also is essential for preventing it being
 	 * driven away by possible systematic errors in balanced_rate.
 	 */
+	delta = 0;
 	if (base_rate > pos_rate) {
 		if (base_rate > balanced_rate)
-			base_rate = max(balanced_rate, pos_rate);
+			delta = base_rate - max(balanced_rate, pos_rate);
 	} else {
 		if (base_rate < balanced_rate)
-			base_rate = min(balanced_rate, pos_rate);
+			delta = min(balanced_rate, pos_rate) - base_rate;
 	}
 
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Eliminates unnecessary jolting.
+	 */
+	delta >>= base_rate / (8 * delta + 1);
+	/*
+	 * Limit the step size to avoid overshooting. It also implicitly
+	 * prevents dirty_ratelimit from dropping to 0.
+	 */
+	delta >>= 2;
+
+	if (base_rate < pos_rate)
+		base_rate += delta;
+	else
+		base_rate -= delta;
+
 	bdi->dirty_ratelimit = base_rate;
 
 	trace_dirty_ratelimit(bdi, dirty_rate, pos_rate, balanced_rate);

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 17:02     ` Peter Zijlstra
@ 2011-08-10 14:15       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:02:02AM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> > +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> > +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> > +
> 
> > +       pos_ratio *= bdi->avg_write_bandwidth;
> > +       do_div(pos_ratio, dirty_bw | 1);
> > +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 
> 
> when written out that results in:
> 
>            bw * pos_ratio * bdi->avg_write_bandwidth
>   ref_bw = -----------------------------------------
>                          dirty_bw
> 
> which would suggest you write it like:
> 
>   ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);
> 
> since pos_bw is already bw * pos_ratio per the above.

Good point. Oopse I even wrote a comment for the over complex calculation:

         * balanced_rate = pos_rate * write_bw / dirty_rate

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 14:15       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-10 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:02:02AM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> 
> > +       pos_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT;
> > +       pos_bw++;  /* this avoids bdi->dirty_ratelimit get stuck in 0 */
> > +
> 
> > +       pos_ratio *= bdi->avg_write_bandwidth;
> > +       do_div(pos_ratio, dirty_bw | 1);
> > +       ref_bw = bw * pos_ratio >> BANDWIDTH_CALC_SHIFT; 
> 
> when written out that results in:
> 
>            bw * pos_ratio * bdi->avg_write_bandwidth
>   ref_bw = -----------------------------------------
>                          dirty_bw
> 
> which would suggest you write it like:
> 
>   ref_bw = div_u64((u64)pos_bw * bdi->avg_write_bandwidth, dirty_bw | 1);
> 
> since pos_bw is already bw * pos_ratio per the above.

Good point. Oopse I even wrote a comment for the over complex calculation:

         * balanced_rate = pos_rate * write_bw / dirty_rate

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-10 11:07       ` Wu Fengguang
  (?)
@ 2011-08-10 16:17         ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-10 16:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

How about something like the below, it still needs some more work, but
its more or less complete in that is now explains both controls in one
story. The actual update bit is still missing.

---

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = writeout_bandwidth

The fairness requirements gives us:

	task_ratelimit = write_bandwidth / N

> : When started N dd, we would like to throttle each dd at
> : 
> :          balanced_rate == write_bw / N                                  (1)
> : 
> : We don't know N beforehand, but still can estimate balanced_rate
> : within 200ms.
> : 
> : Start by throttling each dd task at rate
> : 
> :         task_ratelimit = task_ratelimit_0                               (2)
> :                          (any non-zero initial value is OK)
> : 
> : After 200ms, we got
> : 
> :         dirty_rate = # of pages dirtied by all dd's / 200ms
> :         write_bw   = # of pages written to the disk / 200ms
> : 
> : For the aggressive dd dirtiers, the equality holds
> : 
> :         dirty_rate == N * task_rate
> :                    == N * task_ratelimit
> :                    == N * task_ratelimit_0                              (3)
> : Or
> :         task_ratelimit_0 = dirty_rate / N                               (4)
> :                           
> : So the balanced throttle bandwidth can be estimated by
> :                           
> :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> :                           
> : Because with (4) and (5) we can get the desired equality (1):
> :                           
> :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> :                       == write_bw / N

Then using the balance_rate we can compute task pause times like:

	task_pause = task->nr_dirtied / task_ratelimit

[ however all that still misses the primary feedback of:

   task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)

  there's still some confusion in the above due to task_ratelimit and
  balanced_rate.
]

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

> So if the dirty pages are ABOVE the setpoints, we throttle each task
> a bit more HEAVY than balanced_rate, so that the dirty pages are
> created less fast than they are cleaned, thus DROP to the setpoints
> (and the reverse). With that positional adjustment, the formula is
> transformed from
> 
>         task_ratelimit = balanced_rate
> 
> to
> 
>         task_ratelimit = balanced_rate * pos_ratio

> In terms of the negative feedback control theory, the
> bdi_position_ratio() function (control lines) can be expressed as
> 
> 1) f(setpoint) = 1.0
> 2) df/dt < 0
> 
> 3) optionally, abs(df/dt) should be large on large errors (= dirty -
>    setpoint) in order to cancel the errors fast, and be smaller when
>    dirty pages get closer to the setpoints in order to avoid overshooting.



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 16:17         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-10 16:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

How about something like the below, it still needs some more work, but
its more or less complete in that is now explains both controls in one
story. The actual update bit is still missing.

---

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = writeout_bandwidth

The fairness requirements gives us:

	task_ratelimit = write_bandwidth / N

> : When started N dd, we would like to throttle each dd at
> : 
> :          balanced_rate == write_bw / N                                  (1)
> : 
> : We don't know N beforehand, but still can estimate balanced_rate
> : within 200ms.
> : 
> : Start by throttling each dd task at rate
> : 
> :         task_ratelimit = task_ratelimit_0                               (2)
> :                          (any non-zero initial value is OK)
> : 
> : After 200ms, we got
> : 
> :         dirty_rate = # of pages dirtied by all dd's / 200ms
> :         write_bw   = # of pages written to the disk / 200ms
> : 
> : For the aggressive dd dirtiers, the equality holds
> : 
> :         dirty_rate == N * task_rate
> :                    == N * task_ratelimit
> :                    == N * task_ratelimit_0                              (3)
> : Or
> :         task_ratelimit_0 = dirty_rate / N                               (4)
> :                           
> : So the balanced throttle bandwidth can be estimated by
> :                           
> :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> :                           
> : Because with (4) and (5) we can get the desired equality (1):
> :                           
> :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> :                       == write_bw / N

Then using the balance_rate we can compute task pause times like:

	task_pause = task->nr_dirtied / task_ratelimit

[ however all that still misses the primary feedback of:

   task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)

  there's still some confusion in the above due to task_ratelimit and
  balanced_rate.
]

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

> So if the dirty pages are ABOVE the setpoints, we throttle each task
> a bit more HEAVY than balanced_rate, so that the dirty pages are
> created less fast than they are cleaned, thus DROP to the setpoints
> (and the reverse). With that positional adjustment, the formula is
> transformed from
> 
>         task_ratelimit = balanced_rate
> 
> to
> 
>         task_ratelimit = balanced_rate * pos_ratio

> In terms of the negative feedback control theory, the
> bdi_position_ratio() function (control lines) can be expressed as
> 
> 1) f(setpoint) = 1.0
> 2) df/dt < 0
> 
> 3) optionally, abs(df/dt) should be large on large errors (= dirty -
>    setpoint) in order to cancel the errors fast, and be smaller when
>    dirty pages get closer to the setpoints in order to avoid overshooting.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 16:17         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-10 16:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

How about something like the below, it still needs some more work, but
its more or less complete in that is now explains both controls in one
story. The actual update bit is still missing.

---

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = writeout_bandwidth

The fairness requirements gives us:

	task_ratelimit = write_bandwidth / N

> : When started N dd, we would like to throttle each dd at
> : 
> :          balanced_rate == write_bw / N                                  (1)
> : 
> : We don't know N beforehand, but still can estimate balanced_rate
> : within 200ms.
> : 
> : Start by throttling each dd task at rate
> : 
> :         task_ratelimit = task_ratelimit_0                               (2)
> :                          (any non-zero initial value is OK)
> : 
> : After 200ms, we got
> : 
> :         dirty_rate = # of pages dirtied by all dd's / 200ms
> :         write_bw   = # of pages written to the disk / 200ms
> : 
> : For the aggressive dd dirtiers, the equality holds
> : 
> :         dirty_rate == N * task_rate
> :                    == N * task_ratelimit
> :                    == N * task_ratelimit_0                              (3)
> : Or
> :         task_ratelimit_0 = dirty_rate / N                               (4)
> :                           
> : So the balanced throttle bandwidth can be estimated by
> :                           
> :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> :                           
> : Because with (4) and (5) we can get the desired equality (1):
> :                           
> :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> :                       == write_bw / N

Then using the balance_rate we can compute task pause times like:

	task_pause = task->nr_dirtied / task_ratelimit

[ however all that still misses the primary feedback of:

   task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)

  there's still some confusion in the above due to task_ratelimit and
  balanced_rate.
]

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

> So if the dirty pages are ABOVE the setpoints, we throttle each task
> a bit more HEAVY than balanced_rate, so that the dirty pages are
> created less fast than they are cleaned, thus DROP to the setpoints
> (and the reverse). With that positional adjustment, the formula is
> transformed from
> 
>         task_ratelimit = balanced_rate
> 
> to
> 
>         task_ratelimit = balanced_rate * pos_ratio

> In terms of the negative feedback control theory, the
> bdi_position_ratio() function (control lines) can be expressed as
> 
> 1) f(setpoint) = 1.0
> 2) df/dt < 0
> 
> 3) optionally, abs(df/dt) should be large on large errors (= dirty -
>    setpoint) in order to cancel the errors fast, and be smaller when
>    dirty pages get closer to the setpoints in order to avoid overshooting.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-10 14:00         ` Wu Fengguang
@ 2011-08-10 17:10           ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-10 17:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote:
> 
> > Although I'm not quite sure how he keeps fairness in light of the
> > sleep time bounding to MAX_PAUSE.
> 
> Firstly, MAX_PAUSE will only be applied when the dirty pages rush
> high (dirty exceeded).  Secondly, the dirty exceeded state is global
> to all tasks, in which case each task will sleep for MAX_PAUSE equally.
> So the fairness is still maintained in dirty exceeded state. 

Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact,
but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair,
its only fair if they dirty at the same rate.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-10 17:10           ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-10 17:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote:
> 
> > Although I'm not quite sure how he keeps fairness in light of the
> > sleep time bounding to MAX_PAUSE.
> 
> Firstly, MAX_PAUSE will only be applied when the dirty pages rush
> high (dirty exceeded).  Secondly, the dirty exceeded state is global
> to all tasks, in which case each task will sleep for MAX_PAUSE equally.
> So the fairness is still maintained in dirty exceeded state. 

Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact,
but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair,
its only fair if they dirty at the same rate.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-10  3:29       ` Wu Fengguang
@ 2011-08-10 18:18         ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-10 18:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Wed, Aug 10, 2011 at 11:29:54AM +0800, Wu Fengguang wrote:

[..]
> > > -	ratelimit = ratelimit_pages;
> > > -	if (mapping->backing_dev_info->dirty_exceeded)
> > > +	ratelimit = current->nr_dirtied_pause;
> > > +	if (bdi->dirty_exceeded)
> > >  		ratelimit = 8;
> > 
> > Should we make sure that ratelimit is more than 8? It could be that
> > ratelimit is 1 and we set it higher (just reverse of what we wanted?)
> 
> Good catch! I actually just fixed it in that direction :)
> 
>         if (bdi->dirty_exceeded)
> -               ratelimit = 8;
> +               ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

With page size 64K, will above lead to retelimit 0? Is that what you want.
I wouldn't think so.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-10 18:18         ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-10 18:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Wed, Aug 10, 2011 at 11:29:54AM +0800, Wu Fengguang wrote:

[..]
> > > -	ratelimit = ratelimit_pages;
> > > -	if (mapping->backing_dev_info->dirty_exceeded)
> > > +	ratelimit = current->nr_dirtied_pause;
> > > +	if (bdi->dirty_exceeded)
> > >  		ratelimit = 8;
> > 
> > Should we make sure that ratelimit is more than 8? It could be that
> > ratelimit is 1 and we set it higher (just reverse of what we wanted?)
> 
> Good catch! I actually just fixed it in that direction :)
> 
>         if (bdi->dirty_exceeded)
> -               ratelimit = 8;
> +               ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

With page size 64K, will above lead to retelimit 0? Is that what you want.
I wouldn't think so.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
  2011-08-10  7:41         ` Greg Thelen
  (?)
@ 2011-08-10 18:40           ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-10 18:40 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML,
	Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara,
	KAMEZAWA Hiroyuki

On Wed, Aug 10, 2011 at 12:41:00AM -0700, Greg Thelen wrote:

[..]
> > > However, before we have a "finished product", there is still another
> > > piece of the puzzle to be put in place - memcg-aware buffered
> > > writeback. That is, having a flusher thread do work on behalf of
> > > memcg in the IO context of the memcg. Then the IO controller just
> > > sees a stream of async writes in the context of the memcg the
> > > buffered writes came from in the first place. The block layer
> > > throttles them just like any other IO in the IO context of the
> > > memcg...
> >
> > Yes that is still a piece remaining. I was hoping that Greg Thelen will
> > be able to extend his patches to submit writes in the context of
> > per cgroup flusher/worker threads and solve this problem.
> >
> > Thanks
> > Vivek
> 
> Are you suggesting multiple flushers per bdi (one per cgroup)?  I
> thought the point of IO less was to one issue buffered writes from a
> single thread.

I think in one of the mail threads Dave Chinner mentioned this idea
of using per cgroup worker/worqueue.

Agreed that it leads back to the issue of multiple writers (but only
if multiple cgroups are there). But at the same time it simplifies
atleast two problems.

- Worker could be migrated to the cgroup we are writting for and we
  don't need the IO tracking logic. blkio controller should will
  automatically account the IO to right group.

- We don't have to worry about a single flusher thread sleeping
  on request queue because either queue or group is congested and
  this can lead other group's IO is not being submitted.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-10 18:40           ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-10 18:40 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML,
	Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara,
	KAMEZAWA Hiroyuki

On Wed, Aug 10, 2011 at 12:41:00AM -0700, Greg Thelen wrote:

[..]
> > > However, before we have a "finished product", there is still another
> > > piece of the puzzle to be put in place - memcg-aware buffered
> > > writeback. That is, having a flusher thread do work on behalf of
> > > memcg in the IO context of the memcg. Then the IO controller just
> > > sees a stream of async writes in the context of the memcg the
> > > buffered writes came from in the first place. The block layer
> > > throttles them just like any other IO in the IO context of the
> > > memcg...
> >
> > Yes that is still a piece remaining. I was hoping that Greg Thelen will
> > be able to extend his patches to submit writes in the context of
> > per cgroup flusher/worker threads and solve this problem.
> >
> > Thanks
> > Vivek
> 
> Are you suggesting multiple flushers per bdi (one per cgroup)?  I
> thought the point of IO less was to one issue buffered writes from a
> single thread.

I think in one of the mail threads Dave Chinner mentioned this idea
of using per cgroup worker/worqueue.

Agreed that it leads back to the issue of multiple writers (but only
if multiple cgroups are there). But at the same time it simplifies
atleast two problems.

- Worker could be migrated to the cgroup we are writting for and we
  don't need the IO tracking logic. blkio controller should will
  automatically account the IO to right group.

- We don't have to worry about a single flusher thread sleeping
  on request queue because either queue or group is congested and
  this can lead other group's IO is not being submitted.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-10 18:40           ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-10 18:40 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Minchan Kim, Wu Fengguang, Dave Chinner, Christoph Hellwig, LKML,
	Andrea Righi, Andrew Morton, linux-fsdevel, linux-mm, Jan Kara,
	KAMEZAWA Hiroyuki

On Wed, Aug 10, 2011 at 12:41:00AM -0700, Greg Thelen wrote:

[..]
> > > However, before we have a "finished product", there is still another
> > > piece of the puzzle to be put in place - memcg-aware buffered
> > > writeback. That is, having a flusher thread do work on behalf of
> > > memcg in the IO context of the memcg. Then the IO controller just
> > > sees a stream of async writes in the context of the memcg the
> > > buffered writes came from in the first place. The block layer
> > > throttles them just like any other IO in the IO context of the
> > > memcg...
> >
> > Yes that is still a piece remaining. I was hoping that Greg Thelen will
> > be able to extend his patches to submit writes in the context of
> > per cgroup flusher/worker threads and solve this problem.
> >
> > Thanks
> > Vivek
> 
> Are you suggesting multiple flushers per bdi (one per cgroup)?  I
> thought the point of IO less was to one issue buffered writes from a
> single thread.

I think in one of the mail threads Dave Chinner mentioned this idea
of using per cgroup worker/worqueue.

Agreed that it leads back to the issue of multiple writers (but only
if multiple cgroups are there). But at the same time it simplifies
atleast two problems.

- Worker could be migrated to the cgroup we are writting for and we
  don't need the IO tracking logic. blkio controller should will
  automatically account the IO to right group.

- We don't have to worry about a single flusher thread sleeping
  on request queue because either queue or group is congested and
  this can lead other group's IO is not being submitted.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
@ 2011-08-10 21:40             ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-10 21:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 07:05:35AM +0800, Wu Fengguang wrote:
> On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> > >         goal = thresh - thresh / DIRTY_SCOPE;
> > >         origin = 4 * thresh;
> > >  
> > > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > > -               origin = limit;                 /* auxiliary control line */
> > > -               goal = (goal + origin) / 2;
> > > -               pos_ratio >>= 1;
> > > -       }
> > >         pos_ratio = origin - dirty;
> > >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> > >         do_div(pos_ratio, origin - goal + 1); 
> 
> FYI I've updated the fix to the below one, so that @limit will be used
> as the origin in the rare case of (4*thresh < dirty).
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
> @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
>  	 * global setpoint
>  	 */
>  	goal = thresh - thresh / DIRTY_SCOPE;
> -	origin = 4 * thresh;
> +	origin = max(4 * thresh, limit);

Hi Fengguang,

Ok, so just trying to understand this pos_ratio little better.

You have following basic formula.

                     origin - dirty
         pos_ratio = --------------
                     origin - goal

Terminology is very confusing and following is my understanding. 

- setpoint == goal

  setpoint is the point where we would like our number of dirty pages to
  be and at this point pos_ratio = 1. For global dirty this number seems
  to be (thresh - thresh / DIRTY_SCOPE) 

- thresh
  dirty page threshold calculated from dirty_ratio (Certain percentage of
  total memory).

- Origin (seems to be equivalent of limit)

  This seems to be the reference point/limit we don't want to cross and
  distance from this limit basically decides the pos_ratio. Closer we
  are to limit, lower the pos_ratio and further we are higher the
  pos_ratio.

So threshold is just a number which helps us determine goal and limit.

goal = thresh - thresh / DIRTY_SCOPE
limit = 4*thresh

So goal is where we want to be and we start throttling the task more as
we move away goal and approach limit. We keep the limit high enough
so that (origin-dirty) does not become negative entity.

So we do expect to cross "thresh" otherwise thresh itself could have
served as limit?

If my understanding is right, then can we get rid of terms "setpoint" and
"origin". Would it be easier to understand the things if we just talk
in terms of "goal" and "limit" and how these are derived from "thresh".

	thresh == soft limit
	limit == 4*thresh (hard limit)
	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
						be in steady state).
                     limit - dirty
         pos_ratio = --------------
                     limit - goal

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-10 21:40             ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-10 21:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 07:05:35AM +0800, Wu Fengguang wrote:
> On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> > >         goal = thresh - thresh / DIRTY_SCOPE;
> > >         origin = 4 * thresh;
> > >  
> > > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > > -               origin = limit;                 /* auxiliary control line */
> > > -               goal = (goal + origin) / 2;
> > > -               pos_ratio >>= 1;
> > > -       }
> > >         pos_ratio = origin - dirty;
> > >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> > >         do_div(pos_ratio, origin - goal + 1); 
> 
> FYI I've updated the fix to the below one, so that @limit will be used
> as the origin in the rare case of (4*thresh < dirty).
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
> @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
>  	 * global setpoint
>  	 */
>  	goal = thresh - thresh / DIRTY_SCOPE;
> -	origin = 4 * thresh;
> +	origin = max(4 * thresh, limit);

Hi Fengguang,

Ok, so just trying to understand this pos_ratio little better.

You have following basic formula.

                     origin - dirty
         pos_ratio = --------------
                     origin - goal

Terminology is very confusing and following is my understanding. 

- setpoint == goal

  setpoint is the point where we would like our number of dirty pages to
  be and at this point pos_ratio = 1. For global dirty this number seems
  to be (thresh - thresh / DIRTY_SCOPE) 

- thresh
  dirty page threshold calculated from dirty_ratio (Certain percentage of
  total memory).

- Origin (seems to be equivalent of limit)

  This seems to be the reference point/limit we don't want to cross and
  distance from this limit basically decides the pos_ratio. Closer we
  are to limit, lower the pos_ratio and further we are higher the
  pos_ratio.

So threshold is just a number which helps us determine goal and limit.

goal = thresh - thresh / DIRTY_SCOPE
limit = 4*thresh

So goal is where we want to be and we start throttling the task more as
we move away goal and approach limit. We keep the limit high enough
so that (origin-dirty) does not become negative entity.

So we do expect to cross "thresh" otherwise thresh itself could have
served as limit?

If my understanding is right, then can we get rid of terms "setpoint" and
"origin". Would it be easier to understand the things if we just talk
in terms of "goal" and "limit" and how these are derived from "thresh".

	thresh == soft limit
	limit == 4*thresh (hard limit)
	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
						be in steady state).
                     limit - dirty
         pos_ratio = --------------
                     limit - goal

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09 17:20             ` Peter Zijlstra
@ 2011-08-10 22:34               ` Jan Kara
  -1 siblings, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-10 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw

  Actually, thinking about these formulas, why do we even bother with
computing all these factors like write_bw, dirty_bw, pos_ratio, ...
Couldn't we just have a feedback loop (probably similar to the one
computing pos_ratio) which will maintain single value - ratelimit? When we
are getting close to dirty limit, we will scale ratelimit down, when we
will be getting significantly below dirty limit, we will scale the
ratelimit up.  Because looking at the formulas it seems to me that the net
effect is the same - pos_ratio basically overrules everything... 

> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.
> 
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> 	dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> 	dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-10 22:34               ` Jan Kara
  0 siblings, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-10 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw

  Actually, thinking about these formulas, why do we even bother with
computing all these factors like write_bw, dirty_bw, pos_ratio, ...
Couldn't we just have a feedback loop (probably similar to the one
computing pos_ratio) which will maintain single value - ratelimit? When we
are getting close to dirty limit, we will scale ratelimit down, when we
will be getting significantly below dirty limit, we will scale the
ratelimit up.  Because looking at the formulas it seems to me that the net
effect is the same - pos_ratio basically overrules everything... 

> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.
> 
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> 	dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> 	dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-10 18:18         ` Vivek Goyal
@ 2011-08-11  0:55           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-11  0:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Thu, Aug 11, 2011 at 02:18:54AM +0800, Vivek Goyal wrote:
> On Wed, Aug 10, 2011 at 11:29:54AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > > -	ratelimit = ratelimit_pages;
> > > > -	if (mapping->backing_dev_info->dirty_exceeded)
> > > > +	ratelimit = current->nr_dirtied_pause;
> > > > +	if (bdi->dirty_exceeded)
> > > >  		ratelimit = 8;
> > > 
> > > Should we make sure that ratelimit is more than 8? It could be that
> > > ratelimit is 1 and we set it higher (just reverse of what we wanted?)
> > 
> > Good catch! I actually just fixed it in that direction :)
> > 
> >         if (bdi->dirty_exceeded)
> > -               ratelimit = 8;
> > +               ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
> 
> With page size 64K, will above lead to retelimit 0? Is that what you want.
> I wouldn't think so.

Yeah, it looks a bit weird.. however ratelimit=0 would behave the
same with ratelimit=1 because balance_dirty_pages_ratelimited_nr()
is always called with (nr_pages_dirtied >= 1).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-11  0:55           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-11  0:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Thu, Aug 11, 2011 at 02:18:54AM +0800, Vivek Goyal wrote:
> On Wed, Aug 10, 2011 at 11:29:54AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > > -	ratelimit = ratelimit_pages;
> > > > -	if (mapping->backing_dev_info->dirty_exceeded)
> > > > +	ratelimit = current->nr_dirtied_pause;
> > > > +	if (bdi->dirty_exceeded)
> > > >  		ratelimit = 8;
> > > 
> > > Should we make sure that ratelimit is more than 8? It could be that
> > > ratelimit is 1 and we set it higher (just reverse of what we wanted?)
> > 
> > Good catch! I actually just fixed it in that direction :)
> > 
> >         if (bdi->dirty_exceeded)
> > -               ratelimit = 8;
> > +               ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
> 
> With page size 64K, will above lead to retelimit 0? Is that what you want.
> I wouldn't think so.

Yeah, it looks a bit weird.. however ratelimit=0 would behave the
same with ratelimit=1 because balance_dirty_pages_ratelimited_nr()
is always called with (nr_pages_dirtied >= 1).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-10 22:34               ` Jan Kara
@ 2011-08-11  2:29                 ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-11  2:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > >                     origin - dirty
> > > >         pos_ratio = --------------
> > > >                     origin - goal 
> > > 
> > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > pos_ratio == 1.0:
> > > 
> > > OK, so basically you want a linear function for which:
> > > 
> > > f(goal) = 1 and has a root somewhere > goal.
> > > 
> > > (that one line is much more informative than all your graphs put
> > > together, one can start from there and derive your function)
> > > 
> > > That does indeed get you the above function, now what does it mean? 
> > 
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> 
>   Actually, thinking about these formulas, why do we even bother with
> computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> Couldn't we just have a feedback loop (probably similar to the one
> computing pos_ratio) which will maintain single value - ratelimit? When we
> are getting close to dirty limit, we will scale ratelimit down, when we
> will be getting significantly below dirty limit, we will scale the
> ratelimit up.  Because looking at the formulas it seems to me that the net
> effect is the same - pos_ratio basically overrules everything... 

Good question. That is actually one of the early approaches I tried.
It somehow worked, however the resulted ratelimit is not only slow
responding, but also oscillating all the time.

This is due to the imperfections

1) pos_ratio at best only provides a "direction" for adjusting the
   ratelimit. There is only vague clues that if pos_ratio is small,
   the errors in ratelimit should be small.

2) Due to time-lag, the assumptions in (1) about "direction" and
   "error size" can be wrong. The ratelimit may already be
   over-adjusted when the dirty pages take time to approach the
   setpoint. The larger memory, the more time lag, the easier to
   overshoot and oscillate.

3) dirty pages are constantly fluctuating around the setpoint,
   so is pos_ratio.

With (1) and (2), it's a control system very susceptible to disturbs.
With (3) we get constant disturbs. Well I had very hard time and
played dirty tricks (which you may never want to know ;-) trying to
tradeoff between response time and stableness..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11  2:29                 ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-11  2:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > >                     origin - dirty
> > > >         pos_ratio = --------------
> > > >                     origin - goal 
> > > 
> > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > pos_ratio == 1.0:
> > > 
> > > OK, so basically you want a linear function for which:
> > > 
> > > f(goal) = 1 and has a root somewhere > goal.
> > > 
> > > (that one line is much more informative than all your graphs put
> > > together, one can start from there and derive your function)
> > > 
> > > That does indeed get you the above function, now what does it mean? 
> > 
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> 
>   Actually, thinking about these formulas, why do we even bother with
> computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> Couldn't we just have a feedback loop (probably similar to the one
> computing pos_ratio) which will maintain single value - ratelimit? When we
> are getting close to dirty limit, we will scale ratelimit down, when we
> will be getting significantly below dirty limit, we will scale the
> ratelimit up.  Because looking at the formulas it seems to me that the net
> effect is the same - pos_ratio basically overrules everything... 

Good question. That is actually one of the early approaches I tried.
It somehow worked, however the resulted ratelimit is not only slow
responding, but also oscillating all the time.

This is due to the imperfections

1) pos_ratio at best only provides a "direction" for adjusting the
   ratelimit. There is only vague clues that if pos_ratio is small,
   the errors in ratelimit should be small.

2) Due to time-lag, the assumptions in (1) about "direction" and
   "error size" can be wrong. The ratelimit may already be
   over-adjusted when the dirty pages take time to approach the
   setpoint. The larger memory, the more time lag, the easier to
   overshoot and oscillate.

3) dirty pages are constantly fluctuating around the setpoint,
   so is pos_ratio.

With (1) and (2), it's a control system very susceptible to disturbs.
With (3) we get constant disturbs. Well I had very hard time and
played dirty tricks (which you may never want to know ;-) trying to
tradeoff between response time and stableness..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
  2011-08-09  2:01   ` Vivek Goyal
@ 2011-08-11  3:21     ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-11  3:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

> [...] it only deals with controlling buffered write IO and nothing
> else. So on the same block device, other direct writes might be
> going on from same group and in this scheme a user will not have any
> control.

The IO-less balance_dirty_pages() will be able to throttle DIRECT
writes. There is nothing fundamental in the way.

The basic approach will be to add a balance_dirty_pages_ratelimited_nr()
call in the DIRECT write path, and to call into balance_dirty_pages()
regardless of the various dirty thresholds.

Then the IO-less balance_dirty_pages() has all the facilities to
throttle a task at any auto-estimated or user-specified ratelimit.

> Another disadvantage is that throttling at page cache level does not
> take care of IO spikes at device level.

Yes this is a problem. But it's a problem best fixable in the IO
scheduler.. (I cannot go to details at this time, however it does
_sound_ possible to me..)

> How do you implement proportional control here? From overall bdi bandwidth
> vary per cgroup bandwidth regularly based on cgroup weight? Again the
> issue here is that it controls only buffered WRITES and nothing else and
> in this case co-ordinating with CFQ will probably be hard. So I guess
> usage of proportional IO just for buffered WRITES will have limited
> usage.

"priority" may be a more suitable phrase. It will be implemented like
this (without the user interface):

@@ -1007,6 +1001,13 @@ static void balance_dirty_pages(struct a
                max_pause = bdi_max_pause(bdi, bdi_dirty);
               
                base_rate = bdi->dirty_ratelimit;
+               /*
+                * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
+                * real-time tasks.
+                */
+               if (current->flags & PF_LESS_THROTTLE || rt_task(current))
+                       base_rate *= 2;
+              
                pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
                                               background_thresh, nr_dirty,
                                               bdi_thresh, bdi_dirty);                                                        
That is, if start 2 dd tasks A and B with priority_B=2. Then the
resulting rate_B will be equal to 2*rate_A. The ->dirty_ratelimit will
auto adapt to rate_A or equally (write_bw/3).

The same can be applied to cgroup. One may specify the whole cgroup's
dirty rate be throttled at N times that of a normal dd in the root cgroup,
or be throttled at some absolute 10MB/s rate. The corresponding
cgroup->dirty_ratelimit will be set to (N * bdi->dirty_ratelimit) for
the former and 10MB/s for the latter.

The user can specify any combinations of "priority" and "absolute
ratelimit" for any task and/or cgroup, tasks inside cgroup, and so on.
We have very powerful (bdi or cgroup)->dirty_ratelimit adaptation
mechanism to support the combinations :)

The "priority" can even be applied to DIRECT dirtiers, _as long as_
there are other buffered dirtiers to generate enough dirty pages. It's
not as easy to apply priorities when there are only DIRECT dirtiers.
In contrast, the absolute ratelimit is always applicable to all kind
of tasks and cgroups.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-11  3:21     ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-11  3:21 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

> [...] it only deals with controlling buffered write IO and nothing
> else. So on the same block device, other direct writes might be
> going on from same group and in this scheme a user will not have any
> control.

The IO-less balance_dirty_pages() will be able to throttle DIRECT
writes. There is nothing fundamental in the way.

The basic approach will be to add a balance_dirty_pages_ratelimited_nr()
call in the DIRECT write path, and to call into balance_dirty_pages()
regardless of the various dirty thresholds.

Then the IO-less balance_dirty_pages() has all the facilities to
throttle a task at any auto-estimated or user-specified ratelimit.

> Another disadvantage is that throttling at page cache level does not
> take care of IO spikes at device level.

Yes this is a problem. But it's a problem best fixable in the IO
scheduler.. (I cannot go to details at this time, however it does
_sound_ possible to me..)

> How do you implement proportional control here? From overall bdi bandwidth
> vary per cgroup bandwidth regularly based on cgroup weight? Again the
> issue here is that it controls only buffered WRITES and nothing else and
> in this case co-ordinating with CFQ will probably be hard. So I guess
> usage of proportional IO just for buffered WRITES will have limited
> usage.

"priority" may be a more suitable phrase. It will be implemented like
this (without the user interface):

@@ -1007,6 +1001,13 @@ static void balance_dirty_pages(struct a
                max_pause = bdi_max_pause(bdi, bdi_dirty);
               
                base_rate = bdi->dirty_ratelimit;
+               /*
+                * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
+                * real-time tasks.
+                */
+               if (current->flags & PF_LESS_THROTTLE || rt_task(current))
+                       base_rate *= 2;
+              
                pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
                                               background_thresh, nr_dirty,
                                               bdi_thresh, bdi_dirty);                                                        
That is, if start 2 dd tasks A and B with priority_B=2. Then the
resulting rate_B will be equal to 2*rate_A. The ->dirty_ratelimit will
auto adapt to rate_A or equally (write_bw/3).

The same can be applied to cgroup. One may specify the whole cgroup's
dirty rate be throttled at N times that of a normal dd in the root cgroup,
or be throttled at some absolute 10MB/s rate. The corresponding
cgroup->dirty_ratelimit will be set to (N * bdi->dirty_ratelimit) for
the former and 10MB/s for the latter.

The user can specify any combinations of "priority" and "absolute
ratelimit" for any task and/or cgroup, tasks inside cgroup, and so on.
We have very powerful (bdi or cgroup)->dirty_ratelimit adaptation
mechanism to support the combinations :)

The "priority" can even be applied to DIRECT dirtiers, _as long as_
there are other buffered dirtiers to generate enough dirty pages. It's
not as easy to apply priorities when there are only DIRECT dirtiers.
In contrast, the absolute ratelimit is always applicable to all kind
of tasks and cgroups.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-09 14:54     ` Vivek Goyal
@ 2011-08-11  3:42       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-11  3:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Tue, Aug 09, 2011 at 10:54:38PM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:
> > It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
> > when there are N dd tasks.
> > 
> > On write() syscall, use bdi->dirty_ratelimit
> > ============================================
> > 
> >     balance_dirty_pages(pages_dirtied)
> >     {
> >         pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
> >         pause = pages_dirtied / pos_bw;
> >         sleep(pause);
> >     }
> > 
> > On every 200ms, update bdi->dirty_ratelimit
> > ===========================================
> > 
> >     bdi_update_dirty_ratelimit()
> >     {
> >         bw = bdi->dirty_ratelimit;
> >         ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
> >         if (dirty pages unbalanced)
> >              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
> >     }
> > 
> > Estimation of balanced bdi->dirty_ratelimit
> > ===========================================
> > 
> > When started N dd, throttle each dd at
> > 
> >          task_ratelimit = pos_bw (any non-zero initial value is OK)
> > 
> > After 200ms, we got
> > 
> >          dirty_bw = # of pages dirtied by app / 200ms
> >          write_bw = # of pages written to disk / 200ms
> > 
> > For aggressive dirtiers, the equality holds
> > 
> >          dirty_bw == N * task_ratelimit
> >                   == N * pos_bw                      	(1)
> > 
> > The balanced throttle bandwidth can be estimated by
> > 
> >          ref_bw = pos_bw * write_bw / dirty_bw       	(2)
> > 
> > >From (1) and (2), we get equality
> > 
> >          ref_bw == write_bw / N                      	(3)
> > 
> > If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> > will match. So ref_bw is the balanced dirty rate.
> 
> Hi Fengguang,

Hi Vivek,

> So how much work it is to extend all this to handle the case of cgroups?

Here is the simplest form.

writeback: async write IO controllers
http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=blobdiff;f=mm/page-writeback.c;h=0b579e7fd338fd1f59cc36bf15fda06ff6260634;hp=34dff9f0d28d0f4f0794eb41187f71b4ade6b8a2;hb=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hpb=5b6fcb3125ea52ff04a2fad27a51307842deb1a0

And an old email on this topic:

https://lkml.org/lkml/2011/4/28/229

> IOW, I would imagine that you shall have to keep track of per cgroup/per
> bdi state of many of the variables. For example, write_bw will become
> per cgroup/per bdi entity instead of per bdi entity only. Same should
> be true for position ratio, dirty_bw etc?
 
The dirty_bw, write_bw and dirty_ratelimit should be replicated,
but not necessarily dirty pages and position ratio.

The cgroup can just rely on the root cgroup's dirty pages position
control if it does not care about its own dirty pages consumptions.

> I am assuming that if some cgroup is low weight on end device, then
> WRITE bandwidth of that cgroup should go down and that should be
> accounted for at per bdi state and task throttling should happen
> accordingly so that a lower weight cgroup tasks get throttled more
> as compared to higher weight cgroup tasks?

Sorry I don't quite catch your meaning, but the current
->dirty_ratelimit adaptation scheme (detailed in another email) should
handle all such rate/bw allocation issues automatically?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-11  3:42       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-11  3:42 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Tue, Aug 09, 2011 at 10:54:38PM +0800, Vivek Goyal wrote:
> On Sat, Aug 06, 2011 at 04:44:50PM +0800, Wu Fengguang wrote:
> > It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
> > when there are N dd tasks.
> > 
> > On write() syscall, use bdi->dirty_ratelimit
> > ============================================
> > 
> >     balance_dirty_pages(pages_dirtied)
> >     {
> >         pos_bw = bdi->dirty_ratelimit * bdi_position_ratio();
> >         pause = pages_dirtied / pos_bw;
> >         sleep(pause);
> >     }
> > 
> > On every 200ms, update bdi->dirty_ratelimit
> > ===========================================
> > 
> >     bdi_update_dirty_ratelimit()
> >     {
> >         bw = bdi->dirty_ratelimit;
> >         ref_bw = bw * bdi_position_ratio() * write_bw / dirty_bw;
> >         if (dirty pages unbalanced)
> >              bdi->dirty_ratelimit = (bw * 3 + ref_bw) / 4;
> >     }
> > 
> > Estimation of balanced bdi->dirty_ratelimit
> > ===========================================
> > 
> > When started N dd, throttle each dd at
> > 
> >          task_ratelimit = pos_bw (any non-zero initial value is OK)
> > 
> > After 200ms, we got
> > 
> >          dirty_bw = # of pages dirtied by app / 200ms
> >          write_bw = # of pages written to disk / 200ms
> > 
> > For aggressive dirtiers, the equality holds
> > 
> >          dirty_bw == N * task_ratelimit
> >                   == N * pos_bw                      	(1)
> > 
> > The balanced throttle bandwidth can be estimated by
> > 
> >          ref_bw = pos_bw * write_bw / dirty_bw       	(2)
> > 
> > >From (1) and (2), we get equality
> > 
> >          ref_bw == write_bw / N                      	(3)
> > 
> > If the N dd's are all throttled at ref_bw, the dirty/writeback rates
> > will match. So ref_bw is the balanced dirty rate.
> 
> Hi Fengguang,

Hi Vivek,

> So how much work it is to extend all this to handle the case of cgroups?

Here is the simplest form.

writeback: async write IO controllers
http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=blobdiff;f=mm/page-writeback.c;h=0b579e7fd338fd1f59cc36bf15fda06ff6260634;hp=34dff9f0d28d0f4f0794eb41187f71b4ade6b8a2;hb=1a58ad99ce1f6a9df6618a4b92fa4859cc3e7e90;hpb=5b6fcb3125ea52ff04a2fad27a51307842deb1a0

And an old email on this topic:

https://lkml.org/lkml/2011/4/28/229

> IOW, I would imagine that you shall have to keep track of per cgroup/per
> bdi state of many of the variables. For example, write_bw will become
> per cgroup/per bdi entity instead of per bdi entity only. Same should
> be true for position ratio, dirty_bw etc?
 
The dirty_bw, write_bw and dirty_ratelimit should be replicated,
but not necessarily dirty pages and position ratio.

The cgroup can just rely on the root cgroup's dirty pages position
control if it does not care about its own dirty pages consumptions.

> I am assuming that if some cgroup is low weight on end device, then
> WRITE bandwidth of that cgroup should go down and that should be
> accounted for at per bdi state and task throttling should happen
> accordingly so that a lower weight cgroup tasks get throttled more
> as compared to higher weight cgroup tasks?

Sorry I don't quite catch your meaning, but the current
->dirty_ratelimit adaptation scheme (detailed in another email) should
handle all such rate/bw allocation issues automatically?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-11  2:29                 ` Wu Fengguang
@ 2011-08-11 11:14                   ` Jan Kara
  -1 siblings, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-11 11:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > >                     origin - dirty
> > > > >         pos_ratio = --------------
> > > > >                     origin - goal 
> > > > 
> > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > pos_ratio == 1.0:
> > > > 
> > > > OK, so basically you want a linear function for which:
> > > > 
> > > > f(goal) = 1 and has a root somewhere > goal.
> > > > 
> > > > (that one line is much more informative than all your graphs put
> > > > together, one can start from there and derive your function)
> > > > 
> > > > That does indeed get you the above function, now what does it mean? 
> > > 
> > > So going by:
> > > 
> > >                                          write_bw
> > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > >                                          dirty_bw
> > 
> >   Actually, thinking about these formulas, why do we even bother with
> > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > Couldn't we just have a feedback loop (probably similar to the one
> > computing pos_ratio) which will maintain single value - ratelimit? When we
> > are getting close to dirty limit, we will scale ratelimit down, when we
> > will be getting significantly below dirty limit, we will scale the
> > ratelimit up.  Because looking at the formulas it seems to me that the net
> > effect is the same - pos_ratio basically overrules everything... 
> 
> Good question. That is actually one of the early approaches I tried.
> It somehow worked, however the resulted ratelimit is not only slow
> responding, but also oscillating all the time.
  Yes, I think I vaguely remember that.

> This is due to the imperfections
> 
> 1) pos_ratio at best only provides a "direction" for adjusting the
>    ratelimit. There is only vague clues that if pos_ratio is small,
>    the errors in ratelimit should be small.
> 
> 2) Due to time-lag, the assumptions in (1) about "direction" and
>    "error size" can be wrong. The ratelimit may already be
>    over-adjusted when the dirty pages take time to approach the
>    setpoint. The larger memory, the more time lag, the easier to
>    overshoot and oscillate.
> 
> 3) dirty pages are constantly fluctuating around the setpoint,
>    so is pos_ratio.
> 
> With (1) and (2), it's a control system very susceptible to disturbs.
> With (3) we get constant disturbs. Well I had very hard time and
> played dirty tricks (which you may never want to know ;-) trying to
> tradeoff between response time and stableness..
  Yes, I can see especially 2) is a problem. But I don't understand why
your current formula would be that much different. As Peter decoded from
your code, your current formula is:
                                        write_bw
 ref_bw = dirty_ratelimit * pos_ratio * --------
                                        dirty_bw

while previously it was essentially:
 ref_bw = dirty_ratelimit * pos_ratio

So what is so magical about computing write_bw and dirty_bw separately? Is
it because previously you did not use derivation of distance from the goal
for updating pos_ratio? Because in your current formula write_bw/dirty_bw
is a derivation of position...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11 11:14                   ` Jan Kara
  0 siblings, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-11 11:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > >                     origin - dirty
> > > > >         pos_ratio = --------------
> > > > >                     origin - goal 
> > > > 
> > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > pos_ratio == 1.0:
> > > > 
> > > > OK, so basically you want a linear function for which:
> > > > 
> > > > f(goal) = 1 and has a root somewhere > goal.
> > > > 
> > > > (that one line is much more informative than all your graphs put
> > > > together, one can start from there and derive your function)
> > > > 
> > > > That does indeed get you the above function, now what does it mean? 
> > > 
> > > So going by:
> > > 
> > >                                          write_bw
> > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > >                                          dirty_bw
> > 
> >   Actually, thinking about these formulas, why do we even bother with
> > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > Couldn't we just have a feedback loop (probably similar to the one
> > computing pos_ratio) which will maintain single value - ratelimit? When we
> > are getting close to dirty limit, we will scale ratelimit down, when we
> > will be getting significantly below dirty limit, we will scale the
> > ratelimit up.  Because looking at the formulas it seems to me that the net
> > effect is the same - pos_ratio basically overrules everything... 
> 
> Good question. That is actually one of the early approaches I tried.
> It somehow worked, however the resulted ratelimit is not only slow
> responding, but also oscillating all the time.
  Yes, I think I vaguely remember that.

> This is due to the imperfections
> 
> 1) pos_ratio at best only provides a "direction" for adjusting the
>    ratelimit. There is only vague clues that if pos_ratio is small,
>    the errors in ratelimit should be small.
> 
> 2) Due to time-lag, the assumptions in (1) about "direction" and
>    "error size" can be wrong. The ratelimit may already be
>    over-adjusted when the dirty pages take time to approach the
>    setpoint. The larger memory, the more time lag, the easier to
>    overshoot and oscillate.
> 
> 3) dirty pages are constantly fluctuating around the setpoint,
>    so is pos_ratio.
> 
> With (1) and (2), it's a control system very susceptible to disturbs.
> With (3) we get constant disturbs. Well I had very hard time and
> played dirty tricks (which you may never want to know ;-) trying to
> tradeoff between response time and stableness..
  Yes, I can see especially 2) is a problem. But I don't understand why
your current formula would be that much different. As Peter decoded from
your code, your current formula is:
                                        write_bw
 ref_bw = dirty_ratelimit * pos_ratio * --------
                                        dirty_bw

while previously it was essentially:
 ref_bw = dirty_ratelimit * pos_ratio

So what is so magical about computing write_bw and dirty_bw separately? Is
it because previously you did not use derivation of distance from the goal
for updating pos_ratio? Because in your current formula write_bw/dirty_bw
is a derivation of position...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
  2011-08-11  3:21     ` Wu Fengguang
@ 2011-08-11 20:42       ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-11 20:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Thu, Aug 11, 2011 at 11:21:43AM +0800, Wu Fengguang wrote:
> > [...] it only deals with controlling buffered write IO and nothing
> > else. So on the same block device, other direct writes might be
> > going on from same group and in this scheme a user will not have any
> > control.
> 
> The IO-less balance_dirty_pages() will be able to throttle DIRECT
> writes. There is nothing fundamental in the way.
> 
> The basic approach will be to add a balance_dirty_pages_ratelimited_nr()
> call in the DIRECT write path, and to call into balance_dirty_pages()
> regardless of the various dirty thresholds.
> 
> Then the IO-less balance_dirty_pages() has all the facilities to
> throttle a task at any auto-estimated or user-specified ratelimit.

A direct IO being routed through balance_dirty_pages() when it is really
not dirtying anything, sounds really odd to me.

What about direct AIO. Throttling direct IO at balance_dirty_pages() is
little different than throttling at device level where we build a buffer
of requests and submit requests asynchronously (even when submitter
has crossed the threshold/rate). Submitter does not have to block and
can go back to user space and do other things while waiting for
completion of submitted IO. 

You know what, since the beginning you have been talking about how
this mechanism can be extended to do some IO control. That's fine.
I think a more fruitul discussion can happen if we approach the
problem in a different way and that is lets figure out what are
the requirements, what are the problems, what do we need to control,
what is the best place to control something and how the interface
is going to look like.

Once we figure out interfaces and what are we trying to achieve
then rest of it is just mechanism and your method is one possible
way of implementing things and then we can discuss advantages and
disadvantages of various mechanisms.

What do we want
---------------

To me I see basic problem is as follows. We primarily want to provide
two controls, atleast at cgroup level. If the same can be extended
to task level, that's a bonus.

- Notion of iopriority (work conserving control, proportional IO)
- Absolute limits (non work conserving control, throttling)

What do we currently have
-------------------------
- Proportional IO is implemented at device level in CFQ IO scheduler.
	- It works both at task level (ioprio) and group level
	  (blkio.weight). The only problem is it works only for
	  synchronous IO and does not cover buffered WRITES. 

- Throttling
	- Implemented at block layer (per device). Works for groups. There
	  is no per task interface. Again works for synchronous IO and
 	  does not cover buffered writes.

So to me in current scheme of things there is only one big problem to
be solved.

- How to control buffered writes.
	- prportional IO
	- Absolute throttling.

Proportional IO
---------------
- Because we lose all the context information of submitter by the time IO
  reaches CFQ, for task ioprio, it is probably best to do something about
  it when writting to bdi. So your scheme sounds like a good candiate
  for that.

- At cgroup level, things get little more complicated as priority belongs
  to the whole group and a group could be doing some READs, some direct
  WRITES and some buffered WRITEs. If we implement a group's proportional
  write control at page cache level, we have following issue.

	- bdi based control does not know about READs and direct WRITEs.
	  Now assume that a high prio group is doing just buffered writes
	  and a low prio group is doing READs. CFQ will choke WRITEs
	  behind READs and effectively a higher prio group did not get
	  its share.

  So I think doing proportional IO control at device level provides
  better control overall and better integration with cgroups.

Throttling
----------
-  Throttling of buffered WRITEs can be done at page cache level and it
   makes sense to me in general. There seem to be two primary issue we
   need to think about.

	- It should gel well with current IO controller interfaces. Either
	  we provide a separate control file in blkio controller which
	  only controls buffered write rate or we come up with a way so
	  that common control knows both about direct and buffered writes
	  and control can come out of common quota. For example if
	  somebody says that 10MB/s is limit for write for this cgroup
	  on device 8:32, then that limit is effective both for direct
	  write as well as buffered write.

	  Alternatively we could implement a separate control file say
	  blkio.throttle.buffered_write_bps_device where one specifies
	  the buffered write rate of a cgroup on a device and your logic
	  parses it and controls it. And direct IO control limit comes
	  from a separate existing file. blkio.throttle.write_bps_device.
	  In my opinion it is less integrated appraoch and user will
	  find it less friendly to configure.

	- IO spike at device when flusher clean up dirty memory. I know
	  you have been saying that IO scheduler's somehow should take
	  care of it, but IO schedulers provide ony so much of protection
	  against WRITE. On top of that protection is not predictable.
	  CFQ still provides good protection against WRITEs but what
	  about deadline and noop. There spikes for sure will lead to
	  less predictable IO latencies for READs. 

  If we implement throttling for buffered write at device level and
  feedback mechanism reduces the dirty rate for the cgroup automatically
  that will take care of both the above issues. The only issue we will
  have to worry about how to take care of priority inversion issues
  where a high prio IO does not get throttled behind low prio IO. For
  that file systems will have to be more parallel. 

  Throttling at page cache level has this advantage that it has to
  worry less about this serializaiton.

So I see following immediate extension of your scheme possible.

- Inherit ioprio from iocontext and provide buffered write service
  differentiation for writers.

- Create a per task buffered write throttling interface and do
  absolute throttling of task.

- We can possibly do the idea of throttling group wide buffered
  writes only control at this layer using this mechanism.

Thoughts?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-11 20:42       ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-11 20:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Thu, Aug 11, 2011 at 11:21:43AM +0800, Wu Fengguang wrote:
> > [...] it only deals with controlling buffered write IO and nothing
> > else. So on the same block device, other direct writes might be
> > going on from same group and in this scheme a user will not have any
> > control.
> 
> The IO-less balance_dirty_pages() will be able to throttle DIRECT
> writes. There is nothing fundamental in the way.
> 
> The basic approach will be to add a balance_dirty_pages_ratelimited_nr()
> call in the DIRECT write path, and to call into balance_dirty_pages()
> regardless of the various dirty thresholds.
> 
> Then the IO-less balance_dirty_pages() has all the facilities to
> throttle a task at any auto-estimated or user-specified ratelimit.

A direct IO being routed through balance_dirty_pages() when it is really
not dirtying anything, sounds really odd to me.

What about direct AIO. Throttling direct IO at balance_dirty_pages() is
little different than throttling at device level where we build a buffer
of requests and submit requests asynchronously (even when submitter
has crossed the threshold/rate). Submitter does not have to block and
can go back to user space and do other things while waiting for
completion of submitted IO. 

You know what, since the beginning you have been talking about how
this mechanism can be extended to do some IO control. That's fine.
I think a more fruitul discussion can happen if we approach the
problem in a different way and that is lets figure out what are
the requirements, what are the problems, what do we need to control,
what is the best place to control something and how the interface
is going to look like.

Once we figure out interfaces and what are we trying to achieve
then rest of it is just mechanism and your method is one possible
way of implementing things and then we can discuss advantages and
disadvantages of various mechanisms.

What do we want
---------------

To me I see basic problem is as follows. We primarily want to provide
two controls, atleast at cgroup level. If the same can be extended
to task level, that's a bonus.

- Notion of iopriority (work conserving control, proportional IO)
- Absolute limits (non work conserving control, throttling)

What do we currently have
-------------------------
- Proportional IO is implemented at device level in CFQ IO scheduler.
	- It works both at task level (ioprio) and group level
	  (blkio.weight). The only problem is it works only for
	  synchronous IO and does not cover buffered WRITES. 

- Throttling
	- Implemented at block layer (per device). Works for groups. There
	  is no per task interface. Again works for synchronous IO and
 	  does not cover buffered writes.

So to me in current scheme of things there is only one big problem to
be solved.

- How to control buffered writes.
	- prportional IO
	- Absolute throttling.

Proportional IO
---------------
- Because we lose all the context information of submitter by the time IO
  reaches CFQ, for task ioprio, it is probably best to do something about
  it when writting to bdi. So your scheme sounds like a good candiate
  for that.

- At cgroup level, things get little more complicated as priority belongs
  to the whole group and a group could be doing some READs, some direct
  WRITES and some buffered WRITEs. If we implement a group's proportional
  write control at page cache level, we have following issue.

	- bdi based control does not know about READs and direct WRITEs.
	  Now assume that a high prio group is doing just buffered writes
	  and a low prio group is doing READs. CFQ will choke WRITEs
	  behind READs and effectively a higher prio group did not get
	  its share.

  So I think doing proportional IO control at device level provides
  better control overall and better integration with cgroups.

Throttling
----------
-  Throttling of buffered WRITEs can be done at page cache level and it
   makes sense to me in general. There seem to be two primary issue we
   need to think about.

	- It should gel well with current IO controller interfaces. Either
	  we provide a separate control file in blkio controller which
	  only controls buffered write rate or we come up with a way so
	  that common control knows both about direct and buffered writes
	  and control can come out of common quota. For example if
	  somebody says that 10MB/s is limit for write for this cgroup
	  on device 8:32, then that limit is effective both for direct
	  write as well as buffered write.

	  Alternatively we could implement a separate control file say
	  blkio.throttle.buffered_write_bps_device where one specifies
	  the buffered write rate of a cgroup on a device and your logic
	  parses it and controls it. And direct IO control limit comes
	  from a separate existing file. blkio.throttle.write_bps_device.
	  In my opinion it is less integrated appraoch and user will
	  find it less friendly to configure.

	- IO spike at device when flusher clean up dirty memory. I know
	  you have been saying that IO scheduler's somehow should take
	  care of it, but IO schedulers provide ony so much of protection
	  against WRITE. On top of that protection is not predictable.
	  CFQ still provides good protection against WRITEs but what
	  about deadline and noop. There spikes for sure will lead to
	  less predictable IO latencies for READs. 

  If we implement throttling for buffered write at device level and
  feedback mechanism reduces the dirty rate for the cgroup automatically
  that will take care of both the above issues. The only issue we will
  have to worry about how to take care of priority inversion issues
  where a high prio IO does not get throttled behind low prio IO. For
  that file systems will have to be more parallel. 

  Throttling at page cache level has this advantage that it has to
  worry less about this serializaiton.

So I see following immediate extension of your scheme possible.

- Inherit ioprio from iocontext and provide buffered write service
  differentiation for writers.

- Create a per task buffered write throttling interface and do
  absolute throttling of task.

- We can possibly do the idea of throttling group wide buffered
  writes only control at this layer using this mechanism.

Thoughts?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
  2011-08-11 20:42       ` Vivek Goyal
@ 2011-08-11 21:00         ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-11 21:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Thu, Aug 11, 2011 at 04:42:55PM -0400, Vivek Goyal wrote:

[..]
> So I see following immediate extension of your scheme possible.
> 
> - Inherit ioprio from iocontext and provide buffered write service
>   differentiation for writers.
> 
> - Create a per task buffered write throttling interface and do
>   absolute throttling of task.
> 
> - We can possibly do the idea of throttling group wide buffered
>   writes only control at this layer using this mechanism.

Though personally I like the idea of absolute throttling at page cache
level as it can help a bit with problem of buffered WRITES impacting
the latency of everything else in the system. CFQ helps a lot but
it idles enough that cost of this isolation is very high on faster
storage.

Deadline and noop really do not do much about protection from WRITEs.

So it is not perfect but might prove to be good enough for some use
cases.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 0/5] IO-less dirty throttling v8
@ 2011-08-11 21:00         ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-11 21:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Thu, Aug 11, 2011 at 04:42:55PM -0400, Vivek Goyal wrote:

[..]
> So I see following immediate extension of your scheme possible.
> 
> - Inherit ioprio from iocontext and provide buffered write service
>   differentiation for writers.
> 
> - Create a per task buffered write throttling interface and do
>   absolute throttling of task.
> 
> - We can possibly do the idea of throttling group wide buffered
>   writes only control at this layer using this mechanism.

Though personally I like the idea of absolute throttling at page cache
level as it can help a bit with problem of buffered WRITES impacting
the latency of everything else in the system. CFQ helps a lot but
it idles enough that cost of this isolation is very high on faster
storage.

Deadline and noop really do not do much about protection from WRITEs.

So it is not perfect but might prove to be good enough for some use
cases.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-11 22:56             ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace. 

Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
your negative slope (df/dx < 0), simply because it implies your
condition and because it expresses our hard stop at limit.

Also, while I know this is totally over the top, but..

I saw you added a ramp and brake area in future patches, so have you
considered using a third order polynomial instead?

The simple:

 f(x) = -x^3 

has the 'right' shape, all we need is move it so that:

 f(s) = 1

and stretch it to put the single root at our limit. You'd get something
like:

               s - x 3
 f(x) :=  1 + (-----)
                 d

Which, as required, is 1 at our setpoint and the factor d stretches the
middle bit. Which has a single (real) root at: 

  x = s + d, 

by setting that to our limit, we get:

  d = l - s

Making our final function look like:

               s - x 3
 f(x) :=  1 + (-----)
               l - s

You can clamp it at [0,2] or so. The implementation wouldn't be too
horrid either, something like:

unsigned long bdi_pos_ratio(..)
{
	if (dirty > limit)
		return 0;

	if (dirty < 2*setpoint - limit)
		return 2 * SCALE;

	x = SCALE * (setpoint - dirty) / (limit - setpoint);
	xx = (x * x) / SCALE;
	xxx = (xx * x) / SCALE;

	return xxx;
}


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11 22:56             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace. 

Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
your negative slope (df/dx < 0), simply because it implies your
condition and because it expresses our hard stop at limit.

Also, while I know this is totally over the top, but..

I saw you added a ramp and brake area in future patches, so have you
considered using a third order polynomial instead?

The simple:

 f(x) = -x^3 

has the 'right' shape, all we need is move it so that:

 f(s) = 1

and stretch it to put the single root at our limit. You'd get something
like:

               s - x 3
 f(x) :=  1 + (-----)
                 d

Which, as required, is 1 at our setpoint and the factor d stretches the
middle bit. Which has a single (real) root at: 

  x = s + d, 

by setting that to our limit, we get:

  d = l - s

Making our final function look like:

               s - x 3
 f(x) :=  1 + (-----)
               l - s

You can clamp it at [0,2] or so. The implementation wouldn't be too
horrid either, something like:

unsigned long bdi_pos_ratio(..)
{
	if (dirty > limit)
		return 0;

	if (dirty < 2*setpoint - limit)
		return 2 * SCALE;

	x = SCALE * (setpoint - dirty) / (limit - setpoint);
	xx = (x * x) / SCALE;
	xxx = (xx * x) / SCALE;

	return xxx;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11 22:56             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace. 

Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
your negative slope (df/dx < 0), simply because it implies your
condition and because it expresses our hard stop at limit.

Also, while I know this is totally over the top, but..

I saw you added a ramp and brake area in future patches, so have you
considered using a third order polynomial instead?

The simple:

 f(x) = -x^3 

has the 'right' shape, all we need is move it so that:

 f(s) = 1

and stretch it to put the single root at our limit. You'd get something
like:

               s - x 3
 f(x) :=  1 + (-----)
                 d

Which, as required, is 1 at our setpoint and the factor d stretches the
middle bit. Which has a single (real) root at: 

  x = s + d, 

by setting that to our limit, we get:

  d = l - s

Making our final function look like:

               s - x 3
 f(x) :=  1 + (-----)
               l - s

You can clamp it at [0,2] or so. The implementation wouldn't be too
horrid either, something like:

unsigned long bdi_pos_ratio(..)
{
	if (dirty > limit)
		return 0;

	if (dirty < 2*setpoint - limit)
		return 2 * SCALE;

	x = SCALE * (setpoint - dirty) / (limit - setpoint);
	xx = (x * x) / SCALE;
	xxx = (xx * x) / SCALE;

	return xxx;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-11 22:56             ` Peter Zijlstra
@ 2011-08-12  2:43               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 06:56:06AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> > 
> > pos_ratio seems to be the feedback on the deviation of the dirty pages
> > around its setpoint. So we adjust the reference bw (or rather ratelimit)
> > to take account of the shift in output vs input capacity as well as the
> > shift in dirty pages around its setpoint.
> > 
> > From that we derive the condition that: 
> > 
> >   pos_ratio(setpoint) := 1
> > 
> > Now in order to create a linear function we need one more condition. We
> > get one from the fact that once we hit the limit we should hard throttle
> > our writers. We get that by setting the ratelimit to 0, because, after
> > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> > 
> >   pos_ratio(limit) := 0
> > 
> > Using these two conditions we can solve the equations and get your:
> > 
> >                         limit - dirty
> >   pos_ratio(dirty) =  ----------------
> >                       limit - setpoint
> > 
> > Now, for some reason you chose not to use limit, but something like
> > min(limit, 4*thresh) something to do with the slope affecting the rate
> > of adjustment. This wants a comment someplace. 
> 
> Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
> your negative slope (df/dx < 0), simply because it implies your
> condition and because it expresses our hard stop at limit.

Right. That's good point.

> Also, while I know this is totally over the top, but..
> 
> I saw you added a ramp and brake area in future patches, so have you
> considered using a third order polynomial instead?

No I have not ;)

The 3 lines/curves should be a bit more flexible/configurable than the
single 3rd order polynomial.  However the 3rd order polynomial is sure
much more simple and consistent by removing the explicit rampup/brake
areas and curves.

> The simple:
> 
>  f(x) = -x^3 
> 
> has the 'right' shape, all we need is move it so that:
> 
>  f(s) = 1
> 
> and stretch it to put the single root at our limit. You'd get something
> like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
> Which, as required, is 1 at our setpoint and the factor d stretches the
> middle bit. Which has a single (real) root at: 
> 
>   x = s + d, 
> 
> by setting that to our limit, we get:
> 
>   d = l - s
> 
> Making our final function look like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                l - s

Very intuitive reasoning, thanks!

I substituted real numbers to the function assuming a mem=2GB system.

with limit=thresh:

        gnuplot> set xrange [60000:80000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

with limit=thresh+thresh/DIRTY_SCOPE

        gnuplot> set xrange [60000:90000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(90000-70000.0)**3

Figures attached.  The latter produces reasonably flat slope and I'll
give it a spin in the dd tests :)
 
> You can clamp it at [0,2] or so.

Looking at the figures, we may even do without the clamp because it's
already inside the range [0, 2].

> The implementation wouldn't be too horrid either, something like:
> 
> unsigned long bdi_pos_ratio(..)
> {
> 	if (dirty > limit)
> 		return 0;
> 
> 	if (dirty < 2*setpoint - limit)
> 		return 2 * SCALE;
> 
> 	x = SCALE * (setpoint - dirty) / (limit - setpoint);
> 	xx = (x * x) / SCALE;
> 	xxx = (xx * x) / SCALE;
> 
> 	return xxx;
> }

Looks very neat, much simpler than the three curves solution!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  2:43               ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 06:56:06AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> > 
> > pos_ratio seems to be the feedback on the deviation of the dirty pages
> > around its setpoint. So we adjust the reference bw (or rather ratelimit)
> > to take account of the shift in output vs input capacity as well as the
> > shift in dirty pages around its setpoint.
> > 
> > From that we derive the condition that: 
> > 
> >   pos_ratio(setpoint) := 1
> > 
> > Now in order to create a linear function we need one more condition. We
> > get one from the fact that once we hit the limit we should hard throttle
> > our writers. We get that by setting the ratelimit to 0, because, after
> > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> > 
> >   pos_ratio(limit) := 0
> > 
> > Using these two conditions we can solve the equations and get your:
> > 
> >                         limit - dirty
> >   pos_ratio(dirty) =  ----------------
> >                       limit - setpoint
> > 
> > Now, for some reason you chose not to use limit, but something like
> > min(limit, 4*thresh) something to do with the slope affecting the rate
> > of adjustment. This wants a comment someplace. 
> 
> Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
> your negative slope (df/dx < 0), simply because it implies your
> condition and because it expresses our hard stop at limit.

Right. That's good point.

> Also, while I know this is totally over the top, but..
> 
> I saw you added a ramp and brake area in future patches, so have you
> considered using a third order polynomial instead?

No I have not ;)

The 3 lines/curves should be a bit more flexible/configurable than the
single 3rd order polynomial.  However the 3rd order polynomial is sure
much more simple and consistent by removing the explicit rampup/brake
areas and curves.

> The simple:
> 
>  f(x) = -x^3 
> 
> has the 'right' shape, all we need is move it so that:
> 
>  f(s) = 1
> 
> and stretch it to put the single root at our limit. You'd get something
> like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
> Which, as required, is 1 at our setpoint and the factor d stretches the
> middle bit. Which has a single (real) root at: 
> 
>   x = s + d, 
> 
> by setting that to our limit, we get:
> 
>   d = l - s
> 
> Making our final function look like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                l - s

Very intuitive reasoning, thanks!

I substituted real numbers to the function assuming a mem=2GB system.

with limit=thresh:

        gnuplot> set xrange [60000:80000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

with limit=thresh+thresh/DIRTY_SCOPE

        gnuplot> set xrange [60000:90000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(90000-70000.0)**3

Figures attached.  The latter produces reasonably flat slope and I'll
give it a spin in the dd tests :)
 
> You can clamp it at [0,2] or so.

Looking at the figures, we may even do without the clamp because it's
already inside the range [0, 2].

> The implementation wouldn't be too horrid either, something like:
> 
> unsigned long bdi_pos_ratio(..)
> {
> 	if (dirty > limit)
> 		return 0;
> 
> 	if (dirty < 2*setpoint - limit)
> 		return 2 * SCALE;
> 
> 	x = SCALE * (setpoint - dirty) / (limit - setpoint);
> 	xx = (x * x) / SCALE;
> 	xxx = (xx * x) / SCALE;
> 
> 	return xxx;
> }

Looks very neat, much simpler than the three curves solution!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  2:43               ` Wu Fengguang
  (?)
@ 2011-08-12  3:18               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12  3:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1306 bytes --]

Sorry forgot the 2 gnuplot figures, attached now.

> > Making our final function look like:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 
> Very intuitive reasoning, thanks!
> 
> I substituted real numbers to the function assuming a mem=2GB system.
> 
> with limit=thresh:
> 
>         gnuplot> set xrange [60000:80000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3
> 
> with limit=thresh+thresh/DIRTY_SCOPE
> 
>         gnuplot> set xrange [60000:90000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(90000-70000.0)**3
> 
> Figures attached.  The latter produces reasonably flat slope and I'll
> give it a spin in the dd tests :)
>  
> > You can clamp it at [0,2] or so.
> 
> Looking at the figures, we may even do without the clamp because it's
> already inside the range [0, 2].
> 
> > The implementation wouldn't be too horrid either, something like:
> > 
> > unsigned long bdi_pos_ratio(..)
> > {
> > 	if (dirty > limit)
> > 		return 0;
> > 
> > 	if (dirty < 2*setpoint - limit)
> > 		return 2 * SCALE;
> > 
> > 	x = SCALE * (setpoint - dirty) / (limit - setpoint);
> > 	xx = (x * x) / SCALE;
> > 	xxx = (xx * x) / SCALE;
> > 
> > 	return xxx;
> > }
> 
> Looks very neat, much simpler than the three curves solution!
> 
> Thanks,
> Fengguang

[-- Attachment #2: 3rd-order-limit=thresh+halfscope.png --]
[-- Type: image/png, Size: 30247 bytes --]

[-- Attachment #3: 3rd-order-limit=thresh.png --]
[-- Type: image/png, Size: 28785 bytes --]

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  2:43               ` Wu Fengguang
@ 2011-08-12  5:45                 ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12  5:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > Making our final function look like:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 
> Very intuitive reasoning, thanks!
> 
> I substituted real numbers to the function assuming a mem=2GB system.
> 
> with limit=thresh:
> 
>         gnuplot> set xrange [60000:80000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

I'll use the above one, which is more simple and elegant: 

        f(freerun)  = 2.0
        f(setpoint) = 1.0
        f(limit)    = 0

Code is

        unsigned long freerun = (thresh + bg_thresh) / 2;

        setpoint = (limit + freerun) / 2;
        pos_ratio = abs(dirty - setpoint);
        pos_ratio <<= BANDWIDTH_CALC_SHIFT;
        do_div(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        if (dirty > setpoint)
                pos_ratio = -pos_ratio;
        pos_ratio += 1 << BANDWIDTH_CALC_SHIFT;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  5:45                 ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12  5:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > Making our final function look like:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 
> Very intuitive reasoning, thanks!
> 
> I substituted real numbers to the function assuming a mem=2GB system.
> 
> with limit=thresh:
> 
>         gnuplot> set xrange [60000:80000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

I'll use the above one, which is more simple and elegant: 

        f(freerun)  = 2.0
        f(setpoint) = 1.0
        f(limit)    = 0

Code is

        unsigned long freerun = (thresh + bg_thresh) / 2;

        setpoint = (limit + freerun) / 2;
        pos_ratio = abs(dirty - setpoint);
        pos_ratio <<= BANDWIDTH_CALC_SHIFT;
        do_div(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        if (dirty > setpoint)
                pos_ratio = -pos_ratio;
        pos_ratio += 1 << BANDWIDTH_CALC_SHIFT;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  5:45                 ` Wu Fengguang
  (?)
@ 2011-08-12  9:45                   ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> Code is
> 
>         unsigned long freerun = (thresh + bg_thresh) / 2;
> 
>         setpoint = (limit + freerun) / 2;
>         pos_ratio = abs(dirty - setpoint);
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, limit - setpoint + 1);

Why do you use do_div()? from the code those things are unsigned long,
and you can divide that just fine.

Also, there's div64_s64 that can do signed divides for s64 types.
That'll loose the extra conditionals you used for abs and putting the
sign back.

>         x = pos_ratio;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;

So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
solves to 6, which isn't going to be enough I figure since
(dirty-setpoint) !< 64.

So you really need to use u64/s64 types here, unsigned long just won't
do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.


>         if (dirty > setpoint)
>                 pos_ratio = -pos_ratio;
>         pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; 



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:45                   ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> Code is
> 
>         unsigned long freerun = (thresh + bg_thresh) / 2;
> 
>         setpoint = (limit + freerun) / 2;
>         pos_ratio = abs(dirty - setpoint);
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, limit - setpoint + 1);

Why do you use do_div()? from the code those things are unsigned long,
and you can divide that just fine.

Also, there's div64_s64 that can do signed divides for s64 types.
That'll loose the extra conditionals you used for abs and putting the
sign back.

>         x = pos_ratio;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;

So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
solves to 6, which isn't going to be enough I figure since
(dirty-setpoint) !< 64.

So you really need to use u64/s64 types here, unsigned long just won't
do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.


>         if (dirty > setpoint)
>                 pos_ratio = -pos_ratio;
>         pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:45                   ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> Code is
> 
>         unsigned long freerun = (thresh + bg_thresh) / 2;
> 
>         setpoint = (limit + freerun) / 2;
>         pos_ratio = abs(dirty - setpoint);
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, limit - setpoint + 1);

Why do you use do_div()? from the code those things are unsigned long,
and you can divide that just fine.

Also, there's div64_s64 that can do signed divides for s64 types.
That'll loose the extra conditionals you used for abs and putting the
sign back.

>         x = pos_ratio;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;

So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
solves to 6, which isn't going to be enough I figure since
(dirty-setpoint) !< 64.

So you really need to use u64/s64 types here, unsigned long just won't
do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.


>         if (dirty > setpoint)
>                 pos_ratio = -pos_ratio;
>         pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  2:43               ` Wu Fengguang
  (?)
@ 2011-08-12  9:47                 ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 

> Looks very neat, much simpler than the three curves solution!

Glad you like it, there is of course the small matter of real-world
behaviour to consider, lets hope that works as well :-)

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:47                 ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 

> Looks very neat, much simpler than the three curves solution!

Glad you like it, there is of course the small matter of real-world
behaviour to consider, lets hope that works as well :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:47                 ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 

> Looks very neat, much simpler than the three curves solution!

Glad you like it, there is of course the small matter of real-world
behaviour to consider, lets hope that works as well :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  9:45                   ` Peter Zijlstra
@ 2011-08-12 11:07                     ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:45:33PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> > Code is
> > 
> >         unsigned long freerun = (thresh + bg_thresh) / 2;
> > 
> >         setpoint = (limit + freerun) / 2;
> >         pos_ratio = abs(dirty - setpoint);
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, limit - setpoint + 1);
> 
> Why do you use do_div()? from the code those things are unsigned long,
> and you can divide that just fine.

Because pos_ratio was "unsigned long long"..

> Also, there's div64_s64 that can do signed divides for s64 types.
> That'll loose the extra conditionals you used for abs and putting the
> sign back.

Ah ok, good to know that :)

> >         x = pos_ratio;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> 
> So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
> solves to 6, which isn't going to be enough I figure since
> (dirty-setpoint) !< 64.
> 
> So you really need to use u64/s64 types here, unsigned long just won't
> do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.

Sure, here is the updated code:

        long long pos_ratio;            /* for scaling up/down the rate limit */
        long x;
       
        if (unlikely(dirty >= limit))
                return 0;

        /*
         * global setpoint
         *
         *                  setpoint - dirty 3
         * f(dirty) := 1 + (----------------)
         *                  limit - setpoint
         *
         * it's a 3rd order polynomial that subjects to
         *
         * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
         * (2) f(setpoint) = 1.0 => the balance point
         * (3) f(limit)    = 0   => the hard limit
         * (4) df/dx < 0         => negative feedback control
         * (5) the closer to setpoint, the smaller |df/dx| (and the reverse),
         *     => fast response on large errors; small oscillation near setpoint
         */
        setpoint = (limit + freerun) / 2;
        pos_ratio = (setpoint - dirty) << RATELIMIT_CALC_SHIFT;
        pos_ratio = div_s64(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio += 1 << RATELIMIT_CALC_SHIFT;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 11:07                     ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:45:33PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> > Code is
> > 
> >         unsigned long freerun = (thresh + bg_thresh) / 2;
> > 
> >         setpoint = (limit + freerun) / 2;
> >         pos_ratio = abs(dirty - setpoint);
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, limit - setpoint + 1);
> 
> Why do you use do_div()? from the code those things are unsigned long,
> and you can divide that just fine.

Because pos_ratio was "unsigned long long"..

> Also, there's div64_s64 that can do signed divides for s64 types.
> That'll loose the extra conditionals you used for abs and putting the
> sign back.

Ah ok, good to know that :)

> >         x = pos_ratio;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> 
> So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
> solves to 6, which isn't going to be enough I figure since
> (dirty-setpoint) !< 64.
> 
> So you really need to use u64/s64 types here, unsigned long just won't
> do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.

Sure, here is the updated code:

        long long pos_ratio;            /* for scaling up/down the rate limit */
        long x;
       
        if (unlikely(dirty >= limit))
                return 0;

        /*
         * global setpoint
         *
         *                  setpoint - dirty 3
         * f(dirty) := 1 + (----------------)
         *                  limit - setpoint
         *
         * it's a 3rd order polynomial that subjects to
         *
         * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
         * (2) f(setpoint) = 1.0 => the balance point
         * (3) f(limit)    = 0   => the hard limit
         * (4) df/dx < 0         => negative feedback control
         * (5) the closer to setpoint, the smaller |df/dx| (and the reverse),
         *     => fast response on large errors; small oscillation near setpoint
         */
        setpoint = (limit + freerun) / 2;
        pos_ratio = (setpoint - dirty) << RATELIMIT_CALC_SHIFT;
        pos_ratio = div_s64(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio += 1 << RATELIMIT_CALC_SHIFT;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  9:47                 ` Peter Zijlstra
@ 2011-08-12 11:11                   ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:47:54PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                l - s
> > 
> 
> > Looks very neat, much simpler than the three curves solution!
> 
> Glad you like it, there is of course the small matter of real-world
> behaviour to consider, lets hope that works as well :-)

It magically meets all the criteria in my mind, not to mention it can
eliminate 2 extra patches. As for the tests, so far, so good :)

Your arithmetics are awesome!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 11:11                   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:47:54PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                l - s
> > 
> 
> > Looks very neat, much simpler than the three curves solution!
> 
> Glad you like it, there is of course the small matter of real-world
> behaviour to consider, lets hope that works as well :-)

It magically meets all the criteria in my mind, not to mention it can
eliminate 2 extra patches. As for the tests, so far, so good :)

Your arithmetics are awesome!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 11:07                     ` Wu Fengguang
  (?)
@ 2011-08-12 12:17                       ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote:
> Because pos_ratio was "unsigned long long"..

Ah! totally missed that ;-)

Yes looks good.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:17                       ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote:
> Because pos_ratio was "unsigned long long"..

Ah! totally missed that ;-)

Yes looks good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:17                       ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote:
> Because pos_ratio was "unsigned long long"..

Ah! totally missed that ;-)

Yes looks good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-12 12:54             ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
btw, if you want steeper slopes for rampup and brake you can add another
factor like:

                 s - x 3
  f(x) :=  1 + a(-----)
                   d
 
And solve the whole f(l)=0 thing again to determine d in l and a.

For 0 < a < 1 the slopes increase.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:54             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
btw, if you want steeper slopes for rampup and brake you can add another
factor like:

                 s - x 3
  f(x) :=  1 + a(-----)
                   d
 
And solve the whole f(l)=0 thing again to determine d in l and a.

For 0 < a < 1 the slopes increase.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:54             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
btw, if you want steeper slopes for rampup and brake you can add another
factor like:

                 s - x 3
  f(x) :=  1 + a(-----)
                   d
 
And solve the whole f(l)=0 thing again to determine d in l and a.

For 0 < a < 1 the slopes increase.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 12:54             ` Peter Zijlstra
@ 2011-08-12 12:59               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                  d
> > 
> btw, if you want steeper slopes for rampup and brake you can add another
> factor like:
> 
>                  s - x 3
>   f(x) :=  1 + a(-----)
>                    d
>  
> And solve the whole f(l)=0 thing again to determine d in l and a.
> 
> For 0 < a < 1 the slopes increase.

Yes, we can leave it as a future tuning option. For now I'm pretty
satisfied with the current function's shape :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:59               ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                  d
> > 
> btw, if you want steeper slopes for rampup and brake you can add another
> factor like:
> 
>                  s - x 3
>   f(x) :=  1 + a(-----)
>                    d
>  
> And solve the whole f(l)=0 thing again to determine d in l and a.
> 
> For 0 < a < 1 the slopes increase.

Yes, we can leave it as a future tuning option. For now I'm pretty
satisfied with the current function's shape :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-12 13:04             ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
>         dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
>         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there? 

Could you elaborate on this primary feedback loop? Its the one part I
don't feel I actually understand well.



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:04             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
>         dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
>         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there? 

Could you elaborate on this primary feedback loop? Its the one part I
don't feel I actually understand well.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:04             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
>         dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
>         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there? 

Could you elaborate on this primary feedback loop? Its the one part I
don't feel I actually understand well.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 12:59               ` Wu Fengguang
  (?)
@ 2011-08-12 13:08                 ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > > 
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                  d
> > > 
> > btw, if you want steeper slopes for rampup and brake you can add another
> > factor like:
> > 
> >                  s - x 3
> >   f(x) :=  1 + a(-----)
> >                    d
> >  
> > And solve the whole f(l)=0 thing again to determine d in l and a.
> > 
> > For 0 < a < 1 the slopes increase.
> 
> Yes, we can leave it as a future tuning option. For now I'm pretty
> satisfied with the current function's shape :)

Oh for sure, it just occurred to me when looking at your plots and
thought I'd at least mention it.. You know something to poke at on a
rainy afternoon ;-)

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:08                 ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > > 
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                  d
> > > 
> > btw, if you want steeper slopes for rampup and brake you can add another
> > factor like:
> > 
> >                  s - x 3
> >   f(x) :=  1 + a(-----)
> >                    d
> >  
> > And solve the whole f(l)=0 thing again to determine d in l and a.
> > 
> > For 0 < a < 1 the slopes increase.
> 
> Yes, we can leave it as a future tuning option. For now I'm pretty
> satisfied with the current function's shape :)

Oh for sure, it just occurred to me when looking at your plots and
thought I'd at least mention it.. You know something to poke at on a
rainy afternoon ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:08                 ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > > 
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                  d
> > > 
> > btw, if you want steeper slopes for rampup and brake you can add another
> > factor like:
> > 
> >                  s - x 3
> >   f(x) :=  1 + a(-----)
> >                    d
> >  
> > And solve the whole f(l)=0 thing again to determine d in l and a.
> > 
> > For 0 < a < 1 the slopes increase.
> 
> Yes, we can leave it as a future tuning option. For now I'm pretty
> satisfied with the current function's shape :)

Oh for sure, it just occurred to me when looking at your plots and
thought I'd at least mention it.. You know something to poke at on a
rainy afternoon ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09 17:20             ` Peter Zijlstra
@ 2011-08-12 13:19               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 13:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:20:27AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint.

Yes.

> So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.

However the above function should better be interpreted as

                                            write_bw
    ref_bw = task_ratelimit_in_past_200ms * --------
                                            dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

It would be highly confusing if trying to find the direct "logical"
relationships between ref_bw and pos_ratio in the above equation.

> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1

Right.

> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.

Thanks to your reasoning that lead to the more elegant 

                            setpoint - dirty 3
   pos_ratio(dirty) := 1 + (----------------)
                            limit - setpoint

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:19               ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 13:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:20:27AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint.

Yes.

> So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.

However the above function should better be interpreted as

                                            write_bw
    ref_bw = task_ratelimit_in_past_200ms * --------
                                            dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

It would be highly confusing if trying to find the direct "logical"
relationships between ref_bw and pos_ratio in the above equation.

> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1

Right.

> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.

Thanks to your reasoning that lead to the more elegant 

                            setpoint - dirty 3
   pos_ratio(dirty) := 1 + (----------------)
                            limit - setpoint

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 13:04             ` Peter Zijlstra
@ 2011-08-12 14:20               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Peter,

Sorry for the delay..

On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:

To start with,

                                                write_bw
        ref_bw = task_ratelimit_in_past_200ms * --------
                                                dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

> > Now all of the above would seem to suggest:
> > 
> >   dirty_ratelimit := ref_bw

Right, ideally ref_bw is the balanced dirty ratelimit. I actually
started with exactly the above equation when I got choked by pure
pos_bw based feedback control (as mentioned in the reply to Jan's
email) and introduced the ref_bw estimation as the way out.

But there are some imperfections in ref_bw, too. Which makes it not
suitable for direct use:

1) large fluctuations

The dirty_bw used for computing ref_bw is merely averaged in the
past 200ms (very small comparing to the 3s estimation period in
write_bw), which makes rather dispersed distribution of ref_bw.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/ext4-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:48/balance_dirty_pages-pages.png

Take a look at the blue [*] points in the above graph. I find it pretty
hard to average out the singular points by increasing the estimation
period. Considering that the averaging technique will introduce the
very undesirable time lags, I give it up totally. (btw, the write_bw
averaging time lag is much more acceptable because its impact is
one-way and therefore won't lead to oscillations.)

The one practical way is filtering -- the most large singular ref_bw
points can be filtered out effectively by remembering some prev_ref_bw
and prev_prev_ref_bw. However it cannot do away all of them. And the
remaining majority ref_bw points are still randomly dancing around the
ideal balanced rate. 

2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
becomes unbalanced match, which leads to large systematical errors
in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
be compensated smoothly. So let's face it. When some over-estimated
ref_bw brings ->dirty_ratelimit high, higher than the setpoint, the
pos_bw will in turn become lower than ->dirty_ratelimit. So if we
consider both ref_bw and pos_bw and update ->dirty_ratelimit only when
they are on the same side of ->dirty_ratelimit, the systematical
errors in ref_bw won't be able to bring ->dirty_ratelimit too away.

The ref_bw estimation is also not accurate when near the max pause and
free run areas.

3) since we ultimately want to

- keep the dirty pages around the setpoint as long time as possible
- keep the fluctuations of task ratelimit as small as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
point to bring up dirty_ratelimit in a hurry and to hurt both the
above two goals.

> > However for that you use:
> > 
> >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> >         dirty_ratelimit = max(ref_bw, pos_bw);
> > 
> >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> >         dirty_ratelimit = min(ref_bw, pos_bw);

The above are merely constraints to the dirty_ratelimit update.
It serves to

1) stop adjusting the rate when it's against the position control
   target (the adjusted rate will slow down the progress of dirty
   pages going back to setpoint).

2) limit the step size. pos_bw is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   ref_bw. pos_bw also has smaller errors in stable state and normally
   have larger errors when there are big errors in rate. So it's a
   pretty good limiting factor for the step size of dirty_ratelimit.

> > You have:
> > 
> >   pos_bw = dirty_ratelimit * pos_ratio
> > 
> > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> > why are you ignoring the shift in output vs input rate there? 

Again, you need to understand pos_bw the other way.  Only (pos_bw -
dirty_ratelimit) matters here, which is exactly the position error.

> Could you elaborate on this primary feedback loop? Its the one part I
> don't feel I actually understand well.

Hope the above elaboration helps :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 14:20               ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-12 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Peter,

Sorry for the delay..

On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:

To start with,

                                                write_bw
        ref_bw = task_ratelimit_in_past_200ms * --------
                                                dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

> > Now all of the above would seem to suggest:
> > 
> >   dirty_ratelimit := ref_bw

Right, ideally ref_bw is the balanced dirty ratelimit. I actually
started with exactly the above equation when I got choked by pure
pos_bw based feedback control (as mentioned in the reply to Jan's
email) and introduced the ref_bw estimation as the way out.

But there are some imperfections in ref_bw, too. Which makes it not
suitable for direct use:

1) large fluctuations

The dirty_bw used for computing ref_bw is merely averaged in the
past 200ms (very small comparing to the 3s estimation period in
write_bw), which makes rather dispersed distribution of ref_bw.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/ext4-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:48/balance_dirty_pages-pages.png

Take a look at the blue [*] points in the above graph. I find it pretty
hard to average out the singular points by increasing the estimation
period. Considering that the averaging technique will introduce the
very undesirable time lags, I give it up totally. (btw, the write_bw
averaging time lag is much more acceptable because its impact is
one-way and therefore won't lead to oscillations.)

The one practical way is filtering -- the most large singular ref_bw
points can be filtered out effectively by remembering some prev_ref_bw
and prev_prev_ref_bw. However it cannot do away all of them. And the
remaining majority ref_bw points are still randomly dancing around the
ideal balanced rate. 

2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
becomes unbalanced match, which leads to large systematical errors
in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
be compensated smoothly. So let's face it. When some over-estimated
ref_bw brings ->dirty_ratelimit high, higher than the setpoint, the
pos_bw will in turn become lower than ->dirty_ratelimit. So if we
consider both ref_bw and pos_bw and update ->dirty_ratelimit only when
they are on the same side of ->dirty_ratelimit, the systematical
errors in ref_bw won't be able to bring ->dirty_ratelimit too away.

The ref_bw estimation is also not accurate when near the max pause and
free run areas.

3) since we ultimately want to

- keep the dirty pages around the setpoint as long time as possible
- keep the fluctuations of task ratelimit as small as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
point to bring up dirty_ratelimit in a hurry and to hurt both the
above two goals.

> > However for that you use:
> > 
> >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> >         dirty_ratelimit = max(ref_bw, pos_bw);
> > 
> >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> >         dirty_ratelimit = min(ref_bw, pos_bw);

The above are merely constraints to the dirty_ratelimit update.
It serves to

1) stop adjusting the rate when it's against the position control
   target (the adjusted rate will slow down the progress of dirty
   pages going back to setpoint).

2) limit the step size. pos_bw is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   ref_bw. pos_bw also has smaller errors in stable state and normally
   have larger errors when there are big errors in rate. So it's a
   pretty good limiting factor for the step size of dirty_ratelimit.

> > You have:
> > 
> >   pos_bw = dirty_ratelimit * pos_ratio
> > 
> > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> > why are you ignoring the shift in output vs input rate there? 

Again, you need to understand pos_bw the other way.  Only (pos_bw -
dirty_ratelimit) matters here, which is exactly the position error.

> Could you elaborate on this primary feedback loop? Its the one part I
> don't feel I actually understand well.

Hope the above elaboration helps :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-08 14:23       ` Wu Fengguang
@ 2011-08-13 16:28         ` Andrea Righi
  -1 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-13 16:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:23:18PM +0800, Wu Fengguang wrote:
> On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote:
> > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > > Add two fields to task_struct.
> > > 
> > > 1) account dirtied pages in the individual tasks, for accuracy
> > > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > > 
> > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > > scale near-sqrt to the safety gap between dirty pages and threshold.
> > > 
> > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > > dirtying pages at exactly the same time, each task will be assigned a
> > > large initial nr_dirtied_pause, so that the dirty threshold will be
> > > exceeded long before each task reached its nr_dirtied_pause and hence
> > > call balance_dirty_pages().
> > > 
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  include/linux/sched.h |    7 ++
> > >  mm/memory_hotplug.c   |    3 -
> > >  mm/page-writeback.c   |  106 +++++++++-------------------------------
> > >  3 files changed, 32 insertions(+), 84 deletions(-) 
> > 
> > No fork() hooks? This way tasks inherit their parent's dirty count on
> > clone().
> 
> btw, I do have another patch queued for improving the "leaked dirties
> on exit" case :)
> 
> Thanks,
> Fengguang
> ---
> Subject: writeback: charge leaked page dirties to active tasks
> Date: Tue Apr 05 13:21:19 CST 2011
> 
> It's a years long problem that a large number of short-lived dirtiers
> (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
> (eg. dd) as well as pushing the dirty pages to the global hard limit.
> 
> The solution is to charge the pages dirtied by the exited gcc to the
> other random gcc/dd instances. It sounds not perfect, however should
> behave good enough in practice.
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/writeback.h |    2 ++
>  kernel/exit.c             |    2 ++
>  mm/page-writeback.c       |   11 +++++++++++
>  3 files changed, 15 insertions(+)
> 
> --- linux-next.orig/include/linux/writeback.h	2011-08-08 21:45:58.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-08 21:45:58.000000000 +0800
> @@ -7,6 +7,8 @@
>  #include <linux/sched.h>
>  #include <linux/fs.h>
>  
> +DECLARE_PER_CPU(int, dirty_leaks);
> +
>  /*
>   * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
>   *
> --- linux-next.orig/mm/page-writeback.c	2011-08-08 21:45:58.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-08 22:21:50.000000000 +0800
> @@ -190,6 +190,7 @@ int dirty_ratio_handler(struct ctl_table
>  	return ret;
>  }
>  
> +DEFINE_PER_CPU(int, dirty_leaks) = 0;
>  
>  int dirty_bytes_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *lenp,
> @@ -1150,6 +1151,7 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	int ratelimit;
> +	int *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
> @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr(
>  	if (bdi->dirty_exceeded)
>  		ratelimit = 8;
>  
> +	preempt_disable();
> +	p = &__get_cpu_var(dirty_leaks);
> +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> +		*p -= nr_pages_dirtied;
> +		current->nr_dirtied += nr_pages_dirtied;
> +	}
> +	preempt_enable();
> +

I think we are still leaking some dirty pages, when the condition is
false nr_pages_dirtied is just ignored.

Why not doing something like this?

	current->nr_dirtied += nr_pages_dirtied;
	if (current->nr_dirtied < ratelimit) {
		p = &get_cpu_var(dirty_leaks);
		if (*p > 0) {
			nr_pages_dirtied = min(*p, ratelimit -
							current->nr_dirtied);
			*p -= nr_pages_dirtied;
		} else
			nr_pages_dirtied = 0;
		put_cpu_var(dirty_leaks);

		current->nr_dirtied += nr_pages_dirtied;
	}

Thanks,
-Andrea

>  	if (unlikely(current->nr_dirtied >= ratelimit))
>  		balance_dirty_pages(mapping, current->nr_dirtied);
>  }
> --- linux-next.orig/kernel/exit.c	2011-08-08 21:43:37.000000000 +0800
> +++ linux-next/kernel/exit.c	2011-08-08 21:45:58.000000000 +0800
> @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code)
>  	validate_creds_for_do_exit(tsk);
>  
>  	preempt_disable();
> +	if (tsk->nr_dirtied)
> +		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
>  	exit_rcu();
>  	/* causes final put_task_struct in finish_task_switch(). */
>  	tsk->state = TASK_DEAD;

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-13 16:28         ` Andrea Righi
  0 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-13 16:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:23:18PM +0800, Wu Fengguang wrote:
> On Mon, Aug 08, 2011 at 09:47:14PM +0800, Peter Zijlstra wrote:
> > On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > > Add two fields to task_struct.
> > > 
> > > 1) account dirtied pages in the individual tasks, for accuracy
> > > 2) per-task balance_dirty_pages() call intervals, for flexibility
> > > 
> > > The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
> > > scale near-sqrt to the safety gap between dirty pages and threshold.
> > > 
> > > XXX: The main problem of per-task nr_dirtied is, if 10k tasks start
> > > dirtying pages at exactly the same time, each task will be assigned a
> > > large initial nr_dirtied_pause, so that the dirty threshold will be
> > > exceeded long before each task reached its nr_dirtied_pause and hence
> > > call balance_dirty_pages().
> > > 
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  include/linux/sched.h |    7 ++
> > >  mm/memory_hotplug.c   |    3 -
> > >  mm/page-writeback.c   |  106 +++++++++-------------------------------
> > >  3 files changed, 32 insertions(+), 84 deletions(-) 
> > 
> > No fork() hooks? This way tasks inherit their parent's dirty count on
> > clone().
> 
> btw, I do have another patch queued for improving the "leaked dirties
> on exit" case :)
> 
> Thanks,
> Fengguang
> ---
> Subject: writeback: charge leaked page dirties to active tasks
> Date: Tue Apr 05 13:21:19 CST 2011
> 
> It's a years long problem that a large number of short-lived dirtiers
> (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
> (eg. dd) as well as pushing the dirty pages to the global hard limit.
> 
> The solution is to charge the pages dirtied by the exited gcc to the
> other random gcc/dd instances. It sounds not perfect, however should
> behave good enough in practice.
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/writeback.h |    2 ++
>  kernel/exit.c             |    2 ++
>  mm/page-writeback.c       |   11 +++++++++++
>  3 files changed, 15 insertions(+)
> 
> --- linux-next.orig/include/linux/writeback.h	2011-08-08 21:45:58.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-08 21:45:58.000000000 +0800
> @@ -7,6 +7,8 @@
>  #include <linux/sched.h>
>  #include <linux/fs.h>
>  
> +DECLARE_PER_CPU(int, dirty_leaks);
> +
>  /*
>   * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
>   *
> --- linux-next.orig/mm/page-writeback.c	2011-08-08 21:45:58.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-08 22:21:50.000000000 +0800
> @@ -190,6 +190,7 @@ int dirty_ratio_handler(struct ctl_table
>  	return ret;
>  }
>  
> +DEFINE_PER_CPU(int, dirty_leaks) = 0;
>  
>  int dirty_bytes_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *lenp,
> @@ -1150,6 +1151,7 @@ void balance_dirty_pages_ratelimited_nr(
>  {
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	int ratelimit;
> +	int *p;
>  
>  	if (!bdi_cap_account_dirty(bdi))
>  		return;
> @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr(
>  	if (bdi->dirty_exceeded)
>  		ratelimit = 8;
>  
> +	preempt_disable();
> +	p = &__get_cpu_var(dirty_leaks);
> +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> +		*p -= nr_pages_dirtied;
> +		current->nr_dirtied += nr_pages_dirtied;
> +	}
> +	preempt_enable();
> +

I think we are still leaking some dirty pages, when the condition is
false nr_pages_dirtied is just ignored.

Why not doing something like this?

	current->nr_dirtied += nr_pages_dirtied;
	if (current->nr_dirtied < ratelimit) {
		p = &get_cpu_var(dirty_leaks);
		if (*p > 0) {
			nr_pages_dirtied = min(*p, ratelimit -
							current->nr_dirtied);
			*p -= nr_pages_dirtied;
		} else
			nr_pages_dirtied = 0;
		put_cpu_var(dirty_leaks);

		current->nr_dirtied += nr_pages_dirtied;
	}

Thanks,
-Andrea

>  	if (unlikely(current->nr_dirtied >= ratelimit))
>  		balance_dirty_pages(mapping, current->nr_dirtied);
>  }
> --- linux-next.orig/kernel/exit.c	2011-08-08 21:43:37.000000000 +0800
> +++ linux-next/kernel/exit.c	2011-08-08 21:45:58.000000000 +0800
> @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code)
>  	validate_creds_for_do_exit(tsk);
>  
>  	preempt_disable();
> +	if (tsk->nr_dirtied)
> +		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
>  	exit_rcu();
>  	/* causes final put_task_struct in finish_task_switch(). */
>  	tsk->state = TASK_DEAD;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-10 16:17         ` Peter Zijlstra
@ 2011-08-15 14:08           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-15 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 12:17:55AM +0800, Peter Zijlstra wrote:
> How about something like the below, it still needs some more work, but
> its more or less complete in that is now explains both controls in one
> story. The actual update bit is still missing.

Looks pretty good, thanks!  I'll post the completed version at the
bottom.

> ---
> 
> balance_dirty_pages() needs to throttle tasks dirtying pages such that
> the total amount of dirty pages stays below the specified dirty limit in
> order to avoid memory deadlocks. Furthermore we desire fairness in that
> tasks get throttled proportionally to the amount of pages they dirty.
> 
> IOW we want to throttle tasks such that we match the dirty rate to the
> writeout bandwidth, this yields a stable amount of dirty pages:
> 
> 	ratelimit = writeout_bandwidth
> 
> The fairness requirements gives us:
> 
> 	task_ratelimit = write_bandwidth / N
> 
> > : When started N dd, we would like to throttle each dd at
> > : 
> > :          balanced_rate == write_bw / N                                  (1)
> > : 
> > : We don't know N beforehand, but still can estimate balanced_rate
> > : within 200ms.
> > : 
> > : Start by throttling each dd task at rate
> > : 
> > :         task_ratelimit = task_ratelimit_0                               (2)
> > :                          (any non-zero initial value is OK)
> > : 
> > : After 200ms, we got
> > : 
> > :         dirty_rate = # of pages dirtied by all dd's / 200ms
> > :         write_bw   = # of pages written to the disk / 200ms
> > : 
> > : For the aggressive dd dirtiers, the equality holds
> > : 
> > :         dirty_rate == N * task_rate
> > :                    == N * task_ratelimit
> > :                    == N * task_ratelimit_0                              (3)
> > : Or
> > :         task_ratelimit_0 = dirty_rate / N                               (4)
> > :                           
> > : So the balanced throttle bandwidth can be estimated by
> > :                           
> > :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> > :                           
> > : Because with (4) and (5) we can get the desired equality (1):
> > :                           
> > :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > :                       == write_bw / N
> 
> Then using the balance_rate we can compute task pause times like:
> 
> 	task_pause = task->nr_dirtied / task_ratelimit
> 
> [ however all that still misses the primary feedback of:
> 
>    task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)
> 
>   there's still some confusion in the above due to task_ratelimit and
>   balanced_rate.
> ]
> 
> However, while the above gives us means of matching the dirty rate to
> the writeout bandwidth, it at best provides us with a stable dirty page
> count (assuming a static system). In order to control the dirty page
> count such that it is high enough to provide performance, but does not
> exceed the specified limit we need another control.
> 
> > So if the dirty pages are ABOVE the setpoints, we throttle each task
> > a bit more HEAVY than balanced_rate, so that the dirty pages are
> > created less fast than they are cleaned, thus DROP to the setpoints
> > (and the reverse). With that positional adjustment, the formula is
> > transformed from
> > 
> >         task_ratelimit = balanced_rate
> > 
> > to
> > 
> >         task_ratelimit = balanced_rate * pos_ratio
> 
> > In terms of the negative feedback control theory, the
> > bdi_position_ratio() function (control lines) can be expressed as
> > 
> > 1) f(setpoint) = 1.0
> > 2) df/dt < 0
> > 
> > 3) optionally, abs(df/dt) should be large on large errors (= dirty -
> >    setpoint) in order to cancel the errors fast, and be smaller when
> >    dirty pages get closer to the setpoints in order to avoid overshooting.
> 
> 

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:
        
        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-15 14:08           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-15 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 12:17:55AM +0800, Peter Zijlstra wrote:
> How about something like the below, it still needs some more work, but
> its more or less complete in that is now explains both controls in one
> story. The actual update bit is still missing.

Looks pretty good, thanks!  I'll post the completed version at the
bottom.

> ---
> 
> balance_dirty_pages() needs to throttle tasks dirtying pages such that
> the total amount of dirty pages stays below the specified dirty limit in
> order to avoid memory deadlocks. Furthermore we desire fairness in that
> tasks get throttled proportionally to the amount of pages they dirty.
> 
> IOW we want to throttle tasks such that we match the dirty rate to the
> writeout bandwidth, this yields a stable amount of dirty pages:
> 
> 	ratelimit = writeout_bandwidth
> 
> The fairness requirements gives us:
> 
> 	task_ratelimit = write_bandwidth / N
> 
> > : When started N dd, we would like to throttle each dd at
> > : 
> > :          balanced_rate == write_bw / N                                  (1)
> > : 
> > : We don't know N beforehand, but still can estimate balanced_rate
> > : within 200ms.
> > : 
> > : Start by throttling each dd task at rate
> > : 
> > :         task_ratelimit = task_ratelimit_0                               (2)
> > :                          (any non-zero initial value is OK)
> > : 
> > : After 200ms, we got
> > : 
> > :         dirty_rate = # of pages dirtied by all dd's / 200ms
> > :         write_bw   = # of pages written to the disk / 200ms
> > : 
> > : For the aggressive dd dirtiers, the equality holds
> > : 
> > :         dirty_rate == N * task_rate
> > :                    == N * task_ratelimit
> > :                    == N * task_ratelimit_0                              (3)
> > : Or
> > :         task_ratelimit_0 = dirty_rate / N                               (4)
> > :                           
> > : So the balanced throttle bandwidth can be estimated by
> > :                           
> > :         balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)      (5)
> > :                           
> > : Because with (4) and (5) we can get the desired equality (1):
> > :                           
> > :         balanced_rate == (dirty_rate / N) * (write_bw / dirty_rate)
> > :                       == write_bw / N
> 
> Then using the balance_rate we can compute task pause times like:
> 
> 	task_pause = task->nr_dirtied / task_ratelimit
> 
> [ however all that still misses the primary feedback of:
> 
>    task_ratelimit_(i+1) = task_ratelimit_i * (write_bw / dirty_rate)
> 
>   there's still some confusion in the above due to task_ratelimit and
>   balanced_rate.
> ]
> 
> However, while the above gives us means of matching the dirty rate to
> the writeout bandwidth, it at best provides us with a stable dirty page
> count (assuming a static system). In order to control the dirty page
> count such that it is high enough to provide performance, but does not
> exceed the specified limit we need another control.
> 
> > So if the dirty pages are ABOVE the setpoints, we throttle each task
> > a bit more HEAVY than balanced_rate, so that the dirty pages are
> > created less fast than they are cleaned, thus DROP to the setpoints
> > (and the reverse). With that positional adjustment, the formula is
> > transformed from
> > 
> >         task_ratelimit = balanced_rate
> > 
> > to
> > 
> >         task_ratelimit = balanced_rate * pos_ratio
> 
> > In terms of the negative feedback control theory, the
> > bdi_position_ratio() function (control lines) can be expressed as
> > 
> > 1) f(setpoint) = 1.0
> > 2) df/dt < 0
> > 
> > 3) optionally, abs(df/dt) should be large on large errors (= dirty -
> >    setpoint) in order to cancel the errors fast, and be smaller when
> >    dirty pages get closer to the setpoints in order to avoid overshooting.
> 
> 

Estimation of balanced bdi->dirty_ratelimit
===========================================

balanced task_ratelimit
-----------------------

balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.

IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:

	ratelimit = write_bw						(1)

The fairness requirement gives us:

        task_ratelimit = write_bw / N					(2)

where N is the number of dd tasks.  We don't know N beforehand, but
still can estimate the balanced task_ratelimit within 200ms.

Start by throttling each dd task at rate

        task_ratelimit = task_ratelimit_0				(3)
 		  	 (any non-zero initial value is OK)

After 200ms, we measured

        dirty_rate = # of pages dirtied by all dd's / 200ms
        write_bw   = # of pages written to the disk / 200ms

For the aggressive dd dirtiers, the equality holds

	dirty_rate == N * task_rate
                   == N * task_ratelimit
                   == N * task_ratelimit_0            			(4)
Or
	task_ratelimit_0 = dirty_rate / N            			(5)

Now we conclude that the balanced task ratelimit can be estimated by

        task_ratelimit = task_ratelimit_0 * (write_bw / dirty_rate)	(6)

Because with (4) and (5) we can get the desired equality (1):

	task_ratelimit == (dirty_rate / N) * (write_bw / dirty_rate)
	       	       == write_bw / N

Then using the balanced task ratelimit we can compute task pause times like:
        
        task_pause = task->nr_dirtied / task_ratelimit

task_ratelimit with position control
------------------------------------

However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.

The dirty position control works by splitting (6) to

        task_ratelimit = balanced_rate					(7)
        balanced_rate = task_ratelimit_0 * (write_bw / dirty_rate)	(8)

and extend (7) to

        task_ratelimit = balanced_rate * pos_ratio			(9)

where pos_ratio is a negative feedback function that subjects to

1) f(setpoint) = 1.0
2) df/dx < 0

That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_rate, so that the dirty pages are
created less fast than they are cleaned, thus DROP to the setpoints
(and the reverse).

bdi->dirty_ratelimit update policy
----------------------------------

The balanced_rate calculated by (8) is not suitable for direct use (*).
For the reasons listed below, (9) is further transformed into

	task_ratelimit = dirty_ratelimit * pos_ratio			(10)

where dirty_ratelimit will be tracking balanced_rate _conservatively_.

---
(*) There are some imperfections in balanced_rate, which make it not
suitable for direct use:

1) large fluctuations

The dirty_rate used for computing balanced_rate is merely averaged in
the past 200ms (very small comparing to the 3s estimation period for
write_bw), which makes rather dispersed distribution of balanced_rate.

It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)

The more practical way is filtering -- most singular balanced_rate
points can be filtered out by remembering some prev_balanced_rate and
prev_prev_balanced_rate. However the more reliable way is to guard
balanced_rate with pos_rate.

2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_rate. The truncates, due to its possibly bumpy
nature, can hardly be compensated smoothly. So let's face it. When some
over-estimated balanced_rate brings dirty_ratelimit high, dirty pages
will go higher than the setpoint. pos_rate will in turn become lower
than dirty_ratelimit.  So if we consider both balanced_rate and pos_rate
and update dirty_ratelimit only when they are on the same side of
dirty_ratelimit, the systematical errors in balanced_rate won't be able
to bring dirty_ratelimit far away.

The balanced_rate estimation may also be inaccurate when near the max
pause and free run areas, however is less an issue.

3) since we ultimately want to

- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_rate < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_rate), there is
no point to bring up dirty_ratelimit in a hurry only to hurt both the
above two goals.

In summary, the dirty_ratelimit update policy consists of two constraints:

1) avoid changing dirty rate when it's against the position control target
   (the adjusted rate will slow down the progress of dirty pages going
   back to setpoint).

2) limit the step size. pos_rate is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   balanced_rate. pos_rate also has the nice smaller errors in stable
   state and typically larger errors when there are big errors in rate.
   So it's a pretty good limiting factor for the step size of dirty_ratelimit.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
  2011-08-10 17:10           ` Peter Zijlstra
@ 2011-08-15 14:11             ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-15 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 01:10:26AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote:
> > 
> > > Although I'm not quite sure how he keeps fairness in light of the
> > > sleep time bounding to MAX_PAUSE.
> > 
> > Firstly, MAX_PAUSE will only be applied when the dirty pages rush
> > high (dirty exceeded).  Secondly, the dirty exceeded state is global
> > to all tasks, in which case each task will sleep for MAX_PAUSE equally.
> > So the fairness is still maintained in dirty exceeded state. 
> 
> Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact,
> but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair,
> its only fair if they dirty at the same rate.

Yeah I forget to mention that, but when dirty_exceeded, the tasks will
typically sleep for MAX_PAUSE on every 8 pages, so resulting in the
same dirty rate :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 3/5] writeback: dirty rate control
@ 2011-08-15 14:11             ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-15 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 01:10:26AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-10 at 22:00 +0800, Wu Fengguang wrote:
> > 
> > > Although I'm not quite sure how he keeps fairness in light of the
> > > sleep time bounding to MAX_PAUSE.
> > 
> > Firstly, MAX_PAUSE will only be applied when the dirty pages rush
> > high (dirty exceeded).  Secondly, the dirty exceeded state is global
> > to all tasks, in which case each task will sleep for MAX_PAUSE equally.
> > So the fairness is still maintained in dirty exceeded state. 
> 
> Its not immediately apparent how dirty_exceeded and MAX_PAUSE interact,
> but having everybody sleep MAX_PAUSE doesn't necessarily mean its fair,
> its only fair if they dirty at the same rate.

Yeah I forget to mention that, but when dirty_exceeded, the tasks will
typically sleep for MAX_PAUSE on every 8 pages, so resulting in the
same dirty rate :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-13 16:28         ` Andrea Righi
  (?)
@ 2011-08-15 14:21         ` Wu Fengguang
  2011-08-15 14:26             ` Andrea Righi
  -1 siblings, 1 reply; 305+ messages in thread
From: Wu Fengguang @ 2011-08-15 14:21 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1759 bytes --]

Andrea,

> > @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr(
> >  	if (bdi->dirty_exceeded)
> >  		ratelimit = 8;
> >  
> > +	preempt_disable();
> > +	p = &__get_cpu_var(dirty_leaks);
> > +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> > +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> > +		*p -= nr_pages_dirtied;
> > +		current->nr_dirtied += nr_pages_dirtied;
> > +	}
> > +	preempt_enable();
> > +
> 
> I think we are still leaking some dirty pages, when the condition is
> false nr_pages_dirtied is just ignored.
> 
> Why not doing something like this?
> 
> 	current->nr_dirtied += nr_pages_dirtied;

You must mean the above line. Sorry I failed to provide another patch
before this one (attached this time). With that preparation patch, it
effectively become equal to the logic below :)

> 	if (current->nr_dirtied < ratelimit) {
> 		p = &get_cpu_var(dirty_leaks);
> 		if (*p > 0) {
> 			nr_pages_dirtied = min(*p, ratelimit -
> 							current->nr_dirtied);
> 			*p -= nr_pages_dirtied;
> 		} else
> 			nr_pages_dirtied = 0;
> 		put_cpu_var(dirty_leaks);
> 
> 		current->nr_dirtied += nr_pages_dirtied;
> 	}

Thanks,
Fengguang

> >  	if (unlikely(current->nr_dirtied >= ratelimit))
> >  		balance_dirty_pages(mapping, current->nr_dirtied);
> >  }
> > --- linux-next.orig/kernel/exit.c	2011-08-08 21:43:37.000000000 +0800
> > +++ linux-next/kernel/exit.c	2011-08-08 21:45:58.000000000 +0800
> > @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code)
> >  	validate_creds_for_do_exit(tsk);
> >  
> >  	preempt_disable();
> > +	if (tsk->nr_dirtied)
> > +		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
> >  	exit_rcu();
> >  	/* causes final put_task_struct in finish_task_switch(). */
> >  	tsk->state = TASK_DEAD;

[-- Attachment #2: writeback-accurate-task-dirtied.patch --]
[-- Type: text/x-diff, Size: 1226 bytes --]

Subject: writeback: fix dirtied pages accounting on sub-page writes
Date: Thu Apr 14 07:52:37 CST 2011

When dd in 512bytes, generic_perform_write() calls
balance_dirty_pages_ratelimited() 8 times for the same page, but
obviously the page is only dirtied once.

Fix it by accounting nr_dirtied at page dirty time.

This will allow further simplification of the
balance_dirty_pages_ratelimited_nr() calls.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-15 22:12:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-15 22:12:27.000000000 +0800
@@ -1211,8 +1211,6 @@ void balance_dirty_pages_ratelimited_nr(
 	else
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
 
-	current->nr_dirtied += nr_pages_dirtied;
-
 	preempt_disable();
 	/*
 	 * This prevents one CPU to accumulate too many dirtied pages without
@@ -1711,6 +1709,7 @@ void account_page_dirtied(struct page *p
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		current->nr_dirtied++;
 	}
 }
 EXPORT_SYMBOL(account_page_dirtied);

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
  2011-08-15 14:21         ` Wu Fengguang
@ 2011-08-15 14:26             ` Andrea Righi
  0 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-15 14:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

On Mon, Aug 15, 2011 at 10:21:41PM +0800, Wu Fengguang wrote:
> Andrea,
> 
> > > @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr(
> > >  	if (bdi->dirty_exceeded)
> > >  		ratelimit = 8;
> > >  
> > > +	preempt_disable();
> > > +	p = &__get_cpu_var(dirty_leaks);
> > > +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> > > +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> > > +		*p -= nr_pages_dirtied;
> > > +		current->nr_dirtied += nr_pages_dirtied;
> > > +	}
> > > +	preempt_enable();
> > > +
> > 
> > I think we are still leaking some dirty pages, when the condition is
> > false nr_pages_dirtied is just ignored.
> > 
> > Why not doing something like this?
> > 
> > 	current->nr_dirtied += nr_pages_dirtied;
> 
> You must mean the above line. Sorry I failed to provide another patch
> before this one (attached this time). With that preparation patch, it
> effectively become equal to the logic below :)

OK. This is even better than my proposal, because it doesn't charge
pages that are dirtied multiple times. Sounds good.

Thanks,
-Andrea

> 
> > 	if (current->nr_dirtied < ratelimit) {
> > 		p = &get_cpu_var(dirty_leaks);
> > 		if (*p > 0) {
> > 			nr_pages_dirtied = min(*p, ratelimit -
> > 							current->nr_dirtied);
> > 			*p -= nr_pages_dirtied;
> > 		} else
> > 			nr_pages_dirtied = 0;
> > 		put_cpu_var(dirty_leaks);
> > 
> > 		current->nr_dirtied += nr_pages_dirtied;
> > 	}
> 
> Thanks,
> Fengguang
> 
> > >  	if (unlikely(current->nr_dirtied >= ratelimit))
> > >  		balance_dirty_pages(mapping, current->nr_dirtied);
> > >  }
> > > --- linux-next.orig/kernel/exit.c	2011-08-08 21:43:37.000000000 +0800
> > > +++ linux-next/kernel/exit.c	2011-08-08 21:45:58.000000000 +0800
> > > @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code)
> > >  	validate_creds_for_do_exit(tsk);
> > >  
> > >  	preempt_disable();
> > > +	if (tsk->nr_dirtied)
> > > +		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
> > >  	exit_rcu();
> > >  	/* causes final put_task_struct in finish_task_switch(). */
> > >  	tsk->state = TASK_DEAD;

> Subject: writeback: fix dirtied pages accounting on sub-page writes
> Date: Thu Apr 14 07:52:37 CST 2011
> 
> When dd in 512bytes, generic_perform_write() calls
> balance_dirty_pages_ratelimited() 8 times for the same page, but
> obviously the page is only dirtied once.
> 
> Fix it by accounting nr_dirtied at page dirty time.
> 
> This will allow further simplification of the
> balance_dirty_pages_ratelimited_nr() calls.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-15 22:12:14.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-15 22:12:27.000000000 +0800
> @@ -1211,8 +1211,6 @@ void balance_dirty_pages_ratelimited_nr(
>  	else
>  		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
>  
> -	current->nr_dirtied += nr_pages_dirtied;
> -
>  	preempt_disable();
>  	/*
>  	 * This prevents one CPU to accumulate too many dirtied pages without
> @@ -1711,6 +1709,7 @@ void account_page_dirtied(struct page *p
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
>  		task_dirty_inc(current);
>  		task_io_account_write(PAGE_CACHE_SIZE);
> +		current->nr_dirtied++;
>  	}
>  }
>  EXPORT_SYMBOL(account_page_dirtied);

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 4/5] writeback: per task dirty rate limit
@ 2011-08-15 14:26             ` Andrea Righi
  0 siblings, 0 replies; 305+ messages in thread
From: Andrea Righi @ 2011-08-15 14:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, linux-mm, LKML

On Mon, Aug 15, 2011 at 10:21:41PM +0800, Wu Fengguang wrote:
> Andrea,
> 
> > > @@ -1158,6 +1160,15 @@ void balance_dirty_pages_ratelimited_nr(
> > >  	if (bdi->dirty_exceeded)
> > >  		ratelimit = 8;
> > >  
> > > +	preempt_disable();
> > > +	p = &__get_cpu_var(dirty_leaks);
> > > +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> > > +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> > > +		*p -= nr_pages_dirtied;
> > > +		current->nr_dirtied += nr_pages_dirtied;
> > > +	}
> > > +	preempt_enable();
> > > +
> > 
> > I think we are still leaking some dirty pages, when the condition is
> > false nr_pages_dirtied is just ignored.
> > 
> > Why not doing something like this?
> > 
> > 	current->nr_dirtied += nr_pages_dirtied;
> 
> You must mean the above line. Sorry I failed to provide another patch
> before this one (attached this time). With that preparation patch, it
> effectively become equal to the logic below :)

OK. This is even better than my proposal, because it doesn't charge
pages that are dirtied multiple times. Sounds good.

Thanks,
-Andrea

> 
> > 	if (current->nr_dirtied < ratelimit) {
> > 		p = &get_cpu_var(dirty_leaks);
> > 		if (*p > 0) {
> > 			nr_pages_dirtied = min(*p, ratelimit -
> > 							current->nr_dirtied);
> > 			*p -= nr_pages_dirtied;
> > 		} else
> > 			nr_pages_dirtied = 0;
> > 		put_cpu_var(dirty_leaks);
> > 
> > 		current->nr_dirtied += nr_pages_dirtied;
> > 	}
> 
> Thanks,
> Fengguang
> 
> > >  	if (unlikely(current->nr_dirtied >= ratelimit))
> > >  		balance_dirty_pages(mapping, current->nr_dirtied);
> > >  }
> > > --- linux-next.orig/kernel/exit.c	2011-08-08 21:43:37.000000000 +0800
> > > +++ linux-next/kernel/exit.c	2011-08-08 21:45:58.000000000 +0800
> > > @@ -1039,6 +1039,8 @@ NORET_TYPE void do_exit(long code)
> > >  	validate_creds_for_do_exit(tsk);
> > >  
> > >  	preempt_disable();
> > > +	if (tsk->nr_dirtied)
> > > +		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
> > >  	exit_rcu();
> > >  	/* causes final put_task_struct in finish_task_switch(). */
> > >  	tsk->state = TASK_DEAD;

> Subject: writeback: fix dirtied pages accounting on sub-page writes
> Date: Thu Apr 14 07:52:37 CST 2011
> 
> When dd in 512bytes, generic_perform_write() calls
> balance_dirty_pages_ratelimited() 8 times for the same page, but
> obviously the page is only dirtied once.
> 
> Fix it by accounting nr_dirtied at page dirty time.
> 
> This will allow further simplification of the
> balance_dirty_pages_ratelimited_nr() calls.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-15 22:12:14.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-15 22:12:27.000000000 +0800
> @@ -1211,8 +1211,6 @@ void balance_dirty_pages_ratelimited_nr(
>  	else
>  		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
>  
> -	current->nr_dirtied += nr_pages_dirtied;
> -
>  	preempt_disable();
>  	/*
>  	 * This prevents one CPU to accumulate too many dirtied pages without
> @@ -1711,6 +1709,7 @@ void account_page_dirtied(struct page *p
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
>  		task_dirty_inc(current);
>  		task_io_account_write(PAGE_CACHE_SIZE);
> +		current->nr_dirtied++;
>  	}
>  }
>  EXPORT_SYMBOL(account_page_dirtied);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-11 11:14                   ` Jan Kara
@ 2011-08-16  8:35                     ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 07:14:23PM +0800, Jan Kara wrote:
> On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > > >                     origin - dirty
> > > > > >         pos_ratio = --------------
> > > > > >                     origin - goal 
> > > > > 
> > > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > > pos_ratio == 1.0:
> > > > > 
> > > > > OK, so basically you want a linear function for which:
> > > > > 
> > > > > f(goal) = 1 and has a root somewhere > goal.
> > > > > 
> > > > > (that one line is much more informative than all your graphs put
> > > > > together, one can start from there and derive your function)
> > > > > 
> > > > > That does indeed get you the above function, now what does it mean? 
> > > > 
> > > > So going by:
> > > > 
> > > >                                          write_bw
> > > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > > >                                          dirty_bw
> > > 
> > >   Actually, thinking about these formulas, why do we even bother with
> > > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > > Couldn't we just have a feedback loop (probably similar to the one
> > > computing pos_ratio) which will maintain single value - ratelimit? When we
> > > are getting close to dirty limit, we will scale ratelimit down, when we
> > > will be getting significantly below dirty limit, we will scale the
> > > ratelimit up.  Because looking at the formulas it seems to me that the net
> > > effect is the same - pos_ratio basically overrules everything... 
> > 
> > Good question. That is actually one of the early approaches I tried.
> > It somehow worked, however the resulted ratelimit is not only slow
> > responding, but also oscillating all the time.
>   Yes, I think I vaguely remember that.
> 
> > This is due to the imperfections
> > 
> > 1) pos_ratio at best only provides a "direction" for adjusting the
> >    ratelimit. There is only vague clues that if pos_ratio is small,
> >    the errors in ratelimit should be small.
> > 
> > 2) Due to time-lag, the assumptions in (1) about "direction" and
> >    "error size" can be wrong. The ratelimit may already be
> >    over-adjusted when the dirty pages take time to approach the
> >    setpoint. The larger memory, the more time lag, the easier to
> >    overshoot and oscillate.
> > 
> > 3) dirty pages are constantly fluctuating around the setpoint,
> >    so is pos_ratio.
> > 
> > With (1) and (2), it's a control system very susceptible to disturbs.
> > With (3) we get constant disturbs. Well I had very hard time and
> > played dirty tricks (which you may never want to know ;-) trying to
> > tradeoff between response time and stableness..
>   Yes, I can see especially 2) is a problem. But I don't understand why
> your current formula would be that much different. As Peter decoded from
> your code, your current formula is:
>                                         write_bw
>  ref_bw = dirty_ratelimit * pos_ratio * --------
>                                         dirty_bw
> 
> while previously it was essentially:
>  ref_bw = dirty_ratelimit * pos_ratio

Sorry what's the code you are referring to? Does the changelog in the
newly posted patchset make the ref_bw calculation and dirty_ratelimit
updating more clear?

> So what is so magical about computing write_bw and dirty_bw separately? Is
> it because previously you did not use derivation of distance from the goal
> for updating pos_ratio? Because in your current formula write_bw/dirty_bw
> is a derivation of position...

dirty_bw is the main feedback. If we are throttling too much, the
resulting dirty_bw will be lowered than write_bw. Thus 

                                      write_bw
   ref_bw = ratelimit_in_past_200ms * --------
                                      dirty_bw

will give us a higher ref_bw than ratelimit_in_past_200ms. For pure
dd workload, the computed ref_bw by the above formula is exactly the
balanced rate (if not considering trivial errors).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  8:35                     ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 07:14:23PM +0800, Jan Kara wrote:
> On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > > >                     origin - dirty
> > > > > >         pos_ratio = --------------
> > > > > >                     origin - goal 
> > > > > 
> > > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > > pos_ratio == 1.0:
> > > > > 
> > > > > OK, so basically you want a linear function for which:
> > > > > 
> > > > > f(goal) = 1 and has a root somewhere > goal.
> > > > > 
> > > > > (that one line is much more informative than all your graphs put
> > > > > together, one can start from there and derive your function)
> > > > > 
> > > > > That does indeed get you the above function, now what does it mean? 
> > > > 
> > > > So going by:
> > > > 
> > > >                                          write_bw
> > > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > > >                                          dirty_bw
> > > 
> > >   Actually, thinking about these formulas, why do we even bother with
> > > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > > Couldn't we just have a feedback loop (probably similar to the one
> > > computing pos_ratio) which will maintain single value - ratelimit? When we
> > > are getting close to dirty limit, we will scale ratelimit down, when we
> > > will be getting significantly below dirty limit, we will scale the
> > > ratelimit up.  Because looking at the formulas it seems to me that the net
> > > effect is the same - pos_ratio basically overrules everything... 
> > 
> > Good question. That is actually one of the early approaches I tried.
> > It somehow worked, however the resulted ratelimit is not only slow
> > responding, but also oscillating all the time.
>   Yes, I think I vaguely remember that.
> 
> > This is due to the imperfections
> > 
> > 1) pos_ratio at best only provides a "direction" for adjusting the
> >    ratelimit. There is only vague clues that if pos_ratio is small,
> >    the errors in ratelimit should be small.
> > 
> > 2) Due to time-lag, the assumptions in (1) about "direction" and
> >    "error size" can be wrong. The ratelimit may already be
> >    over-adjusted when the dirty pages take time to approach the
> >    setpoint. The larger memory, the more time lag, the easier to
> >    overshoot and oscillate.
> > 
> > 3) dirty pages are constantly fluctuating around the setpoint,
> >    so is pos_ratio.
> > 
> > With (1) and (2), it's a control system very susceptible to disturbs.
> > With (3) we get constant disturbs. Well I had very hard time and
> > played dirty tricks (which you may never want to know ;-) trying to
> > tradeoff between response time and stableness..
>   Yes, I can see especially 2) is a problem. But I don't understand why
> your current formula would be that much different. As Peter decoded from
> your code, your current formula is:
>                                         write_bw
>  ref_bw = dirty_ratelimit * pos_ratio * --------
>                                         dirty_bw
> 
> while previously it was essentially:
>  ref_bw = dirty_ratelimit * pos_ratio

Sorry what's the code you are referring to? Does the changelog in the
newly posted patchset make the ref_bw calculation and dirty_ratelimit
updating more clear?

> So what is so magical about computing write_bw and dirty_bw separately? Is
> it because previously you did not use derivation of distance from the goal
> for updating pos_ratio? Because in your current formula write_bw/dirty_bw
> is a derivation of position...

dirty_bw is the main feedback. If we are throttling too much, the
resulting dirty_bw will be lowered than write_bw. Thus 

                                      write_bw
   ref_bw = ratelimit_in_past_200ms * --------
                                      dirty_bw

will give us a higher ref_bw than ratelimit_in_past_200ms. For pure
dd workload, the computed ref_bw by the above formula is exactly the
balanced rate (if not considering trivial errors).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-10 21:40             ` Vivek Goyal
@ 2011-08-16  8:55               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

Sorry it made such a big confusion to you. I hope Peter's 3rd order
polynomial abstraction in v9 can clarify the concepts a lot.

As for the old global control line

                       origin - dirty
           pos_ratio = --------------           (1)
                       origin - goal

where

        origin = 4 * thresh                     (2)

effectively decides the slope of the line. The use of @limit in code

        origin = max(4 * thresh, limit)         (3)

is merely to safeguard the rare case that (2) might result in negative
pos_ratio in (1).

I have another patch to add a "brake" area immediately below @limit
that will scale pos_ratio down to 0. However that's no longer
necessary with the 3rd order polynomial solution. 

Note that @limit will normally be equal to @thresh except in the rare
case that @thresh is suddenly knocked down and @limit is taking time
to follow it.

Thanks,
Fengguang

> Hi Fengguang,
> 
> Ok, so just trying to understand this pos_ratio little better.
> 
> You have following basic formula.
> 
>                      origin - dirty
>          pos_ratio = --------------
>                      origin - goal
> 
> Terminology is very confusing and following is my understanding. 
> 
> - setpoint == goal
> 
>   setpoint is the point where we would like our number of dirty pages to
>   be and at this point pos_ratio = 1. For global dirty this number seems
>   to be (thresh - thresh / DIRTY_SCOPE) 
> 
> - thresh
>   dirty page threshold calculated from dirty_ratio (Certain percentage of
>   total memory).
> 
> - Origin (seems to be equivalent of limit)
> 
>   This seems to be the reference point/limit we don't want to cross and
>   distance from this limit basically decides the pos_ratio. Closer we
>   are to limit, lower the pos_ratio and further we are higher the
>   pos_ratio.
> 
> So threshold is just a number which helps us determine goal and limit.
> 
> goal = thresh - thresh / DIRTY_SCOPE
> limit = 4*thresh
> 
> So goal is where we want to be and we start throttling the task more as
> we move away goal and approach limit. We keep the limit high enough
> so that (origin-dirty) does not become negative entity.
> 
> So we do expect to cross "thresh" otherwise thresh itself could have
> served as limit?
> 
> If my understanding is right, then can we get rid of terms "setpoint" and
> "origin". Would it be easier to understand the things if we just talk
> in terms of "goal" and "limit" and how these are derived from "thresh".
> 
> 	thresh == soft limit
> 	limit == 4*thresh (hard limit)
> 	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
> 						be in steady state).
>                      limit - dirty
>          pos_ratio = --------------
>                      limit - goal
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  8:55               ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

Sorry it made such a big confusion to you. I hope Peter's 3rd order
polynomial abstraction in v9 can clarify the concepts a lot.

As for the old global control line

                       origin - dirty
           pos_ratio = --------------           (1)
                       origin - goal

where

        origin = 4 * thresh                     (2)

effectively decides the slope of the line. The use of @limit in code

        origin = max(4 * thresh, limit)         (3)

is merely to safeguard the rare case that (2) might result in negative
pos_ratio in (1).

I have another patch to add a "brake" area immediately below @limit
that will scale pos_ratio down to 0. However that's no longer
necessary with the 3rd order polynomial solution. 

Note that @limit will normally be equal to @thresh except in the rare
case that @thresh is suddenly knocked down and @limit is taking time
to follow it.

Thanks,
Fengguang

> Hi Fengguang,
> 
> Ok, so just trying to understand this pos_ratio little better.
> 
> You have following basic formula.
> 
>                      origin - dirty
>          pos_ratio = --------------
>                      origin - goal
> 
> Terminology is very confusing and following is my understanding. 
> 
> - setpoint == goal
> 
>   setpoint is the point where we would like our number of dirty pages to
>   be and at this point pos_ratio = 1. For global dirty this number seems
>   to be (thresh - thresh / DIRTY_SCOPE) 
> 
> - thresh
>   dirty page threshold calculated from dirty_ratio (Certain percentage of
>   total memory).
> 
> - Origin (seems to be equivalent of limit)
> 
>   This seems to be the reference point/limit we don't want to cross and
>   distance from this limit basically decides the pos_ratio. Closer we
>   are to limit, lower the pos_ratio and further we are higher the
>   pos_ratio.
> 
> So threshold is just a number which helps us determine goal and limit.
> 
> goal = thresh - thresh / DIRTY_SCOPE
> limit = 4*thresh
> 
> So goal is where we want to be and we start throttling the task more as
> we move away goal and approach limit. We keep the limit high enough
> so that (origin-dirty) does not become negative entity.
> 
> So we do expect to cross "thresh" otherwise thresh itself could have
> served as limit?
> 
> If my understanding is right, then can we get rid of terms "setpoint" and
> "origin". Would it be easier to understand the things if we just talk
> in terms of "goal" and "limit" and how these are derived from "thresh".
> 
> 	thresh == soft limit
> 	limit == 4*thresh (hard limit)
> 	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
> 						be in steady state).
>                      limit - dirty
>          pos_ratio = --------------
>                      limit - goal
> 
> Thanks
> Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09  2:08     ` Vivek Goyal
@ 2011-08-16  8:59       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

> > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> > that the resulted task rate limit can drive the dirty pages back to the
> > global/bdi setpoints.
> > 
> 
> IMHO, "position_ratio" is not necessarily very intutive. Can there be
> a better name? Based on your slides, it is scaling factor applied to
> task rate limit depending on how well we are doing in terms of meeting
> our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
> that make sense and be little more intutive? 

Yeah position_ratio is some scale factor to the dirty rate, and I
added a comment for that. On the other hand position_ratio does
reflect the underlying "position control of dirty pages" logic. So
over time it should be reasonably understandable in the other way :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  8:59       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

> > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> > that the resulted task rate limit can drive the dirty pages back to the
> > global/bdi setpoints.
> > 
> 
> IMHO, "position_ratio" is not necessarily very intutive. Can there be
> a better name? Based on your slides, it is scaling factor applied to
> task rate limit depending on how well we are doing in terms of meeting
> our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
> that make sense and be little more intutive? 

Yeah position_ratio is some scale factor to the dirty rate, and I
added a comment for that. On the other hand position_ratio does
reflect the underlying "position control of dirty pages" logic. So
over time it should be reasonably understandable in the other way :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 14:20               ` Wu Fengguang
  (?)
@ 2011-08-22 15:38                 ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> To start with,
> 
>                                                 write_bw
>         ref_bw = task_ratelimit_in_past_200ms * --------
>                                                 dirty_bw
> 
> where
>         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> 
> > > Now all of the above would seem to suggest:
> > > 
> > >   dirty_ratelimit := ref_bw
> 
> Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> started with exactly the above equation when I got choked by pure
> pos_bw based feedback control (as mentioned in the reply to Jan's
> email) and introduced the ref_bw estimation as the way out.
> 
> But there are some imperfections in ref_bw, too. Which makes it not
> suitable for direct use:
> 
> 1) large fluctuations

OK, understood.

> 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> becomes unbalanced match, which leads to large systematical errors
> in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> be compensated smoothly.

OK.

> 3) since we ultimately want to
> 
> - keep the dirty pages around the setpoint as long time as possible
> - keep the fluctuations of task ratelimit as small as possible

Fair enough ;-)

> the update policy used for (2) also serves the above goals nicely:
> if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> point to bring up dirty_ratelimit in a hurry and to hurt both the
> above two goals.

Right, so still I feel somewhat befuddled, so we have:

	dirty_ratelimit - rate at which we throttle dirtiers as
			  estimated upto 200ms ago.

	pos_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in dirty pages around its target

	bw_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in input/output bandwidth

and we need to basically do:

	dirty_ratelimit *= pos_ratio * bw_ratio

to update the dirty_ratelimit to reflect the current state. However per
1) and 2) bw_ratio is crappy and hard to fix.

So you propose to update dirty_ratelimit only if both pos_ratio and
bw_ratio point in the same direction, however that would result in:

  if (pos_ratio < UNIT && bw_ratio < UNIT ||
      pos_ratio > UNIT && bw_ratio > UNIT) {
	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
  }

> > > However for that you use:
> > > 
> > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > 
> > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > >         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> The above are merely constraints to the dirty_ratelimit update.
> It serves to
> 
> 1) stop adjusting the rate when it's against the position control
>    target (the adjusted rate will slow down the progress of dirty
>    pages going back to setpoint).

Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
they point in different directions however:

 0.5 < 1 &&  0.5 * 1.1 < 1

so your code will in fact update the dirty_ratelimit, even though the
two factors point in opposite directions.

> 2) limit the step size. pos_bw is changing values step by step,
>    leaving a consistent trace comparing to the randomly jumping
>    ref_bw. pos_bw also has smaller errors in stable state and normally
>    have larger errors when there are big errors in rate. So it's a
>    pretty good limiting factor for the step size of dirty_ratelimit.

OK, so that's the min/max stuff, however it only works because you use
pos_bw and ref_bw instead of the fully separated factors.

> Hope the above elaboration helps :)

A little.. 

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-22 15:38                 ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> To start with,
> 
>                                                 write_bw
>         ref_bw = task_ratelimit_in_past_200ms * --------
>                                                 dirty_bw
> 
> where
>         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> 
> > > Now all of the above would seem to suggest:
> > > 
> > >   dirty_ratelimit := ref_bw
> 
> Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> started with exactly the above equation when I got choked by pure
> pos_bw based feedback control (as mentioned in the reply to Jan's
> email) and introduced the ref_bw estimation as the way out.
> 
> But there are some imperfections in ref_bw, too. Which makes it not
> suitable for direct use:
> 
> 1) large fluctuations

OK, understood.

> 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> becomes unbalanced match, which leads to large systematical errors
> in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> be compensated smoothly.

OK.

> 3) since we ultimately want to
> 
> - keep the dirty pages around the setpoint as long time as possible
> - keep the fluctuations of task ratelimit as small as possible

Fair enough ;-)

> the update policy used for (2) also serves the above goals nicely:
> if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> point to bring up dirty_ratelimit in a hurry and to hurt both the
> above two goals.

Right, so still I feel somewhat befuddled, so we have:

	dirty_ratelimit - rate at which we throttle dirtiers as
			  estimated upto 200ms ago.

	pos_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in dirty pages around its target

	bw_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in input/output bandwidth

and we need to basically do:

	dirty_ratelimit *= pos_ratio * bw_ratio

to update the dirty_ratelimit to reflect the current state. However per
1) and 2) bw_ratio is crappy and hard to fix.

So you propose to update dirty_ratelimit only if both pos_ratio and
bw_ratio point in the same direction, however that would result in:

  if (pos_ratio < UNIT && bw_ratio < UNIT ||
      pos_ratio > UNIT && bw_ratio > UNIT) {
	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
  }

> > > However for that you use:
> > > 
> > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > 
> > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > >         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> The above are merely constraints to the dirty_ratelimit update.
> It serves to
> 
> 1) stop adjusting the rate when it's against the position control
>    target (the adjusted rate will slow down the progress of dirty
>    pages going back to setpoint).

Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
they point in different directions however:

 0.5 < 1 &&  0.5 * 1.1 < 1

so your code will in fact update the dirty_ratelimit, even though the
two factors point in opposite directions.

> 2) limit the step size. pos_bw is changing values step by step,
>    leaving a consistent trace comparing to the randomly jumping
>    ref_bw. pos_bw also has smaller errors in stable state and normally
>    have larger errors when there are big errors in rate. So it's a
>    pretty good limiting factor for the step size of dirty_ratelimit.

OK, so that's the min/max stuff, however it only works because you use
pos_bw and ref_bw instead of the fully separated factors.

> Hope the above elaboration helps :)

A little.. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-22 15:38                 ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> To start with,
> 
>                                                 write_bw
>         ref_bw = task_ratelimit_in_past_200ms * --------
>                                                 dirty_bw
> 
> where
>         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> 
> > > Now all of the above would seem to suggest:
> > > 
> > >   dirty_ratelimit := ref_bw
> 
> Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> started with exactly the above equation when I got choked by pure
> pos_bw based feedback control (as mentioned in the reply to Jan's
> email) and introduced the ref_bw estimation as the way out.
> 
> But there are some imperfections in ref_bw, too. Which makes it not
> suitable for direct use:
> 
> 1) large fluctuations

OK, understood.

> 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> becomes unbalanced match, which leads to large systematical errors
> in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> be compensated smoothly.

OK.

> 3) since we ultimately want to
> 
> - keep the dirty pages around the setpoint as long time as possible
> - keep the fluctuations of task ratelimit as small as possible

Fair enough ;-)

> the update policy used for (2) also serves the above goals nicely:
> if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> point to bring up dirty_ratelimit in a hurry and to hurt both the
> above two goals.

Right, so still I feel somewhat befuddled, so we have:

	dirty_ratelimit - rate at which we throttle dirtiers as
			  estimated upto 200ms ago.

	pos_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in dirty pages around its target

	bw_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in input/output bandwidth

and we need to basically do:

	dirty_ratelimit *= pos_ratio * bw_ratio

to update the dirty_ratelimit to reflect the current state. However per
1) and 2) bw_ratio is crappy and hard to fix.

So you propose to update dirty_ratelimit only if both pos_ratio and
bw_ratio point in the same direction, however that would result in:

  if (pos_ratio < UNIT && bw_ratio < UNIT ||
      pos_ratio > UNIT && bw_ratio > UNIT) {
	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
  }

> > > However for that you use:
> > > 
> > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > 
> > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > >         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> The above are merely constraints to the dirty_ratelimit update.
> It serves to
> 
> 1) stop adjusting the rate when it's against the position control
>    target (the adjusted rate will slow down the progress of dirty
>    pages going back to setpoint).

Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
they point in different directions however:

 0.5 < 1 &&  0.5 * 1.1 < 1

so your code will in fact update the dirty_ratelimit, even though the
two factors point in opposite directions.

> 2) limit the step size. pos_bw is changing values step by step,
>    leaving a consistent trace comparing to the randomly jumping
>    ref_bw. pos_bw also has smaller errors in stable state and normally
>    have larger errors when there are big errors in rate. So it's a
>    pretty good limiting factor for the step size of dirty_ratelimit.

OK, so that's the min/max stuff, however it only works because you use
pos_bw and ref_bw instead of the fully separated factors.

> Hope the above elaboration helps :)

A little.. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-22 15:38                 ` Peter Zijlstra
@ 2011-08-23  3:40                   ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 22, 2011 at 11:38:07PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > 
> > To start with,
> > 
> >                                                 write_bw
> >         ref_bw = task_ratelimit_in_past_200ms * --------
> >                                                 dirty_bw
> > 
> > where
> >         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> > 
> > > > Now all of the above would seem to suggest:
> > > > 
> > > >   dirty_ratelimit := ref_bw
> > 
> > Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> > started with exactly the above equation when I got choked by pure
> > pos_bw based feedback control (as mentioned in the reply to Jan's
> > email) and introduced the ref_bw estimation as the way out.
> > 
> > But there are some imperfections in ref_bw, too. Which makes it not
> > suitable for direct use:
> > 
> > 1) large fluctuations
> 
> OK, understood.
> 
> > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> > becomes unbalanced match, which leads to large systematical errors
> > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> > be compensated smoothly.
> 
> OK.
> 
> > 3) since we ultimately want to
> > 
> > - keep the dirty pages around the setpoint as long time as possible
> > - keep the fluctuations of task ratelimit as small as possible
> 
> Fair enough ;-)
> 
> > the update policy used for (2) also serves the above goals nicely:
> > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> > point to bring up dirty_ratelimit in a hurry and to hurt both the
> > above two goals.
> 
> Right, so still I feel somewhat befuddled, so we have:
> 
> 	dirty_ratelimit - rate at which we throttle dirtiers as
> 			  estimated upto 200ms ago.

Note that bdi->dirty_ratelimit is supposed to be the balanced
ratelimit, ie. (write_bw / N), regardless whether dirty pages meets
the setpoint.

In _concept_, the bdi balanced ratelimit is updated _independent_ of
the position control embodied in the task ratelimit calculation.

A lot of confusions seem to come from the seemingly inter-twisted rate
and position controls, however in my mind, there are two levels of
relationship:

1) work fundamentally independent of each other, each tries to fulfill
   one single target (either balanced rate or balanced position)

2) _based_ on (1), completely optional, try to constraint the rate update 
   to get more stable ->dirty_ratelimit and more balanced dirty position

Note that (2) is not a must even if there are systematic errors in
balanced_rate calculation. For example, the v8 patchset only does (1)
and hence do simple

        bdi->dirty_ratelimit = balanced_rate;

And it can still balance at some point (though not exactly around the setpoint):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/balance_dirty_pages-pages.png

Even if ext4 has mis-matched (dirty_rate:write_bw ~= 3:2) hence
introduced systematic errors in balanced_rate:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/global_dirtied_written.png

> 	pos_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in dirty pages around its target

So pos_ratio is

- is a _limiting_ factor rather than an _adjusting_ factor for
  updating ->dirty_ratelimit (when do (2))

- not a factor at all for updating balanced_rate (whether or not we do (2))
  well, in this concept: the balanced_rate formula inherently does not
  derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
  based on the ratelimit executed for the past 200ms:

          balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

  and task_ratelimit_200ms happen to can be estimated from

          task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

  There is fundamentally no dependency between balanced_rate_(i+1) and
  balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
  only asks for _whatever_ CONSTANT task ratelimit to be executed for
  200ms, then it get the balanced rate from the dirty_rate feedback.

  We may alternatively record every task_ratelimit executed in the
  past 200ms and average them all to get task_ratelimit_200ms. In this
  way we take the "superfluous" pos_ratio out of sight :)

> 	bw_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in input/output bandwidth
> 
> and we need to basically do:
> 
> 	dirty_ratelimit *= pos_ratio * bw_ratio

So there is even no such recursing at all:

        balanced_rate *= bw_ratio

Each balanced_rate is estimated from the start, based on each 200ms period.

> to update the dirty_ratelimit to reflect the current state. However per
> 1) and 2) bw_ratio is crappy and hard to fix.
> 
> So you propose to update dirty_ratelimit only if both pos_ratio and
> bw_ratio point in the same direction, however that would result in:
> 
>   if (pos_ratio < UNIT && bw_ratio < UNIT ||
>       pos_ratio > UNIT && bw_ratio > UNIT) {
> 	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
> 	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
>   }

We start by doing this for (1):

        dirty_ratelimit = balanced_rate

and then try to refine it for (1)+(2):

        dirty_ratelimit => balanced_rate, but limit the progress by pos_ratio

> > > > However for that you use:
> > > > 
> > > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > > 
> > > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > > >         dirty_ratelimit = min(ref_bw, pos_bw);
> > 
> > The above are merely constraints to the dirty_ratelimit update.
> > It serves to
> > 
> > 1) stop adjusting the rate when it's against the position control
> >    target (the adjusted rate will slow down the progress of dirty
> >    pages going back to setpoint).
> 
> Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
> they point in different directions however:
> 
>  0.5 < 1 &&  0.5 * 1.1 < 1
> 
> so your code will in fact update the dirty_ratelimit, even though the
> two factors point in opposite directions.

It does not work that way since pos_ratio does not take part in the
multiplication. However I admit that the tests

        (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
        (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)

don't aim to avoid all unnecessary updates, and it may even stop some
rightful updates. It's not possible at all to act perfect. It's merely
a rule that sounds "reasonable" in theory and works reasonably good in
practice :) I'd be happy to try more if there are better ones.

> > 2) limit the step size. pos_bw is changing values step by step,
> >    leaving a consistent trace comparing to the randomly jumping
> >    ref_bw. pos_bw also has smaller errors in stable state and normally
> >    have larger errors when there are big errors in rate. So it's a
> >    pretty good limiting factor for the step size of dirty_ratelimit.
> 
> OK, so that's the min/max stuff, however it only works because you use
> pos_bw and ref_bw instead of the fully separated factors.

Yes, the min/max stuff is for limiting the step size. The "limiting"
intention can be made more clear if written as

        delta = balanced_rate - base_rate;

        if (delta > pos_rate - base_rate)
            delta = pos_rate - base_rate;

        delta /= 8;

> > Hope the above elaboration helps :)
> 
> A little.. 

And now? ;)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23  3:40                   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 22, 2011 at 11:38:07PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > 
> > To start with,
> > 
> >                                                 write_bw
> >         ref_bw = task_ratelimit_in_past_200ms * --------
> >                                                 dirty_bw
> > 
> > where
> >         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> > 
> > > > Now all of the above would seem to suggest:
> > > > 
> > > >   dirty_ratelimit := ref_bw
> > 
> > Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> > started with exactly the above equation when I got choked by pure
> > pos_bw based feedback control (as mentioned in the reply to Jan's
> > email) and introduced the ref_bw estimation as the way out.
> > 
> > But there are some imperfections in ref_bw, too. Which makes it not
> > suitable for direct use:
> > 
> > 1) large fluctuations
> 
> OK, understood.
> 
> > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> > becomes unbalanced match, which leads to large systematical errors
> > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> > be compensated smoothly.
> 
> OK.
> 
> > 3) since we ultimately want to
> > 
> > - keep the dirty pages around the setpoint as long time as possible
> > - keep the fluctuations of task ratelimit as small as possible
> 
> Fair enough ;-)
> 
> > the update policy used for (2) also serves the above goals nicely:
> > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> > point to bring up dirty_ratelimit in a hurry and to hurt both the
> > above two goals.
> 
> Right, so still I feel somewhat befuddled, so we have:
> 
> 	dirty_ratelimit - rate at which we throttle dirtiers as
> 			  estimated upto 200ms ago.

Note that bdi->dirty_ratelimit is supposed to be the balanced
ratelimit, ie. (write_bw / N), regardless whether dirty pages meets
the setpoint.

In _concept_, the bdi balanced ratelimit is updated _independent_ of
the position control embodied in the task ratelimit calculation.

A lot of confusions seem to come from the seemingly inter-twisted rate
and position controls, however in my mind, there are two levels of
relationship:

1) work fundamentally independent of each other, each tries to fulfill
   one single target (either balanced rate or balanced position)

2) _based_ on (1), completely optional, try to constraint the rate update 
   to get more stable ->dirty_ratelimit and more balanced dirty position

Note that (2) is not a must even if there are systematic errors in
balanced_rate calculation. For example, the v8 patchset only does (1)
and hence do simple

        bdi->dirty_ratelimit = balanced_rate;

And it can still balance at some point (though not exactly around the setpoint):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/balance_dirty_pages-pages.png

Even if ext4 has mis-matched (dirty_rate:write_bw ~= 3:2) hence
introduced systematic errors in balanced_rate:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/global_dirtied_written.png

> 	pos_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in dirty pages around its target

So pos_ratio is

- is a _limiting_ factor rather than an _adjusting_ factor for
  updating ->dirty_ratelimit (when do (2))

- not a factor at all for updating balanced_rate (whether or not we do (2))
  well, in this concept: the balanced_rate formula inherently does not
  derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
  based on the ratelimit executed for the past 200ms:

          balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

  and task_ratelimit_200ms happen to can be estimated from

          task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

  There is fundamentally no dependency between balanced_rate_(i+1) and
  balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
  only asks for _whatever_ CONSTANT task ratelimit to be executed for
  200ms, then it get the balanced rate from the dirty_rate feedback.

  We may alternatively record every task_ratelimit executed in the
  past 200ms and average them all to get task_ratelimit_200ms. In this
  way we take the "superfluous" pos_ratio out of sight :)

> 	bw_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in input/output bandwidth
> 
> and we need to basically do:
> 
> 	dirty_ratelimit *= pos_ratio * bw_ratio

So there is even no such recursing at all:

        balanced_rate *= bw_ratio

Each balanced_rate is estimated from the start, based on each 200ms period.

> to update the dirty_ratelimit to reflect the current state. However per
> 1) and 2) bw_ratio is crappy and hard to fix.
> 
> So you propose to update dirty_ratelimit only if both pos_ratio and
> bw_ratio point in the same direction, however that would result in:
> 
>   if (pos_ratio < UNIT && bw_ratio < UNIT ||
>       pos_ratio > UNIT && bw_ratio > UNIT) {
> 	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
> 	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
>   }

We start by doing this for (1):

        dirty_ratelimit = balanced_rate

and then try to refine it for (1)+(2):

        dirty_ratelimit => balanced_rate, but limit the progress by pos_ratio

> > > > However for that you use:
> > > > 
> > > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > > 
> > > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > > >         dirty_ratelimit = min(ref_bw, pos_bw);
> > 
> > The above are merely constraints to the dirty_ratelimit update.
> > It serves to
> > 
> > 1) stop adjusting the rate when it's against the position control
> >    target (the adjusted rate will slow down the progress of dirty
> >    pages going back to setpoint).
> 
> Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
> they point in different directions however:
> 
>  0.5 < 1 &&  0.5 * 1.1 < 1
> 
> so your code will in fact update the dirty_ratelimit, even though the
> two factors point in opposite directions.

It does not work that way since pos_ratio does not take part in the
multiplication. However I admit that the tests

        (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
        (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)

don't aim to avoid all unnecessary updates, and it may even stop some
rightful updates. It's not possible at all to act perfect. It's merely
a rule that sounds "reasonable" in theory and works reasonably good in
practice :) I'd be happy to try more if there are better ones.

> > 2) limit the step size. pos_bw is changing values step by step,
> >    leaving a consistent trace comparing to the randomly jumping
> >    ref_bw. pos_bw also has smaller errors in stable state and normally
> >    have larger errors when there are big errors in rate. So it's a
> >    pretty good limiting factor for the step size of dirty_ratelimit.
> 
> OK, so that's the min/max stuff, however it only works because you use
> pos_bw and ref_bw instead of the fully separated factors.

Yes, the min/max stuff is for limiting the step size. The "limiting"
intention can be made more clear if written as

        delta = balanced_rate - base_rate;

        if (delta > pos_rate - base_rate)
            delta = pos_rate - base_rate;

        delta /= 8;

> > Hope the above elaboration helps :)
> 
> A little.. 

And now? ;)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23  3:40                   ` Wu Fengguang
  (?)
@ 2011-08-23 10:01                     ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> - not a factor at all for updating balanced_rate (whether or not we do (2))
>   well, in this concept: the balanced_rate formula inherently does not
>   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
>   based on the ratelimit executed for the past 200ms:
> 
>           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

Ok, this is where it all goes funny..

So if you want completely separated feedback loops I would expect
something like:

	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms

The former is a complete feedback loop, expressing the new value in the
old value (*) with bw_ratio as feedback parameter; if we throttled too
much, the dirty_rate will have dropped and the bw_ratio will be <1
causing the balance_rate to drop increasing the dirty_rate, and vice
versa.

(*) which is the form I expected and why I thought your primary feedback
loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

With the above balance_rate is an independent variable that tracks the
write bandwidth. Now possibly you'd want a low-pass filter on that since
your bw_ratio is a bit funny in the head, but that's another story.

Then when you use the balance_rate to actually throttle tasks you apply
your secondary control steering the dirty page count, yielding:

	task_rate = balance_rate * pos_ratio

>   and task_ratelimit_200ms happen to can be estimated from
> 
>           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

>   We may alternatively record every task_ratelimit executed in the
>   past 200ms and average them all to get task_ratelimit_200ms. In this
>   way we take the "superfluous" pos_ratio out of sight :) 

Right, so I'm not at all sure that makes sense, its not immediately
evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
clear to me why your primary feedback loop uses task_ratelimit_200ms at
all. 

>   There is fundamentally no dependency between balanced_rate_(i+1) and
>   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
>   only asks for _whatever_ CONSTANT task ratelimit to be executed for
>   200ms, then it get the balanced rate from the dirty_rate feedback.

How can there not be a relation between balance_rate_(i+1) and
balance_rate_(i) ? 

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 10:01                     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> - not a factor at all for updating balanced_rate (whether or not we do (2))
>   well, in this concept: the balanced_rate formula inherently does not
>   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
>   based on the ratelimit executed for the past 200ms:
> 
>           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

Ok, this is where it all goes funny..

So if you want completely separated feedback loops I would expect
something like:

	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms

The former is a complete feedback loop, expressing the new value in the
old value (*) with bw_ratio as feedback parameter; if we throttled too
much, the dirty_rate will have dropped and the bw_ratio will be <1
causing the balance_rate to drop increasing the dirty_rate, and vice
versa.

(*) which is the form I expected and why I thought your primary feedback
loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

With the above balance_rate is an independent variable that tracks the
write bandwidth. Now possibly you'd want a low-pass filter on that since
your bw_ratio is a bit funny in the head, but that's another story.

Then when you use the balance_rate to actually throttle tasks you apply
your secondary control steering the dirty page count, yielding:

	task_rate = balance_rate * pos_ratio

>   and task_ratelimit_200ms happen to can be estimated from
> 
>           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

>   We may alternatively record every task_ratelimit executed in the
>   past 200ms and average them all to get task_ratelimit_200ms. In this
>   way we take the "superfluous" pos_ratio out of sight :) 

Right, so I'm not at all sure that makes sense, its not immediately
evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
clear to me why your primary feedback loop uses task_ratelimit_200ms at
all. 

>   There is fundamentally no dependency between balanced_rate_(i+1) and
>   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
>   only asks for _whatever_ CONSTANT task ratelimit to be executed for
>   200ms, then it get the balanced rate from the dirty_rate feedback.

How can there not be a relation between balance_rate_(i+1) and
balance_rate_(i) ? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 10:01                     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> - not a factor at all for updating balanced_rate (whether or not we do (2))
>   well, in this concept: the balanced_rate formula inherently does not
>   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
>   based on the ratelimit executed for the past 200ms:
> 
>           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

Ok, this is where it all goes funny..

So if you want completely separated feedback loops I would expect
something like:

	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms

The former is a complete feedback loop, expressing the new value in the
old value (*) with bw_ratio as feedback parameter; if we throttled too
much, the dirty_rate will have dropped and the bw_ratio will be <1
causing the balance_rate to drop increasing the dirty_rate, and vice
versa.

(*) which is the form I expected and why I thought your primary feedback
loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

With the above balance_rate is an independent variable that tracks the
write bandwidth. Now possibly you'd want a low-pass filter on that since
your bw_ratio is a bit funny in the head, but that's another story.

Then when you use the balance_rate to actually throttle tasks you apply
your secondary control steering the dirty page count, yielding:

	task_rate = balance_rate * pos_ratio

>   and task_ratelimit_200ms happen to can be estimated from
> 
>           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

>   We may alternatively record every task_ratelimit executed in the
>   past 200ms and average them all to get task_ratelimit_200ms. In this
>   way we take the "superfluous" pos_ratio out of sight :) 

Right, so I'm not at all sure that makes sense, its not immediately
evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
clear to me why your primary feedback loop uses task_ratelimit_200ms at
all. 

>   There is fundamentally no dependency between balanced_rate_(i+1) and
>   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
>   only asks for _whatever_ CONSTANT task ratelimit to be executed for
>   200ms, then it get the balanced rate from the dirty_rate feedback.

How can there not be a relation between balance_rate_(i+1) and
balance_rate_(i) ? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 10:01                     ` Peter Zijlstra
@ 2011-08-23 14:15                       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-23 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..
> 
> So if you want completely separated feedback loops I would expect

If call it feedback loops, then it's a series of independent feedback
loops of depth 1.  Because each balanced_rate is a fresh estimation
dependent solely on

- writeout bandwidth
- N, the number of dd tasks

in the past 200ms.

As long as a CONSTANT ratelimit (whatever value it is) is executed in
the past 200ms, we can get the same balanced_rate.

        balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate

The resulted balanced_rate is independent of how large the CONSTANT
ratelimit is, because if we start with a doubled CONSTANT ratelimit,
we'll see doubled dirty_rate and result in the same balanced_rate. 

In that manner, balance_rate_(i+1) is not really depending on the
value of balance_rate_(i): whatever balance_rate_(i) is, we are going
to get the same balance_rate_(i+1) if not considering estimation
errors. Note that the estimation errors mainly come from the
fluctuations in dirty_rate.

That may well be what's already in your mind, just that we disagree
about the terms ;)

> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 
> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

In principle, the bw_ratio works that way. However since
balance_rate_(i) is not the exact _executed_ ratelimit in
balance_dirty_pages().

> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
 
Because the executed ratelimit was rate_(i) * pos_ratio.

> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.

Yeah.

> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio

Right. Note the above formula is not a derived one, but an original
one that later leads to pos_ratio showing up in the calculation of
balanced_rate.

> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 

task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
by balance_dirty_pages(). So this is an original formula:

        task_ratelimit = balance_rate * pos_ratio

task_ratelimit_200ms is also used as an original data source in

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Then we try to estimate task_ratelimit_200ms by assuming all tasks
have been executing the same CONSTANT ratelimit in
balance_dirty_pages(). Hence we get

        task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

> >   There is fundamentally no dependency between balanced_rate_(i+1) and
> >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> >   200ms, then it get the balanced rate from the dirty_rate feedback.
> 
> How can there not be a relation between balance_rate_(i+1) and
> balance_rate_(i) ? 

In this manner: even though balance_rate_(i) is somehow used for
calculating balance_rate_(i+1), the latter will evaluate to the same
value given whatever balance_rate_(i).

That is, there is two dependencies, the seemingly dependency in the
formula, and the effective dependency in the data values.

Thank,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 14:15                       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-23 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..
> 
> So if you want completely separated feedback loops I would expect

If call it feedback loops, then it's a series of independent feedback
loops of depth 1.  Because each balanced_rate is a fresh estimation
dependent solely on

- writeout bandwidth
- N, the number of dd tasks

in the past 200ms.

As long as a CONSTANT ratelimit (whatever value it is) is executed in
the past 200ms, we can get the same balanced_rate.

        balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate

The resulted balanced_rate is independent of how large the CONSTANT
ratelimit is, because if we start with a doubled CONSTANT ratelimit,
we'll see doubled dirty_rate and result in the same balanced_rate. 

In that manner, balance_rate_(i+1) is not really depending on the
value of balance_rate_(i): whatever balance_rate_(i) is, we are going
to get the same balance_rate_(i+1) if not considering estimation
errors. Note that the estimation errors mainly come from the
fluctuations in dirty_rate.

That may well be what's already in your mind, just that we disagree
about the terms ;)

> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 
> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

In principle, the bw_ratio works that way. However since
balance_rate_(i) is not the exact _executed_ ratelimit in
balance_dirty_pages().

> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
 
Because the executed ratelimit was rate_(i) * pos_ratio.

> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.

Yeah.

> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio

Right. Note the above formula is not a derived one, but an original
one that later leads to pos_ratio showing up in the calculation of
balanced_rate.

> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 

task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
by balance_dirty_pages(). So this is an original formula:

        task_ratelimit = balance_rate * pos_ratio

task_ratelimit_200ms is also used as an original data source in

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Then we try to estimate task_ratelimit_200ms by assuming all tasks
have been executing the same CONSTANT ratelimit in
balance_dirty_pages(). Hence we get

        task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

> >   There is fundamentally no dependency between balanced_rate_(i+1) and
> >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> >   200ms, then it get the balanced rate from the dirty_rate feedback.
> 
> How can there not be a relation between balance_rate_(i+1) and
> balance_rate_(i) ? 

In this manner: even though balance_rate_(i) is somehow used for
calculating balance_rate_(i+1), the latter will evaluate to the same
value given whatever balance_rate_(i).

That is, there is two dependencies, the seemingly dependency in the
formula, and the effective dependency in the data values.

Thank,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 10:01                     ` Peter Zijlstra
@ 2011-08-23 14:36                       ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-23 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 12:01:00PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..

Exactly. This is where it gets confusing and is bone of contention.

> 
> So if you want completely separated feedback loops I would expect
> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 

I agree. This makes sense. IOW.
						      write_bw
bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_(n-1) * -------
						      dirty_rate

> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

I think you meant.

"if we throttled too much, the dirty_rate will have dropped and the bw_ratio
 will be >1 causing the balance_rate to increase hence increasing the
 dirty_rate, and vice versa."

> 
> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> 
> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.
> 
> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio
> 
> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 
> 

We I thought that this is evident that.

task_ratelimit = balanced_rate * pos_ratio

What is not evident to me is following.

balanced_rate_(i+1) = task_ratelimit_200ms * pos_ratio.

Instead, like you, I also thought that following is more obivious.

balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 14:36                       ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-23 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 12:01:00PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..

Exactly. This is where it gets confusing and is bone of contention.

> 
> So if you want completely separated feedback loops I would expect
> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 

I agree. This makes sense. IOW.
						      write_bw
bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_(n-1) * -------
						      dirty_rate

> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

I think you meant.

"if we throttled too much, the dirty_rate will have dropped and the bw_ratio
 will be >1 causing the balance_rate to increase hence increasing the
 dirty_rate, and vice versa."

> 
> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> 
> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.
> 
> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio
> 
> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 
> 

We I thought that this is evident that.

task_ratelimit = balanced_rate * pos_ratio

What is not evident to me is following.

balanced_rate_(i+1) = task_ratelimit_200ms * pos_ratio.

Instead, like you, I also thought that following is more obivious.

balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 14:15                       ` Wu Fengguang
@ 2011-08-23 17:47                         ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-23 17:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 10:15:04PM +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) if not considering estimation
> errors. Note that the estimation errors mainly come from the
> fluctuations in dirty_rate.
> 
> That may well be what's already in your mind, just that we disagree
> about the terms ;)
> 
> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().
> 
> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> > With the above balance_rate is an independent variable that tracks the
> > write bandwidth. Now possibly you'd want a low-pass filter on that since
> > your bw_ratio is a bit funny in the head, but that's another story.
> 
> Yeah.
> 
> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.
> 
> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 

I think above calculates to.

 task_ratelimit = balanced_rate * pos_ratio
or
 task_ratelimit = task_ratelimit_200ms * write_bw / dirty_rate * pos_ratio
or
 task_ratelimit = balance_rate * pos_ratio  * write_bw / dirty_rate * pos_ratio
or
								    2
 task_ratelimit = balance_rate * write_bw / dirty_rate * (pos_ratio)

And the question is why not.

 task_ratelimit = prev-balance_rate * write_bw / dirty_rate * pos_ratio

Which sounds intutive as comapred to former one.

You somehow directly jump to  

	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

without explaining why following will not work.

	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 17:47                         ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-23 17:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 10:15:04PM +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) if not considering estimation
> errors. Note that the estimation errors mainly come from the
> fluctuations in dirty_rate.
> 
> That may well be what's already in your mind, just that we disagree
> about the terms ;)
> 
> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().
> 
> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> > With the above balance_rate is an independent variable that tracks the
> > write bandwidth. Now possibly you'd want a low-pass filter on that since
> > your bw_ratio is a bit funny in the head, but that's another story.
> 
> Yeah.
> 
> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.
> 
> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 

I think above calculates to.

 task_ratelimit = balanced_rate * pos_ratio
or
 task_ratelimit = task_ratelimit_200ms * write_bw / dirty_rate * pos_ratio
or
 task_ratelimit = balance_rate * pos_ratio  * write_bw / dirty_rate * pos_ratio
or
								    2
 task_ratelimit = balance_rate * write_bw / dirty_rate * (pos_ratio)

And the question is why not.

 task_ratelimit = prev-balance_rate * write_bw / dirty_rate * pos_ratio

Which sounds intutive as comapred to former one.

You somehow directly jump to  

	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

without explaining why following will not work.

	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 17:47                         ` Vivek Goyal
@ 2011-08-24  0:12                           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-24  0:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> You somehow directly jump to  
> 
> 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> without explaining why following will not work.
> 
> 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks for asking that, it's probably the root of confusions, so let
me answer it standalone.

It's actually pretty simple to explain this equation:

                                               write_bw
        balanced_rate = task_ratelimit_200ms * ----------       (1)
                                               dirty_rate

If there are N dd tasks, each task is throttled at task_ratelimit_200ms
for the past 200ms, we are going to measure the overall bdi dirty rate

        dirty_rate = N * task_ratelimit_200ms                   (2)

put (2) into (1) we get

        balanced_rate = write_bw / N                            (3)

So equation (1) is the right estimation to get the desired target (3).


As for

                                                  write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
                                                  dirty_rate

Let's compare it with the "expanded" form of (1):

                                                              write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
                                                              dirty_rate

So the difference lies in pos_ratio.

Believe it or not, it's exactly the seemingly use of pos_ratio that
makes (5) independent(*) of the position control.

Why? Look at (4), assume the system is in a state

- dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
- dirty position is not balanced, for example pos_ratio = 0.5

balance_dirty_pages() will be rate limiting each tasks at half the
balanced dirty rate, yielding a measured

        dirty_rate = write_bw / 2                               (6)

Put (6) into (4), we get

        balanced_rate_(i+1) = balanced_rate_(i) * 2
                            = (write_bw / N) * 2

That means, any position imbalance will lead to balanced_rate
estimation errors if we follow (4). Whereas if (1)/(5) is used, we
always get the right balanced dirty ratelimit value whether or not
(pos_ratio == 1.0), hence make the rate estimation independent(*) of
dirty position control.

(*) independent as in real values, not the seemingly relations in equation

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24  0:12                           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-24  0:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> You somehow directly jump to  
> 
> 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> without explaining why following will not work.
> 
> 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks for asking that, it's probably the root of confusions, so let
me answer it standalone.

It's actually pretty simple to explain this equation:

                                               write_bw
        balanced_rate = task_ratelimit_200ms * ----------       (1)
                                               dirty_rate

If there are N dd tasks, each task is throttled at task_ratelimit_200ms
for the past 200ms, we are going to measure the overall bdi dirty rate

        dirty_rate = N * task_ratelimit_200ms                   (2)

put (2) into (1) we get

        balanced_rate = write_bw / N                            (3)

So equation (1) is the right estimation to get the desired target (3).


As for

                                                  write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
                                                  dirty_rate

Let's compare it with the "expanded" form of (1):

                                                              write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
                                                              dirty_rate

So the difference lies in pos_ratio.

Believe it or not, it's exactly the seemingly use of pos_ratio that
makes (5) independent(*) of the position control.

Why? Look at (4), assume the system is in a state

- dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
- dirty position is not balanced, for example pos_ratio = 0.5

balance_dirty_pages() will be rate limiting each tasks at half the
balanced dirty rate, yielding a measured

        dirty_rate = write_bw / 2                               (6)

Put (6) into (4), we get

        balanced_rate_(i+1) = balanced_rate_(i) * 2
                            = (write_bw / N) * 2

That means, any position imbalance will lead to balanced_rate
estimation errors if we follow (4). Whereas if (1)/(5) is used, we
always get the right balanced dirty ratelimit value whether or not
(pos_ratio == 1.0), hence make the rate estimation independent(*) of
dirty position control.

(*) independent as in real values, not the seemingly relations in equation

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 14:15                       ` Wu Fengguang
  (?)
@ 2011-08-24 15:57                         ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) 

At best this argument says it doesn't matter what we use, making
balance_rate_i an equally valid choice. However I don't buy this, your
argument is broken, your CONSTANT_ratelimit breaks feedback but then you
rely on the iterative form of feedback to finish your argument.

Consider:

	r_(i+1) = r_i * ratio_i

you say, r_i := C for all i, then by definition ratio_i must be 1 and
you've got nothing. The only way your conclusion can be right is by
allowing the proper iteration, otherwise we'll never reach the
equilibrium.

Now it is true you can introduce random perturbations in r_i at any
given point and still end up in equilibrium, such is the power of
iterative feedback, but that doesn't say you can do away with r_i. 

> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().

This seems to be where your argument goes bad, the actually executed
ratelimit is not important, the variance introduced by pos_ratio is
purely for the benefit of the dirty page count. 

It doesn't matter for the balance_rate. Without pos_ratio, the dirty
page count would stay stable (ignoring all these oscillations and other
fun things), and therefore it is the balance_rate we should be using for
the iterative feedback.

> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.

No, because iterative feedback has the form: 

	new = old $op $feedback-term


> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, 

Agreed, its not a derived expression but the originator of the dirty
page count control.

> but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.

That's where I disagree :-)

> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

But that's exactly where you conflate the positional feedback with the
throughput feedback, the effective ratelimit includes the positional
feedback so that the dirty page count can move around, but that is
completely orthogonal to the throughput feedback since the throughout
thing would leave the dirty count constant (ideal case again).

That is, yes the iterative feedback still works because you still got
your primary feedback in place, but the addition of pos_ratio in the
feedback loop is a pure perturbation and doesn't matter one whit.

> Then we try to estimate task_ratelimit_200ms by assuming all tasks
> have been executing the same CONSTANT ratelimit in
> balance_dirty_pages(). Hence we get
> 
>         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

But this just cannot be true (and, as argued above, is completely
unnecessary). 

Consider the case where the dirty count is way below the setpoint but
the base ratelimit is pretty accurate. In that case we would start out
by creating very low task ratelimits such that the dirty count can
increase. Once we match the setpoint we go back to the base ratelimit.
The average over those 200ms would be <1, but since we're right at the
setpoint when we do the base ratelimit feedback we pick exactly 1. 

Anyway, its completely irrelevant.. :-)

> > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > 
> > How can there not be a relation between balance_rate_(i+1) and
> > balance_rate_(i) ? 
> 
> In this manner: even though balance_rate_(i) is somehow used for
> calculating balance_rate_(i+1), the latter will evaluate to the same
> value given whatever balance_rate_(i).

But only if you allow for the iterative feedback to work, you absolutely
need that balance_rate_(i), without that its completely broken.


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 15:57                         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) 

At best this argument says it doesn't matter what we use, making
balance_rate_i an equally valid choice. However I don't buy this, your
argument is broken, your CONSTANT_ratelimit breaks feedback but then you
rely on the iterative form of feedback to finish your argument.

Consider:

	r_(i+1) = r_i * ratio_i

you say, r_i := C for all i, then by definition ratio_i must be 1 and
you've got nothing. The only way your conclusion can be right is by
allowing the proper iteration, otherwise we'll never reach the
equilibrium.

Now it is true you can introduce random perturbations in r_i at any
given point and still end up in equilibrium, such is the power of
iterative feedback, but that doesn't say you can do away with r_i. 

> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().

This seems to be where your argument goes bad, the actually executed
ratelimit is not important, the variance introduced by pos_ratio is
purely for the benefit of the dirty page count. 

It doesn't matter for the balance_rate. Without pos_ratio, the dirty
page count would stay stable (ignoring all these oscillations and other
fun things), and therefore it is the balance_rate we should be using for
the iterative feedback.

> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.

No, because iterative feedback has the form: 

	new = old $op $feedback-term


> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, 

Agreed, its not a derived expression but the originator of the dirty
page count control.

> but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.

That's where I disagree :-)

> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

But that's exactly where you conflate the positional feedback with the
throughput feedback, the effective ratelimit includes the positional
feedback so that the dirty page count can move around, but that is
completely orthogonal to the throughput feedback since the throughout
thing would leave the dirty count constant (ideal case again).

That is, yes the iterative feedback still works because you still got
your primary feedback in place, but the addition of pos_ratio in the
feedback loop is a pure perturbation and doesn't matter one whit.

> Then we try to estimate task_ratelimit_200ms by assuming all tasks
> have been executing the same CONSTANT ratelimit in
> balance_dirty_pages(). Hence we get
> 
>         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

But this just cannot be true (and, as argued above, is completely
unnecessary). 

Consider the case where the dirty count is way below the setpoint but
the base ratelimit is pretty accurate. In that case we would start out
by creating very low task ratelimits such that the dirty count can
increase. Once we match the setpoint we go back to the base ratelimit.
The average over those 200ms would be <1, but since we're right at the
setpoint when we do the base ratelimit feedback we pick exactly 1. 

Anyway, its completely irrelevant.. :-)

> > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > 
> > How can there not be a relation between balance_rate_(i+1) and
> > balance_rate_(i) ? 
> 
> In this manner: even though balance_rate_(i) is somehow used for
> calculating balance_rate_(i+1), the latter will evaluate to the same
> value given whatever balance_rate_(i).

But only if you allow for the iterative feedback to work, you absolutely
need that balance_rate_(i), without that its completely broken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 15:57                         ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) 

At best this argument says it doesn't matter what we use, making
balance_rate_i an equally valid choice. However I don't buy this, your
argument is broken, your CONSTANT_ratelimit breaks feedback but then you
rely on the iterative form of feedback to finish your argument.

Consider:

	r_(i+1) = r_i * ratio_i

you say, r_i := C for all i, then by definition ratio_i must be 1 and
you've got nothing. The only way your conclusion can be right is by
allowing the proper iteration, otherwise we'll never reach the
equilibrium.

Now it is true you can introduce random perturbations in r_i at any
given point and still end up in equilibrium, such is the power of
iterative feedback, but that doesn't say you can do away with r_i. 

> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().

This seems to be where your argument goes bad, the actually executed
ratelimit is not important, the variance introduced by pos_ratio is
purely for the benefit of the dirty page count. 

It doesn't matter for the balance_rate. Without pos_ratio, the dirty
page count would stay stable (ignoring all these oscillations and other
fun things), and therefore it is the balance_rate we should be using for
the iterative feedback.

> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.

No, because iterative feedback has the form: 

	new = old $op $feedback-term


> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, 

Agreed, its not a derived expression but the originator of the dirty
page count control.

> but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.

That's where I disagree :-)

> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

But that's exactly where you conflate the positional feedback with the
throughput feedback, the effective ratelimit includes the positional
feedback so that the dirty page count can move around, but that is
completely orthogonal to the throughput feedback since the throughout
thing would leave the dirty count constant (ideal case again).

That is, yes the iterative feedback still works because you still got
your primary feedback in place, but the addition of pos_ratio in the
feedback loop is a pure perturbation and doesn't matter one whit.

> Then we try to estimate task_ratelimit_200ms by assuming all tasks
> have been executing the same CONSTANT ratelimit in
> balance_dirty_pages(). Hence we get
> 
>         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

But this just cannot be true (and, as argued above, is completely
unnecessary). 

Consider the case where the dirty count is way below the setpoint but
the base ratelimit is pretty accurate. In that case we would start out
by creating very low task ratelimits such that the dirty count can
increase. Once we match the setpoint we go back to the base ratelimit.
The average over those 200ms would be <1, but since we're right at the
setpoint when we do the base ratelimit feedback we pick exactly 1. 

Anyway, its completely irrelevant.. :-)

> > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > 
> > How can there not be a relation between balance_rate_(i+1) and
> > balance_rate_(i) ? 
> 
> In this manner: even though balance_rate_(i) is somehow used for
> calculating balance_rate_(i+1), the latter will evaluate to the same
> value given whatever balance_rate_(i).

But only if you allow for the iterative feedback to work, you absolutely
need that balance_rate_(i), without that its completely broken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24  0:12                           ` Wu Fengguang
@ 2011-08-24 16:12                             ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-24 16:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation


The assumption here is that N is a constant.. in the above case
pos_ratio would eventually end up at 1 and things would be good again. I
see your argument about oscillations, but I think you can introduce
similar effects by varying N.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 16:12                             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-24 16:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation


The assumption here is that N is a constant.. in the above case
pos_ratio would eventually end up at 1 and things would be good again. I
see your argument about oscillations, but I think you can introduce
similar effects by varying N.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24  0:12                           ` Wu Fengguang
@ 2011-08-24 18:00                             ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-24 18:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation

Ok, I think I am beginning to see your point. Let me just elaborate on
the example you gave.

Assume a system is completely balanced and a task is writing at 100MB/s
rate.

write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1

bdi->dirty_ratelimit = 100MB/s

Now another tasks starts dirtying the page cache on same bdi. Number of 
dirty pages should go up pretty fast and likely position ratio feedback
will kick in to reduce the dirtying rate. (rate based feedback does not
kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
Assume new pos_ratio is .5

So new throttle rate for both the tasks is 50MB/s.

bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s

Now lets say 200ms have passed and rate base feedback is reevaluated.

						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
						      dirty_bw

bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s

Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
that did not happen. And reason being that there are two feedback control
loops and pos_ratio loops reacts to imbalances much more quickly. Because
previous loop has already reacted to the imbalance and reduced the
dirtying rate of task, rate based loop does not try to adjust anything
and thinks everything is just fine.

Things are fine in the sense that still dirty_rate == write_bw but
system is not balanced in terms of number of dirty pages and pos_ratio=.5

So you are trying to make one feedback loop aware of second loop so that
if second loop is unbalanced, first loop reacts to that as well and not
just look at dirty_rate and write_bw. So refining new balanced rate by
pos_ratio helps.
						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
						      dirty_bw

Now if global dirty pages are imbalanced, balanced rate will still go
down despite the fact that dirty_bw == write_bw. This will lead to
further reduction in task dirty rate. Which in turn will lead to reduced
number of dirty rate and should eventually lead to pos_ratio=1.

A related question though I should have asked you this long back. How does
throttling based on rate helps. Why we could not just work with two
pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
And then throttle task gradually to achieve smooth throttling behavior.
IOW, what property does rate provide which is not available just by
looking at per bdi dirty pages. Can't we come up with bdi setpoint and
limit the way you have done for gloabl setpoint and throttle tasks
accordingly?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 18:00                             ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-24 18:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation

Ok, I think I am beginning to see your point. Let me just elaborate on
the example you gave.

Assume a system is completely balanced and a task is writing at 100MB/s
rate.

write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1

bdi->dirty_ratelimit = 100MB/s

Now another tasks starts dirtying the page cache on same bdi. Number of 
dirty pages should go up pretty fast and likely position ratio feedback
will kick in to reduce the dirtying rate. (rate based feedback does not
kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
Assume new pos_ratio is .5

So new throttle rate for both the tasks is 50MB/s.

bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s

Now lets say 200ms have passed and rate base feedback is reevaluated.

						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
						      dirty_bw

bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s

Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
that did not happen. And reason being that there are two feedback control
loops and pos_ratio loops reacts to imbalances much more quickly. Because
previous loop has already reacted to the imbalance and reduced the
dirtying rate of task, rate based loop does not try to adjust anything
and thinks everything is just fine.

Things are fine in the sense that still dirty_rate == write_bw but
system is not balanced in terms of number of dirty pages and pos_ratio=.5

So you are trying to make one feedback loop aware of second loop so that
if second loop is unbalanced, first loop reacts to that as well and not
just look at dirty_rate and write_bw. So refining new balanced rate by
pos_ratio helps.
						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
						      dirty_bw

Now if global dirty pages are imbalanced, balanced rate will still go
down despite the fact that dirty_bw == write_bw. This will lead to
further reduction in task dirty rate. Which in turn will lead to reduced
number of dirty rate and should eventually lead to pos_ratio=1.

A related question though I should have asked you this long back. How does
throttling based on rate helps. Why we could not just work with two
pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
And then throttle task gradually to achieve smooth throttling behavior.
IOW, what property does rate provide which is not available just by
looking at per bdi dirty pages. Can't we come up with bdi setpoint and
limit the way you have done for gloabl setpoint and throttle tasks
accordingly?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 18:00                             ` Vivek Goyal
@ 2011-08-25  3:19                               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-25  3:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 02:00:58AM +0800, Vivek Goyal wrote:
> On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.

Thank you very much :)

> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.

That's right. There must be some instantaneous feedback to react to
fast workload changes. With pos_ratio providing this capability, the
estimated balanced rate can take time to follow.

Note that pos_ratio by itself is enough to limit dirty pages within
the [freerun, limit] control scope. The cost of (temporarily) large
error in balanced rate is, task_ratelimit will be fluctuating much
more, due to the fact pos_ratio will depart from 1.0 (to the point it
can fully compensate for the rate errors) and dirty pages approaching
@freerun or @limit where the slope of pos_ratio goes sharp.

The correct estimation of balanced rate serves to drive pos_ratio back
to 1.0, where it has the most flat slope.

> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
> 						        write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> 						        dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.

That's right.

> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5

Yes. The bad thing is, if the above equation (of pure rate feedback)
is used, the system is going to remain in that position-imbalanced
state forever, which is bad for the smoothness of task_ratelimit.

> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
> 						      write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> 						      dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.

Right, that's a good alternative viewpoint to the below one.

  						  write_bw	
  bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
  						  dirty_bw

(1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
(2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

> A related question though I should have asked you this long back. How does
> throttling based on rate helps. Why we could not just work with two
> pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> And then throttle task gradually to achieve smooth throttling behavior.
> IOW, what property does rate provide which is not available just by
> looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> limit the way you have done for gloabl setpoint and throttle tasks
> accordingly?

Good question. If we have no idea of the balanced rate at all, but
still want to limit dirty pages within the range [freerun, limit],
all we can do is to throttle the task at eg. 1TB/s at @freerun and
0 at @limit. Then you get a really sharp control line which will make
task_ratelimit fluctuate like mad...

So the balanced rate estimation is the key to get smooth task_ratelimit,
while pos_ratio is the ultimate guarantee for the dirty pages range.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-25  3:19                               ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-25  3:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 02:00:58AM +0800, Vivek Goyal wrote:
> On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.

Thank you very much :)

> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.

That's right. There must be some instantaneous feedback to react to
fast workload changes. With pos_ratio providing this capability, the
estimated balanced rate can take time to follow.

Note that pos_ratio by itself is enough to limit dirty pages within
the [freerun, limit] control scope. The cost of (temporarily) large
error in balanced rate is, task_ratelimit will be fluctuating much
more, due to the fact pos_ratio will depart from 1.0 (to the point it
can fully compensate for the rate errors) and dirty pages approaching
@freerun or @limit where the slope of pos_ratio goes sharp.

The correct estimation of balanced rate serves to drive pos_ratio back
to 1.0, where it has the most flat slope.

> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
> 						        write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> 						        dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.

That's right.

> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5

Yes. The bad thing is, if the above equation (of pure rate feedback)
is used, the system is going to remain in that position-imbalanced
state forever, which is bad for the smoothness of task_ratelimit.

> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
> 						      write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> 						      dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.

Right, that's a good alternative viewpoint to the below one.

  						  write_bw	
  bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
  						  dirty_bw

(1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
(2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

> A related question though I should have asked you this long back. How does
> throttling based on rate helps. Why we could not just work with two
> pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> And then throttle task gradually to achieve smooth throttling behavior.
> IOW, what property does rate provide which is not available just by
> looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> limit the way you have done for gloabl setpoint and throttle tasks
> accordingly?

Good question. If we have no idea of the balanced rate at all, but
still want to limit dirty pages within the range [freerun, limit],
all we can do is to throttle the task at eg. 1TB/s at @freerun and
0 at @limit. Then you get a really sharp control line which will make
task_ratelimit fluctuate like mad...

So the balanced rate estimation is the key to get smooth task_ratelimit,
while pos_ratio is the ultimate guarantee for the dirty pages range.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 15:57                         ` Peter Zijlstra
@ 2011-08-25  5:30                           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-25  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 11:57:39PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > > >   well, in this concept: the balanced_rate formula inherently does not
> > > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > > >   based on the ratelimit executed for the past 200ms:
> > > > 
> > > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > > 
> > > Ok, this is where it all goes funny..
> > > 
> > > So if you want completely separated feedback loops I would expect
> > 
> > If call it feedback loops, then it's a series of independent feedback
> > loops of depth 1.  Because each balanced_rate is a fresh estimation
> > dependent solely on
> > 
> > - writeout bandwidth
> > - N, the number of dd tasks
> > 
> > in the past 200ms.
> > 
> > As long as a CONSTANT ratelimit (whatever value it is) is executed in
> > the past 200ms, we can get the same balanced_rate.
> > 
> >         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> > 
> > The resulted balanced_rate is independent of how large the CONSTANT
> > ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> > we'll see doubled dirty_rate and result in the same balanced_rate. 
> > 
> > In that manner, balance_rate_(i+1) is not really depending on the
> > value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> > to get the same balance_rate_(i+1) 
> 
> At best this argument says it doesn't matter what we use, making
> balance_rate_i an equally valid choice. However I don't buy this, your
> argument is broken, your CONSTANT_ratelimit breaks feedback but then you
> rely on the iterative form of feedback to finish your argument.
> 
> Consider:
> 
> 	r_(i+1) = r_i * ratio_i
> 
> you say, r_i := C for all i, then by definition ratio_i must be 1 and
> you've got nothing. The only way your conclusion can be right is by
> allowing the proper iteration, otherwise we'll never reach the
> equilibrium.
> 
> Now it is true you can introduce random perturbations in r_i at any
> given point and still end up in equilibrium, such is the power of
> iterative feedback, but that doesn't say you can do away with r_i. 

Sure there are always r_i.

Sorry what I mean CONSTANT_ratelimit is, it remains CONSTANT _inside_
every 200ms. There will be a series of different CONSTANT values for
each 200ms, which is roughly (r_i * pos_ratio_i).

> > > something like:
> > > 
> > > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > > 
> > > The former is a complete feedback loop, expressing the new value in the
> > > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > > causing the balance_rate to drop increasing the dirty_rate, and vice
> > > versa.
> > 
> > In principle, the bw_ratio works that way. However since
> > balance_rate_(i) is not the exact _executed_ ratelimit in
> > balance_dirty_pages().
> 
> This seems to be where your argument goes bad, the actually executed
> ratelimit is not important, the variance introduced by pos_ratio is
> purely for the benefit of the dirty page count. 
> 
> It doesn't matter for the balance_rate. Without pos_ratio, the dirty
> page count would stay stable (ignoring all these oscillations and other
> fun things), and therefore it is the balance_rate we should be using for
> the iterative feedback.

Nope. The dirty page count can always stay stable somewhere (but not
necessarily at setpoint) purely by the pos_ratio feedback, as illustrated
by Vivek's example.

But that's not the balance state we want. Although the pos_ratio
feedback all by itself is strong enough to keep (dirty_rate == write_bw),
the ideal state is to achieve pos_ratio=1 and eliminate its feedback
error as much as possible, so as to get smooth task_ratelimit.

We may take this viewpoint: a "successful" balance_rate should help
keep pos_ratio around 1.0 in long term.

> > > (*) which is the form I expected and why I thought your primary feedback
> > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> >  
> > Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> No, because iterative feedback has the form: 
> 
> 	new = old $op $feedback-term
> 

The problem is, the pos_ratio feedback will jump in and prematurely make
$feedback-term = 1, thus rendering the pure rate feedback weak/useless.

> > > Then when you use the balance_rate to actually throttle tasks you apply
> > > your secondary control steering the dirty page count, yielding:
> > > 
> > > 	task_rate = balance_rate * pos_ratio
> > 
> > Right. Note the above formula is not a derived one, 
> 
> Agreed, its not a derived expression but the originator of the dirty
> page count control.
> 
> > but an original
> > one that later leads to pos_ratio showing up in the calculation of
> > balanced_rate.
> 
> That's where I disagree :-)
> 
> > > >   and task_ratelimit_200ms happen to can be estimated from
> > > > 
> > > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > > 
> > > >   We may alternatively record every task_ratelimit executed in the
> > > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > > >   way we take the "superfluous" pos_ratio out of sight :) 
> > > 
> > > Right, so I'm not at all sure that makes sense, its not immediately
> > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > > all. 
> > 
> > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> > by balance_dirty_pages(). So this is an original formula:
> > 
> >         task_ratelimit = balance_rate * pos_ratio
> > 
> > task_ratelimit_200ms is also used as an original data source in
> > 
> >         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> But that's exactly where you conflate the positional feedback with the
> throughput feedback, the effective ratelimit includes the positional
> feedback so that the dirty page count can move around, but that is
> completely orthogonal to the throughput feedback since the throughout
> thing would leave the dirty count constant (ideal case again).
> 
> That is, yes the iterative feedback still works because you still got
> your primary feedback in place, but the addition of pos_ratio in the
> feedback loop is a pure perturbation and doesn't matter one whit.

The problem is that pure rate feedback is not possible because
pos_ratio also takes part in altering the task rate...

> > Then we try to estimate task_ratelimit_200ms by assuming all tasks
> > have been executing the same CONSTANT ratelimit in
> > balance_dirty_pages(). Hence we get
> > 
> >         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio
> 
> But this just cannot be true (and, as argued above, is completely
> unnecessary). 
> 
> Consider the case where the dirty count is way below the setpoint but
> the base ratelimit is pretty accurate. In that case we would start out
> by creating very low task ratelimits such that the dirty count can

s/low/high/

> increase. Once we match the setpoint we go back to the base ratelimit.
> The average over those 200ms would be <1, but since we're right at the
> setpoint when we do the base ratelimit feedback we pick exactly 1. 

Yeah that's the kind of error introduced by the CONSTANT ratelimit.
Which could be pretty large in small memory boxes. Given that
pos_ratio will fluctuate more anyway when memory and hence the
dirty control scope is small, such rate estimation errors are tolerable.

> Anyway, its completely irrelevant.. :-)

Yeah, that's one step further to discuss all kinds of possible errors
on top of the basic theory :)

> > > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > > 
> > > How can there not be a relation between balance_rate_(i+1) and
> > > balance_rate_(i) ? 
> > 
> > In this manner: even though balance_rate_(i) is somehow used for
> > calculating balance_rate_(i+1), the latter will evaluate to the same
> > value given whatever balance_rate_(i).
> 
> But only if you allow for the iterative feedback to work, you absolutely
> need that balance_rate_(i), without that its completely broken.

Agreed.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-25  5:30                           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-25  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 11:57:39PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > > >   well, in this concept: the balanced_rate formula inherently does not
> > > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > > >   based on the ratelimit executed for the past 200ms:
> > > > 
> > > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > > 
> > > Ok, this is where it all goes funny..
> > > 
> > > So if you want completely separated feedback loops I would expect
> > 
> > If call it feedback loops, then it's a series of independent feedback
> > loops of depth 1.  Because each balanced_rate is a fresh estimation
> > dependent solely on
> > 
> > - writeout bandwidth
> > - N, the number of dd tasks
> > 
> > in the past 200ms.
> > 
> > As long as a CONSTANT ratelimit (whatever value it is) is executed in
> > the past 200ms, we can get the same balanced_rate.
> > 
> >         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> > 
> > The resulted balanced_rate is independent of how large the CONSTANT
> > ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> > we'll see doubled dirty_rate and result in the same balanced_rate. 
> > 
> > In that manner, balance_rate_(i+1) is not really depending on the
> > value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> > to get the same balance_rate_(i+1) 
> 
> At best this argument says it doesn't matter what we use, making
> balance_rate_i an equally valid choice. However I don't buy this, your
> argument is broken, your CONSTANT_ratelimit breaks feedback but then you
> rely on the iterative form of feedback to finish your argument.
> 
> Consider:
> 
> 	r_(i+1) = r_i * ratio_i
> 
> you say, r_i := C for all i, then by definition ratio_i must be 1 and
> you've got nothing. The only way your conclusion can be right is by
> allowing the proper iteration, otherwise we'll never reach the
> equilibrium.
> 
> Now it is true you can introduce random perturbations in r_i at any
> given point and still end up in equilibrium, such is the power of
> iterative feedback, but that doesn't say you can do away with r_i. 

Sure there are always r_i.

Sorry what I mean CONSTANT_ratelimit is, it remains CONSTANT _inside_
every 200ms. There will be a series of different CONSTANT values for
each 200ms, which is roughly (r_i * pos_ratio_i).

> > > something like:
> > > 
> > > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > > 
> > > The former is a complete feedback loop, expressing the new value in the
> > > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > > causing the balance_rate to drop increasing the dirty_rate, and vice
> > > versa.
> > 
> > In principle, the bw_ratio works that way. However since
> > balance_rate_(i) is not the exact _executed_ ratelimit in
> > balance_dirty_pages().
> 
> This seems to be where your argument goes bad, the actually executed
> ratelimit is not important, the variance introduced by pos_ratio is
> purely for the benefit of the dirty page count. 
> 
> It doesn't matter for the balance_rate. Without pos_ratio, the dirty
> page count would stay stable (ignoring all these oscillations and other
> fun things), and therefore it is the balance_rate we should be using for
> the iterative feedback.

Nope. The dirty page count can always stay stable somewhere (but not
necessarily at setpoint) purely by the pos_ratio feedback, as illustrated
by Vivek's example.

But that's not the balance state we want. Although the pos_ratio
feedback all by itself is strong enough to keep (dirty_rate == write_bw),
the ideal state is to achieve pos_ratio=1 and eliminate its feedback
error as much as possible, so as to get smooth task_ratelimit.

We may take this viewpoint: a "successful" balance_rate should help
keep pos_ratio around 1.0 in long term.

> > > (*) which is the form I expected and why I thought your primary feedback
> > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> >  
> > Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> No, because iterative feedback has the form: 
> 
> 	new = old $op $feedback-term
> 

The problem is, the pos_ratio feedback will jump in and prematurely make
$feedback-term = 1, thus rendering the pure rate feedback weak/useless.

> > > Then when you use the balance_rate to actually throttle tasks you apply
> > > your secondary control steering the dirty page count, yielding:
> > > 
> > > 	task_rate = balance_rate * pos_ratio
> > 
> > Right. Note the above formula is not a derived one, 
> 
> Agreed, its not a derived expression but the originator of the dirty
> page count control.
> 
> > but an original
> > one that later leads to pos_ratio showing up in the calculation of
> > balanced_rate.
> 
> That's where I disagree :-)
> 
> > > >   and task_ratelimit_200ms happen to can be estimated from
> > > > 
> > > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > > 
> > > >   We may alternatively record every task_ratelimit executed in the
> > > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > > >   way we take the "superfluous" pos_ratio out of sight :) 
> > > 
> > > Right, so I'm not at all sure that makes sense, its not immediately
> > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > > all. 
> > 
> > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> > by balance_dirty_pages(). So this is an original formula:
> > 
> >         task_ratelimit = balance_rate * pos_ratio
> > 
> > task_ratelimit_200ms is also used as an original data source in
> > 
> >         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> But that's exactly where you conflate the positional feedback with the
> throughput feedback, the effective ratelimit includes the positional
> feedback so that the dirty page count can move around, but that is
> completely orthogonal to the throughput feedback since the throughout
> thing would leave the dirty count constant (ideal case again).
> 
> That is, yes the iterative feedback still works because you still got
> your primary feedback in place, but the addition of pos_ratio in the
> feedback loop is a pure perturbation and doesn't matter one whit.

The problem is that pure rate feedback is not possible because
pos_ratio also takes part in altering the task rate...

> > Then we try to estimate task_ratelimit_200ms by assuming all tasks
> > have been executing the same CONSTANT ratelimit in
> > balance_dirty_pages(). Hence we get
> > 
> >         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio
> 
> But this just cannot be true (and, as argued above, is completely
> unnecessary). 
> 
> Consider the case where the dirty count is way below the setpoint but
> the base ratelimit is pretty accurate. In that case we would start out
> by creating very low task ratelimits such that the dirty count can

s/low/high/

> increase. Once we match the setpoint we go back to the base ratelimit.
> The average over those 200ms would be <1, but since we're right at the
> setpoint when we do the base ratelimit feedback we pick exactly 1. 

Yeah that's the kind of error introduced by the CONSTANT ratelimit.
Which could be pretty large in small memory boxes. Given that
pos_ratio will fluctuate more anyway when memory and hence the
dirty control scope is small, such rate estimation errors are tolerable.

> Anyway, its completely irrelevant.. :-)

Yeah, that's one step further to discuss all kinds of possible errors
on top of the basic theory :)

> > > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > > 
> > > How can there not be a relation between balance_rate_(i+1) and
> > > balance_rate_(i) ? 
> > 
> > In this manner: even though balance_rate_(i) is somehow used for
> > calculating balance_rate_(i+1), the latter will evaluate to the same
> > value given whatever balance_rate_(i).
> 
> But only if you allow for the iterative feedback to work, you absolutely
> need that balance_rate_(i), without that its completely broken.

Agreed.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-25  3:19                               ` Wu Fengguang
@ 2011-08-25 22:20                                 ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-25 22:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:

[..]
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> > 						      write_bw	
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > 						      dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> Right, that's a good alternative viewpoint to the below one.
> 
>   						  write_bw	
>   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
>   						  dirty_bw
> 
> (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

Personally I found it much easier to understand the other representation.
Once you have come up with equation.

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw

Can you please put few lines of comments to explain that why above
alone is not sufficient and we need to take pos_ratio also in to
account to keep number of dirty pages in check. And then go onto

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio

This kind of maintains the continuity of explanation and explains
that why are we deviating from the theory we discussed so far.

> 
> > A related question though I should have asked you this long back. How does
> > throttling based on rate helps. Why we could not just work with two
> > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > And then throttle task gradually to achieve smooth throttling behavior.
> > IOW, what property does rate provide which is not available just by
> > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > limit the way you have done for gloabl setpoint and throttle tasks
> > accordingly?
> 
> Good question. If we have no idea of the balanced rate at all, but
> still want to limit dirty pages within the range [freerun, limit],
> all we can do is to throttle the task at eg. 1TB/s at @freerun and
> 0 at @limit. Then you get a really sharp control line which will make
> task_ratelimit fluctuate like mad...
> 
> So the balanced rate estimation is the key to get smooth task_ratelimit,
> while pos_ratio is the ultimate guarantee for the dirty pages range.

Ok, that makes sense. By keeping an estimation of rate at which bdi
can write, our range of throttling goes down. Say 0 to 300MB/s instead
of 0 to 1TB/sec and that can lead to a more smooth behavior.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-25 22:20                                 ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-25 22:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:

[..]
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> > 						      write_bw	
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > 						      dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> Right, that's a good alternative viewpoint to the below one.
> 
>   						  write_bw	
>   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
>   						  dirty_bw
> 
> (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

Personally I found it much easier to understand the other representation.
Once you have come up with equation.

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw

Can you please put few lines of comments to explain that why above
alone is not sufficient and we need to take pos_ratio also in to
account to keep number of dirty pages in check. And then go onto

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio

This kind of maintains the continuity of explanation and explains
that why are we deviating from the theory we discussed so far.

> 
> > A related question though I should have asked you this long back. How does
> > throttling based on rate helps. Why we could not just work with two
> > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > And then throttle task gradually to achieve smooth throttling behavior.
> > IOW, what property does rate provide which is not available just by
> > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > limit the way you have done for gloabl setpoint and throttle tasks
> > accordingly?
> 
> Good question. If we have no idea of the balanced rate at all, but
> still want to limit dirty pages within the range [freerun, limit],
> all we can do is to throttle the task at eg. 1TB/s at @freerun and
> 0 at @limit. Then you get a really sharp control line which will make
> task_ratelimit fluctuate like mad...
> 
> So the balanced rate estimation is the key to get smooth task_ratelimit,
> while pos_ratio is the ultimate guarantee for the dirty pages range.

Ok, that makes sense. By keeping an estimation of rate at which bdi
can write, our range of throttling goes down. Say 0 to 300MB/s instead
of 0 to 1TB/sec and that can lead to a more smooth behavior.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 16:12                             ` Peter Zijlstra
@ 2011-08-26  0:18                               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> 
> The assumption here is that N is a constant.. in the above case
> pos_ratio would eventually end up at 1 and things would be good again. I
> see your argument about oscillations, but I think you can introduce
> similar effects by varying N.

Yeah, it's very possible for N to change over time, in which case
balanced_rate will adapt to new N in similar way.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  0:18                               ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> 
> The assumption here is that N is a constant.. in the above case
> pos_ratio would eventually end up at 1 and things would be good again. I
> see your argument about oscillations, but I think you can introduce
> similar effects by varying N.

Yeah, it's very possible for N to change over time, in which case
balanced_rate will adapt to new N in similar way.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-25 22:20                                 ` Vivek Goyal
@ 2011-08-26  1:56                                   ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26  1:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:20:01AM +0800, Vivek Goyal wrote:
> On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > So you are trying to make one feedback loop aware of second loop so that
> > > if second loop is unbalanced, first loop reacts to that as well and not
> > > just look at dirty_rate and write_bw. So refining new balanced rate by
> > > pos_ratio helps.
> > > 						      write_bw	
> > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > > 						      dirty_bw
> > > 
> > > Now if global dirty pages are imbalanced, balanced rate will still go
> > > down despite the fact that dirty_bw == write_bw. This will lead to
> > > further reduction in task dirty rate. Which in turn will lead to reduced
> > > number of dirty rate and should eventually lead to pos_ratio=1.
> > 
> > Right, that's a good alternative viewpoint to the below one.
> > 
> >   						  write_bw	
> >   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
> >   						  dirty_bw
> > 
> > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0
> 
> Personally I found it much easier to understand the other representation.
> Once you have come up with equation.
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw
> 
> Can you please put few lines of comments to explain that why above
> alone is not sufficient and we need to take pos_ratio also in to
> account to keep number of dirty pages in check. And then go onto
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio
> 
> This kind of maintains the continuity of explanation and explains
> that why are we deviating from the theory we discussed so far.

Good point. Here is the commented code:

        /*
         * task_ratelimit reflects each dd's dirty rate for the past 200ms.
         */
        task_ratelimit = (u64)dirty_ratelimit *
                                        pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * A linear estimation of the "balanced" throttle rate. The theory is,
         * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
         * dirty_rate will be measured to be (N * task_ratelimit). So the below
         * formula will yield the balanced rate limit (write_bw / N).
         *
         * Note that the expanded form is not a pure rate feedback:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
         * but also takes pos_ratio into account:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
         *
         * (1) is not realistic because pos_ratio also takes part in balancing
         * the dirty rate.  Consider the state
         *      pos_ratio = 0.5                                              (3)
         *      rate = 2 * (write_bw / N)                                    (4)
         * If (1) is used, it will stuck in that state! Because each dd will be
         * throttled at
         *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
         * yielding
         *      dirty_rate = N * task_ratelimit = write_bw                   (6)
         * put (6) into (1) we get
         *      rate_(i+1) = rate_(i)                                        (7)
         *
         * So we end up using (2) to always keep
         *      rate_(i+1) ~= (write_bw / N)                                 (8)
         * regardless of the value of pos_ratio. As long as (8) is satisfied,
         * pos_ratio is able to drive itself to 1.0, which is not only where
         * the dirty count meet the setpoint, but also where the slope of
         * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
         */
        balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
                                           dirty_rate | 1);

> > 
> > > A related question though I should have asked you this long back. How does
> > > throttling based on rate helps. Why we could not just work with two
> > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > > And then throttle task gradually to achieve smooth throttling behavior.
> > > IOW, what property does rate provide which is not available just by
> > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > > limit the way you have done for gloabl setpoint and throttle tasks
> > > accordingly?
> > 
> > Good question. If we have no idea of the balanced rate at all, but
> > still want to limit dirty pages within the range [freerun, limit],
> > all we can do is to throttle the task at eg. 1TB/s at @freerun and
> > 0 at @limit. Then you get a really sharp control line which will make
> > task_ratelimit fluctuate like mad...
> > 
> > So the balanced rate estimation is the key to get smooth task_ratelimit,
> > while pos_ratio is the ultimate guarantee for the dirty pages range.
> 
> Ok, that makes sense. By keeping an estimation of rate at which bdi
> can write, our range of throttling goes down. Say 0 to 300MB/s instead
> of 0 to 1TB/sec and that can lead to a more smooth behavior.

Yeah exactly, and even better, we can make the slope much more flat
around the setpoint to achieve excellent smoothness in stable state :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  1:56                                   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26  1:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:20:01AM +0800, Vivek Goyal wrote:
> On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > So you are trying to make one feedback loop aware of second loop so that
> > > if second loop is unbalanced, first loop reacts to that as well and not
> > > just look at dirty_rate and write_bw. So refining new balanced rate by
> > > pos_ratio helps.
> > > 						      write_bw	
> > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > > 						      dirty_bw
> > > 
> > > Now if global dirty pages are imbalanced, balanced rate will still go
> > > down despite the fact that dirty_bw == write_bw. This will lead to
> > > further reduction in task dirty rate. Which in turn will lead to reduced
> > > number of dirty rate and should eventually lead to pos_ratio=1.
> > 
> > Right, that's a good alternative viewpoint to the below one.
> > 
> >   						  write_bw	
> >   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
> >   						  dirty_bw
> > 
> > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0
> 
> Personally I found it much easier to understand the other representation.
> Once you have come up with equation.
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw
> 
> Can you please put few lines of comments to explain that why above
> alone is not sufficient and we need to take pos_ratio also in to
> account to keep number of dirty pages in check. And then go onto
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio
> 
> This kind of maintains the continuity of explanation and explains
> that why are we deviating from the theory we discussed so far.

Good point. Here is the commented code:

        /*
         * task_ratelimit reflects each dd's dirty rate for the past 200ms.
         */
        task_ratelimit = (u64)dirty_ratelimit *
                                        pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * A linear estimation of the "balanced" throttle rate. The theory is,
         * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
         * dirty_rate will be measured to be (N * task_ratelimit). So the below
         * formula will yield the balanced rate limit (write_bw / N).
         *
         * Note that the expanded form is not a pure rate feedback:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
         * but also takes pos_ratio into account:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
         *
         * (1) is not realistic because pos_ratio also takes part in balancing
         * the dirty rate.  Consider the state
         *      pos_ratio = 0.5                                              (3)
         *      rate = 2 * (write_bw / N)                                    (4)
         * If (1) is used, it will stuck in that state! Because each dd will be
         * throttled at
         *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
         * yielding
         *      dirty_rate = N * task_ratelimit = write_bw                   (6)
         * put (6) into (1) we get
         *      rate_(i+1) = rate_(i)                                        (7)
         *
         * So we end up using (2) to always keep
         *      rate_(i+1) ~= (write_bw / N)                                 (8)
         * regardless of the value of pos_ratio. As long as (8) is satisfied,
         * pos_ratio is able to drive itself to 1.0, which is not only where
         * the dirty count meet the setpoint, but also where the slope of
         * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
         */
        balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
                                           dirty_rate | 1);

> > 
> > > A related question though I should have asked you this long back. How does
> > > throttling based on rate helps. Why we could not just work with two
> > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > > And then throttle task gradually to achieve smooth throttling behavior.
> > > IOW, what property does rate provide which is not available just by
> > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > > limit the way you have done for gloabl setpoint and throttle tasks
> > > accordingly?
> > 
> > Good question. If we have no idea of the balanced rate at all, but
> > still want to limit dirty pages within the range [freerun, limit],
> > all we can do is to throttle the task at eg. 1TB/s at @freerun and
> > 0 at @limit. Then you get a really sharp control line which will make
> > task_ratelimit fluctuate like mad...
> > 
> > So the balanced rate estimation is the key to get smooth task_ratelimit,
> > while pos_ratio is the ultimate guarantee for the dirty pages range.
> 
> Ok, that makes sense. By keeping an estimation of rate at which bdi
> can write, our range of throttling goes down. Say 0 to 300MB/s instead
> of 0 to 1TB/sec and that can lead to a more smooth behavior.

Yeah exactly, and even better, we can make the slope much more flat
around the setpoint to achieve excellent smoothness in stable state :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  1:56                                   ` Wu Fengguang
@ 2011-08-26  8:56                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26  8:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
>         /*
>          * A linear estimation of the "balanced" throttle rate. The theory is,
>          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
>          * dirty_rate will be measured to be (N * task_ratelimit). So the below
>          * formula will yield the balanced rate limit (write_bw / N).
>          *
>          * Note that the expanded form is not a pure rate feedback:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
>          * but also takes pos_ratio into account:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
>          *
>          * (1) is not realistic because pos_ratio also takes part in balancing
>          * the dirty rate.  Consider the state
>          *      pos_ratio = 0.5                                              (3)
>          *      rate = 2 * (write_bw / N)                                    (4)
>          * If (1) is used, it will stuck in that state! Because each dd will be
>          * throttled at
>          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
>          * yielding
>          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
>          * put (6) into (1) we get
>          *      rate_(i+1) = rate_(i)                                        (7)
>          *
>          * So we end up using (2) to always keep
>          *      rate_(i+1) ~= (write_bw / N)                                 (8)
>          * regardless of the value of pos_ratio. As long as (8) is satisfied,
>          * pos_ratio is able to drive itself to 1.0, which is not only where
>          * the dirty count meet the setpoint, but also where the slope of
>          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
>          */ 

I'm still not buying this, it has the massive assumption N is a
constant, without that assumption you get the same kind of thing you get
from not adding pos_ratio to the feedback term.

Also, I've yet to see what harm it does if you leave it out, all
feedback loops should stabilize just fine.

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  8:56                                     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26  8:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
>         /*
>          * A linear estimation of the "balanced" throttle rate. The theory is,
>          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
>          * dirty_rate will be measured to be (N * task_ratelimit). So the below
>          * formula will yield the balanced rate limit (write_bw / N).
>          *
>          * Note that the expanded form is not a pure rate feedback:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
>          * but also takes pos_ratio into account:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
>          *
>          * (1) is not realistic because pos_ratio also takes part in balancing
>          * the dirty rate.  Consider the state
>          *      pos_ratio = 0.5                                              (3)
>          *      rate = 2 * (write_bw / N)                                    (4)
>          * If (1) is used, it will stuck in that state! Because each dd will be
>          * throttled at
>          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
>          * yielding
>          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
>          * put (6) into (1) we get
>          *      rate_(i+1) = rate_(i)                                        (7)
>          *
>          * So we end up using (2) to always keep
>          *      rate_(i+1) ~= (write_bw / N)                                 (8)
>          * regardless of the value of pos_ratio. As long as (8) is satisfied,
>          * pos_ratio is able to drive itself to 1.0, which is not only where
>          * the dirty count meet the setpoint, but also where the slope of
>          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
>          */ 

I'm still not buying this, it has the massive assumption N is a
constant, without that assumption you get the same kind of thing you get
from not adding pos_ratio to the feedback term.

Also, I've yet to see what harm it does if you leave it out, all
feedback loops should stabilize just fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  0:18                               ` Wu Fengguang
@ 2011-08-26  9:04                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26  9:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:

> > > Put (6) into (4), we get
> > > 
> > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > >                             = (write_bw / N) * 2
> > > 
> > > That means, any position imbalance will lead to balanced_rate
> > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > always get the right balanced dirty ratelimit value whether or not
> > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > dirty position control.
> > > 
> > > (*) independent as in real values, not the seemingly relations in equation
> > 
> > 
> > The assumption here is that N is a constant.. in the above case
> > pos_ratio would eventually end up at 1 and things would be good again. I
> > see your argument about oscillations, but I think you can introduce
> > similar effects by varying N.
> 
> Yeah, it's very possible for N to change over time, in which case
> balanced_rate will adapt to new N in similar way.

Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
you accept that for pos_ratio but you don't mind for N ?



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  9:04                                 ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26  9:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:

> > > Put (6) into (4), we get
> > > 
> > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > >                             = (write_bw / N) * 2
> > > 
> > > That means, any position imbalance will lead to balanced_rate
> > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > always get the right balanced dirty ratelimit value whether or not
> > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > dirty position control.
> > > 
> > > (*) independent as in real values, not the seemingly relations in equation
> > 
> > 
> > The assumption here is that N is a constant.. in the above case
> > pos_ratio would eventually end up at 1 and things would be good again. I
> > see your argument about oscillations, but I think you can introduce
> > similar effects by varying N.
> 
> Yeah, it's very possible for N to change over time, in which case
> balanced_rate will adapt to new N in similar way.

Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
you accept that for pos_ratio but you don't mind for N ?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  8:56                                     ` Peter Zijlstra
@ 2011-08-26  9:53                                       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 04:56:11PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
> >         /*
> >          * A linear estimation of the "balanced" throttle rate. The theory is,
> >          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
> >          * dirty_rate will be measured to be (N * task_ratelimit). So the below
> >          * formula will yield the balanced rate limit (write_bw / N).
> >          *
> >          * Note that the expanded form is not a pure rate feedback:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
> >          * but also takes pos_ratio into account:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
> >          *
> >          * (1) is not realistic because pos_ratio also takes part in balancing
> >          * the dirty rate.  Consider the state
> >          *      pos_ratio = 0.5                                              (3)
> >          *      rate = 2 * (write_bw / N)                                    (4)
> >          * If (1) is used, it will stuck in that state! Because each dd will be
> >          * throttled at
> >          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
> >          * yielding
> >          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
> >          * put (6) into (1) we get
> >          *      rate_(i+1) = rate_(i)                                        (7)
> >          *
> >          * So we end up using (2) to always keep
> >          *      rate_(i+1) ~= (write_bw / N)                                 (8)
> >          * regardless of the value of pos_ratio. As long as (8) is satisfied,
> >          * pos_ratio is able to drive itself to 1.0, which is not only where
> >          * the dirty count meet the setpoint, but also where the slope of
> >          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
> >          */ 
> 
> I'm still not buying this, it has the massive assumption N is a
> constant, without that assumption you get the same kind of thing you get
> from not adding pos_ratio to the feedback term.

The reasoning between (3)-(7) actually assumes both N and write_bw to
be some constant. It's documenting some stuck state..

> Also, I've yet to see what harm it does if you leave it out, all
> feedback loops should stabilize just fine.

That's a good question. It should be trivial to try out equation (1)
and see how it work out in practice. Let me collect some figures..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  9:53                                       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 04:56:11PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
> >         /*
> >          * A linear estimation of the "balanced" throttle rate. The theory is,
> >          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
> >          * dirty_rate will be measured to be (N * task_ratelimit). So the below
> >          * formula will yield the balanced rate limit (write_bw / N).
> >          *
> >          * Note that the expanded form is not a pure rate feedback:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
> >          * but also takes pos_ratio into account:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
> >          *
> >          * (1) is not realistic because pos_ratio also takes part in balancing
> >          * the dirty rate.  Consider the state
> >          *      pos_ratio = 0.5                                              (3)
> >          *      rate = 2 * (write_bw / N)                                    (4)
> >          * If (1) is used, it will stuck in that state! Because each dd will be
> >          * throttled at
> >          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
> >          * yielding
> >          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
> >          * put (6) into (1) we get
> >          *      rate_(i+1) = rate_(i)                                        (7)
> >          *
> >          * So we end up using (2) to always keep
> >          *      rate_(i+1) ~= (write_bw / N)                                 (8)
> >          * regardless of the value of pos_ratio. As long as (8) is satisfied,
> >          * pos_ratio is able to drive itself to 1.0, which is not only where
> >          * the dirty count meet the setpoint, but also where the slope of
> >          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
> >          */ 
> 
> I'm still not buying this, it has the massive assumption N is a
> constant, without that assumption you get the same kind of thing you get
> from not adding pos_ratio to the feedback term.

The reasoning between (3)-(7) actually assumes both N and write_bw to
be some constant. It's documenting some stuck state..

> Also, I've yet to see what harm it does if you leave it out, all
> feedback loops should stabilize just fine.

That's a good question. It should be trivial to try out equation (1)
and see how it work out in practice. Let me collect some figures..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  9:04                                 ` Peter Zijlstra
@ 2011-08-26 10:04                                   ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 05:04:29PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> 
> > > > Put (6) into (4), we get
> > > > 
> > > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > > >                             = (write_bw / N) * 2
> > > > 
> > > > That means, any position imbalance will lead to balanced_rate
> > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > > always get the right balanced dirty ratelimit value whether or not
> > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > > dirty position control.
> > > > 
> > > > (*) independent as in real values, not the seemingly relations in equation
> > > 
> > > 
> > > The assumption here is that N is a constant.. in the above case
> > > pos_ratio would eventually end up at 1 and things would be good again. I
> > > see your argument about oscillations, but I think you can introduce
> > > similar effects by varying N.
> > 
> > Yeah, it's very possible for N to change over time, in which case
> > balanced_rate will adapt to new N in similar way.
> 
> Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
> you accept that for pos_ratio but you don't mind for N ?

Sorry I'm now feeling lost...anyway it's convenient to try out the
pure rate feedback. And the test case exactly includes the sudden
change of N.

I'm now running the tests with this trivial patch:

--- linux-next.orig/mm/page-writeback.c	2011-08-26 17:58:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 17:59:06.000000000 +0800
@@ -800,7 +800,7 @@ static void bdi_update_dirty_ratelimit(s
 	 * the dirty count meet the setpoint, but also where the slope of
 	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
 	 */
-	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+	balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw,
 					   dirty_rate | 1);
 
 	/*

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 10:04                                   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 05:04:29PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> 
> > > > Put (6) into (4), we get
> > > > 
> > > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > > >                             = (write_bw / N) * 2
> > > > 
> > > > That means, any position imbalance will lead to balanced_rate
> > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > > always get the right balanced dirty ratelimit value whether or not
> > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > > dirty position control.
> > > > 
> > > > (*) independent as in real values, not the seemingly relations in equation
> > > 
> > > 
> > > The assumption here is that N is a constant.. in the above case
> > > pos_ratio would eventually end up at 1 and things would be good again. I
> > > see your argument about oscillations, but I think you can introduce
> > > similar effects by varying N.
> > 
> > Yeah, it's very possible for N to change over time, in which case
> > balanced_rate will adapt to new N in similar way.
> 
> Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
> you accept that for pos_ratio but you don't mind for N ?

Sorry I'm now feeling lost...anyway it's convenient to try out the
pure rate feedback. And the test case exactly includes the sudden
change of N.

I'm now running the tests with this trivial patch:

--- linux-next.orig/mm/page-writeback.c	2011-08-26 17:58:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 17:59:06.000000000 +0800
@@ -800,7 +800,7 @@ static void bdi_update_dirty_ratelimit(s
 	 * the dirty count meet the setpoint, but also where the slope of
 	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
 	 */
-	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+	balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw,
 					   dirty_rate | 1);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 10:04                                   ` Wu Fengguang
@ 2011-08-26 10:42                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> Sorry I'm now feeling lost...

hehe welcome to my world ;-)

Seriously though, I appreciate all the effort you put in trying to
explain things. I feel I do understand things now, although I might not
completely agree with them quite yet ;-)

I'll go read the v9 patch-set you send out and look at some of the
details (such as pos_ratio being comprised of both global and bdi
limits, which so far has been somewhat glossed over).



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 10:42                                     ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> Sorry I'm now feeling lost...

hehe welcome to my world ;-)

Seriously though, I appreciate all the effort you put in trying to
explain things. I feel I do understand things now, although I might not
completely agree with them quite yet ;-)

I'll go read the v9 patch-set you send out and look at some of the
details (such as pos_ratio being comprised of both global and bdi
limits, which so far has been somewhat glossed over).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 10:42                                     ` Peter Zijlstra
@ 2011-08-26 10:52                                       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:42:22PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> > Sorry I'm now feeling lost...
> 
> hehe welcome to my world ;-)

Yeah, so sorry...

> Seriously though, I appreciate all the effort you put in trying to
> explain things. I feel I do understand things now, although I might not
> completely agree with them quite yet ;-)

Thank you :)

> I'll go read the v9 patch-set you send out and look at some of the
> details (such as pos_ratio being comprised of both global and bdi
> limits, which so far has been somewhat glossed over).

Hold on please! I'll immediately post a v10 with all the comment updates.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 10:52                                       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:42:22PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> > Sorry I'm now feeling lost...
> 
> hehe welcome to my world ;-)

Yeah, so sorry...

> Seriously though, I appreciate all the effort you put in trying to
> explain things. I feel I do understand things now, although I might not
> completely agree with them quite yet ;-)

Thank you :)

> I'll go read the v9 patch-set you send out and look at some of the
> details (such as pos_ratio being comprised of both global and bdi
> limits, which so far has been somewhat glossed over).

Hold on please! I'll immediately post a v10 with all the comment updates.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 10:04                                   ` Wu Fengguang
  (?)
  (?)
@ 2011-08-26 11:26                                   ` Wu Fengguang
  2011-08-26 12:11                                       ` Peter Zijlstra
  -1 siblings, 1 reply; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

Peter,

Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
a "disturber" dd read task during roughly 120-130s.

(1) balance_dirty_pages-pages.png

This is the output of the original patchset. Here the "balanced
ratelimit" dots are mostly accurate except when near @freerun or @limit.

(2) balance_dirty_pages-pages_pure-rate-feedback.png

do this change:
  -	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
  +	balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw,
   					   dirty_rate | 1);

Here the "balanced ratelimit" dots goto the opposite direction
comparing to "pos ratelimit", which is the expected result discussed
in the other email. Then the system got stuck in unbalanced dirty
position.  It's slowly moving towards the setpoint thanks to the
dirty_ratelimit update policy: it only updates dirty_ratelimit when
balanced_dirty_ratelimit fluctuates to the same side of
task_ratelimit, hence introduced some systematical "errors" in the
right direction ;)

(3) balance_dirty_pages-pages_pure-rate-feedback-without-dirty_ratelimit-update-constraints.png

further remove the "do conservative bdi->dirty_ratelimit updates"
feature, by replacing its update policy with a direct assignment:

        bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);

This is to check if dirty_ratelimit can still go back to the balance
point without the help of the dirty_ratelimit update policy. To my
surprise, dirty_ratelimit jumps to HUGE singular value and shows no
sign to come back to normal..

In summary, the original patchset shows the best behavior :)

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 75688 bytes --]

[-- Attachment #3: balance_dirty_pages-pages_pure-rate-feedback.png --]
[-- Type: image/png, Size: 83327 bytes --]

[-- Attachment #4: balance_dirty_pages-pages_pure-rate-feedback-without-dirty_ratelimit-update-constraints.png --]
[-- Type: image/png, Size: 63923 bytes --]

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 11:26                                   ` Wu Fengguang
@ 2011-08-26 12:11                                       ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26 12:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> a "disturber" dd read task during roughly 120-130s. 

Ah, but ideally the disturber task should run in bursts of 100ms
(<feedback period), otherwise your N is indeed mostly constant.



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 12:11                                       ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26 12:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> a "disturber" dd read task during roughly 120-130s. 

Ah, but ideally the disturber task should run in bursts of 100ms
(<feedback period), otherwise your N is indeed mostly constant.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 12:11                                       ` Peter Zijlstra
@ 2011-08-26 12:20                                         ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> > a "disturber" dd read task during roughly 120-130s. 
> 
> Ah, but ideally the disturber task should run in bursts of 100ms
> (<feedback period), otherwise your N is indeed mostly constant.

Ah yeah, the disturber task should be a dd writer! Then we get

- 120s: N=1 => N=2
- 130s: N=2 => N=1

I'll try it right away.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 12:20                                         ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> > a "disturber" dd read task during roughly 120-130s. 
> 
> Ah, but ideally the disturber task should run in bursts of 100ms
> (<feedback period), otherwise your N is indeed mostly constant.

Ah yeah, the disturber task should be a dd writer! Then we get

- 120s: N=1 => N=2
- 130s: N=2 => N=1

I'll try it right away.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 12:20                                         ` Wu Fengguang
  (?)
@ 2011-08-26 13:13                                         ` Wu Fengguang
  2011-08-26 13:18                                             ` Peter Zijlstra
  -1 siblings, 1 reply; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 13:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

On Fri, Aug 26, 2011 at 08:20:57PM +0800, Wu Fengguang wrote:
> On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> > > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> > > a "disturber" dd read task during roughly 120-130s. 
> > 
> > Ah, but ideally the disturber task should run in bursts of 100ms
> > (<feedback period), otherwise your N is indeed mostly constant.
> 
> Ah yeah, the disturber task should be a dd writer! Then we get
> 
> - 120s: N=1 => N=2
> - 130s: N=2 => N=1

Here they are. The write disturber starts/stops around 150s.

We got similar result as in the read disturber case, even though one
disturbs N and the other impacts writeout bandwith.  The original
patchset is consistently performing much better :)

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 120914 bytes --]

[-- Attachment #3: balance_dirty_pages-pages_pure-rate-feedback.png --]
[-- Type: image/png, Size: 142966 bytes --]

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 13:13                                         ` Wu Fengguang
@ 2011-08-26 13:18                                             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26 13:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> We got similar result as in the read disturber case, even though one
> disturbs N and the other impacts writeout bandwith.  The original
> patchset is consistently performing much better :) 

It does indeed, and I figure on these timescales it makes sense to
assumes N is a constant. Fair enough, thanks!

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 13:18                                             ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-26 13:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> We got similar result as in the read disturber case, even though one
> disturbs N and the other impacts writeout bandwith.  The original
> patchset is consistently performing much better :) 

It does indeed, and I figure on these timescales it makes sense to
assumes N is a constant. Fair enough, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 13:18                                             ` Peter Zijlstra
@ 2011-08-26 13:24                                               ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 09:18:21PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> > We got similar result as in the read disturber case, even though one
> > disturbs N and the other impacts writeout bandwith.  The original
> > patchset is consistently performing much better :) 
> 
> It does indeed, and I figure on these timescales it makes sense to
> assumes N is a constant. Fair enough, thanks!

Thank you! Glad that we finally reaches some consensus :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 13:24                                               ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-26 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 09:18:21PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> > We got similar result as in the read disturber case, even though one
> > disturbs N and the other impacts writeout bandwith.  The original
> > patchset is consistently performing much better :) 
> 
> It does indeed, and I figure on these timescales it makes sense to
> assumes N is a constant. Fair enough, thanks!

Thank you! Glad that we finally reaches some consensus :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 18:00                             ` Vivek Goyal
@ 2011-08-29 13:12                               ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-29 13:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.
> 
> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
>                                                       dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.
> 
> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5
> 
> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                       dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.


Ok so this argument makes sense, is there some formalism to describe
such systems where such things are more evident?



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-29 13:12                               ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-08-29 13:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.
> 
> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
>                                                       dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.
> 
> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5
> 
> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                       dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.


Ok so this argument makes sense, is there some formalism to describe
such systems where such things are more evident?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-29 13:12                               ` Peter Zijlstra
@ 2011-08-29 13:37                                 ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-29 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 29, 2011 at 09:12:07PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> > 
> > Ok, I think I am beginning to see your point. Let me just elaborate on
> > the example you gave.
> > 
> > Assume a system is completely balanced and a task is writing at 100MB/s
> > rate.
> > 
> > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> > 
> > bdi->dirty_ratelimit = 100MB/s
> > 
> > Now another tasks starts dirtying the page cache on same bdi. Number of 
> > dirty pages should go up pretty fast and likely position ratio feedback
> > will kick in to reduce the dirtying rate. (rate based feedback does not
> > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> > Assume new pos_ratio is .5
> > 
> > So new throttle rate for both the tasks is 50MB/s.
> > 
> > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> > 
> > Now lets say 200ms have passed and rate base feedback is reevaluated.
> > 
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> >                                                       dirty_bw
> > 
> > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> > 
> > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> > that did not happen. And reason being that there are two feedback control
> > loops and pos_ratio loops reacts to imbalances much more quickly. Because
> > previous loop has already reacted to the imbalance and reduced the
> > dirtying rate of task, rate based loop does not try to adjust anything
> > and thinks everything is just fine.
> > 
> > Things are fine in the sense that still dirty_rate == write_bw but
> > system is not balanced in terms of number of dirty pages and pos_ratio=.5
> > 
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                       dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> 
> Ok so this argument makes sense, is there some formalism to describe
> such systems where such things are more evident?

I find the most easy and clean way to describe it is,

(1) the below formula
                                                          write_bw  
    bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
                                                          dirty_bw
is able to yield

    dirty_ratelimit_(i) ~= (write_bw / N)

as long as

- write_bw, dirty_bw and pos_ratio are not changing rapidly
- dirty pages are not around @freerun or @limit

Otherwise there will be larger estimation errors.

(2) based on (1), we get

    task_ratelimit ~= (write_bw / N) * pos_ratio

So the pos_ratio feedback is able to drive dirty count to the
setpoint, where pos_ratio = 1.

That interpretation based on _real values_ can neatly decouple the two
feedback loops :) It makes full utilization of the fact "the
dirty_ratelimit _value_ is independent on pos_ratio except for
possible impacts on estimation errors".

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-29 13:37                                 ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-29 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 29, 2011 at 09:12:07PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> > 
> > Ok, I think I am beginning to see your point. Let me just elaborate on
> > the example you gave.
> > 
> > Assume a system is completely balanced and a task is writing at 100MB/s
> > rate.
> > 
> > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> > 
> > bdi->dirty_ratelimit = 100MB/s
> > 
> > Now another tasks starts dirtying the page cache on same bdi. Number of 
> > dirty pages should go up pretty fast and likely position ratio feedback
> > will kick in to reduce the dirtying rate. (rate based feedback does not
> > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> > Assume new pos_ratio is .5
> > 
> > So new throttle rate for both the tasks is 50MB/s.
> > 
> > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> > 
> > Now lets say 200ms have passed and rate base feedback is reevaluated.
> > 
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> >                                                       dirty_bw
> > 
> > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> > 
> > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> > that did not happen. And reason being that there are two feedback control
> > loops and pos_ratio loops reacts to imbalances much more quickly. Because
> > previous loop has already reacted to the imbalance and reduced the
> > dirtying rate of task, rate based loop does not try to adjust anything
> > and thinks everything is just fine.
> > 
> > Things are fine in the sense that still dirty_rate == write_bw but
> > system is not balanced in terms of number of dirty pages and pos_ratio=.5
> > 
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                       dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> 
> Ok so this argument makes sense, is there some formalism to describe
> such systems where such things are more evident?

I find the most easy and clean way to describe it is,

(1) the below formula
                                                          write_bw  
    bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
                                                          dirty_bw
is able to yield

    dirty_ratelimit_(i) ~= (write_bw / N)

as long as

- write_bw, dirty_bw and pos_ratio are not changing rapidly
- dirty pages are not around @freerun or @limit

Otherwise there will be larger estimation errors.

(2) based on (1), we get

    task_ratelimit ~= (write_bw / N) * pos_ratio

So the pos_ratio feedback is able to drive dirty count to the
setpoint, where pos_ratio = 1.

That interpretation based on _real values_ can neatly decouple the two
feedback loops :) It makes full utilization of the fact "the
dirty_ratelimit _value_ is independent on pos_ratio except for
possible impacts on estimation errors".

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-29 13:37                                 ` Wu Fengguang
@ 2011-09-02 12:16                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-09-02 12:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > 
> > Ok so this argument makes sense, is there some formalism to describe
> > such systems where such things are more evident?
> 
> I find the most easy and clean way to describe it is,
> 
> (1) the below formula
>                                                           write_bw  
>     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                           dirty_bw
> is able to yield
> 
>     dirty_ratelimit_(i) ~= (write_bw / N)
> 
> as long as
> 
> - write_bw, dirty_bw and pos_ratio are not changing rapidly
> - dirty pages are not around @freerun or @limit
> 
> Otherwise there will be larger estimation errors.
> 
> (2) based on (1), we get
> 
>     task_ratelimit ~= (write_bw / N) * pos_ratio
> 
> So the pos_ratio feedback is able to drive dirty count to the
> setpoint, where pos_ratio = 1.
> 
> That interpretation based on _real values_ can neatly decouple the two
> feedback loops :) It makes full utilization of the fact "the
> dirty_ratelimit _value_ is independent on pos_ratio except for
> possible impacts on estimation errors". 

OK, so the 'problem' I have with this is that the whole control thing
really doesn't care about N. All it does is measure:

 - dirty rate
 - writeback rate

observe:

 - dirty count; with the independent input of its setpoint

control:

 - ratelimit

so I was looking for a way to describe the interaction between the two
feedback loops without involving the exact details of what they're
controlling, but that might just end up being an oxymoron.



^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-09-02 12:16                                   ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-09-02 12:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > 
> > Ok so this argument makes sense, is there some formalism to describe
> > such systems where such things are more evident?
> 
> I find the most easy and clean way to describe it is,
> 
> (1) the below formula
>                                                           write_bw  
>     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                           dirty_bw
> is able to yield
> 
>     dirty_ratelimit_(i) ~= (write_bw / N)
> 
> as long as
> 
> - write_bw, dirty_bw and pos_ratio are not changing rapidly
> - dirty pages are not around @freerun or @limit
> 
> Otherwise there will be larger estimation errors.
> 
> (2) based on (1), we get
> 
>     task_ratelimit ~= (write_bw / N) * pos_ratio
> 
> So the pos_ratio feedback is able to drive dirty count to the
> setpoint, where pos_ratio = 1.
> 
> That interpretation based on _real values_ can neatly decouple the two
> feedback loops :) It makes full utilization of the fact "the
> dirty_ratelimit _value_ is independent on pos_ratio except for
> possible impacts on estimation errors". 

OK, so the 'problem' I have with this is that the whole control thing
really doesn't care about N. All it does is measure:

 - dirty rate
 - writeback rate

observe:

 - dirty count; with the independent input of its setpoint

control:

 - ratelimit

so I was looking for a way to describe the interaction between the two
feedback loops without involving the exact details of what they're
controlling, but that might just end up being an oxymoron.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-29 13:37                                 ` Wu Fengguang
@ 2011-09-06 12:40                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-09-06 12:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-09-02 at 14:16 +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > > 
> > > Ok so this argument makes sense, is there some formalism to describe
> > > such systems where such things are more evident?
> > 
> > I find the most easy and clean way to describe it is,
> > 
> > (1) the below formula
> >                                                           write_bw  
> >     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                           dirty_bw
> > is able to yield
> > 
> >     dirty_ratelimit_(i) ~= (write_bw / N)
> > 
> > as long as
> > 
> > - write_bw, dirty_bw and pos_ratio are not changing rapidly
> > - dirty pages are not around @freerun or @limit
> > 
> > Otherwise there will be larger estimation errors.
> > 
> > (2) based on (1), we get
> > 
> >     task_ratelimit ~= (write_bw / N) * pos_ratio
> > 
> > So the pos_ratio feedback is able to drive dirty count to the
> > setpoint, where pos_ratio = 1.
> > 
> > That interpretation based on _real values_ can neatly decouple the two
> > feedback loops :) It makes full utilization of the fact "the
> > dirty_ratelimit _value_ is independent on pos_ratio except for
> > possible impacts on estimation errors". 
> 
> OK, so the 'problem' I have with this is that the whole control thing
> really doesn't care about N. All it does is measure:
> 
>  - dirty rate
>  - writeback rate
> 
> observe:
> 
>  - dirty count; with the independent input of its setpoint
> 
> control:
> 
>  - ratelimit
> 
> so I was looking for a way to describe the interaction between the two
> feedback loops without involving the exact details of what they're
> controlling, but that might just end up being an oxymoron.


Hmm, so per Vivek's argument the system without pos_ratio in the
feedback term isn't convergent. Therefore we should be able to argue
from convergent/stability grounds that this term is indeed needed.

Does the stability proof of a control system need the model of what its
controlling? I guess I ought to go get a book on this or so.




^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-09-06 12:40                                   ` Peter Zijlstra
  0 siblings, 0 replies; 305+ messages in thread
From: Peter Zijlstra @ 2011-09-06 12:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-09-02 at 14:16 +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > > 
> > > Ok so this argument makes sense, is there some formalism to describe
> > > such systems where such things are more evident?
> > 
> > I find the most easy and clean way to describe it is,
> > 
> > (1) the below formula
> >                                                           write_bw  
> >     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                           dirty_bw
> > is able to yield
> > 
> >     dirty_ratelimit_(i) ~= (write_bw / N)
> > 
> > as long as
> > 
> > - write_bw, dirty_bw and pos_ratio are not changing rapidly
> > - dirty pages are not around @freerun or @limit
> > 
> > Otherwise there will be larger estimation errors.
> > 
> > (2) based on (1), we get
> > 
> >     task_ratelimit ~= (write_bw / N) * pos_ratio
> > 
> > So the pos_ratio feedback is able to drive dirty count to the
> > setpoint, where pos_ratio = 1.
> > 
> > That interpretation based on _real values_ can neatly decouple the two
> > feedback loops :) It makes full utilization of the fact "the
> > dirty_ratelimit _value_ is independent on pos_ratio except for
> > possible impacts on estimation errors". 
> 
> OK, so the 'problem' I have with this is that the whole control thing
> really doesn't care about N. All it does is measure:
> 
>  - dirty rate
>  - writeback rate
> 
> observe:
> 
>  - dirty count; with the independent input of its setpoint
> 
> control:
> 
>  - ratelimit
> 
> so I was looking for a way to describe the interaction between the two
> feedback loops without involving the exact details of what they're
> controlling, but that might just end up being an oxymoron.


Hmm, so per Vivek's argument the system without pos_ratio in the
feedback term isn't convergent. Therefore we should be able to argue
from convergent/stability grounds that this term is indeed needed.

Does the stability proof of a control system need the model of what its
controlling? I guess I ought to go get a book on this or so.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 20:24         ` Jan Kara
@ 2011-08-24  3:16           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
> span?

OK, I'll follow your suggestion to use

        span = 8 * write_bw, for single bdi case 
        span = bdi_thresh, for JBOD case
        x_intercept = setpoint + span;

It does make sense to squeeze the bdi_dirty fluctuation range a bit by
doubling span and making the control line more sharp.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24  3:16           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
> span?

OK, I'll follow your suggestion to use

        span = 8 * write_bw, for single bdi case 
        span = bdi_thresh, for JBOD case
        x_intercept = setpoint + span;

It does make sense to squeeze the bdi_dirty fluctuation range a bit by
doubling span and making the control line more sharp.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-19  2:53     ` Vivek Goyal
@ 2011-08-19  3:25       ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-19  3:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:53:21AM +0800, Vivek Goyal wrote:
> On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:
> 
> [..]
> > +/*
> > + * Dirty position control.
> > + *
> > + * (o) global/bdi setpoints
> > + *
> > + * We want the dirty pages be balanced around the global/bdi setpoints.
> > + * When the number of dirty pages is higher/lower than the setpoint, the
> > + * dirty position control ratio (and hence task dirty ratelimit) will be
> > + * decreased/increased to bring the dirty pages back to the setpoint.
> > + *
> > + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> > + *
> > + *     if (dirty < setpoint) scale up   pos_ratio
> > + *     if (dirty > setpoint) scale down pos_ratio
> > + *
> > + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> > + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> > + *
> > + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> > + *
> > + * (o) global control line
> > + *
> > + *     ^ pos_ratio
> > + *     |
> > + *     |            |<===== global dirty control scope ======>|
> > + * 2.0 .............*
> > + *     |            .*
> > + *     |            . *
> > + *     |            .   *
> > + *     |            .     *
> > + *     |            .        *
> > + *     |            .            *
> > + * 1.0 ................................*
> > + *     |            .                  .     *
> > + *     |            .                  .          *
> > + *     |            .                  .              *
> > + *     |            .                  .                 *
> > + *     |            .                  .                    *
> > + *   0 +------------.------------------.----------------------*------------->
> > + *           freerun^          setpoint^                 limit^   dirty pages
> > + *
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0                 setpoint                     x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> > + */
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
> > +	x_intercept = setpoint + 2 * span;
> > +
> 
> Hi Fengguang,
> 
> Few very basic queries.
> 
> - Why can't we use the same formula for bdi position ratio as gloabl
>   position ratio. Are you not looking for similar proporties. Near the
>   set point variation is less and away from setup poing throttling is
>   faster.

The changelog has more details, however I hope the rephrased summary
can answer this question better.

Firstly, for single bdi case, the different bdi/global formula is
complementing each other, where the bdi's slope is proportional to the
writeout bandwidth, while the global one is scaling to memory size.
In huge memory system, the global position feedback becomes very weak
(even far away from the setpoint).  This is where the bdi control line
can help pull the dirty pages to the setpoint.

Secondly, for JBOD case, the global/bdi dirty thresholds are
fundamentally different. The global one is stable and strong limit,
while the bdi one is fluctuating and hence only suitable be taken as a
weak limit. The other reason to make it a weak limit is, there are
valid situations that (bdi_dirty >> bdi_thresh) and it's desirable to
throttle the dirtier in reasonable small rate rather than to hard
throttle it.

> - In the bdi calculation, setpoint seems to be in number of pages and 
>   limit (x_intercept) seems to be a combination of nr pages + pages/sec.
>   Why it is different from gloabl setpoint and limit. I mean could this
>   not have been like global calculation where we try to keep bdi_dirty
>   close to bdi_thresh and calculate pos_ratio. 

Because the bdi dirty pages are observed to typically fluctuate up to
1-second worth of data. So the write_bw used here is really (1s * write_bw).

> - In global pos_ratio calculation terminology used is "limit" while
>   the same thing seems be being meintioned as x_intercept in bdi position
>   ratio calculation.

Yes. Because the bdi control lines don't intent to do hard limit at all.

It's actually possible for x_intercept to become larger than the global limit.
This means the it's a memory tight system (or the storage is super fast)
where the bdi dirty pages will inevitably fluctuate a lot (up to write_bw).
We just let go of them and let the global formula take the control.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-19  3:25       ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-19  3:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:53:21AM +0800, Vivek Goyal wrote:
> On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:
> 
> [..]
> > +/*
> > + * Dirty position control.
> > + *
> > + * (o) global/bdi setpoints
> > + *
> > + * We want the dirty pages be balanced around the global/bdi setpoints.
> > + * When the number of dirty pages is higher/lower than the setpoint, the
> > + * dirty position control ratio (and hence task dirty ratelimit) will be
> > + * decreased/increased to bring the dirty pages back to the setpoint.
> > + *
> > + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> > + *
> > + *     if (dirty < setpoint) scale up   pos_ratio
> > + *     if (dirty > setpoint) scale down pos_ratio
> > + *
> > + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> > + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> > + *
> > + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> > + *
> > + * (o) global control line
> > + *
> > + *     ^ pos_ratio
> > + *     |
> > + *     |            |<===== global dirty control scope ======>|
> > + * 2.0 .............*
> > + *     |            .*
> > + *     |            . *
> > + *     |            .   *
> > + *     |            .     *
> > + *     |            .        *
> > + *     |            .            *
> > + * 1.0 ................................*
> > + *     |            .                  .     *
> > + *     |            .                  .          *
> > + *     |            .                  .              *
> > + *     |            .                  .                 *
> > + *     |            .                  .                    *
> > + *   0 +------------.------------------.----------------------*------------->
> > + *           freerun^          setpoint^                 limit^   dirty pages
> > + *
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0                 setpoint                     x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> > + */
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
> > +	x_intercept = setpoint + 2 * span;
> > +
> 
> Hi Fengguang,
> 
> Few very basic queries.
> 
> - Why can't we use the same formula for bdi position ratio as gloabl
>   position ratio. Are you not looking for similar proporties. Near the
>   set point variation is less and away from setup poing throttling is
>   faster.

The changelog has more details, however I hope the rephrased summary
can answer this question better.

Firstly, for single bdi case, the different bdi/global formula is
complementing each other, where the bdi's slope is proportional to the
writeout bandwidth, while the global one is scaling to memory size.
In huge memory system, the global position feedback becomes very weak
(even far away from the setpoint).  This is where the bdi control line
can help pull the dirty pages to the setpoint.

Secondly, for JBOD case, the global/bdi dirty thresholds are
fundamentally different. The global one is stable and strong limit,
while the bdi one is fluctuating and hence only suitable be taken as a
weak limit. The other reason to make it a weak limit is, there are
valid situations that (bdi_dirty >> bdi_thresh) and it's desirable to
throttle the dirtier in reasonable small rate rather than to hard
throttle it.

> - In the bdi calculation, setpoint seems to be in number of pages and 
>   limit (x_intercept) seems to be a combination of nr pages + pages/sec.
>   Why it is different from gloabl setpoint and limit. I mean could this
>   not have been like global calculation where we try to keep bdi_dirty
>   close to bdi_thresh and calculate pos_ratio. 

Because the bdi dirty pages are observed to typically fluctuate up to
1-second worth of data. So the write_bw used here is really (1s * write_bw).

> - In global pos_ratio calculation terminology used is "limit" while
>   the same thing seems be being meintioned as x_intercept in bdi position
>   ratio calculation.

Yes. Because the bdi control lines don't intent to do hard limit at all.

It's actually possible for x_intercept to become larger than the global limit.
This means the it's a memory tight system (or the storage is super fast)
where the bdi dirty pages will inevitably fluctuate a lot (up to write_bw).
We just let go of them and let the global formula take the control.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-19  2:53     ` Vivek Goyal
  -1 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 setpoint                     x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
> +	x_intercept = setpoint + 2 * span;
> +

Hi Fengguang,

Few very basic queries.

- Why can't we use the same formula for bdi position ratio as gloabl
  position ratio. Are you not looking for similar proporties. Near the
  set point variation is less and away from setup poing throttling is
  faster.

- In the bdi calculation, setpoint seems to be in number of pages and 
  limit (x_intercept) seems to be a combination of nr pages + pages/sec.
  Why it is different from gloabl setpoint and limit. I mean could this
  not have been like global calculation where we try to keep bdi_dirty
  close to bdi_thresh and calculate pos_ratio. 

- In global pos_ratio calculation terminology used is "limit" while
  the same thing seems be being meintioned as x_intercept in bdi position
  ratio calculation.

Am I missing something very basic here.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-19  2:53     ` Vivek Goyal
  0 siblings, 0 replies; 305+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 setpoint                     x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
> +	x_intercept = setpoint + 2 * span;
> +

Hi Fengguang,

Few very basic queries.

- Why can't we use the same formula for bdi position ratio as gloabl
  position ratio. Are you not looking for similar proporties. Near the
  set point variation is less and away from setup poing throttling is
  faster.

- In the bdi calculation, setpoint seems to be in number of pages and 
  limit (x_intercept) seems to be a combination of nr pages + pages/sec.
  Why it is different from gloabl setpoint and limit. I mean could this
  not have been like global calculation where we try to keep bdi_dirty
  close to bdi_thresh and calculate pos_ratio. 

- In global pos_ratio calculation terminology used is "limit" while
  the same thing seems be being meintioned as x_intercept in bdi position
  ratio calculation.

Am I missing something very basic here.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-18  4:18           ` Wu Fengguang
@ 2011-08-18 19:16             ` Jan Kara
  -1 siblings, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-18 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > > +	 */
> > > > > +	setpoint = (freerun + limit) / 2;
> > > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +		    limit - setpoint + 1);
> > > > > +	pos_ratio = x;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * bdi setpoint
> >   OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
> 
> Right.
> 
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
> 
> Yes.
> 
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
> 
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
  OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > +	 *
> > > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> >                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
> 
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
> 
> > > > > +	 *
> > > > > +	 * The main bdi control line is a linear function that subjects to
> > > > > +	 *
> > > > > +	 * (1) f(setpoint) = 1.0
> > > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > > +	 *
> > > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > > +	 * regularly within range
> > > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > +	 * fluctuation range for pos_ratio.
> > > > > +	 *
> > > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > +	 * own size, so move the slope over accordingly.
> > > > > +	 */
> > > > > +	if (unlikely(bdi_thresh > thresh))
> > > > > +		bdi_thresh = thresh;
> > > > > +	/*
> > > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > > +	 */
> > > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > > +	/*
> > > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > +	 */
> > > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > +		       thresh + 1);
> > > >   I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > > 
> > > Good idea!
> > > 
> > > > > +	x_intercept = setpoint + 2 * span;
> >    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
> 
> Right.
> 
> > So maybe you should use bdi_thresh/2 in the computation of span?
> 
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
> 
> Given equations
> 
>         span = (x_intercept - bdi_setpoint) / 2
>         k = df/dx = -0.5 / span
> 
> and the values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / (2 * span) = - 1/2
> 
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
  OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> The slope of the bdi control line should be
> 
> 1) large enough to pull the dirty pages to setpoint reasonably fast
> 
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>    hence task ratelimit
> 
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
> 
> Assume the bdi control line
> 
> 	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
> 
> where k is the negative slope.
> 
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
> 
> 	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
> 
> we get slope
> 
> 	k = - 1 / (8 * write_bw)
> 
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
> 
> 	x_intercept = bdi_setpoint + 8 * write_bw
> 
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
> 
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the write bandwidth
> 
> so that
> 
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> peter: use 3rd order polynomial for the global control line
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c         |    2 
>  include/linux/writeback.h |    1 
>  mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 209 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define RATELIMIT_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +					   unsigned long bg_thresh)
> +{
> +	return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>  	return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |<-- span --->| .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* dirty pages' target balance point */
> +	unsigned long bdi_setpoint;
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                           setpoint - dirty 3
> +	 *        f(dirty) := 1.0 + (----------------)
> +	 *                           limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx      <= 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * We have computed basic pos_ratio above based on global situation. If
> +	 * the bdi is over/under its share of dirty pages, we want to scale
> +	 * pos_ratio further down/up. That is done by the following policies:
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +	 * for various filesystems, so choose a slope that can yield in a
> +	 * reasonable 12.5% fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly and choose a slope that
> +	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> +	 */
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +	 *
> +	 *                        x_intercept - bdi_dirty
> +	 *                     := --------------------------
> +	 *                        x_intercept - bdi_setpoint
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(bdi_setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:
> +	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +	bdi_setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 *
> +	 *        bdi_thresh                  thresh - bdi_thresh
> +	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> +	 *          thresh                          thresh
> +	 */
> +	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> +								(u64)x >> 16;
> +	x_intercept = bdi_setpoint + 2 * span;
> +
> +	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			bdi_setpoint += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long thresh,
> +				 unsigned long bg_thresh,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
> -	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -			       start_time);
> +	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +			       bdi_thresh, bdi_dirty, start_time);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +						      background_thresh))
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -				     bdi_thresh, bdi_dirty, start_time);
> +		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +				     nr_dirty, bdi_thresh, bdi_dirty,
> +				     start_time);
>  
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>  				unsigned long start_time)
>  {
> -	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18 19:16             ` Jan Kara
  0 siblings, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-18 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > > +	 */
> > > > > +	setpoint = (freerun + limit) / 2;
> > > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +		    limit - setpoint + 1);
> > > > > +	pos_ratio = x;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * bdi setpoint
> >   OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
> 
> Right.
> 
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
> 
> Yes.
> 
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
> 
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
  OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > +	 *
> > > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> >                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
> 
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
> 
> > > > > +	 *
> > > > > +	 * The main bdi control line is a linear function that subjects to
> > > > > +	 *
> > > > > +	 * (1) f(setpoint) = 1.0
> > > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > > +	 *
> > > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > > +	 * regularly within range
> > > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > +	 * fluctuation range for pos_ratio.
> > > > > +	 *
> > > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > +	 * own size, so move the slope over accordingly.
> > > > > +	 */
> > > > > +	if (unlikely(bdi_thresh > thresh))
> > > > > +		bdi_thresh = thresh;
> > > > > +	/*
> > > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > > +	 */
> > > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > > +	/*
> > > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > +	 */
> > > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > +		       thresh + 1);
> > > >   I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > > 
> > > Good idea!
> > > 
> > > > > +	x_intercept = setpoint + 2 * span;
> >    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
> 
> Right.
> 
> > So maybe you should use bdi_thresh/2 in the computation of span?
> 
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
> 
> Given equations
> 
>         span = (x_intercept - bdi_setpoint) / 2
>         k = df/dx = -0.5 / span
> 
> and the values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / (2 * span) = - 1/2
> 
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
  OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> The slope of the bdi control line should be
> 
> 1) large enough to pull the dirty pages to setpoint reasonably fast
> 
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>    hence task ratelimit
> 
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
> 
> Assume the bdi control line
> 
> 	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
> 
> where k is the negative slope.
> 
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
> 
> 	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
> 
> we get slope
> 
> 	k = - 1 / (8 * write_bw)
> 
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
> 
> 	x_intercept = bdi_setpoint + 8 * write_bw
> 
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
> 
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the write bandwidth
> 
> so that
> 
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> peter: use 3rd order polynomial for the global control line
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c         |    2 
>  include/linux/writeback.h |    1 
>  mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 209 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define RATELIMIT_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +					   unsigned long bg_thresh)
> +{
> +	return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>  	return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |<-- span --->| .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* dirty pages' target balance point */
> +	unsigned long bdi_setpoint;
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                           setpoint - dirty 3
> +	 *        f(dirty) := 1.0 + (----------------)
> +	 *                           limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx      <= 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * We have computed basic pos_ratio above based on global situation. If
> +	 * the bdi is over/under its share of dirty pages, we want to scale
> +	 * pos_ratio further down/up. That is done by the following policies:
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +	 * for various filesystems, so choose a slope that can yield in a
> +	 * reasonable 12.5% fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly and choose a slope that
> +	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> +	 */
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +	 *
> +	 *                        x_intercept - bdi_dirty
> +	 *                     := --------------------------
> +	 *                        x_intercept - bdi_setpoint
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(bdi_setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:
> +	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +	bdi_setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 *
> +	 *        bdi_thresh                  thresh - bdi_thresh
> +	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> +	 *          thresh                          thresh
> +	 */
> +	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> +								(u64)x >> 16;
> +	x_intercept = bdi_setpoint + 2 * span;
> +
> +	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			bdi_setpoint += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long thresh,
> +				 unsigned long bg_thresh,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
> -	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -			       start_time);
> +	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +			       bdi_thresh, bdi_dirty, start_time);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +						      background_thresh))
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -				     bdi_thresh, bdi_dirty, start_time);
> +		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +				     nr_dirty, bdi_thresh, bdi_dirty,
> +				     start_time);
>  
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>  				unsigned long start_time)
>  {
> -	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-18  4:18           ` Wu Fengguang
@ 2011-08-18  4:41             ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Hi Jan,

> > > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > > easily 500 MB, that happens quite often I imagine?
> > > 
> > > That's fine because I no longer target "bdi_thresh" as some limiting
> > > factor as the global "thresh". Due to it being unstable in small
> > > memory JBOD systems, which is the big and unique problem in JBOD.
> >   I see. Given the control mechanism below, I think we can try this idea
> > and see whether it makes problems in practice or not. But the fact that
> > bdi_thresh is no longer treated as limit should be noted in a changelog -
> > probably of the last patch (although that is already too long for my taste
> > so I'll look into how we could make it shorter so that average developer
> > has enough patience to read it ;).
> 
> Good point. I'll make it a comment in the last patch.

Just added this comment:

+               /*
+                * bdi_thresh is not treated as some limiting factor as
+                * dirty_thresh, due to reasons
+                * - in JBOD setup, bdi_thresh can fluctuate a lot
+                * - in a system with HDD and USB key, the USB key may somehow
+                *   go into state (bdi_dirty >> bdi_thresh) either because
+                *   bdi_dirty starts high, or because bdi_thresh drops low.
+                *   In this case we don't want to hard throttle the USB key
+                *   dirtiers for 100 seconds until bdi_dirty drops under
+                *   bdi_thresh. Instead the auxiliary bdi control line in
+                *   bdi_position_ratio() will let the dirtier task progress
+                *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+                */
                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18  4:41             ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Hi Jan,

> > > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > > easily 500 MB, that happens quite often I imagine?
> > > 
> > > That's fine because I no longer target "bdi_thresh" as some limiting
> > > factor as the global "thresh". Due to it being unstable in small
> > > memory JBOD systems, which is the big and unique problem in JBOD.
> >   I see. Given the control mechanism below, I think we can try this idea
> > and see whether it makes problems in practice or not. But the fact that
> > bdi_thresh is no longer treated as limit should be noted in a changelog -
> > probably of the last patch (although that is already too long for my taste
> > so I'll look into how we could make it shorter so that average developer
> > has enough patience to read it ;).
> 
> Good point. I'll make it a comment in the last patch.

Just added this comment:

+               /*
+                * bdi_thresh is not treated as some limiting factor as
+                * dirty_thresh, due to reasons
+                * - in JBOD setup, bdi_thresh can fluctuate a lot
+                * - in a system with HDD and USB key, the USB key may somehow
+                *   go into state (bdi_dirty >> bdi_thresh) either because
+                *   bdi_dirty starts high, or because bdi_thresh drops low.
+                *   In this case we don't want to hard throttle the USB key
+                *   dirtiers for 100 seconds until bdi_dirty drops under
+                *   bdi_thresh. Instead the auxiliary bdi control line in
+                *   bdi_position_ratio() will let the dirtier task progress
+                *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+                */
                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 20:24         ` Jan Kara
@ 2011-08-18  4:18           ` Wu Fengguang
  -1 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 18, 2011 at 04:24:14AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> > On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > > +					unsigned long thresh,
> > > > +					unsigned long bg_thresh,
> > > > +					unsigned long dirty,
> > > > +					unsigned long bdi_thresh,
> > > > +					unsigned long bdi_dirty)
> > > > +{
> > > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > > +	unsigned long x_intercept;
> > > > +	unsigned long setpoint;		/* the target balance point */
> > > > +	unsigned long span;
> > > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > > +	long x;
> > > > +
> > > > +	if (unlikely(dirty >= limit))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * global setpoint
> > > > +	 *
> > > > +	 *                         setpoint - dirty 3
> > > > +	 *        f(dirty) := 1 + (----------------)
> > > > +	 *                         limit - setpoint
> > > > +	 *
> > > > +	 * it's a 3rd order polynomial that subjects to
> > > > +	 *
> > > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > > +	 * (3) f(limit)    = 0   => the hard limit
> > > > +	 * (4) df/dx       < 0	 => negative feedback control
>                           ^^^ Strictly speaking this is <= 0

Ah yes, it can be 0 right at the setpoint. 

> > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > +	 */
> > > > +	setpoint = (freerun + limit) / 2;
> > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > +		    limit - setpoint + 1);
> > > > +	pos_ratio = x;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > +
> > > > +	/*
> > > > +	 * bdi setpoint
>   OK, so if I understand the code right, we now have basic pos_ratio based
> on global situation. Now, in the following code, we might scale pos_ratio
> further down, if bdi_dirty is too much over bdi's share, right?

Right.

> Do we also want to scale pos_ratio up, if we are under bdi's share?

Yes.

> If yes, do we really want to do it even if global pos_ratio < 1
> (i.e. we are over global setpoint)?

Yes. It's safe because the bdi pos_ratio scale is linear and the
global pos_ratio scale will quickly drop to 0 near @limit, thus
counter-acting any > 1 bdi pos_ratio.

> Maybe we could update the comment with something like:
>  * We have computed basic pos_ratio above based on global situation. If the
>  * bdi is over its share of dirty pages, we want to scale pos_ratio further
>  * down. That is done by the following mechanism:
> and now describe how updating works.

OK.

> > > > +	 *
> > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
>                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> bdi_setpoint to distinguish clearly from the global value.

OK. I'll add a new variable bdi_setpoint, too, to make it consistent
all over the places.

> > > > +	 *
> > > > +	 * The main bdi control line is a linear function that subjects to
> > > > +	 *
> > > > +	 * (1) f(setpoint) = 1.0
> > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > +	 *
> > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > +	 * regularly within range
> > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > +	 * fluctuation range for pos_ratio.
> > > > +	 *
> > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > +	 * own size, so move the slope over accordingly.
> > > > +	 */
> > > > +	if (unlikely(bdi_thresh > thresh))
> > > > +		bdi_thresh = thresh;
> > > > +	/*
> > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > +	 */
> > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > +	/*
> > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > +	 */
> > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > +		       thresh + 1);
> > >   I think you can slightly simplify this to:
> > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > 
> > Good idea!
> > 
> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh...

Right.

> So maybe you should use bdi_thresh/2 in the computation of span?

Given that at some configurations bdi_thresh can fluctuate to its own
size, I guess the current slope of control line is sharp enough.

Given equations

        span = (x_intercept - bdi_setpoint) / 2
        k = df/dx = -0.5 / span

and the values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / (2 * span) = - 1/2

That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
hence task ratelimit will fluctuate by -1/2. This is probably more
than the users can tolerate already?

btw. the connection point of main/auxiliary control lines are at

        (x_intercept + bdi_setpoint) / 2 

as shown in the graph of the below updated patch.

> > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > easily 500 MB, that happens quite often I imagine?
> > 
> > That's fine because I no longer target "bdi_thresh" as some limiting
> > factor as the global "thresh". Due to it being unstable in small
> > memory JBOD systems, which is the big and unique problem in JBOD.
>   I see. Given the control mechanism below, I think we can try this idea
> and see whether it makes problems in practice or not. But the fact that
> bdi_thresh is no longer treated as limit should be noted in a changelog -
> probably of the last patch (although that is already too long for my taste
> so I'll look into how we could make it shorter so that average developer
> has enough patience to read it ;).

Good point. I'll make it a comment in the last patch.

Thanks,
Fengguang
---
Subject: writeback: dirty position control
Date: Wed Mar 02 16:04:18 CST 2011

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
 3 files changed, 209 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |<-- span --->| .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following policies:
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, so choose a slope that can yield in a
+	 * reasonable 12.5% fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                  thresh - bdi_thresh
+	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
+	 *          thresh                          thresh
+	 */
+	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
+								(u64)x >> 16;
+	x_intercept = bdi_setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +828,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18  4:18           ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 18, 2011 at 04:24:14AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> > On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > > +					unsigned long thresh,
> > > > +					unsigned long bg_thresh,
> > > > +					unsigned long dirty,
> > > > +					unsigned long bdi_thresh,
> > > > +					unsigned long bdi_dirty)
> > > > +{
> > > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > > +	unsigned long x_intercept;
> > > > +	unsigned long setpoint;		/* the target balance point */
> > > > +	unsigned long span;
> > > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > > +	long x;
> > > > +
> > > > +	if (unlikely(dirty >= limit))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * global setpoint
> > > > +	 *
> > > > +	 *                         setpoint - dirty 3
> > > > +	 *        f(dirty) := 1 + (----------------)
> > > > +	 *                         limit - setpoint
> > > > +	 *
> > > > +	 * it's a 3rd order polynomial that subjects to
> > > > +	 *
> > > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > > +	 * (3) f(limit)    = 0   => the hard limit
> > > > +	 * (4) df/dx       < 0	 => negative feedback control
>                           ^^^ Strictly speaking this is <= 0

Ah yes, it can be 0 right at the setpoint. 

> > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > +	 */
> > > > +	setpoint = (freerun + limit) / 2;
> > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > +		    limit - setpoint + 1);
> > > > +	pos_ratio = x;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > +
> > > > +	/*
> > > > +	 * bdi setpoint
>   OK, so if I understand the code right, we now have basic pos_ratio based
> on global situation. Now, in the following code, we might scale pos_ratio
> further down, if bdi_dirty is too much over bdi's share, right?

Right.

> Do we also want to scale pos_ratio up, if we are under bdi's share?

Yes.

> If yes, do we really want to do it even if global pos_ratio < 1
> (i.e. we are over global setpoint)?

Yes. It's safe because the bdi pos_ratio scale is linear and the
global pos_ratio scale will quickly drop to 0 near @limit, thus
counter-acting any > 1 bdi pos_ratio.

> Maybe we could update the comment with something like:
>  * We have computed basic pos_ratio above based on global situation. If the
>  * bdi is over its share of dirty pages, we want to scale pos_ratio further
>  * down. That is done by the following mechanism:
> and now describe how updating works.

OK.

> > > > +	 *
> > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
>                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> bdi_setpoint to distinguish clearly from the global value.

OK. I'll add a new variable bdi_setpoint, too, to make it consistent
all over the places.

> > > > +	 *
> > > > +	 * The main bdi control line is a linear function that subjects to
> > > > +	 *
> > > > +	 * (1) f(setpoint) = 1.0
> > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > +	 *
> > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > +	 * regularly within range
> > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > +	 * fluctuation range for pos_ratio.
> > > > +	 *
> > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > +	 * own size, so move the slope over accordingly.
> > > > +	 */
> > > > +	if (unlikely(bdi_thresh > thresh))
> > > > +		bdi_thresh = thresh;
> > > > +	/*
> > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > +	 */
> > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > +	/*
> > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > +	 */
> > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > +		       thresh + 1);
> > >   I think you can slightly simplify this to:
> > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > 
> > Good idea!
> > 
> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh...

Right.

> So maybe you should use bdi_thresh/2 in the computation of span?

Given that at some configurations bdi_thresh can fluctuate to its own
size, I guess the current slope of control line is sharp enough.

Given equations

        span = (x_intercept - bdi_setpoint) / 2
        k = df/dx = -0.5 / span

and the values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / (2 * span) = - 1/2

That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
hence task ratelimit will fluctuate by -1/2. This is probably more
than the users can tolerate already?

btw. the connection point of main/auxiliary control lines are at

        (x_intercept + bdi_setpoint) / 2 

as shown in the graph of the below updated patch.

> > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > easily 500 MB, that happens quite often I imagine?
> > 
> > That's fine because I no longer target "bdi_thresh" as some limiting
> > factor as the global "thresh". Due to it being unstable in small
> > memory JBOD systems, which is the big and unique problem in JBOD.
>   I see. Given the control mechanism below, I think we can try this idea
> and see whether it makes problems in practice or not. But the fact that
> bdi_thresh is no longer treated as limit should be noted in a changelog -
> probably of the last patch (although that is already too long for my taste
> so I'll look into how we could make it shorter so that average developer
> has enough patience to read it ;).

Good point. I'll make it a comment in the last patch.

Thanks,
Fengguang
---
Subject: writeback: dirty position control
Date: Wed Mar 02 16:04:18 CST 2011

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
 3 files changed, 209 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |<-- span --->| .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following policies:
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, so choose a slope that can yield in a
+	 * reasonable 12.5% fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                  thresh - bdi_thresh
+	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
+	 *          thresh                          thresh
+	 */
+	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
+								(u64)x >> 16;
+	x_intercept = bdi_setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +828,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 13:23     ` Wu Fengguang
@ 2011-08-17 20:24         ` Jan Kara
  2011-08-17 20:24         ` Jan Kara
  1 sibling, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-17 20:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hi Fengguang,

On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > +					unsigned long thresh,
> > > +					unsigned long bg_thresh,
> > > +					unsigned long dirty,
> > > +					unsigned long bdi_thresh,
> > > +					unsigned long bdi_dirty)
> > > +{
> > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > +	unsigned long x_intercept;
> > > +	unsigned long setpoint;		/* the target balance point */
> > > +	unsigned long span;
> > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > +	long x;
> > > +
> > > +	if (unlikely(dirty >= limit))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * global setpoint
> > > +	 *
> > > +	 *                         setpoint - dirty 3
> > > +	 *        f(dirty) := 1 + (----------------)
> > > +	 *                         limit - setpoint
> > > +	 *
> > > +	 * it's a 3rd order polynomial that subjects to
> > > +	 *
> > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > +	 * (3) f(limit)    = 0   => the hard limit
> > > +	 * (4) df/dx       < 0	 => negative feedback control
                          ^^^ Strictly speaking this is <= 0

> > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > +	 */
> > > +	setpoint = (freerun + limit) / 2;
> > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > +		    limit - setpoint + 1);
> > > +	pos_ratio = x;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > +
> > > +	/*
> > > +	 * bdi setpoint
  OK, so if I understand the code right, we now have basic pos_ratio based
on global situation. Now, in the following code, we might scale pos_ratio
further down, if bdi_dirty is too much over bdi's share, right? Do we also
want to scale pos_ratio up, if we are under bdi's share? If yes, do we
really want to do it even if global pos_ratio < 1 (i.e. we are over global
setpoint)?

Maybe we could update the comment with something like:
 * We have computed basic pos_ratio above based on global situation. If the
 * bdi is over its share of dirty pages, we want to scale pos_ratio further
 * down. That is done by the following mechanism:
and now describe how updating works.

> > > +	 *
> > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
                  ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
bdi_setpoint to distinguish clearly from the global value.

> > > +	 *
> > > +	 * The main bdi control line is a linear function that subjects to
> > > +	 *
> > > +	 * (1) f(setpoint) = 1.0
> > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > +	 *
> > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > +	 * regularly within range
> > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > +	 * fluctuation range for pos_ratio.
> > > +	 *
> > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > +	 * own size, so move the slope over accordingly.
> > > +	 */
> > > +	if (unlikely(bdi_thresh > thresh))
> > > +		bdi_thresh = thresh;
> > > +	/*
> > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > +	 */
> > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > +	setpoint = setpoint * (u64)x >> 16;
> > > +	/*
> > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > +	 */
> > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > +		       thresh + 1);
> >   I think you can slightly simplify this to:
> > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> 
> Good idea!
> 
> > > +	x_intercept = setpoint + 2 * span;
   ^^ BTW, why do you have 2*span here? It can result in x_intercept being
~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
span?

> >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > easily 500 MB, that happens quite often I imagine?
> 
> That's fine because I no longer target "bdi_thresh" as some limiting
> factor as the global "thresh". Due to it being unstable in small
> memory JBOD systems, which is the big and unique problem in JBOD.
  I see. Given the control mechanism below, I think we can try this idea
and see whether it makes problems in practice or not. But the fact that
bdi_thresh is no longer treated as limit should be noted in a changelog -
probably of the last patch (although that is already too long for my taste
so I'll look into how we could make it shorter so that average developer
has enough patience to read it ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-17 20:24         ` Jan Kara
  0 siblings, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-17 20:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hi Fengguang,

On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > +					unsigned long thresh,
> > > +					unsigned long bg_thresh,
> > > +					unsigned long dirty,
> > > +					unsigned long bdi_thresh,
> > > +					unsigned long bdi_dirty)
> > > +{
> > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > +	unsigned long x_intercept;
> > > +	unsigned long setpoint;		/* the target balance point */
> > > +	unsigned long span;
> > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > +	long x;
> > > +
> > > +	if (unlikely(dirty >= limit))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * global setpoint
> > > +	 *
> > > +	 *                         setpoint - dirty 3
> > > +	 *        f(dirty) := 1 + (----------------)
> > > +	 *                         limit - setpoint
> > > +	 *
> > > +	 * it's a 3rd order polynomial that subjects to
> > > +	 *
> > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > +	 * (3) f(limit)    = 0   => the hard limit
> > > +	 * (4) df/dx       < 0	 => negative feedback control
                          ^^^ Strictly speaking this is <= 0

> > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > +	 */
> > > +	setpoint = (freerun + limit) / 2;
> > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > +		    limit - setpoint + 1);
> > > +	pos_ratio = x;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > +
> > > +	/*
> > > +	 * bdi setpoint
  OK, so if I understand the code right, we now have basic pos_ratio based
on global situation. Now, in the following code, we might scale pos_ratio
further down, if bdi_dirty is too much over bdi's share, right? Do we also
want to scale pos_ratio up, if we are under bdi's share? If yes, do we
really want to do it even if global pos_ratio < 1 (i.e. we are over global
setpoint)?

Maybe we could update the comment with something like:
 * We have computed basic pos_ratio above based on global situation. If the
 * bdi is over its share of dirty pages, we want to scale pos_ratio further
 * down. That is done by the following mechanism:
and now describe how updating works.

> > > +	 *
> > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
                  ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
bdi_setpoint to distinguish clearly from the global value.

> > > +	 *
> > > +	 * The main bdi control line is a linear function that subjects to
> > > +	 *
> > > +	 * (1) f(setpoint) = 1.0
> > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > +	 *
> > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > +	 * regularly within range
> > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > +	 * fluctuation range for pos_ratio.
> > > +	 *
> > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > +	 * own size, so move the slope over accordingly.
> > > +	 */
> > > +	if (unlikely(bdi_thresh > thresh))
> > > +		bdi_thresh = thresh;
> > > +	/*
> > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > +	 */
> > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > +	setpoint = setpoint * (u64)x >> 16;
> > > +	/*
> > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > +	 */
> > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > +		       thresh + 1);
> >   I think you can slightly simplify this to:
> > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> 
> Good idea!
> 
> > > +	x_intercept = setpoint + 2 * span;
   ^^ BTW, why do you have 2*span here? It can result in x_intercept being
~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
span?

> >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > easily 500 MB, that happens quite often I imagine?
> 
> That's fine because I no longer target "bdi_thresh" as some limiting
> factor as the global "thresh". Due to it being unstable in small
> memory JBOD systems, which is the big and unique problem in JBOD.
  I see. Given the control mechanism below, I think we can try this idea
and see whether it makes problems in practice or not. But the fact that
bdi_thresh is no longer treated as limit should be noted in a changelog -
probably of the last patch (although that is already too long for my taste
so I'll look into how we could make it shorter so that average developer
has enough patience to read it ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 13:23     ` Wu Fengguang
@ 2011-08-17 13:49         ` Wu Fengguang
  2011-08-17 20:24         ` Jan Kara
  1 sibling, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > +		if (x_intercept < limit) {
> > > +			x_intercept = limit;	/* auxiliary control line */
> > > +			setpoint += span;
> > > +			pos_ratio >>= 1;
> > > +		}
> >   And here you stretch the control area upto the global dirty limit. I
> > understand you maybe don't want to be really strict and cut control area at
> > bdi_thresh but your choice looks like too benevolent - when you have
> > several active bdi's with different speeds this will effectively erase
> > difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> > bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> > bdi_dirty really heavily exceeds bdi_thresh.
> 
> Yes the auxiliary control line could be very flat (small slope).
> 
> However it's not normal for the bdi dirty pages to fall into the
> range of auxiliary control line at all. And once it takes effect, 
> the pos_ratio is at most 0.5 (which is the value at the connection
> point with the main bdi control line) which is strong enough to pull
> the dirty pages off the auxiliary bdi control line and into the scope
> of main bdi control line.
> 
> The auxiliary control line is intended for bringing down the bdi_dirty
> of the USB key before 250s (where the "pos bandwidth" line keeps low): 
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png
> 
> After that the bdi_dirty will fluctuate around bdi_thresh and won't
> grow high and step into the scope of the auxiliary control line.

Note that the main/auxiliary bdi control lines won't take effect at
the same time: the main bdi control lines works around and under the
bdi setpoint, and the auxiliary line takes over in the higher scope up
to @limit.

In the 1UKEY+1HDD test case, the bdi_dirty of the UKEY rushes at the
free run stage when global dirty pages are smaller than (thresh+bg_thresh)/2.

So it will be initially under the control the auxiliary line. Hence the
dirtier task will progress at 1/4 to 1/2 of the UKEY's write bandwidth. 
This will bring down the bdi_dirty reasonably fast while still allowing
the dirtier task to make some progress.

The connection point of the main/auxiliary control lines has pos_ratio=0.5.

After 250 second, the main bdi control line takes over, indicated by
the bdi_dirty fluctuating around bdi setpoint and the position rate
(green line) fluctuating around the base ratelimit(blue line).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-17 13:49         ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > +		if (x_intercept < limit) {
> > > +			x_intercept = limit;	/* auxiliary control line */
> > > +			setpoint += span;
> > > +			pos_ratio >>= 1;
> > > +		}
> >   And here you stretch the control area upto the global dirty limit. I
> > understand you maybe don't want to be really strict and cut control area at
> > bdi_thresh but your choice looks like too benevolent - when you have
> > several active bdi's with different speeds this will effectively erase
> > difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> > bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> > bdi_dirty really heavily exceeds bdi_thresh.
> 
> Yes the auxiliary control line could be very flat (small slope).
> 
> However it's not normal for the bdi dirty pages to fall into the
> range of auxiliary control line at all. And once it takes effect, 
> the pos_ratio is at most 0.5 (which is the value at the connection
> point with the main bdi control line) which is strong enough to pull
> the dirty pages off the auxiliary bdi control line and into the scope
> of main bdi control line.
> 
> The auxiliary control line is intended for bringing down the bdi_dirty
> of the USB key before 250s (where the "pos bandwidth" line keeps low): 
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png
> 
> After that the bdi_dirty will fluctuate around bdi_thresh and won't
> grow high and step into the scope of the auxiliary control line.

Note that the main/auxiliary bdi control lines won't take effect at
the same time: the main bdi control lines works around and under the
bdi setpoint, and the auxiliary line takes over in the higher scope up
to @limit.

In the 1UKEY+1HDD test case, the bdi_dirty of the UKEY rushes at the
free run stage when global dirty pages are smaller than (thresh+bg_thresh)/2.

So it will be initially under the control the auxiliary line. Hence the
dirtier task will progress at 1/4 to 1/2 of the UKEY's write bandwidth. 
This will bring down the bdi_dirty reasonably fast while still allowing
the dirtier task to make some progress.

The connection point of the main/auxiliary control lines has pos_ratio=0.5.

After 250 second, the main bdi control line takes over, indicated by
the bdi_dirty fluctuating around bdi setpoint and the position rate
(green line) fluctuating around the base ratelimit(blue line).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16 19:41     ` Jan Kara
  (?)
@ 2011-08-17 13:23     ` Wu Fengguang
  2011-08-17 13:49         ` Wu Fengguang
  2011-08-17 20:24         ` Jan Kara
  -1 siblings, 2 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

Hi Jan,

On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
>   Hello Fengguang,
> 
>   this patch is much easier to read than in older versions! Good work!

Thank you :)

> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
>   I think you can slightly simplify this to:
> (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;

Good idea!

> > +	x_intercept = setpoint + 2 * span;
>   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> easily 500 MB, that happens quite often I imagine?

That's fine because I no longer target "bdi_thresh" as some limiting
factor as the global "thresh". Due to it being unstable in small
memory JBOD systems, which is the big and unique problem in JBOD.

> > +
> > +	if (unlikely(bdi_dirty > setpoint + span)) {
> > +		if (unlikely(bdi_dirty > limit))
> > +			return 0;
>   Shouldn't this be bdi_thresh instead of limit? I understand this is a
> hard limit but with more bdis this condition is rather weak and almost
> never true.

Yeah, I mean @limit. @bdi_thresh is made weak in IO-less
balance_dirty_pages() in order to get reasonable smooth dirty rate in
the face of a fluctuating @bdi_thresh.

The tradeoff is to let bdi dirty pages fluctuate more or less freely,
as long as they don't drop low and risk IO queue underflow. The
attached patch tries to prevent the underflow (which is good but not
perfect).

> > +		if (x_intercept < limit) {
> > +			x_intercept = limit;	/* auxiliary control line */
> > +			setpoint += span;
> > +			pos_ratio >>= 1;
> > +		}
>   And here you stretch the control area upto the global dirty limit. I
> understand you maybe don't want to be really strict and cut control area at
> bdi_thresh but your choice looks like too benevolent - when you have
> several active bdi's with different speeds this will effectively erase
> difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> bdi_dirty really heavily exceeds bdi_thresh.

Yes the auxiliary control line could be very flat (small slope).

However it's not normal for the bdi dirty pages to fall into the
range of auxiliary control line at all. And once it takes effect, 
the pos_ratio is at most 0.5 (which is the value at the connection
point with the main bdi control line) which is strong enough to pull
the dirty pages off the auxiliary bdi control line and into the scope
of main bdi control line.

The auxiliary control line is intended for bringing down the bdi_dirty
of the USB key before 250s (where the "pos bandwidth" line keeps low): 

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png

After that the bdi_dirty will fluctuate around bdi_thresh and won't
grow high and step into the scope of the auxiliary control line.

> So wouldn't it be better to
> just make sure control area is reasonably large (e.g. at least 16 MB) to
> allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
> limit?

In order to take bdi_thresh as some semi-strict limit, we need to make
it very stable at first..otherwise the whole control system may fluctuate
violently.

Thanks,
Fengguang

> > +	}
> > +	pos_ratio *= x_intercept - bdi_dirty;
> > +	do_div(pos_ratio, x_intercept - setpoint + 1);
> > +
> > +	return pos_ratio;
> > +}
> > +
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

[-- Attachment #2: bdi-reserve-area --]
[-- Type: text/plain, Size: 2539 bytes --]

Subject: writeback: dirty position control - bdi reserve area
Date: Thu Aug 04 22:16:46 CST 2011

Keep a minimal pool of dirty pages for each bdi, so that the disk IO
queues won't underrun.

It's particularly useful for JBOD and small memory system.

XXX:
When memory is small (in comparison to write bandwidth), this control
line may result in (pos_ratio > 1) at the setpoint and push the dirty
pages high. This is more or less intended because the bdi is in the
danger of IO queue underflow. However the global dirty pages, when
pushed close to limit, will eventually conteract our desire to push up
the low bdi_dirty. In low memory JBOD tests we do see disks
under-utilized from time to time.

One scheme that may completely fix this is to add a BDI_queue_empty to
indicate the block IO queue emptiness (but still there may be in flight
IOs on the driver/hardware side) and to unthrottle the tasks regardless
of the global limit on seeing BDI_queue_empty.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-16 09:06:46.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 09:06:50.000000000 +0800
@@ -488,6 +488,16 @@ unsigned long bdi_dirty_limit(struct bac
  *   0 +------------.------------------.----------------------*------------->
  *           freerun^          setpoint^                 limit^   dirty pages
  *
+ * (o) bdi reserve area
+ *
+ * The bdi reserve area tries to keep a reasonable number of dirty pages for
+ * preventing block queue underrun.
+ *
+ * reserve area, scale up rate as dirty pages drop low
+ * |<----------------------------------------------->|
+ * |-------------------------------------------------------*-------|----------
+ * 0                                           bdi setpoint^       ^bdi_thresh
+ *
  * (o) bdi control lines
  *
  * The control lines for the global/bdi setpoints both stretch up to @limit.
@@ -571,6 +581,19 @@ static unsigned long bdi_position_ratio(
 	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
 
 	/*
+	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
+	 */
+	x_intercept = min(bdi->avg_write_bandwidth + 2 * MIN_WRITEBACK_PAGES,
+			  freerun);
+	if (bdi_dirty < x_intercept) {
+		if (bdi_dirty > x_intercept / 8) {
+			pos_ratio *= x_intercept;
+			do_div(pos_ratio, bdi_dirty);
+		} else
+			pos_ratio *= 8;
+	}
+
+	/*
 	 * bdi setpoint
 	 *
 	 *        f(dirty) := 1.0 + k * (dirty - setpoint)

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 12:03   ` Jan Kara
@ 2011-08-17 12:35     ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-17 12:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: David Horner, linux-kernel

On Wed, Aug 17, 2011 at 08:03:56PM +0800, Jan Kara wrote:
> On Wed 17-08-11 02:40:19, David Horner wrote:
> >  I noticed a significant typo below (another of those thousand eyes,
> > thanks to Jan Kara's post that started ne looking) :
> > 
> >  > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> >  > + unsigned long thresh,
> > ...
> >  > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> >  > + * own size, so move the slope over accordingly.
> >  > + */
> >  > + if (unlikely(bdi_thresh > thresh))
> >  > + bdi_thresh = thresh;
> >  > + /*
> >  > + * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> >  > + */
> >  > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > 
> >                   ^
> >  I believe should be
> > 
> >     x = div_u64((u64)bdi_thresh << 16, thresh + 1);
>   I've noticed this as well but it's mostly a consistency issue. 'thresh'
> is going to be large in practice so there's not much difference between
> thresh + 1 and thresh | 1.

Right :) Anyway I'll change it to thresh + 1.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17  6:40 ` David Horner
@ 2011-08-17 12:03   ` Jan Kara
  2011-08-17 12:35     ` Wu Fengguang
  0 siblings, 1 reply; 305+ messages in thread
From: Jan Kara @ 2011-08-17 12:03 UTC (permalink / raw)
  To: David Horner; +Cc: linux-kernel, fengguang.wu, jack

On Wed 17-08-11 02:40:19, David Horner wrote:
>  I noticed a significant typo below (another of those thousand eyes,
> thanks to Jan Kara's post that started ne looking) :
> 
>  > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>  > + unsigned long thresh,
> ...
>  > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
>  > + * own size, so move the slope over accordingly.
>  > + */
>  > + if (unlikely(bdi_thresh > thresh))
>  > + bdi_thresh = thresh;
>  > + /*
>  > + * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
>  > + */
>  > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> 
>                   ^
>  I believe should be
> 
>     x = div_u64((u64)bdi_thresh << 16, thresh + 1);
  I've noticed this as well but it's mostly a consistency issue. 'thresh'
is going to be large in practice so there's not much difference between
thresh + 1 and thresh | 1.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
       [not found] <CAFdhcLRKvfqBnXCXLwq-Qe1eNAGC-8XJ3BtHpQKzaa3RhHyp6A@mail.gmail.com>
@ 2011-08-17  6:40 ` David Horner
  2011-08-17 12:03   ` Jan Kara
  0 siblings, 1 reply; 305+ messages in thread
From: David Horner @ 2011-08-17  6:40 UTC (permalink / raw)
  To: linux-kernel, fengguang.wu; +Cc: jack

 I noticed a significant typo below (another of those thousand eyes,
thanks to Jan Kara's post that started ne looking) :

 > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
 > + unsigned long thresh,
...
 > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
 > + * own size, so move the slope over accordingly.
 > + */
 > + if (unlikely(bdi_thresh > thresh))
 > + bdi_thresh = thresh;
 > + /*
 > + * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
 > + */
 > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);

                  ^
 I believe should be

    x = div_u64((u64)bdi_thresh << 16, thresh + 1);

    David Horner

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-16 19:41     ` Jan Kara
  -1 siblings, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-16 19:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hello Fengguang,

  this patch is much easier to read than in older versions! Good work!

> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
  I think you can slightly simplify this to:
(thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;


> +	x_intercept = setpoint + 2 * span;
  What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
easily 500 MB, that happens quite often I imagine?

> +
> +	if (unlikely(bdi_dirty > setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
  Shouldn't this be bdi_thresh instead of limit? I understand this is a
hard limit but with more bdis this condition is rather weak and almost
never true.

> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			setpoint += span;
> +			pos_ratio >>= 1;
> +		}
  And here you stretch the control area upto the global dirty limit. I
understand you maybe don't want to be really strict and cut control area at
bdi_thresh but your choice looks like too benevolent - when you have
several active bdi's with different speeds this will effectively erase
difference between them, won't it? E.g. with 10 bdi's (x_intercept -
bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
bdi_dirty really heavily exceeds bdi_thresh. So wouldn't it be better to
just make sure control area is reasonably large (e.g. at least 16 MB) to
allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
limit?

> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 305+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16 19:41     ` Jan Kara
  0 siblings, 0 replies; 305+ messages in thread
From: Jan Kara @ 2011-08-16 19:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hello Fengguang,

  this patch is much easier to read than in older versions! Good work!

> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
  I think you can slightly simplify this to:
(thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;


> +	x_intercept = setpoint + 2 * span;
  What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
easily 500 MB, that happens quite often I imagine?

> +
> +	if (unlikely(bdi_dirty > setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
  Shouldn't this be bdi_thresh instead of limit? I understand this is a
hard limit but with more bdis this condition is rather weak and almost
never true.

> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			setpoint += span;
> +			pos_ratio >>= 1;
> +		}
  And here you stretch the control area upto the global dirty limit. I
understand you maybe don't want to be really strict and cut control area at
bdi_thresh but your choice looks like too benevolent - when you have
several active bdi's with different speeds this will effectively erase
difference between them, won't it? E.g. with 10 bdi's (x_intercept -
bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
bdi_dirty really heavily exceeds bdi_thresh. So wouldn't it be better to
just make sure control area is reasonably large (e.g. at least 16 MB) to
allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
limit?

> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20 [PATCH 0/5] IO-less dirty throttling v9 Wu Fengguang
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13157 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,



^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13460 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 305+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13460 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 305+ messages in thread

end of thread, other threads:[~2011-09-06 12:40 UTC | newest]

Thread overview: 305+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-06  8:44 [PATCH 0/5] IO-less dirty throttling v8 Wu Fengguang
2011-08-06  8:44 ` Wu Fengguang
2011-08-06  8:44 ` Wu Fengguang
2011-08-06  8:44 ` [PATCH 1/5] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06  8:44 ` [PATCH 2/5] writeback: dirty position control Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-08 13:46   ` Peter Zijlstra
2011-08-08 13:46     ` Peter Zijlstra
2011-08-08 13:46     ` Peter Zijlstra
2011-08-08 14:11     ` Wu Fengguang
2011-08-08 14:11       ` Wu Fengguang
2011-08-08 14:31       ` Peter Zijlstra
2011-08-08 14:31         ` Peter Zijlstra
2011-08-08 14:31         ` Peter Zijlstra
2011-08-08 22:47         ` Wu Fengguang
2011-08-08 22:47           ` Wu Fengguang
2011-08-09  9:31           ` Peter Zijlstra
2011-08-09  9:31             ` Peter Zijlstra
2011-08-09  9:31             ` Peter Zijlstra
2011-08-10 12:28             ` Wu Fengguang
2011-08-10 12:28               ` Wu Fengguang
2011-08-08 14:41       ` Peter Zijlstra
2011-08-08 14:41         ` Peter Zijlstra
2011-08-08 14:41         ` Peter Zijlstra
2011-08-08 23:05         ` Wu Fengguang
2011-08-08 23:05           ` Wu Fengguang
2011-08-09 10:32           ` Peter Zijlstra
2011-08-09 10:32             ` Peter Zijlstra
2011-08-09 10:32             ` Peter Zijlstra
2011-08-09 17:20           ` Peter Zijlstra
2011-08-09 17:20             ` Peter Zijlstra
2011-08-09 17:20             ` Peter Zijlstra
2011-08-10 22:34             ` Jan Kara
2011-08-10 22:34               ` Jan Kara
2011-08-11  2:29               ` Wu Fengguang
2011-08-11  2:29                 ` Wu Fengguang
2011-08-11 11:14                 ` Jan Kara
2011-08-11 11:14                   ` Jan Kara
2011-08-16  8:35                   ` Wu Fengguang
2011-08-16  8:35                     ` Wu Fengguang
2011-08-12 13:19             ` Wu Fengguang
2011-08-12 13:19               ` Wu Fengguang
2011-08-10 21:40           ` Vivek Goyal
2011-08-10 21:40             ` Vivek Goyal
2011-08-16  8:55             ` Wu Fengguang
2011-08-16  8:55               ` Wu Fengguang
2011-08-11 22:56           ` Peter Zijlstra
2011-08-11 22:56             ` Peter Zijlstra
2011-08-11 22:56             ` Peter Zijlstra
2011-08-12  2:43             ` Wu Fengguang
2011-08-12  2:43               ` Wu Fengguang
2011-08-12  3:18               ` Wu Fengguang
2011-08-12  5:45               ` Wu Fengguang
2011-08-12  5:45                 ` Wu Fengguang
2011-08-12  9:45                 ` Peter Zijlstra
2011-08-12  9:45                   ` Peter Zijlstra
2011-08-12  9:45                   ` Peter Zijlstra
2011-08-12 11:07                   ` Wu Fengguang
2011-08-12 11:07                     ` Wu Fengguang
2011-08-12 12:17                     ` Peter Zijlstra
2011-08-12 12:17                       ` Peter Zijlstra
2011-08-12 12:17                       ` Peter Zijlstra
2011-08-12  9:47               ` Peter Zijlstra
2011-08-12  9:47                 ` Peter Zijlstra
2011-08-12  9:47                 ` Peter Zijlstra
2011-08-12 11:11                 ` Wu Fengguang
2011-08-12 11:11                   ` Wu Fengguang
2011-08-12 12:54           ` Peter Zijlstra
2011-08-12 12:54             ` Peter Zijlstra
2011-08-12 12:54             ` Peter Zijlstra
2011-08-12 12:59             ` Wu Fengguang
2011-08-12 12:59               ` Wu Fengguang
2011-08-12 13:08               ` Peter Zijlstra
2011-08-12 13:08                 ` Peter Zijlstra
2011-08-12 13:08                 ` Peter Zijlstra
2011-08-12 13:04           ` Peter Zijlstra
2011-08-12 13:04             ` Peter Zijlstra
2011-08-12 13:04             ` Peter Zijlstra
2011-08-12 14:20             ` Wu Fengguang
2011-08-12 14:20               ` Wu Fengguang
2011-08-22 15:38               ` Peter Zijlstra
2011-08-22 15:38                 ` Peter Zijlstra
2011-08-22 15:38                 ` Peter Zijlstra
2011-08-23  3:40                 ` Wu Fengguang
2011-08-23  3:40                   ` Wu Fengguang
2011-08-23 10:01                   ` Peter Zijlstra
2011-08-23 10:01                     ` Peter Zijlstra
2011-08-23 10:01                     ` Peter Zijlstra
2011-08-23 14:15                     ` Wu Fengguang
2011-08-23 14:15                       ` Wu Fengguang
2011-08-23 17:47                       ` Vivek Goyal
2011-08-23 17:47                         ` Vivek Goyal
2011-08-24  0:12                         ` Wu Fengguang
2011-08-24  0:12                           ` Wu Fengguang
2011-08-24 16:12                           ` Peter Zijlstra
2011-08-24 16:12                             ` Peter Zijlstra
2011-08-26  0:18                             ` Wu Fengguang
2011-08-26  0:18                               ` Wu Fengguang
2011-08-26  9:04                               ` Peter Zijlstra
2011-08-26  9:04                                 ` Peter Zijlstra
2011-08-26 10:04                                 ` Wu Fengguang
2011-08-26 10:04                                   ` Wu Fengguang
2011-08-26 10:42                                   ` Peter Zijlstra
2011-08-26 10:42                                     ` Peter Zijlstra
2011-08-26 10:52                                     ` Wu Fengguang
2011-08-26 10:52                                       ` Wu Fengguang
2011-08-26 11:26                                   ` Wu Fengguang
2011-08-26 12:11                                     ` Peter Zijlstra
2011-08-26 12:11                                       ` Peter Zijlstra
2011-08-26 12:20                                       ` Wu Fengguang
2011-08-26 12:20                                         ` Wu Fengguang
2011-08-26 13:13                                         ` Wu Fengguang
2011-08-26 13:18                                           ` Peter Zijlstra
2011-08-26 13:18                                             ` Peter Zijlstra
2011-08-26 13:24                                             ` Wu Fengguang
2011-08-26 13:24                                               ` Wu Fengguang
2011-08-24 18:00                           ` Vivek Goyal
2011-08-24 18:00                             ` Vivek Goyal
2011-08-25  3:19                             ` Wu Fengguang
2011-08-25  3:19                               ` Wu Fengguang
2011-08-25 22:20                               ` Vivek Goyal
2011-08-25 22:20                                 ` Vivek Goyal
2011-08-26  1:56                                 ` Wu Fengguang
2011-08-26  1:56                                   ` Wu Fengguang
2011-08-26  8:56                                   ` Peter Zijlstra
2011-08-26  8:56                                     ` Peter Zijlstra
2011-08-26  9:53                                     ` Wu Fengguang
2011-08-26  9:53                                       ` Wu Fengguang
2011-08-29 13:12                             ` Peter Zijlstra
2011-08-29 13:12                               ` Peter Zijlstra
2011-08-29 13:37                               ` Wu Fengguang
2011-08-29 13:37                                 ` Wu Fengguang
2011-09-02 12:16                                 ` Peter Zijlstra
2011-09-02 12:16                                   ` Peter Zijlstra
2011-09-06 12:40                                 ` Peter Zijlstra
2011-09-06 12:40                                   ` Peter Zijlstra
2011-08-24 15:57                       ` Peter Zijlstra
2011-08-24 15:57                         ` Peter Zijlstra
2011-08-24 15:57                         ` Peter Zijlstra
2011-08-25  5:30                         ` Wu Fengguang
2011-08-25  5:30                           ` Wu Fengguang
2011-08-23 14:36                     ` Vivek Goyal
2011-08-23 14:36                       ` Vivek Goyal
2011-08-09  2:08   ` Vivek Goyal
2011-08-09  2:08     ` Vivek Goyal
2011-08-16  8:59     ` Wu Fengguang
2011-08-16  8:59       ` Wu Fengguang
2011-08-06  8:44 ` [PATCH 3/5] writeback: dirty rate control Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-09 14:54   ` Vivek Goyal
2011-08-09 14:54     ` Vivek Goyal
2011-08-11  3:42     ` Wu Fengguang
2011-08-11  3:42       ` Wu Fengguang
2011-08-09 14:57   ` Peter Zijlstra
2011-08-09 14:57     ` Peter Zijlstra
2011-08-09 14:57     ` Peter Zijlstra
2011-08-10 11:07     ` Wu Fengguang
2011-08-10 11:07       ` Wu Fengguang
2011-08-10 16:17       ` Peter Zijlstra
2011-08-10 16:17         ` Peter Zijlstra
2011-08-10 16:17         ` Peter Zijlstra
2011-08-15 14:08         ` Wu Fengguang
2011-08-15 14:08           ` Wu Fengguang
2011-08-09 15:50   ` Vivek Goyal
2011-08-09 15:50     ` Vivek Goyal
2011-08-09 16:16     ` Peter Zijlstra
2011-08-09 16:16       ` Peter Zijlstra
2011-08-09 16:16       ` Peter Zijlstra
2011-08-09 16:19       ` Peter Zijlstra
2011-08-09 16:19         ` Peter Zijlstra
2011-08-09 16:19         ` Peter Zijlstra
2011-08-10 14:07         ` Wu Fengguang
2011-08-10 14:07           ` Wu Fengguang
2011-08-10 14:00       ` Wu Fengguang
2011-08-10 14:00         ` Wu Fengguang
2011-08-10 17:10         ` Peter Zijlstra
2011-08-10 17:10           ` Peter Zijlstra
2011-08-15 14:11           ` Wu Fengguang
2011-08-15 14:11             ` Wu Fengguang
2011-08-09 16:56   ` Peter Zijlstra
2011-08-09 16:56     ` Peter Zijlstra
2011-08-09 16:56     ` Peter Zijlstra
2011-08-10 14:10     ` Wu Fengguang
2011-08-09 17:02   ` Peter Zijlstra
2011-08-09 17:02     ` Peter Zijlstra
2011-08-09 17:02     ` Peter Zijlstra
2011-08-10 14:15     ` Wu Fengguang
2011-08-10 14:15       ` Wu Fengguang
2011-08-06  8:44 ` [PATCH 4/5] writeback: per task dirty rate limit Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06 14:35   ` Andrea Righi
2011-08-06 14:35     ` Andrea Righi
2011-08-07  6:19     ` Wu Fengguang
2011-08-07  6:19       ` Wu Fengguang
2011-08-08 13:47   ` Peter Zijlstra
2011-08-08 13:47     ` Peter Zijlstra
2011-08-08 13:47     ` Peter Zijlstra
2011-08-08 14:21     ` Wu Fengguang
2011-08-08 14:21       ` Wu Fengguang
2011-08-08 23:32       ` Wu Fengguang
2011-08-08 23:32         ` Wu Fengguang
2011-08-08 14:23     ` Wu Fengguang
2011-08-08 14:23       ` Wu Fengguang
2011-08-08 14:26       ` Peter Zijlstra
2011-08-08 14:26         ` Peter Zijlstra
2011-08-08 14:26         ` Peter Zijlstra
2011-08-08 22:38         ` Wu Fengguang
2011-08-08 22:38           ` Wu Fengguang
2011-08-13 16:28       ` Andrea Righi
2011-08-13 16:28         ` Andrea Righi
2011-08-15 14:21         ` Wu Fengguang
2011-08-15 14:26           ` Andrea Righi
2011-08-15 14:26             ` Andrea Righi
2011-08-09 17:46   ` Vivek Goyal
2011-08-09 17:46     ` Vivek Goyal
2011-08-10  3:29     ` Wu Fengguang
2011-08-10  3:29       ` Wu Fengguang
2011-08-10 18:18       ` Vivek Goyal
2011-08-10 18:18         ` Vivek Goyal
2011-08-11  0:55         ` Wu Fengguang
2011-08-11  0:55           ` Wu Fengguang
2011-08-09 18:35   ` Peter Zijlstra
2011-08-09 18:35     ` Peter Zijlstra
2011-08-09 18:35     ` Peter Zijlstra
2011-08-10  3:40     ` Wu Fengguang
2011-08-10  3:40       ` Wu Fengguang
2011-08-10 10:25       ` Peter Zijlstra
2011-08-10 10:25         ` Peter Zijlstra
2011-08-10 10:25         ` Peter Zijlstra
2011-08-10 11:13         ` Wu Fengguang
2011-08-10 11:13           ` Wu Fengguang
2011-08-06  8:44 ` [PATCH 5/5] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06 14:48   ` Andrea Righi
2011-08-06 14:48     ` Andrea Righi
2011-08-06 14:48     ` Andrea Righi
2011-08-07  6:44     ` Wu Fengguang
2011-08-07  6:44       ` Wu Fengguang
2011-08-07  6:44       ` Wu Fengguang
2011-08-06 16:46   ` Andrea Righi
2011-08-06 16:46     ` Andrea Righi
2011-08-07  7:18     ` Wu Fengguang
2011-08-07  9:50       ` Andrea Righi
2011-08-07  9:50         ` Andrea Righi
2011-08-09 18:15   ` Vivek Goyal
2011-08-09 18:15     ` Vivek Goyal
2011-08-09 18:41     ` Peter Zijlstra
2011-08-09 18:41       ` Peter Zijlstra
2011-08-09 18:41       ` Peter Zijlstra
2011-08-10  3:22       ` Wu Fengguang
2011-08-10  3:22         ` Wu Fengguang
2011-08-10  3:26     ` Wu Fengguang
2011-08-10  3:26       ` Wu Fengguang
2011-08-09 19:16   ` Vivek Goyal
2011-08-09 19:16     ` Vivek Goyal
2011-08-10  4:33     ` Wu Fengguang
2011-08-09  2:01 ` [PATCH 0/5] IO-less dirty throttling v8 Vivek Goyal
2011-08-09  2:01   ` Vivek Goyal
2011-08-09  5:55   ` Dave Chinner
2011-08-09  5:55     ` Dave Chinner
2011-08-09 14:04     ` Vivek Goyal
2011-08-09 14:04       ` Vivek Goyal
2011-08-10  7:41       ` Greg Thelen
2011-08-10  7:41         ` Greg Thelen
2011-08-10  7:41         ` Greg Thelen
2011-08-10 18:40         ` Vivek Goyal
2011-08-10 18:40           ` Vivek Goyal
2011-08-10 18:40           ` Vivek Goyal
2011-08-11  3:21   ` Wu Fengguang
2011-08-11  3:21     ` Wu Fengguang
2011-08-11 20:42     ` Vivek Goyal
2011-08-11 20:42       ` Vivek Goyal
2011-08-11 21:00       ` Vivek Goyal
2011-08-11 21:00         ` Vivek Goyal
2011-08-16  2:20 [PATCH 0/5] IO-less dirty throttling v9 Wu Fengguang
2011-08-16  2:20 ` [PATCH 2/5] writeback: dirty position control Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16 19:41   ` Jan Kara
2011-08-16 19:41     ` Jan Kara
2011-08-17 13:23     ` Wu Fengguang
2011-08-17 13:49       ` Wu Fengguang
2011-08-17 13:49         ` Wu Fengguang
2011-08-17 20:24       ` Jan Kara
2011-08-17 20:24         ` Jan Kara
2011-08-18  4:18         ` Wu Fengguang
2011-08-18  4:18           ` Wu Fengguang
2011-08-18  4:41           ` Wu Fengguang
2011-08-18  4:41             ` Wu Fengguang
2011-08-18 19:16           ` Jan Kara
2011-08-18 19:16             ` Jan Kara
2011-08-24  3:16         ` Wu Fengguang
2011-08-24  3:16           ` Wu Fengguang
2011-08-19  2:53   ` Vivek Goyal
2011-08-19  2:53     ` Vivek Goyal
2011-08-19  3:25     ` Wu Fengguang
2011-08-19  3:25       ` Wu Fengguang
     [not found] <CAFdhcLRKvfqBnXCXLwq-Qe1eNAGC-8XJ3BtHpQKzaa3RhHyp6A@mail.gmail.com>
2011-08-17  6:40 ` David Horner
2011-08-17 12:03   ` Jan Kara
2011-08-17 12:35     ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.