All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] more writeback patches for 3.1
@ 2011-06-19 15:01 Wu Fengguang
  2011-06-19 15:01 ` [PATCH 1/7] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:01 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

Hi,

These patches are mainly seperated out from the original IO-less writeback
patchset. They are general useful as standalone improvements. This
series is based on the write bandwidth patches posted at

	https://lkml.org/lkml/2011/6/12/69

They'll appear in this git branch after a while, and be posted in one
single series for the next version.

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback

cleanups
	[PATCH 1/7] writeback: consolidate variable names in balance_dirty_pages()
	[PATCH 2/7] writeback: add parameters to __bdi_update_bandwidth()

dirty limits
	[PATCH 3/7] writeback: introduce smoothed global dirty limit
	[PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits

write chunk size
	[PATCH 5/7] writeback: make writeback_control.nr_to_write straight
	[PATCH 6/7] writeback: scale IO chunk size up to half device bandwidth

dirty exceeded state
	[PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/7] writeback: consolidate variable names in balance_dirty_pages()
  2011-06-19 15:01 [PATCH 0/7] more writeback patches for 3.1 Wu Fengguang
@ 2011-06-19 15:01 ` Wu Fengguang
  2011-06-20  7:45   ` Christoph Hellwig
  2011-06-19 15:01 ` [PATCH 2/7] writeback: add parameters to __bdi_update_bandwidth() Wu Fengguang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:01 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-cleanup-name-merge.patch --]
[-- Type: text/plain, Size: 2673 bytes --]

Introduce

	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS

in order to simplify many tests in the following patches.

balance_dirty_pages() will eventually care only about the dirty sums
besides nr_writeback.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-06-13 17:27:10.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-13 17:27:30.000000000 +0800
@@ -560,8 +560,9 @@ static void bdi_update_bandwidth(struct 
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long write_chunk)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
+	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -581,7 +582,7 @@ static void balance_dirty_pages(struct a
 
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -590,8 +591,7 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -609,10 +609,12 @@ static void balance_dirty_pages(struct a
 		 */
 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
 			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_nr_reclaimable +
+				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
 			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_nr_reclaimable +
+				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
 		/*
@@ -621,9 +623,8 @@ static void balance_dirty_pages(struct a
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
+				  (nr_dirty > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/7] writeback: add parameters to __bdi_update_bandwidth()
  2011-06-19 15:01 [PATCH 0/7] more writeback patches for 3.1 Wu Fengguang
  2011-06-19 15:01 ` [PATCH 1/7] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
@ 2011-06-19 15:01 ` Wu Fengguang
  2011-06-19 15:31   ` Christoph Hellwig
  2011-06-19 15:01 ` [PATCH 3/7] writeback: introduce smoothed global dirty limit Wu Fengguang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:01 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: lll --]
[-- Type: text/plain, Size: 2660 bytes --]

The added parameters will be used by the next patch and more future patches.

It's separated out for easier review, and will be merged either with the
next patch or with the previous write bandwidth patch in future posts.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 +-
 include/linux/writeback.h |    3 +++
 mm/page-writeback.c       |   21 +++++++++++++++++++--
 3 files changed, 23 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-06-19 22:51:50.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-19 22:53:46.000000000 +0800
@@ -549,6 +549,9 @@ static void global_update_bandwidth(unsi
 }
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long thresh,
+			    unsigned long dirty,
+			    unsigned long bdi_dirty,
 			    unsigned long start_time)
 {
 	unsigned long now = jiffies;
@@ -581,12 +593,16 @@ snapshot:
 }
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
+				 unsigned long thresh,
+				 unsigned long dirty,
+				 unsigned long bdi_dirty,
 				 unsigned long start_time)
 {
 	if (jiffies - bdi->bw_time_stamp <= MAX_PAUSE + MAX_PAUSE / 10)
 		return;
 	if (spin_trylock(&bdi->wb.list_lock)) {
-		__bdi_update_bandwidth(bdi, start_time);
+		__bdi_update_bandwidth(bdi, thresh, dirty, bdi_dirty,
+				       start_time);
 		spin_unlock(&bdi->wb.list_lock);
 	}
 }
@@ -673,7 +689,8 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/include/linux/writeback.h	2011-06-19 22:51:50.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-19 22:53:29.000000000 +0800
@@ -125,6 +125,9 @@ unsigned long bdi_dirty_limit(struct bac
 			       unsigned long dirty);
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long thresh,
+			    unsigned long dirty,
+			    unsigned long bdi_dirty,
 			    unsigned long start_time);
 
 void page_writeback_init(void);
--- linux-next.orig/fs/fs-writeback.c	2011-06-19 22:51:50.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-19 22:53:29.000000000 +0800
@@ -635,7 +635,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, start_time);
 }
 
 /*



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/7] writeback: introduce smoothed global dirty limit
  2011-06-19 15:01 [PATCH 0/7] more writeback patches for 3.1 Wu Fengguang
  2011-06-19 15:01 ` [PATCH 1/7] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
  2011-06-19 15:01 ` [PATCH 2/7] writeback: add parameters to __bdi_update_bandwidth() Wu Fengguang
@ 2011-06-19 15:01 ` Wu Fengguang
  2011-06-19 15:36   ` Christoph Hellwig
                     ` (2 more replies)
  2011-06-19 15:01 ` [PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits Wu Fengguang
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:01 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-dirty-thresh-limit.patch --]
[-- Type: text/plain, Size: 2759 bytes --]

The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() and hence the global/bdi dirty
thresholds.

So introduce global_dirty_limit for tracking the global dirty threshold
with policies

- follow downwards slowly
- follow up in one shot

global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 +
 mm/page-writeback.c       |   41 ++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-06-19 22:56:18.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-19 22:59:29.000000000 +0800
@@ -88,6 +88,8 @@ static inline void laptop_sync_completio
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 
+extern unsigned long global_dirty_limit;
+
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern unsigned long dirty_background_bytes;
--- linux-next.orig/mm/page-writeback.c	2011-06-19 22:56:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-19 22:59:29.000000000 +0800
@@ -116,6 +116,7 @@ EXPORT_SYMBOL(laptop_mode);
 
 /* End of sysctl-exported parameters */
 
+unsigned long global_dirty_limit;
 
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
@@ -510,6 +511,43 @@ static void bdi_update_write_bandwidth(s
 	bdi->avg_write_bandwidth = avg;
 }
 
+static void update_dirty_limit(unsigned long thresh,
+				 unsigned long dirty)
+{
+	unsigned long limit = global_dirty_limit;
+
+	if (limit < thresh) {
+		limit = thresh;
+		goto update;
+	}
+
+	if (limit > thresh &&
+	    limit > dirty) {
+		limit -= (limit - max(thresh, dirty)) >> 5;
+		goto update;
+	}
+	return;
+update:
+	global_dirty_limit = limit;
+}
+
+static void global_update_bandwidth(unsigned long thresh,
+				    unsigned long dirty,
+				    unsigned long now)
+{
+	static DEFINE_SPINLOCK(dirty_lock);
+
+	if (now - default_backing_dev_info.bw_time_stamp < MAX_PAUSE)
+		return;
+
+	spin_lock(&dirty_lock);
+	if (now - default_backing_dev_info.bw_time_stamp >= MAX_PAUSE) {
+		update_dirty_limit(thresh, dirty);
+		default_backing_dev_info.bw_time_stamp = now;
+	}
+	spin_unlock(&dirty_lock);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
 			    unsigned long dirty,
@@ -535,6 +573,9 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
+	if (thresh)
+		global_update_bandwidth(thresh, dirty, now);
+
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits
  2011-06-19 15:01 [PATCH 0/7] more writeback patches for 3.1 Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-06-19 15:01 ` [PATCH 3/7] writeback: introduce smoothed global dirty limit Wu Fengguang
@ 2011-06-19 15:01 ` Wu Fengguang
  2011-06-22  0:20   ` Andrew Morton
  2011-06-19 15:01 ` [PATCH 5/7] writeback: make writeback_control.nr_to_write straight Wu Fengguang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:01 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-dirty-limits --]
[-- Type: text/plain, Size: 2928 bytes --]

The max-pause limit helps to keep the sleep time inside
balance_dirty_pages() within 200ms. The 200ms max sleep means per task
rate limit of 8pages/200ms=160KB/s, which normally is enough to stop
dirtiers from continue pushing the dirty pages high, unless there are
a sufficient large number of slow dirtiers (ie. 500 tasks doing 160KB/s
will still sum up to 80MB/s, reaching the write bandwidth of a slow disk).

The pass-good limit helps to let go of the good bdi's in the presence of
a blocked bdi (ie. NFS server not responding) or slow USB disk which for
some reason build up a large number of initial dirty pages that refuse
to go away anytime soon.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |   21 +++++++++++++++++++++
 mm/page-writeback.c       |   13 +++++++++++++
 2 files changed, 34 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-06-19 22:59:29.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-19 22:59:47.000000000 +0800
@@ -7,6 +7,27 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+/*
+ * The 1/16 region above the global dirty limit will be put to maximum pauses:
+ *
+ *		(limit, limit + limit/DIRTY_MAXPAUSE)
+ *
+ * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
+ * to loops:
+ *
+ *		(limit + limit/DIRTY_MAXPAUSE, limit + limit/DIRTY_PASSGOOD)
+ *
+ * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
+ * time) for the dirty pages to drop, unless written enough pages.
+ *
+ * The global dirty threshold is normally equal to the global dirty limit,
+ * except when the system suddenly allocates a lot of anonymous memory and
+ * knocks down the global dirty threshold quickly, in which case the global
+ * dirty limit will follow down slowly to prevent livelocking all dirtier tasks.
+ */
+#define DIRTY_MAXPAUSE		16
+#define DIRTY_PASSGOOD		8
+
 struct backing_dev_info;
 
 /*
--- linux-next.orig/mm/page-writeback.c	2011-06-19 22:59:29.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-19 22:59:47.000000000 +0800
@@ -399,6 +399,11 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long hard_dirty_limit(unsigned long thresh)
+{
+	return max(thresh, global_dirty_limit);
+}
+
 /*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
@@ -704,6 +709,14 @@ static void balance_dirty_pages(struct a
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 
+		dirty_thresh = hard_dirty_limit(dirty_thresh);
+		if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_MAXPAUSE &&
+		    jiffies - start_time > MAX_PAUSE)
+			break;
+		if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_PASSGOOD &&
+		    bdi_dirty < bdi_thresh)
+			break;
+
 		/*
 		 * Increase the delay for each loop, up to our previous
 		 * default of taking a 100ms nap.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 5/7] writeback: make writeback_control.nr_to_write straight
  2011-06-19 15:01 [PATCH 0/7] more writeback patches for 3.1 Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-06-19 15:01 ` [PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits Wu Fengguang
@ 2011-06-19 15:01 ` Wu Fengguang
  2011-06-19 15:35   ` Christoph Hellwig
  2011-06-19 15:01 ` [PATCH 6/7] writeback: scale IO chunk size up to half device bandwidth Wu Fengguang
  2011-06-19 15:01 ` [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state Wu Fengguang
  6 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:01 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-pass-work-to-writeback_sb_inodes.patch --]
[-- Type: text/plain, Size: 18184 bytes --]

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 
 fs/fs-writeback.c                |  195 ++++++++++++++++-------------
 include/linux/writeback.h        |    6 
 include/trace/events/writeback.h |   39 ++++-
 mm/backing-dev.c                 |    9 -
 mm/page-writeback.c              |   17 --
 6 files changed, 145 insertions(+), 123 deletions(-)

change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
  to the number of pages and/or inodes written
- move writeback_control.older_than_this to struct wb_writeback_work
- remove writeback_control.inodes_written
- remove wbc_writeback_* trace events for now

--- linux-next.orig/fs/fs-writeback.c	2011-06-19 22:16:11.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-19 22:47:43.000000000 +0800
@@ -30,11 +30,21 @@
 #include "internal.h"
 
 /*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024L
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
+	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -472,7 +482,6 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
-			wbc->inodes_written++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -506,6 +515,31 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
+static long writeback_chunk_size(struct wb_writeback_work *work)
+{
+	long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          writeback_sb_inodes()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
+		pages = LONG_MAX;
+	else
+		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+
+	return pages;
+}
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -513,18 +547,30 @@ static bool pin_sb_for_writeback(struct 
  * inodes. Otherwise write only ones which go sequentially
  * in reverse order.
  *
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
+ * Return the number of pages and/or inodes written.
  */
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
-		struct writeback_control *wbc, bool only_this_sb)
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
 {
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	unsigned long start_time = jiffies;
+	long write_chunk;
+	long wrote = 0;  /* count both pages and inodes */
+
 	while (!list_empty(&wb->b_io)) {
-		long pages_skipped;
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
 		if (inode->i_sb != sb) {
-			if (only_this_sb) {
+			if (work->sb) {
 				/*
 				 * We only want to write back data for this
 				 * superblock, move all inodes not belonging
@@ -539,7 +585,7 @@ static int writeback_sb_inodes(struct su
 			 * Bounce back to the caller to unpin this and
 			 * pin the next superblock.
 			 */
-			return 0;
+			break;
 		}
 
 		/*
@@ -553,12 +599,18 @@ static int writeback_sb_inodes(struct su
 			requeue_io(inode, wb);
 			continue;
 		}
-
 		__iget(inode);
+		write_chunk = writeback_chunk_size(work);
+		wbc.nr_to_write = write_chunk;
+		wbc.pages_skipped = 0;
+
+		writeback_single_inode(inode, wb, &wbc);
 
-		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wb, wbc);
-		if (wbc->pages_skipped != pages_skipped) {
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
+		if (!(inode->i_state & I_DIRTY))
+			wrote++;
+		if (wbc.pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
@@ -570,17 +622,25 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0)
-			return 1;
+		/*
+		 * bail out to wb_writeback() often enough to check
+		 * background threshold and other termination conditions.
+		 */
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
-	/* b_io is empty */
-	return 1;
+	return wrote;
 }
 
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
-				  struct writeback_control *wbc)
+static long __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
 {
-	int ret = 0;
+	unsigned long start_time = jiffies;
+	long wrote = 0;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -590,33 +650,36 @@ static void __writeback_inodes_wb(struct
 			requeue_io(inode, wb);
 			continue;
 		}
-		ret = writeback_sb_inodes(sb, wb, wbc, false);
+		wrote += writeback_sb_inodes(sb, wb, work);
 		drop_super(sb);
 
-		if (ret)
-			break;
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
 	/* Leave any unwritten inodes on b_io */
+	return wrote;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
 {
+	struct wb_writeback_work work = {
+		.nr_pages	= nr_pages,
+		.sync_mode	= WB_SYNC_NONE,
+		.range_cyclic	= 1,
+	};
+
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
-	__writeback_inodes_wb(wb, wbc);
+		queue_io(wb, NULL);
+	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
-}
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
+	return nr_pages - work.nr_pages;
+}
 
 static inline bool over_bground_thresh(void)
 {
@@ -656,43 +719,14 @@ static void wb_update_bandwidth(struct b
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= work->sync_mode,
-		.tagged_writepages	= work->tagged_writepages,
-		.older_than_this	= NULL,
-		.for_kupdate		= work->for_kupdate,
-		.for_background		= work->for_background,
-		.range_cyclic		= work->range_cyclic,
-	};
+	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	long wrote = 0;
-	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
 	unsigned long wb_start = jiffies;
-
-	if (!wbc.range_cyclic) {
-		wbc.range_start = 0;
-		wbc.range_end = LLONG_MAX;
-	}
-
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          writeback_sb_inodes()       <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
-		write_chunk = LONG_MAX;
+	long progress;
 
 	oldest_jif = jiffies;
-	wbc.older_than_this = &oldest_jif;
+	work->older_than_this = &oldest_jif;
 
 	spin_lock(&wb->list_lock);
 	for (;;) {
@@ -722,27 +756,20 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_kupdate) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			wbc.older_than_this = &oldest_jif;
+			work->older_than_this = &oldest_jif;
 		}
 
-		wbc.nr_to_write = write_chunk;
-		wbc.pages_skipped = 0;
-		wbc.inodes_written = 0;
-
-		trace_wbc_writeback_start(&wbc, wb->bdi);
+		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, wbc.older_than_this);
+			queue_io(wb, work->older_than_this);
 		if (work->sb)
-			writeback_sb_inodes(work->sb, wb, &wbc, true);
+			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-			__writeback_inodes_wb(wb, &wbc);
-		trace_wbc_writeback_written(&wbc, wb->bdi);
+			progress = __writeback_inodes_wb(wb, work);
+		trace_writeback_written(wb->bdi, work);
 
 		wb_update_bandwidth(wb, wb_start);
 
-		work->nr_pages -= write_chunk - wbc.nr_to_write;
-		wrote += write_chunk - wbc.nr_to_write;
-
 		/*
 		 * Did we write something? Try for more
 		 *
@@ -751,9 +778,7 @@ static long wb_writeback(struct bdi_writ
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		if (wbc.inodes_written)
+		if (progress)
 			continue;
 		/*
 		 * No more inodes for IO, bail
@@ -766,8 +791,8 @@ static long wb_writeback(struct bdi_writ
 		 * we'll just busyloop.
 		 */
 		if (!list_empty(&wb->b_more_io))  {
+			trace_writeback_wait(wb->bdi, work);
 			inode = wb_inode(wb->b_more_io.prev);
-			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
@@ -775,7 +800,7 @@ static long wb_writeback(struct bdi_writ
 	}
 	spin_unlock(&wb->list_lock);
 
-	return wrote;
+	return nr_pages - work->nr_pages;
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-06-19 22:22:03.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-19 22:47:43.000000000 +0800
@@ -45,12 +45,9 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
-	long inodes_written;		/* # of inodes written (at least) */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
@@ -77,8 +74,7 @@ void writeback_inodes_sb_nr(struct super
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
 void sync_inodes_sb(struct super_block *);
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
 
--- linux-next.orig/mm/backing-dev.c	2011-06-19 22:16:11.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-19 22:47:43.000000000 +0800
@@ -270,14 +270,7 @@ int bdi_has_dirty_io(struct backing_dev_
 
 static void bdi_flush_io(struct backing_dev_info *bdi)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
-		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
-	};
-
-	writeback_inodes_wb(&bdi->wb, &wbc);
+	writeback_inodes_wb(&bdi->wb, 1024);
 }
 
 /*
--- linux-next.orig/mm/page-writeback.c	2011-06-19 22:22:03.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-19 22:47:43.000000000 +0800
@@ -619,13 +619,6 @@ static void balance_dirty_pages(struct a
 	unsigned long start_time = jiffies;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
@@ -689,17 +682,17 @@ static void balance_dirty_pages(struct a
 		 * threshold otherwise wait until the disk writes catch
 		 * up.
 		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
+		trace_balance_dirty_start(bdi);
 		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
+			pages_written += writeback_inodes_wb(&bdi->wb,
+							     write_chunk);
+			trace_balance_dirty_written(bdi, pages_written);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		trace_balance_dirty_wait(bdi);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_MAXPAUSE &&
--- linux-next.orig/include/trace/events/writeback.h	2011-06-19 22:16:11.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-06-19 22:47:39.000000000 +0800
@@ -62,6 +62,9 @@ DEFINE_EVENT(writeback_work_class, name,
 DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
 
 TRACE_EVENT(writeback_pages_written,
 	TP_PROTO(long pages_written),
@@ -101,6 +104,30 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
+DEFINE_WRITEBACK_EVENT(balance_dirty_start);
+DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
+
+TRACE_EVENT(balance_dirty_written,
+
+	TP_PROTO(struct backing_dev_info *bdi, int written),
+
+	TP_ARGS(bdi, written),
+
+	TP_STRUCT__entry(
+		__array(char,	name, 32)
+		__field(int,	written)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->written = written;
+	),
+
+	TP_printk("bdi %s written %d",
+		  __entry->name,
+		  __entry->written
+	)
+);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
@@ -114,7 +141,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -128,14 +154,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -145,7 +169,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
@@ -154,12 +177,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 DEFINE_EVENT(wbc_class, name, \
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), \
 	TP_ARGS(wbc, bdi))
-DEFINE_WBC_EVENT(wbc_writeback_start);
-DEFINE_WBC_EVENT(wbc_writeback_written);
-DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
 TRACE_EVENT(writeback_queue_io,
--- linux-next.orig/fs/btrfs/extent_io.c	2011-06-19 22:16:11.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-06-19 22:47:39.000000000 +0800
@@ -2551,7 +2551,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2584,7 +2583,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 6/7] writeback: scale IO chunk size up to half device bandwidth
  2011-06-19 15:01 [PATCH 0/7] more writeback patches for 3.1 Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-06-19 15:01 ` [PATCH 5/7] writeback: make writeback_control.nr_to_write straight Wu Fengguang
@ 2011-06-19 15:01 ` Wu Fengguang
  2011-06-19 15:01 ` [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state Wu Fengguang
  6 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:01 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Theodore Tso, Chris Mason, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: writeback-128M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 4158 bytes --]

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
will adapt to as large as the storage device can write within 500ms.

XFS is observed to do IO completions in a batch, and the batch size is
equal to the write chunk size. To avoid dirty pages to suddenly drop
out of balance_dirty_pages()'s dirty control scope and create large
fluctuations, the chunk size is also limited to half the control scope.

The balance_dirty_pages() control scrope is

	[(background_thresh + dirty_thresh) / 2, dirty_thresh]

which is by default [15%, 20%] * dirty_pages, with the size
dirty_thresh / DIRTY_FULL_SCOPE.

The adpative write chunk size will be rounded to the nearest 4MB
boundary.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   23 ++++++++++-------------
 include/linux/writeback.h |   11 +++++++++++
 2 files changed, 21 insertions(+), 13 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2011-06-16 22:17:10.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-16 22:17:27.000000000 +0800
@@ -8,6 +8,10 @@
 #include <linux/fs.h>
 
 /*
+ * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
+ *
+ *		(thresh - thresh/DIRTY_FULL_SCOPE, thresh)
+ *
  * The 1/16 region above the global dirty limit will be put to maximum pauses:
  *
  *		(limit, limit + limit/DIRTY_MAXPAUSE)
@@ -25,9 +29,16 @@
  * knocks down the global dirty threshold quickly, in which case the global
  * dirty limit will follow down slowly to prevent livelocking all dirtier tasks.
  */
+#define DIRTY_SCOPE		8
+#define DIRTY_FULL_SCOPE	(DIRTY_SCOPE / 2)
 #define DIRTY_MAXPAUSE		16
 #define DIRTY_PASSGOOD		8
 
+/*
+ * 4MB minimal write chunk size
+ */
+#define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
+
 struct backing_dev_info;
 
 /*
--- linux-next.orig/fs/fs-writeback.c	2011-06-16 22:17:10.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-16 22:17:11.000000000 +0800
@@ -30,15 +30,6 @@
 #include "internal.h"
 
 /*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024L
-
-/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
@@ -515,7 +506,8 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
-static long writeback_chunk_size(struct wb_writeback_work *work)
+static long writeback_chunk_size(struct backing_dev_info *bdi,
+				 struct wb_writeback_work *work)
 {
 	long pages;
 
@@ -534,8 +526,13 @@ static long writeback_chunk_size(struct 
 	 */
 	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
 		pages = LONG_MAX;
-	else
-		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+	else {
+		pages = min(bdi->avg_write_bandwidth / 2,
+			    global_dirty_limit / DIRTY_SCOPE);
+		pages = min(pages, work->nr_pages);
+		pages = round_down(pages + MIN_WRITEBACK_PAGES,
+				   MIN_WRITEBACK_PAGES);
+	}
 
 	return pages;
 }
@@ -600,7 +597,7 @@ static long writeback_sb_inodes(struct s
 			continue;
 		}
 		__iget(inode);
-		write_chunk = writeback_chunk_size(work);
+		write_chunk = writeback_chunk_size(wb->bdi, work);
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state
  2011-06-19 15:01 [PATCH 0/7] more writeback patches for 3.1 Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-06-19 15:01 ` [PATCH 6/7] writeback: scale IO chunk size up to half device bandwidth Wu Fengguang
@ 2011-06-19 15:01 ` Wu Fengguang
  2011-06-20 20:09   ` Christoph Hellwig
  2011-06-20 21:38   ` Jan Kara
  6 siblings, 2 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:01 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-dirty-exceed-time.patch --]
[-- Type: text/plain, Size: 3705 bytes --]

When there are only one (or several) dirtiers, dirty_exceeded is always
(or mostly) off. Converting to timestamp avoids this problem. It helps
to use smaller write_chunk for smoother throttling.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---

Before patch, the wait time in balance_dirty_pages() are ~200ms:

[ 1093.397700] write_bandwidth: comm=swapper pages=1536 time=204ms
[ 1093.594319] write_bandwidth: comm=swapper pages=1536 time=196ms
[ 1093.796642] write_bandwidth: comm=swapper pages=1536 time=200ms

After patch, ~25ms:

[   90.261339] write_bandwidth: comm=swapper pages=192 time=20ms
[   90.293168] write_bandwidth: comm=swapper pages=192 time=24ms
[   90.323853] write_bandwidth: comm=swapper pages=192 time=24ms
[   90.354510] write_bandwidth: comm=swapper pages=192 time=28ms
[   90.389890] write_bandwidth: comm=swapper pages=192 time=28ms
[   90.421787] write_bandwidth: comm=swapper pages=192 time=24ms

 include/linux/backing-dev.h |    2 +-
 mm/backing-dev.c            |    2 --
 mm/page-writeback.c         |   24 ++++++++++++------------
 3 files changed, 13 insertions(+), 15 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-06-19 22:59:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-19 22:59:53.000000000 +0800
@@ -483,6 +483,15 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * last time exceeded (limit - limit/DIRTY_BRAKE)
+ */
+static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
+				    unsigned long time_window)
+{
+	return jiffies - bdi->dirty_exceed_time <= time_window;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -621,7 +630,6 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
-	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
@@ -669,14 +677,9 @@ static void balance_dirty_pages(struct a
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
-				  (nr_dirty > dirty_thresh);
-
-		if (!dirty_exceeded)
+		if (bdi_dirty <= bdi_thresh && nr_dirty <= dirty_thresh)
 			break;
-
-		if (!bdi->dirty_exceeded)
-			bdi->dirty_exceeded = 1;
+		bdi->dirty_exceed_time = jiffies;
 
 		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_dirty,
 				     start_time);
@@ -719,9 +722,6 @@ static void balance_dirty_pages(struct a
 			pause = HZ / 10;
 	}
 
-	if (!dirty_exceeded && bdi->dirty_exceeded)
-		bdi->dirty_exceeded = 0;
-
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -775,7 +775,7 @@ void balance_dirty_pages_ratelimited_nr(
 		return;
 
 	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
+	if (dirty_exceeded_recently(bdi, MAX_PAUSE))
 		ratelimit = 8;
 
 	/*
--- linux-next.orig/include/linux/backing-dev.h	2011-06-19 22:54:58.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-19 22:59:53.000000000 +0800
@@ -79,7 +79,7 @@ struct backing_dev_info {
 	unsigned long avg_write_bandwidth;
 
 	struct prop_local_percpu completions;
-	int dirty_exceeded;
+	unsigned long dirty_exceed_time;
 
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
--- linux-next.orig/mm/backing-dev.c	2011-06-19 22:59:49.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-19 22:59:53.000000000 +0800
@@ -670,8 +670,6 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
-	bdi->dirty_exceeded = 0;
-
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] writeback: add parameters to __bdi_update_bandwidth()
  2011-06-19 15:01 ` [PATCH 2/7] writeback: add parameters to __bdi_update_bandwidth() Wu Fengguang
@ 2011-06-19 15:31   ` Christoph Hellwig
  2011-06-19 15:35     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2011-06-19 15:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

On Sun, Jun 19, 2011 at 11:01:10PM +0800, Wu Fengguang wrote:
> The added parameters will be used by the next patch and more future patches.
> 
> It's separated out for easier review, and will be merged either with the
> next patch or with the previous write bandwidth patch in future posts.

I don't think adding parameters separately from the code using it is
useful.  Please fold this into the later patches.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/7] writeback: add parameters to __bdi_update_bandwidth()
  2011-06-19 15:31   ` Christoph Hellwig
@ 2011-06-19 15:35     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Andrew Morton, LKML

On Sun, Jun 19, 2011 at 11:31:10PM +0800, Christoph Hellwig wrote:
> On Sun, Jun 19, 2011 at 11:01:10PM +0800, Wu Fengguang wrote:
> > The added parameters will be used by the next patch and more future patches.
> > 
> > It's separated out for easier review, and will be merged either with the
> > next patch or with the previous write bandwidth patch in future posts.
> 
> I don't think adding parameters separately from the code using it is
> useful.  Please fold this into the later patches.

OK. Folded into the next patch in local queue :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] writeback: make writeback_control.nr_to_write straight
  2011-06-19 15:01 ` [PATCH 5/7] writeback: make writeback_control.nr_to_write straight Wu Fengguang
@ 2011-06-19 15:35   ` Christoph Hellwig
  2011-06-19 16:14     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2011-06-19 15:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

I'd rather see this earlier in the series, before the write bandwith
estimation ones, as it's groundwork independ of those changes.

> +		if (wrote) {
> +			if (jiffies - start_time > HZ / 10UL)
> +				break;
> +			if (work->nr_pages <= 0)
> +				break;
> +		}

This code probably wants some comments.

>  static void bdi_flush_io(struct backing_dev_info *bdi)
>  {
> -	struct writeback_control wbc = {
> -		.sync_mode		= WB_SYNC_NONE,
> -		.older_than_this	= NULL,
> -		.range_cyclic		= 1,
> -		.nr_to_write		= 1024,
> -	};
> -
> -	writeback_inodes_wb(&bdi->wb, &wbc);
> +	writeback_inodes_wb(&bdi->wb, 1024);
>  }

At this point you could probably also kill the bdi_flush_io wrapper, and
just call writeback_inodes_wb directly.  A comment on the 1024 pages to
write would be nice, if you remember it from poking the code.  I can't
find any good explanation for it offhand.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] writeback: introduce smoothed global dirty limit
  2011-06-19 15:01 ` [PATCH 3/7] writeback: introduce smoothed global dirty limit Wu Fengguang
@ 2011-06-19 15:36   ` Christoph Hellwig
  2011-06-19 15:55     ` Wu Fengguang
  2011-06-20 21:18   ` Jan Kara
  2011-06-22  0:04   ` Andrew Morton
  2 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2011-06-19 15:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

On Sun, Jun 19, 2011 at 11:01:11PM +0800, Wu Fengguang wrote:
> The start of a heavy weight application (ie. KVM) may instantly knock
> down determine_dirtyable_memory() and hence the global/bdi dirty
> thresholds.
> 
> So introduce global_dirty_limit for tracking the global dirty threshold
> with policies
> 
> - follow downwards slowly
> - follow up in one shot
> 
> global_dirty_limit can effectively mask out the impact of sudden drop of
> dirtyable memory. It will be used in the next patch for two new type of
> dirty limits.

This needs to be explained in more detail in comments near the actual
code.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] writeback: introduce smoothed global dirty limit
  2011-06-19 15:36   ` Christoph Hellwig
@ 2011-06-19 15:55     ` Wu Fengguang
  2011-06-21 23:59       ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 15:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Andrew Morton, LKML

On Sun, Jun 19, 2011 at 11:36:37PM +0800, Christoph Hellwig wrote:
> On Sun, Jun 19, 2011 at 11:01:11PM +0800, Wu Fengguang wrote:
> > The start of a heavy weight application (ie. KVM) may instantly knock
> > down determine_dirtyable_memory() and hence the global/bdi dirty
> > thresholds.
> > 
> > So introduce global_dirty_limit for tracking the global dirty threshold
> > with policies
> > 
> > - follow downwards slowly
> > - follow up in one shot
> > 
> > global_dirty_limit can effectively mask out the impact of sudden drop of
> > dirtyable memory. It will be used in the next patch for two new type of
> > dirty limits.
> 
> This needs to be explained in more detail in comments near the actual
> code.

Good point! This is the added comment for the update_dirty_limit() function.

/*
 * The global dirtyable memory and dirty threshold could be suddenly knocked
 * down by a large amount (eg. on the startup of KVM). This may throw the
 * system into deep dirty exceeded state and throttled to "death" for a couple
 * of seconds. The solution is to maintain global_dirty_limit for tracking
 * slowly down to the knocked down dirty threshold.
 */
static void update_dirty_limit(unsigned long thresh,
                               unsigned long dirty)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/7] writeback: make writeback_control.nr_to_write straight
  2011-06-19 15:35   ` Christoph Hellwig
@ 2011-06-19 16:14     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-19 16:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Andrew Morton, LKML

On Sun, Jun 19, 2011 at 11:35:36PM +0800, Christoph Hellwig wrote:
> I'd rather see this earlier in the series, before the write bandwith
> estimation ones, as it's groundwork independ of those changes.

OK, done.

> > +		if (wrote) {
> > +			if (jiffies - start_time > HZ / 10UL)
> > +				break;
> > +			if (work->nr_pages <= 0)
> > +				break;
> > +		}
> 
> This code probably wants some comments.

It's duplicated tests as in writeback_sb_inodes(), which have the comments

                /*
                 * bail out to wb_writeback() often enough to check
                 * background threshold and other termination conditions.
                 */        

I'll add a simple comment here:

/* refer to the same tests at the end of writeback_sb_inodes */

> >  static void bdi_flush_io(struct backing_dev_info *bdi)
> >  {
> > -	struct writeback_control wbc = {
> > -		.sync_mode		= WB_SYNC_NONE,
> > -		.older_than_this	= NULL,
> > -		.range_cyclic		= 1,
> > -		.nr_to_write		= 1024,
> > -	};
> > -
> > -	writeback_inodes_wb(&bdi->wb, &wbc);
> > +	writeback_inodes_wb(&bdi->wb, 1024);
> >  }
> 
> At this point you could probably also kill the bdi_flush_io wrapper, and
> just call writeback_inodes_wb directly.

Fair enough.

> A comment on the 1024 pages to
> write would be nice, if you remember it from poking the code.  I can't
> find any good explanation for it offhand.

I comment it with

        Hopefully 1024 is large enough for efficient IO.

Thanks,
Fengguang
---
Subject: writeback: make writeback_control.nr_to_write straight 
Date: Wed May 04 19:54:37 CST 2011

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 
 fs/fs-writeback.c                |  196 ++++++++++++++++-------------
 include/linux/writeback.h        |    6 
 include/trace/events/writeback.h |   39 ++++-
 mm/backing-dev.c                 |   17 --
 mm/page-writeback.c              |   17 --
 6 files changed, 148 insertions(+), 129 deletions(-)

change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
  to the number of pages and/or inodes written
- move writeback_control.older_than_this to struct wb_writeback_work
- remove writeback_control.inodes_written
- remove wbc_writeback_* trace events for now

--- linux-next.orig/fs/fs-writeback.c	2011-06-19 23:57:41.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-20 00:08:00.000000000 +0800
@@ -30,11 +30,21 @@
 #include "internal.h"
 
 /*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024L
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
+	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -472,7 +482,6 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
-			wbc->inodes_written++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -506,6 +515,31 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
+static long writeback_chunk_size(struct wb_writeback_work *work)
+{
+	long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          writeback_sb_inodes()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
+		pages = LONG_MAX;
+	else
+		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+
+	return pages;
+}
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -513,18 +547,30 @@ static bool pin_sb_for_writeback(struct 
  * inodes. Otherwise write only ones which go sequentially
  * in reverse order.
  *
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
+ * Return the number of pages and/or inodes written.
  */
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
-		struct writeback_control *wbc, bool only_this_sb)
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
 {
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	unsigned long start_time = jiffies;
+	long write_chunk;
+	long wrote = 0;  /* count both pages and inodes */
+
 	while (!list_empty(&wb->b_io)) {
-		long pages_skipped;
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
 		if (inode->i_sb != sb) {
-			if (only_this_sb) {
+			if (work->sb) {
 				/*
 				 * We only want to write back data for this
 				 * superblock, move all inodes not belonging
@@ -539,7 +585,7 @@ static int writeback_sb_inodes(struct su
 			 * Bounce back to the caller to unpin this and
 			 * pin the next superblock.
 			 */
-			return 0;
+			break;
 		}
 
 		/*
@@ -553,12 +599,18 @@ static int writeback_sb_inodes(struct su
 			requeue_io(inode, wb);
 			continue;
 		}
-
 		__iget(inode);
+		write_chunk = writeback_chunk_size(work);
+		wbc.nr_to_write = write_chunk;
+		wbc.pages_skipped = 0;
+
+		writeback_single_inode(inode, wb, &wbc);
 
-		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wb, wbc);
-		if (wbc->pages_skipped != pages_skipped) {
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
+		if (!(inode->i_state & I_DIRTY))
+			wrote++;
+		if (wbc.pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
@@ -570,17 +622,25 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0)
-			return 1;
+		/*
+		 * bail out to wb_writeback() often enough to check
+		 * background threshold and other termination conditions.
+		 */
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
-	/* b_io is empty */
-	return 1;
+	return wrote;
 }
 
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
-				  struct writeback_control *wbc)
+static long __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
 {
-	int ret = 0;
+	unsigned long start_time = jiffies;
+	long wrote = 0;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -590,33 +650,37 @@ static void __writeback_inodes_wb(struct
 			requeue_io(inode, wb);
 			continue;
 		}
-		ret = writeback_sb_inodes(sb, wb, wbc, false);
+		wrote += writeback_sb_inodes(sb, wb, work);
 		drop_super(sb);
 
-		if (ret)
-			break;
+		/* refer to the same tests at the end of writeback_sb_inodes */
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
 	/* Leave any unwritten inodes on b_io */
+	return wrote;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
 {
+	struct wb_writeback_work work = {
+		.nr_pages	= nr_pages,
+		.sync_mode	= WB_SYNC_NONE,
+		.range_cyclic	= 1,
+	};
+
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
-	__writeback_inodes_wb(wb, wbc);
+		queue_io(wb, NULL);
+	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
-}
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
+	return nr_pages - work.nr_pages;
+}
 
 static inline bool over_bground_thresh(void)
 {
@@ -646,42 +710,13 @@ static inline bool over_bground_thresh(v
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= work->sync_mode,
-		.tagged_writepages	= work->tagged_writepages,
-		.older_than_this	= NULL,
-		.for_kupdate		= work->for_kupdate,
-		.for_background		= work->for_background,
-		.range_cyclic		= work->range_cyclic,
-	};
+	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	long wrote = 0;
-	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
-
-	if (!wbc.range_cyclic) {
-		wbc.range_start = 0;
-		wbc.range_end = LLONG_MAX;
-	}
-
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          writeback_sb_inodes()       <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
-		write_chunk = LONG_MAX;
+	long progress;
 
 	oldest_jif = jiffies;
-	wbc.older_than_this = &oldest_jif;
+	work->older_than_this = &oldest_jif;
 
 	spin_lock(&wb->list_lock);
 	for (;;) {
@@ -711,24 +746,17 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_kupdate) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			wbc.older_than_this = &oldest_jif;
+			work->older_than_this = &oldest_jif;
 		}
 
-		wbc.nr_to_write = write_chunk;
-		wbc.pages_skipped = 0;
-		wbc.inodes_written = 0;
-
-		trace_wbc_writeback_start(&wbc, wb->bdi);
+		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, wbc.older_than_this);
+			queue_io(wb, work->older_than_this);
 		if (work->sb)
-			writeback_sb_inodes(work->sb, wb, &wbc, true);
+			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-			__writeback_inodes_wb(wb, &wbc);
-		trace_wbc_writeback_written(&wbc, wb->bdi);
-
-		work->nr_pages -= write_chunk - wbc.nr_to_write;
-		wrote += write_chunk - wbc.nr_to_write;
+			progress = __writeback_inodes_wb(wb, work);
+		trace_writeback_written(wb->bdi, work);
 
 		/*
 		 * Did we write something? Try for more
@@ -738,9 +766,7 @@ static long wb_writeback(struct bdi_writ
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		if (wbc.inodes_written)
+		if (progress)
 			continue;
 		/*
 		 * No more inodes for IO, bail
@@ -753,8 +779,8 @@ static long wb_writeback(struct bdi_writ
 		 * we'll just busyloop.
 		 */
 		if (!list_empty(&wb->b_more_io))  {
+			trace_writeback_wait(wb->bdi, work);
 			inode = wb_inode(wb->b_more_io.prev);
-			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
@@ -762,7 +788,7 @@ static long wb_writeback(struct bdi_writ
 	}
 	spin_unlock(&wb->list_lock);
 
-	return wrote;
+	return nr_pages - work->nr_pages;
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-06-19 23:57:41.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-20 00:03:25.000000000 +0800
@@ -24,12 +24,9 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
-	long inodes_written;		/* # of inodes written (at least) */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
@@ -56,8 +53,7 @@ void writeback_inodes_sb_nr(struct super
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
 void sync_inodes_sb(struct super_block *);
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
 
--- linux-next.orig/mm/backing-dev.c	2011-06-19 23:57:41.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-20 00:13:48.000000000 +0800
@@ -260,18 +260,6 @@ int bdi_has_dirty_io(struct backing_dev_
 	return wb_has_dirty_io(&bdi->wb);
 }
 
-static void bdi_flush_io(struct backing_dev_info *bdi)
-{
-	struct writeback_control wbc = {
-		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
-		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
-	};
-
-	writeback_inodes_wb(&bdi->wb, &wbc);
-}
-
 /*
  * kupdated() used to do this. We cannot do it from the bdi_forker_thread()
  * or we risk deadlocking on ->s_umount. The longer term solution would be
@@ -457,9 +445,10 @@ static int bdi_forker_thread(void *ptr)
 			if (IS_ERR(task)) {
 				/*
 				 * If thread creation fails, force writeout of
-				 * the bdi from the thread.
+				 * the bdi from the thread. Hopefully 1024 is
+				 * large enough for efficient IO.
 				 */
-				bdi_flush_io(bdi);
+				writeback_inodes_wb(&bdi->wb, 1024);
 			} else {
 				/*
 				 * The spinlock makes sure we do not lose
--- linux-next.orig/mm/page-writeback.c	2011-06-19 23:58:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-20 00:03:25.000000000 +0800
@@ -492,13 +492,6 @@ static void balance_dirty_pages(struct a
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -560,17 +553,17 @@ static void balance_dirty_pages(struct a
 		 * threshold otherwise wait until the disk writes catch
 		 * up.
 		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
+		trace_balance_dirty_start(bdi);
 		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
+			pages_written += writeback_inodes_wb(&bdi->wb,
+							     write_chunk);
+			trace_balance_dirty_written(bdi, pages_written);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		trace_balance_dirty_wait(bdi);
 
 		/*
 		 * Increase the delay for each loop, up to our previous
--- linux-next.orig/include/trace/events/writeback.h	2011-06-19 23:58:12.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-06-19 23:58:23.000000000 +0800
@@ -62,6 +62,9 @@ DEFINE_EVENT(writeback_work_class, name,
 DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
 
 TRACE_EVENT(writeback_pages_written,
 	TP_PROTO(long pages_written),
@@ -101,6 +104,30 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
+DEFINE_WRITEBACK_EVENT(balance_dirty_start);
+DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
+
+TRACE_EVENT(balance_dirty_written,
+
+	TP_PROTO(struct backing_dev_info *bdi, int written),
+
+	TP_ARGS(bdi, written),
+
+	TP_STRUCT__entry(
+		__array(char,	name, 32)
+		__field(int,	written)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->written = written;
+	),
+
+	TP_printk("bdi %s written %d",
+		  __entry->name,
+		  __entry->written
+	)
+);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
@@ -114,7 +141,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -128,14 +154,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -145,7 +169,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
@@ -154,12 +177,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 DEFINE_EVENT(wbc_class, name, \
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), \
 	TP_ARGS(wbc, bdi))
-DEFINE_WBC_EVENT(wbc_writeback_start);
-DEFINE_WBC_EVENT(wbc_writeback_written);
-DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
 TRACE_EVENT(writeback_queue_io,
--- linux-next.orig/fs/btrfs/extent_io.c	2011-06-19 23:57:40.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-06-19 23:58:23.000000000 +0800
@@ -2551,7 +2551,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2584,7 +2583,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/7] writeback: consolidate variable names in balance_dirty_pages()
  2011-06-19 15:01 ` [PATCH 1/7] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
@ 2011-06-20  7:45   ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2011-06-20  7:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

Looks good to me.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state
  2011-06-19 15:01 ` [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state Wu Fengguang
@ 2011-06-20 20:09   ` Christoph Hellwig
  2011-06-21 10:00     ` Steven Whitehouse
  2011-06-20 21:38   ` Jan Kara
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML, swhiteho

On Sun, Jun 19, 2011 at 11:01:15PM +0800, Wu Fengguang wrote:
> When there are only one (or several) dirtiers, dirty_exceeded is always
> (or mostly) off. Converting to timestamp avoids this problem. It helps
> to use smaller write_chunk for smoother throttling.

In current mainline gfs2 has grown a non-trivial reference to
backing_dev_info.dirty_exceeded, which needs to be dealt with.

> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> 
> Before patch, the wait time in balance_dirty_pages() are ~200ms:
> 
> [ 1093.397700] write_bandwidth: comm=swapper pages=1536 time=204ms
> [ 1093.594319] write_bandwidth: comm=swapper pages=1536 time=196ms
> [ 1093.796642] write_bandwidth: comm=swapper pages=1536 time=200ms
> 
> After patch, ~25ms:
> 
> [   90.261339] write_bandwidth: comm=swapper pages=192 time=20ms
> [   90.293168] write_bandwidth: comm=swapper pages=192 time=24ms
> [   90.323853] write_bandwidth: comm=swapper pages=192 time=24ms
> [   90.354510] write_bandwidth: comm=swapper pages=192 time=28ms
> [   90.389890] write_bandwidth: comm=swapper pages=192 time=28ms
> [   90.421787] write_bandwidth: comm=swapper pages=192 time=24ms
> 
>  include/linux/backing-dev.h |    2 +-
>  mm/backing-dev.c            |    2 --
>  mm/page-writeback.c         |   24 ++++++++++++------------
>  3 files changed, 13 insertions(+), 15 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-06-19 22:59:49.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-06-19 22:59:53.000000000 +0800
> @@ -483,6 +483,15 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * last time exceeded (limit - limit/DIRTY_BRAKE)
> + */
> +static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
> +				    unsigned long time_window)
> +{
> +	return jiffies - bdi->dirty_exceed_time <= time_window;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -621,7 +630,6 @@ static void balance_dirty_pages(struct a
>  	unsigned long bdi_thresh;
>  	unsigned long pages_written = 0;
>  	unsigned long pause = 1;
> -	bool dirty_exceeded = false;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long start_time = jiffies;
>  
> @@ -669,14 +677,9 @@ static void balance_dirty_pages(struct a
>  		 * bdi or process from holding back light ones; The latter is
>  		 * the last resort safeguard.
>  		 */
> -		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> -				  (nr_dirty > dirty_thresh);
> -
> -		if (!dirty_exceeded)
> +		if (bdi_dirty <= bdi_thresh && nr_dirty <= dirty_thresh)
>  			break;
> -
> -		if (!bdi->dirty_exceeded)
> -			bdi->dirty_exceeded = 1;
> +		bdi->dirty_exceed_time = jiffies;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_dirty,
>  				     start_time);
> @@ -719,9 +722,6 @@ static void balance_dirty_pages(struct a
>  			pause = HZ / 10;
>  	}
>  
> -	if (!dirty_exceeded && bdi->dirty_exceeded)
> -		bdi->dirty_exceeded = 0;
> -
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -775,7 +775,7 @@ void balance_dirty_pages_ratelimited_nr(
>  		return;
>  
>  	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> +	if (dirty_exceeded_recently(bdi, MAX_PAUSE))
>  		ratelimit = 8;
>  
>  	/*
> --- linux-next.orig/include/linux/backing-dev.h	2011-06-19 22:54:58.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h	2011-06-19 22:59:53.000000000 +0800
> @@ -79,7 +79,7 @@ struct backing_dev_info {
>  	unsigned long avg_write_bandwidth;
>  
>  	struct prop_local_percpu completions;
> -	int dirty_exceeded;
> +	unsigned long dirty_exceed_time;
>  
>  	unsigned int min_ratio;
>  	unsigned int max_ratio, max_prop_frac;
> --- linux-next.orig/mm/backing-dev.c	2011-06-19 22:59:49.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-06-19 22:59:53.000000000 +0800
> @@ -670,8 +670,6 @@ int bdi_init(struct backing_dev_info *bd
>  			goto err;
>  	}
>  
> -	bdi->dirty_exceeded = 0;
> -
>  	bdi->bw_time_stamp = jiffies;
>  	bdi->written_stamp = 0;
>  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] writeback: introduce smoothed global dirty limit
  2011-06-19 15:01 ` [PATCH 3/7] writeback: introduce smoothed global dirty limit Wu Fengguang
  2011-06-19 15:36   ` Christoph Hellwig
@ 2011-06-20 21:18   ` Jan Kara
  2011-06-21 14:24     ` Wu Fengguang
  2011-06-22  0:04   ` Andrew Morton
  2 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-06-20 21:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML, mgorman

On Sun 19-06-11 23:01:11, Wu Fengguang wrote:
> The start of a heavy weight application (ie. KVM) may instantly knock
> down determine_dirtyable_memory() and hence the global/bdi dirty
> thresholds.
> 
> So introduce global_dirty_limit for tracking the global dirty threshold
> with policies
> 
> - follow downwards slowly
> - follow up in one shot
> 
> global_dirty_limit can effectively mask out the impact of sudden drop of
> dirtyable memory. It will be used in the next patch for two new type of
> dirty limits.
Looking into the code, this patch is dependent on previously submitted
patches for estimation of BDI write bandwidth, isn't it?

> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/writeback.h |    2 +
>  mm/page-writeback.c       |   41 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> --- linux-next.orig/include/linux/writeback.h	2011-06-19 22:56:18.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-06-19 22:59:29.000000000 +0800
> @@ -88,6 +88,8 @@ static inline void laptop_sync_completio
>  #endif
>  void throttle_vm_writeout(gfp_t gfp_mask);
>  
> +extern unsigned long global_dirty_limit;
> +
>  /* These are exported to sysctl. */
>  extern int dirty_background_ratio;
>  extern unsigned long dirty_background_bytes;
> --- linux-next.orig/mm/page-writeback.c	2011-06-19 22:56:18.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-06-19 22:59:29.000000000 +0800
> @@ -116,6 +116,7 @@ EXPORT_SYMBOL(laptop_mode);
>  
>  /* End of sysctl-exported parameters */
>  
> +unsigned long global_dirty_limit;
>  
>  /*
>   * Scale the writeback cache size proportional to the relative writeout speeds.
> @@ -510,6 +511,43 @@ static void bdi_update_write_bandwidth(s
>  	bdi->avg_write_bandwidth = avg;
>  }
>  
> +static void update_dirty_limit(unsigned long thresh,
> +				 unsigned long dirty)
> +{
> +	unsigned long limit = global_dirty_limit;
> +
> +	if (limit < thresh) {
> +		limit = thresh;
> +		goto update;
> +	}
> +
> +	if (limit > thresh &&
> +	    limit > dirty) {
> +		limit -= (limit - max(thresh, dirty)) >> 5;
> +		goto update;
> +	}
Hmm, but strictly speaking this never really converges to the new limit,
right? And even in practice it takes 22 steps to converge within 50% and 73
steps to converge withing 10% of the desired threshold. But before we get
into the discussion about how to update dirty threshold, I'd like to
discuss what are the properties we require from dirty threshold updates?

I understand it's kind of unexpected that when some process allocates
anonymous memory, we suddently stall writers to flush out dirty data. OTOH
it is a nice behavior from memory management point of view because we
really try to keep amount of dirty pages in free memory at given
percentage which is nice for reclaim. When we choose to update dirty limit
in steps as you propose, we get some kind of compromise between behavior
for user and behavior for memory management.

But there are also other options - for example it would seem natural to me
treat allocation of anonymous page the same way as dirtiying the page thus
the process getting us over dirty threshold due to allocations will wait
until enough pages are written. Then we wouldn't need any smoothing of
memory available for caches because allocations would be naturally
throttled (OK, we still have memory taken by kernel allocations but these
are order of magnitude smaller problem I'd say). But I'm not sure how
acceptable this idea would be.

								Honza

> +	return;
> +update:
> +	global_dirty_limit = limit;
> +}
> +
> +static void global_update_bandwidth(unsigned long thresh,
> +				    unsigned long dirty,
> +				    unsigned long now)
> +{
> +	static DEFINE_SPINLOCK(dirty_lock);
> +
> +	if (now - default_backing_dev_info.bw_time_stamp < MAX_PAUSE)
> +		return;
> +
> +	spin_lock(&dirty_lock);
> +	if (now - default_backing_dev_info.bw_time_stamp >= MAX_PAUSE) {
> +		update_dirty_limit(thresh, dirty);
> +		default_backing_dev_info.bw_time_stamp = now;
> +	}
> +	spin_unlock(&dirty_lock);
> +}
> +
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
>  			    unsigned long dirty,
> @@ -535,6 +573,9 @@ void __bdi_update_bandwidth(struct backi
>  	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>  		goto snapshot;
>  
> +	if (thresh)
> +		global_update_bandwidth(thresh, dirty, now);
> +
>  	bdi_update_write_bandwidth(bdi, elapsed, written);
>  
>  snapshot:
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state
  2011-06-19 15:01 ` [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state Wu Fengguang
  2011-06-20 20:09   ` Christoph Hellwig
@ 2011-06-20 21:38   ` Jan Kara
  2011-06-21 15:07     ` Wu Fengguang
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-06-20 21:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

On Sun 19-06-11 23:01:15, Wu Fengguang wrote:
> When there are only one (or several) dirtiers, dirty_exceeded is always
> (or mostly) off. Converting to timestamp avoids this problem. It helps
> to use smaller write_chunk for smoother throttling.
  Hmm, what do you mean by "dirty_exceeded is always (or mostly) off"? I
agree that dirty_exceeded handling is broken in current kernel when there
are more dirtiers because of tasks having different dirty limits. Is this
the problem you are speaking about?

If yes, I think I have a cleaner fix for that - setting dirty_exceeded
when bdi_dirty enters the range where per-task dirty limits can be and
resetting it only when we leave that range. I can refresh the fix and
send it to you...

								Honza

> Before patch, the wait time in balance_dirty_pages() are ~200ms:
> 
> [ 1093.397700] write_bandwidth: comm=swapper pages=1536 time=204ms
> [ 1093.594319] write_bandwidth: comm=swapper pages=1536 time=196ms
> [ 1093.796642] write_bandwidth: comm=swapper pages=1536 time=200ms
> 
> After patch, ~25ms:
> 
> [   90.261339] write_bandwidth: comm=swapper pages=192 time=20ms
> [   90.293168] write_bandwidth: comm=swapper pages=192 time=24ms
> [   90.323853] write_bandwidth: comm=swapper pages=192 time=24ms
> [   90.354510] write_bandwidth: comm=swapper pages=192 time=28ms
> [   90.389890] write_bandwidth: comm=swapper pages=192 time=28ms
> [   90.421787] write_bandwidth: comm=swapper pages=192 time=24ms
> 
>  include/linux/backing-dev.h |    2 +-
>  mm/backing-dev.c            |    2 --
>  mm/page-writeback.c         |   24 ++++++++++++------------
>  3 files changed, 13 insertions(+), 15 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-06-19 22:59:49.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-06-19 22:59:53.000000000 +0800
> @@ -483,6 +483,15 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * last time exceeded (limit - limit/DIRTY_BRAKE)
> + */
> +static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
> +				    unsigned long time_window)
> +{
> +	return jiffies - bdi->dirty_exceed_time <= time_window;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -621,7 +630,6 @@ static void balance_dirty_pages(struct a
>  	unsigned long bdi_thresh;
>  	unsigned long pages_written = 0;
>  	unsigned long pause = 1;
> -	bool dirty_exceeded = false;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long start_time = jiffies;
>  
> @@ -669,14 +677,9 @@ static void balance_dirty_pages(struct a
>  		 * bdi or process from holding back light ones; The latter is
>  		 * the last resort safeguard.
>  		 */
> -		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> -				  (nr_dirty > dirty_thresh);
> -
> -		if (!dirty_exceeded)
> +		if (bdi_dirty <= bdi_thresh && nr_dirty <= dirty_thresh)
>  			break;
> -
> -		if (!bdi->dirty_exceeded)
> -			bdi->dirty_exceeded = 1;
> +		bdi->dirty_exceed_time = jiffies;
>  
>  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_dirty,
>  				     start_time);
> @@ -719,9 +722,6 @@ static void balance_dirty_pages(struct a
>  			pause = HZ / 10;
>  	}
>  
> -	if (!dirty_exceeded && bdi->dirty_exceeded)
> -		bdi->dirty_exceeded = 0;
> -
>  	if (writeback_in_progress(bdi))
>  		return;
>  
> @@ -775,7 +775,7 @@ void balance_dirty_pages_ratelimited_nr(
>  		return;
>  
>  	ratelimit = ratelimit_pages;
> -	if (mapping->backing_dev_info->dirty_exceeded)
> +	if (dirty_exceeded_recently(bdi, MAX_PAUSE))
>  		ratelimit = 8;
>  
>  	/*
> --- linux-next.orig/include/linux/backing-dev.h	2011-06-19 22:54:58.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h	2011-06-19 22:59:53.000000000 +0800
> @@ -79,7 +79,7 @@ struct backing_dev_info {
>  	unsigned long avg_write_bandwidth;
>  
>  	struct prop_local_percpu completions;
> -	int dirty_exceeded;
> +	unsigned long dirty_exceed_time;
>  
>  	unsigned int min_ratio;
>  	unsigned int max_ratio, max_prop_frac;
> --- linux-next.orig/mm/backing-dev.c	2011-06-19 22:59:49.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-06-19 22:59:53.000000000 +0800
> @@ -670,8 +670,6 @@ int bdi_init(struct backing_dev_info *bd
>  			goto err;
>  	}
>  
> -	bdi->dirty_exceeded = 0;
> -
>  	bdi->bw_time_stamp = jiffies;
>  	bdi->written_stamp = 0;
>  
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state
  2011-06-20 20:09   ` Christoph Hellwig
@ 2011-06-21 10:00     ` Steven Whitehouse
  0 siblings, 0 replies; 29+ messages in thread
From: Steven Whitehouse @ 2011-06-21 10:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, linux-fsdevel, Jan Kara, Dave Chinner, Andrew Morton, LKML

Hi,

On Mon, 2011-06-20 at 16:09 -0400, Christoph Hellwig wrote:
> On Sun, Jun 19, 2011 at 11:01:15PM +0800, Wu Fengguang wrote:
> > When there are only one (or several) dirtiers, dirty_exceeded is always
> > (or mostly) off. Converting to timestamp avoids this problem. It helps
> > to use smaller write_chunk for smoother throttling.
> 
> In current mainline gfs2 has grown a non-trivial reference to
> backing_dev_info.dirty_exceeded, which needs to be dealt with.
> 
So let me try and explain whats going on there... the basic issue is
that writeback is done on a per-inode basis, but pages are accounted for
on a per-address space basis.

In GFS2, glocks referring to inodes and rgrps (resource groups) both
have an address space associated with them. These address spaces contain
the metadata that would normally be in the block device address space,
but have been separated so that we can sync and/or invalidate metadata
easily on a per-inode basis. Note that we have the additional
requirement to be able to track clean metadata, so that the existing
per-inode list of dirty metadata doesn't work for GFS2. Due to the
lifetime rules for the glocks, and the lack of an inode for rgrps, the
mapping->host for the glock address spaces has to point at the block
device inode.

Now in the normal inode case, that isn't a problem - writeback calls
->write_inode which can then write out the dirty metadata pages (if
any). The issue we've hit has been with rgrps and in particular if the
total dirty data associated with rgrps exceeds the per-bdi dirty limit.

In that case we found that writeback was spinning without making any
progress since it was trying to writeback inodes (all by that stage
clean) and it didn't have any way to start writeback on rgrps. So the
simplest solution was to check the dirty exceeded flag during inode
writeback, and if set try writing back more data than actually requested
via the ail lists. This list contains all the dirty metadata, so it
includes the rgrps too. Due to the way in which rgrps are used, it is
impossible to dirty one without also dirtying at least one inode.

In addition to that, the ordering of data blocks on the ail list is
often more optimal (especially for workloads with lots of small files)
and we get a performance improvement by doing writeback that way too.

Having said that, I know its not ideal, and I'm open to any suggestions
for better solutions,

Steve.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] writeback: introduce smoothed global dirty limit
  2011-06-20 21:18   ` Jan Kara
@ 2011-06-21 14:24     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-21 14:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Andrew Morton,
	LKML, mgorman

On Tue, Jun 21, 2011 at 05:18:10AM +0800, Jan Kara wrote:
> On Sun 19-06-11 23:01:11, Wu Fengguang wrote:
> > The start of a heavy weight application (ie. KVM) may instantly knock
> > down determine_dirtyable_memory() and hence the global/bdi dirty
> > thresholds.
> > 
> > So introduce global_dirty_limit for tracking the global dirty threshold
> > with policies
> > 
> > - follow downwards slowly
> > - follow up in one shot
> > 
> > global_dirty_limit can effectively mask out the impact of sudden drop of
> > dirtyable memory. It will be used in the next patch for two new type of
> > dirty limits.
> Looking into the code, this patch is dependent on previously submitted
> patches for estimation of BDI write bandwidth, isn't it?

Right... I'll submit them as one series at next post :)

> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/writeback.h |    2 +
> >  mm/page-writeback.c       |   41 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 43 insertions(+)
> > 
> > --- linux-next.orig/include/linux/writeback.h	2011-06-19 22:56:18.000000000 +0800
> > +++ linux-next/include/linux/writeback.h	2011-06-19 22:59:29.000000000 +0800
> > @@ -88,6 +88,8 @@ static inline void laptop_sync_completio
> >  #endif
> >  void throttle_vm_writeout(gfp_t gfp_mask);
> >  
> > +extern unsigned long global_dirty_limit;
> > +
> >  /* These are exported to sysctl. */
> >  extern int dirty_background_ratio;
> >  extern unsigned long dirty_background_bytes;
> > --- linux-next.orig/mm/page-writeback.c	2011-06-19 22:56:18.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-06-19 22:59:29.000000000 +0800
> > @@ -116,6 +116,7 @@ EXPORT_SYMBOL(laptop_mode);
> >  
> >  /* End of sysctl-exported parameters */
> >  
> > +unsigned long global_dirty_limit;
> >  
> >  /*
> >   * Scale the writeback cache size proportional to the relative writeout speeds.
> > @@ -510,6 +511,43 @@ static void bdi_update_write_bandwidth(s
> >  	bdi->avg_write_bandwidth = avg;
> >  }
> >  
> > +static void update_dirty_limit(unsigned long thresh,
> > +				 unsigned long dirty)
> > +{
> > +	unsigned long limit = global_dirty_limit;
> > +
> > +	if (limit < thresh) {
> > +		limit = thresh;
> > +		goto update;
> > +	}
> > +
> > +	if (limit > thresh &&
> > +	    limit > dirty) {
> > +		limit -= (limit - max(thresh, dirty)) >> 5;
> > +		goto update;
> > +	}
> Hmm, but strictly speaking this never really converges to the new limit,
> right? And even in practice it takes 22 steps to converge within 50% and 73

Right, the ">> 5" may introduce 31 pages of steady state tracking error.
And 22/73 steps of 200ms means 4.4/14.6 seconds convergence time.

The tracking error is not more serious than the typical errors in
the dirty threshold checks, where there are per-cpu counter errors
and 8-pages error per CPU even after dirty exceeded.

The convergence time is not a sensitive parameter due to the reasons
detailed below.

> steps to converge withing 10% of the desired threshold. But before we get
> into the discussion about how to update dirty threshold, I'd like to
> discuss what are the properties we require from dirty threshold updates?

Good point. What we need is something optimized for the common case.
The common case is

(1) <= 500 dirtier tasks
(2) >= 80MB/s write bandwidth

Under conditions (1) and (2), it does not really matter whatever
tracking error or speed global_dirty_limit is after the dirtyable
memory is suddenly knocked down. Because the immediately lowered
dirty_thresh and bdi_thresh is enough to throttle each dirty task at
160KB/s and hence knock down the dirty pages fast enough.

So this patch (and the seemingly slow tracking speed) is not a big
change to the current behavior at all, especially for the heavy
dirtiers. It does help the light dirtiers to get out of the throttle
loop within 200ms (unless it's delayed by IO).

> I understand it's kind of unexpected that when some process allocates
> anonymous memory, we suddently stall writers to flush out dirty data. OTOH
> it is a nice behavior from memory management point of view because we
> really try to keep amount of dirty pages in free memory at given
> percentage which is nice for reclaim. When we choose to update dirty limit
> in steps as you propose, we get some kind of compromise between behavior
> for user and behavior for memory management.

We know for sure that "suddenly stall writers" will hurt responsiveness,
and probably never sure the exact dirty threshold will start hurting
page reclaim. So I opt to regard the "rule of thumb" dirty threshold
as some soft suggestion at the times we know for sure that obeying it
strictly will likely hurt.

> But there are also other options - for example it would seem natural to me
> treat allocation of anonymous page the same way as dirtiying the page thus
> the process getting us over dirty threshold due to allocations will wait
> until enough pages are written. Then we wouldn't need any smoothing of
> memory available for caches because allocations would be naturally
> throttled (OK, we still have memory taken by kernel allocations but these
> are order of magnitude smaller problem I'd say). But I'm not sure how
> acceptable this idea would be.

Oh no... It's already a headache how the dirty pages may slow down the
page reclaim, so no way for it to further slow down page allocations! ;)

Thanks,
Fengguang

> > +	return;
> > +update:
> > +	global_dirty_limit = limit;
> > +}
> > +
> > +static void global_update_bandwidth(unsigned long thresh,
> > +				    unsigned long dirty,
> > +				    unsigned long now)
> > +{
> > +	static DEFINE_SPINLOCK(dirty_lock);
> > +
> > +	if (now - default_backing_dev_info.bw_time_stamp < MAX_PAUSE)
> > +		return;
> > +
> > +	spin_lock(&dirty_lock);
> > +	if (now - default_backing_dev_info.bw_time_stamp >= MAX_PAUSE) {
> > +		update_dirty_limit(thresh, dirty);
> > +		default_backing_dev_info.bw_time_stamp = now;
> > +	}
> > +	spin_unlock(&dirty_lock);
> > +}
> > +
> >  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> >  			    unsigned long thresh,
> >  			    unsigned long dirty,
> > @@ -535,6 +573,9 @@ void __bdi_update_bandwidth(struct backi
> >  	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
> >  		goto snapshot;
> >  
> > +	if (thresh)
> > +		global_update_bandwidth(thresh, dirty, now);
> > +
> >  	bdi_update_write_bandwidth(bdi, elapsed, written);
> >  
> >  snapshot:
> > 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state
  2011-06-20 21:38   ` Jan Kara
@ 2011-06-21 15:07     ` Wu Fengguang
  2011-06-21 21:14       ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-21 15:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Andrew Morton, LKML

On Tue, Jun 21, 2011 at 05:38:18AM +0800, Jan Kara wrote:
> On Sun 19-06-11 23:01:15, Wu Fengguang wrote:
> > When there are only one (or several) dirtiers, dirty_exceeded is always
> > (or mostly) off. Converting to timestamp avoids this problem. It helps
> > to use smaller write_chunk for smoother throttling.
>   Hmm, what do you mean by "dirty_exceeded is always (or mostly) off"? I
> agree that dirty_exceeded handling is broken in current kernel when there
> are more dirtiers because of tasks having different dirty limits. Is this
> the problem you are speaking about?

For the 1-dirty case, it will quit on either of the two conditions

(1) no longer dirty exceeded, or
(2) write 1.5 times what it has dirtied

Note that (2) implies (1) because there is only 1 dirtier: if it takes
4MB pages to make it dirty exceeded, it will sure drop out of the dirty
exceeded state after cleaned 6MB.

So the 1 dirtier will mostly see dirty_exceeded=0 inside
balance_dirty_pages_ratelimited_nr().

However that looks not a big problem. Since there is only 1 dirtier,
it will at most get dirty exceeded by 4MB, which is fairly acceptable.

> If yes, I think I have a cleaner fix for that - setting dirty_exceeded
> when bdi_dirty enters the range where per-task dirty limits can be and
> resetting it only when we leave that range. I can refresh the fix and
> send it to you...

Yes there is the per-task dynamic thresholds. However given that the
current scheme looks not too bad, maybe we can just do nothing for now
and skip this patch?

Thanks,
Fengguang

> > Before patch, the wait time in balance_dirty_pages() are ~200ms:
> > 
> > [ 1093.397700] write_bandwidth: comm=swapper pages=1536 time=204ms
> > [ 1093.594319] write_bandwidth: comm=swapper pages=1536 time=196ms
> > [ 1093.796642] write_bandwidth: comm=swapper pages=1536 time=200ms
> > 
> > After patch, ~25ms:
> > 
> > [   90.261339] write_bandwidth: comm=swapper pages=192 time=20ms
> > [   90.293168] write_bandwidth: comm=swapper pages=192 time=24ms
> > [   90.323853] write_bandwidth: comm=swapper pages=192 time=24ms
> > [   90.354510] write_bandwidth: comm=swapper pages=192 time=28ms
> > [   90.389890] write_bandwidth: comm=swapper pages=192 time=28ms
> > [   90.421787] write_bandwidth: comm=swapper pages=192 time=24ms
> > 
> >  include/linux/backing-dev.h |    2 +-
> >  mm/backing-dev.c            |    2 --
> >  mm/page-writeback.c         |   24 ++++++++++++------------
> >  3 files changed, 13 insertions(+), 15 deletions(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2011-06-19 22:59:49.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-06-19 22:59:53.000000000 +0800
> > @@ -483,6 +483,15 @@ unsigned long bdi_dirty_limit(struct bac
> >  	return bdi_dirty;
> >  }
> >  
> > +/*
> > + * last time exceeded (limit - limit/DIRTY_BRAKE)
> > + */
> > +static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
> > +				    unsigned long time_window)
> > +{
> > +	return jiffies - bdi->dirty_exceed_time <= time_window;
> > +}
> > +
> >  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> >  				       unsigned long elapsed,
> >  				       unsigned long written)
> > @@ -621,7 +630,6 @@ static void balance_dirty_pages(struct a
> >  	unsigned long bdi_thresh;
> >  	unsigned long pages_written = 0;
> >  	unsigned long pause = 1;
> > -	bool dirty_exceeded = false;
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  	unsigned long start_time = jiffies;
> >  
> > @@ -669,14 +677,9 @@ static void balance_dirty_pages(struct a
> >  		 * bdi or process from holding back light ones; The latter is
> >  		 * the last resort safeguard.
> >  		 */
> > -		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> > -				  (nr_dirty > dirty_thresh);
> > -
> > -		if (!dirty_exceeded)
> > +		if (bdi_dirty <= bdi_thresh && nr_dirty <= dirty_thresh)
> >  			break;
> > -
> > -		if (!bdi->dirty_exceeded)
> > -			bdi->dirty_exceeded = 1;
> > +		bdi->dirty_exceed_time = jiffies;
> >  
> >  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_dirty,
> >  				     start_time);
> > @@ -719,9 +722,6 @@ static void balance_dirty_pages(struct a
> >  			pause = HZ / 10;
> >  	}
> >  
> > -	if (!dirty_exceeded && bdi->dirty_exceeded)
> > -		bdi->dirty_exceeded = 0;
> > -
> >  	if (writeback_in_progress(bdi))
> >  		return;
> >  
> > @@ -775,7 +775,7 @@ void balance_dirty_pages_ratelimited_nr(
> >  		return;
> >  
> >  	ratelimit = ratelimit_pages;
> > -	if (mapping->backing_dev_info->dirty_exceeded)
> > +	if (dirty_exceeded_recently(bdi, MAX_PAUSE))
> >  		ratelimit = 8;
> >  
> >  	/*
> > --- linux-next.orig/include/linux/backing-dev.h	2011-06-19 22:54:58.000000000 +0800
> > +++ linux-next/include/linux/backing-dev.h	2011-06-19 22:59:53.000000000 +0800
> > @@ -79,7 +79,7 @@ struct backing_dev_info {
> >  	unsigned long avg_write_bandwidth;
> >  
> >  	struct prop_local_percpu completions;
> > -	int dirty_exceeded;
> > +	unsigned long dirty_exceed_time;
> >  
> >  	unsigned int min_ratio;
> >  	unsigned int max_ratio, max_prop_frac;
> > --- linux-next.orig/mm/backing-dev.c	2011-06-19 22:59:49.000000000 +0800
> > +++ linux-next/mm/backing-dev.c	2011-06-19 22:59:53.000000000 +0800
> > @@ -670,8 +670,6 @@ int bdi_init(struct backing_dev_info *bd
> >  			goto err;
> >  	}
> >  
> > -	bdi->dirty_exceeded = 0;
> > -
> >  	bdi->bw_time_stamp = jiffies;
> >  	bdi->written_stamp = 0;
> >  
> > 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state
  2011-06-21 15:07     ` Wu Fengguang
@ 2011-06-21 21:14       ` Jan Kara
  2011-06-22 14:37         ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-06-21 21:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

On Tue 21-06-11 23:07:52, Wu Fengguang wrote:
> On Tue, Jun 21, 2011 at 05:38:18AM +0800, Jan Kara wrote:
> > On Sun 19-06-11 23:01:15, Wu Fengguang wrote:
> > > When there are only one (or several) dirtiers, dirty_exceeded is always
> > > (or mostly) off. Converting to timestamp avoids this problem. It helps
> > > to use smaller write_chunk for smoother throttling.
> >   Hmm, what do you mean by "dirty_exceeded is always (or mostly) off"? I
> > agree that dirty_exceeded handling is broken in current kernel when there
> > are more dirtiers because of tasks having different dirty limits. Is this
> > the problem you are speaking about?
> 
> For the 1-dirty case, it will quit on either of the two conditions
> 
> (1) no longer dirty exceeded, or
> (2) write 1.5 times what it has dirtied
> 
> Note that (2) implies (1) because there is only 1 dirtier: if it takes
> 4MB pages to make it dirty exceeded, it will sure drop out of the dirty
> exceeded state after cleaned 6MB.
> 
> So the 1 dirtier will mostly see dirty_exceeded=0 inside
> balance_dirty_pages_ratelimited_nr().
> 
> However that looks not a big problem. Since there is only 1 dirtier,
> it will at most get dirty exceeded by 4MB, which is fairly acceptable.
  Yes, 1 dirtier case works OK.

> > If yes, I think I have a cleaner fix for that - setting dirty_exceeded
> > when bdi_dirty enters the range where per-task dirty limits can be and
> > resetting it only when we leave that range. I can refresh the fix and
> > send it to you...
> 
> Yes there is the per-task dynamic thresholds. However given that the
> current scheme looks not too bad, maybe we can just do nothing for now
> and skip this patch?
  Well, the current scheme has problems already when there are two
dirtiers. Assume process P1 has threshold T1 and process P2 has threshold
T2, T1 < T2. P1 first hits the threshold, sets dirty_exceeded and does some
writeback. Meanwhile P2 still dirties pages, eventually it enters
balance_dirty_pages() sees threshold T2 is not exceeded and clears
dirty_exceeded even though threshold T1 still is exceeded. After dirtying
another 4 MB, P1 enters balance_dirty_pages() again and sets dirty_exceeded
again.

This way, dirty_exceeded is ping-ponged between 0 and 1 basically in a
random way. If we are unlucky enough, we can get beyond dirty limit by 4 MB
for each dirtier instead of 32 KB originally designed...

I already have a fix for this. Do you want me to base it on top of -mm tree
or your latest writeback series?

								Honza

> > > Before patch, the wait time in balance_dirty_pages() are ~200ms:
> > > 
> > > [ 1093.397700] write_bandwidth: comm=swapper pages=1536 time=204ms
> > > [ 1093.594319] write_bandwidth: comm=swapper pages=1536 time=196ms
> > > [ 1093.796642] write_bandwidth: comm=swapper pages=1536 time=200ms
> > > 
> > > After patch, ~25ms:
> > > 
> > > [   90.261339] write_bandwidth: comm=swapper pages=192 time=20ms
> > > [   90.293168] write_bandwidth: comm=swapper pages=192 time=24ms
> > > [   90.323853] write_bandwidth: comm=swapper pages=192 time=24ms
> > > [   90.354510] write_bandwidth: comm=swapper pages=192 time=28ms
> > > [   90.389890] write_bandwidth: comm=swapper pages=192 time=28ms
> > > [   90.421787] write_bandwidth: comm=swapper pages=192 time=24ms
> > > 
> > >  include/linux/backing-dev.h |    2 +-
> > >  mm/backing-dev.c            |    2 --
> > >  mm/page-writeback.c         |   24 ++++++++++++------------
> > >  3 files changed, 13 insertions(+), 15 deletions(-)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2011-06-19 22:59:49.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2011-06-19 22:59:53.000000000 +0800
> > > @@ -483,6 +483,15 @@ unsigned long bdi_dirty_limit(struct bac
> > >  	return bdi_dirty;
> > >  }
> > >  
> > > +/*
> > > + * last time exceeded (limit - limit/DIRTY_BRAKE)
> > > + */
> > > +static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
> > > +				    unsigned long time_window)
> > > +{
> > > +	return jiffies - bdi->dirty_exceed_time <= time_window;
> > > +}
> > > +
> > >  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> > >  				       unsigned long elapsed,
> > >  				       unsigned long written)
> > > @@ -621,7 +630,6 @@ static void balance_dirty_pages(struct a
> > >  	unsigned long bdi_thresh;
> > >  	unsigned long pages_written = 0;
> > >  	unsigned long pause = 1;
> > > -	bool dirty_exceeded = false;
> > >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > >  	unsigned long start_time = jiffies;
> > >  
> > > @@ -669,14 +677,9 @@ static void balance_dirty_pages(struct a
> > >  		 * bdi or process from holding back light ones; The latter is
> > >  		 * the last resort safeguard.
> > >  		 */
> > > -		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> > > -				  (nr_dirty > dirty_thresh);
> > > -
> > > -		if (!dirty_exceeded)
> > > +		if (bdi_dirty <= bdi_thresh && nr_dirty <= dirty_thresh)
> > >  			break;
> > > -
> > > -		if (!bdi->dirty_exceeded)
> > > -			bdi->dirty_exceeded = 1;
> > > +		bdi->dirty_exceed_time = jiffies;
> > >  
> > >  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_dirty,
> > >  				     start_time);
> > > @@ -719,9 +722,6 @@ static void balance_dirty_pages(struct a
> > >  			pause = HZ / 10;
> > >  	}
> > >  
> > > -	if (!dirty_exceeded && bdi->dirty_exceeded)
> > > -		bdi->dirty_exceeded = 0;
> > > -
> > >  	if (writeback_in_progress(bdi))
> > >  		return;
> > >  
> > > @@ -775,7 +775,7 @@ void balance_dirty_pages_ratelimited_nr(
> > >  		return;
> > >  
> > >  	ratelimit = ratelimit_pages;
> > > -	if (mapping->backing_dev_info->dirty_exceeded)
> > > +	if (dirty_exceeded_recently(bdi, MAX_PAUSE))
> > >  		ratelimit = 8;
> > >  
> > >  	/*
> > > --- linux-next.orig/include/linux/backing-dev.h	2011-06-19 22:54:58.000000000 +0800
> > > +++ linux-next/include/linux/backing-dev.h	2011-06-19 22:59:53.000000000 +0800
> > > @@ -79,7 +79,7 @@ struct backing_dev_info {
> > >  	unsigned long avg_write_bandwidth;
> > >  
> > >  	struct prop_local_percpu completions;
> > > -	int dirty_exceeded;
> > > +	unsigned long dirty_exceed_time;
> > >  
> > >  	unsigned int min_ratio;
> > >  	unsigned int max_ratio, max_prop_frac;
> > > --- linux-next.orig/mm/backing-dev.c	2011-06-19 22:59:49.000000000 +0800
> > > +++ linux-next/mm/backing-dev.c	2011-06-19 22:59:53.000000000 +0800
> > > @@ -670,8 +670,6 @@ int bdi_init(struct backing_dev_info *bd
> > >  			goto err;
> > >  	}
> > >  
> > > -	bdi->dirty_exceeded = 0;
> > > -
> > >  	bdi->bw_time_stamp = jiffies;
> > >  	bdi->written_stamp = 0;
> > >  
> > > 
> > > 
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] writeback: introduce smoothed global dirty limit
  2011-06-19 15:55     ` Wu Fengguang
@ 2011-06-21 23:59       ` Andrew Morton
  2011-06-22 14:11         ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2011-06-21 23:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, Dave Chinner, LKML

On Sun, 19 Jun 2011 23:55:47 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Sun, Jun 19, 2011 at 11:36:37PM +0800, Christoph Hellwig wrote:
> > On Sun, Jun 19, 2011 at 11:01:11PM +0800, Wu Fengguang wrote:
> > > The start of a heavy weight application (ie. KVM) may instantly knock
> > > down determine_dirtyable_memory() and hence the global/bdi dirty
> > > thresholds.
> > > 
> > > So introduce global_dirty_limit for tracking the global dirty threshold
> > > with policies
> > > 
> > > - follow downwards slowly
> > > - follow up in one shot
> > > 
> > > global_dirty_limit can effectively mask out the impact of sudden drop of
> > > dirtyable memory. It will be used in the next patch for two new type of
> > > dirty limits.
> > 
> > This needs to be explained in more detail in comments near the actual
> > code.
> 
> Good point! This is the added comment for the update_dirty_limit() function.
> 
> /*
>  * The global dirtyable memory and dirty threshold could be suddenly knocked
>  * down by a large amount (eg. on the startup of KVM). This may throw the
>  * system into deep dirty exceeded state and throttled to "death" for a couple
>  * of seconds. The solution is to maintain global_dirty_limit for tracking
>  * slowly down to the knocked down dirty threshold.
>  */
> static void update_dirty_limit(unsigned long thresh,
>                                unsigned long dirty)
> 

Neither this nor the changelog explain things well. 

Looking at the code, KVM starts, allocates memory,
global_page_state(NR_FREE_PAGES) decreases by N and
global_reclaimable_pages() increases by N.  Nothing changed.

So what's going on here?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] writeback: introduce smoothed global dirty limit
  2011-06-19 15:01 ` [PATCH 3/7] writeback: introduce smoothed global dirty limit Wu Fengguang
  2011-06-19 15:36   ` Christoph Hellwig
  2011-06-20 21:18   ` Jan Kara
@ 2011-06-22  0:04   ` Andrew Morton
  2011-06-22 14:24     ` Wu Fengguang
  2 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2011-06-22  0:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig, LKML

On Sun, 19 Jun 2011 23:01:11 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> The start of a heavy weight application (ie. KVM) may instantly knock
> down determine_dirtyable_memory() and hence the global/bdi dirty
> thresholds.
> 
> So introduce global_dirty_limit for tracking the global dirty threshold
> with policies
> 
> - follow downwards slowly
> - follow up in one shot
> 
> global_dirty_limit can effectively mask out the impact of sudden drop of
> dirtyable memory. It will be used in the next patch for two new type of
> dirty limits.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/writeback.h |    2 +
>  mm/page-writeback.c       |   41 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> --- linux-next.orig/include/linux/writeback.h	2011-06-19 22:56:18.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-06-19 22:59:29.000000000 +0800
> @@ -88,6 +88,8 @@ static inline void laptop_sync_completio
>  #endif
>  void throttle_vm_writeout(gfp_t gfp_mask);
>  
> +extern unsigned long global_dirty_limit;
> +
>  /* These are exported to sysctl. */
>  extern int dirty_background_ratio;
>  extern unsigned long dirty_background_bytes;
> --- linux-next.orig/mm/page-writeback.c	2011-06-19 22:56:18.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-06-19 22:59:29.000000000 +0800
> @@ -116,6 +116,7 @@ EXPORT_SYMBOL(laptop_mode);
>  
>  /* End of sysctl-exported parameters */
>  
> +unsigned long global_dirty_limit;
>  
>  /*
>   * Scale the writeback cache size proportional to the relative writeout speeds.
> @@ -510,6 +511,43 @@ static void bdi_update_write_bandwidth(s
>  	bdi->avg_write_bandwidth = avg;
>  }
>  
> +static void update_dirty_limit(unsigned long thresh,
> +				 unsigned long dirty)
> +{
> +	unsigned long limit = global_dirty_limit;
> +
> +	if (limit < thresh) {
> +		limit = thresh;
> +		goto update;
> +	}
> +
> +	if (limit > thresh &&
> +	    limit > dirty) {
> +		limit -= (limit - max(thresh, dirty)) >> 5;
> +		goto update;
> +	}
> +	return;
> +update:
> +	global_dirty_limit = limit;
> +}

Are
you
using
a
30
column
monitor
over
there?


This function is just crazy.  It compares various things, applies
limits, churns them all together with magic constants and does it all
in a refreshingly documentation-free manner.

How the heck is anyone supposed to understand what you were thinking
when you typed it in?

Please, write for an audience.

> +static void global_update_bandwidth(unsigned long thresh,
> +				    unsigned long dirty,
> +				    unsigned long now)
> +{
> +	static DEFINE_SPINLOCK(dirty_lock);
> +
> +	if (now - default_backing_dev_info.bw_time_stamp < MAX_PAUSE)
> +		return;
> +
> +	spin_lock(&dirty_lock);
> +	if (now - default_backing_dev_info.bw_time_stamp >= MAX_PAUSE) {
> +		update_dirty_limit(thresh, dirty);
> +		default_backing_dev_info.bw_time_stamp = now;
> +	}
> +	spin_unlock(&dirty_lock);
> +}

Why is it playing with default_backing_dev_info?  That's only there to
support filesystems which were too old-and-slack to implement
backing-devs properly and it really shouldn't exist at all.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits
  2011-06-19 15:01 ` [PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits Wu Fengguang
@ 2011-06-22  0:20   ` Andrew Morton
  2011-06-23 13:18     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2011-06-22  0:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig, LKML

On Sun, 19 Jun 2011 23:01:12 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> The max-pause limit helps to keep the sleep time inside
> balance_dirty_pages() within 200ms. The 200ms max sleep means per task
> rate limit of 8pages/200ms=160KB/s, which normally is enough to stop
> dirtiers from continue pushing the dirty pages high, unless there are
> a sufficient large number of slow dirtiers (ie. 500 tasks doing 160KB/s
> will still sum up to 80MB/s, reaching the write bandwidth of a slow disk).
> 
> The pass-good limit helps to let go of the good bdi's in the presence of
> a blocked bdi (ie. NFS server not responding) or slow USB disk which for
> some reason build up a large number of initial dirty pages that refuse
> to go away anytime soon.

The hard-wired numbers and hard-wired assumptions about device speeds
shouldn't be here at all.  They will be sub-optimal (and sometimes
extremely so) for all cases.  They will become wronger over time.  Or
less wrong, depending upon which way they were originally wrong.

> +		dirty_thresh = hard_dirty_limit(dirty_thresh);
> +		if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_MAXPAUSE &&
> +		    jiffies - start_time > MAX_PAUSE)
> +			break;
> +		if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_PASSGOOD &&
> +		    bdi_dirty < bdi_thresh)
> +			break;

It appears that despite their similarity, DIRTY_MAXPAUSE is a
dimensionless value whereas the units of MAX_PAUSE is jiffies.  Perhaps
more care in naming choices would clarify things like this.

The first comparison might be clearer if it used time_after().

Both statements need comments explaining what they do and *why they do
it*.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] writeback: introduce smoothed global dirty limit
  2011-06-21 23:59       ` Andrew Morton
@ 2011-06-22 14:11         ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-22 14:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, Dave Chinner, LKML

On Wed, Jun 22, 2011 at 07:59:05AM +0800, Andrew Morton wrote:
> On Sun, 19 Jun 2011 23:55:47 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Sun, Jun 19, 2011 at 11:36:37PM +0800, Christoph Hellwig wrote:
> > > On Sun, Jun 19, 2011 at 11:01:11PM +0800, Wu Fengguang wrote:
> > > > The start of a heavy weight application (ie. KVM) may instantly knock
> > > > down determine_dirtyable_memory() and hence the global/bdi dirty
> > > > thresholds.
> > > > 
> > > > So introduce global_dirty_limit for tracking the global dirty threshold
> > > > with policies
> > > > 
> > > > - follow downwards slowly
> > > > - follow up in one shot
> > > > 
> > > > global_dirty_limit can effectively mask out the impact of sudden drop of
> > > > dirtyable memory. It will be used in the next patch for two new type of
> > > > dirty limits.
> > > 
> > > This needs to be explained in more detail in comments near the actual
> > > code.
> > 
> > Good point! This is the added comment for the update_dirty_limit() function.
> > 
> > /*
> >  * The global dirtyable memory and dirty threshold could be suddenly knocked
> >  * down by a large amount (eg. on the startup of KVM). This may throw the
> >  * system into deep dirty exceeded state and throttled to "death" for a couple
> >  * of seconds. The solution is to maintain global_dirty_limit for tracking
> >  * slowly down to the knocked down dirty threshold.
> >  */
> > static void update_dirty_limit(unsigned long thresh,
> >                                unsigned long dirty)
> > 
> 
> Neither this nor the changelog explain things well. 

Yeah sorry.. it's vague and partially flawed as you discovered below.

> Looking at the code, KVM starts, allocates memory,
> global_page_state(NR_FREE_PAGES) decreases by N and
> global_reclaimable_pages() increases by N.  Nothing changed.

...if you enabled swap. As my test boxes typically don't have swap
enabled, I literally see the sudden drop of dirty threshold...

> So what's going on here?

Here is the patch with more changelogs and comments. Hope they are
more clear this time.

Thanks,
Fengguang
---
Subject: writeback: introduce smoothed global dirty limit
Date: Wed Mar 02 15:54:09 CST 2011

The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() if the swap is not enabled or full.
global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
dirty thresholds that are _much_ lower than the global/bdi dirty pages.

balance_dirty_pages() will then heavily throttle all dirtiers including
the light ones, until the dirty pages drop below the new dirty thresholds.
During this _deep_ dirty-exceeded state, the system may appear rather
unresponsive to the users.

About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
threshold to heavy dirtiers than light ones, and the dirty pages will
be throttled around the heavy dirtiers' dirty threshold and reasonably
below the light dirtiers' dirty threshold. In this state, only the heavy
dirtiers will be throttled and the dirty pages are carefully controlled
to not exceed the light dirtiers' dirty threshold. However if the
threshold itself suddenly drops below the number of dirty pages, the
light dirtiers will get heavily throttled.

So introduce global_dirty_limit for tracking the global dirty threshold
with policies

- follow downwards slowly
- follow up in one shot

global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits. Note that the new dirty limits are not going to avoid
throttling the light dirtiers, but could limit their sleep time to 200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 -
 include/linux/writeback.h |    5 ++
 mm/page-writeback.c       |   72 +++++++++++++++++++++++++++++++++++-
 3 files changed, 76 insertions(+), 3 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2011-06-20 00:23:59.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-22 20:43:13.000000000 +0800
@@ -84,6 +84,8 @@ static inline void laptop_sync_completio
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 
+extern unsigned long global_dirty_limit;
+
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern unsigned long dirty_background_bytes;
@@ -119,6 +121,9 @@ unsigned long bdi_dirty_limit(struct bac
 			       unsigned long dirty);
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long thresh,
+			    unsigned long dirty,
+			    unsigned long bdi_dirty,
 			    unsigned long start_time);
 
 void page_writeback_init(void);
--- linux-next.orig/mm/page-writeback.c	2011-06-20 00:23:59.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-22 22:06:29.000000000 +0800
@@ -116,6 +116,7 @@ EXPORT_SYMBOL(laptop_mode);
 
 /* End of sysctl-exported parameters */
 
+unsigned long global_dirty_limit;
 
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
@@ -510,7 +511,66 @@ static void bdi_update_write_bandwidth(s
 	bdi->avg_write_bandwidth = avg;
 }
 
+/*
+ * The global dirtyable memory and dirty threshold could be suddenly knocked
+ * down by a large amount (eg. on the startup of KVM in a swapless system).
+ * This may throw the system into deep dirty exceeded state and throttle
+ * heavy/light dirtiers alike. To retain good responsiveness, maintain
+ * global_dirty_limit for tracking slowly down to the knocked down dirty
+ * threshold.
+ */
+static void update_dirty_limit(unsigned long thresh, unsigned long dirty)
+{
+	unsigned long limit = global_dirty_limit;
+
+	/*
+	 * Follow up in one step.
+	 */
+	if (limit < thresh) {
+		limit = thresh;
+		goto update;
+	}
+
+	/*
+	 * Follow down slowly. Use the higher one as the target, because thresh
+	 * may drop below dirty. This is exactly the reason to introduce
+	 * global_dirty_limit which is guaranteed to lie above the dirty pages.
+	 */
+	thresh = max(thresh, dirty);
+	if (limit > thresh) {
+		limit -= (limit - thresh) >> 5;
+		goto update;
+	}
+	return;
+update:
+	global_dirty_limit = limit;
+}
+
+static void global_update_bandwidth(unsigned long thresh,
+				    unsigned long dirty,
+				    unsigned long now)
+{
+	static DEFINE_SPINLOCK(dirty_lock);
+	static unsigned long update_time;
+
+	/*
+	 * Do a lockless check first to optimize away locking for most time.
+	 */
+	if (now - update_time < MAX_PAUSE)
+		return;
+
+	spin_lock(&dirty_lock);
+	if (now - update_time >= MAX_PAUSE) {
+		update_dirty_limit(thresh, dirty);
+		update_time = now;
+	}
+	spin_unlock(&dirty_lock);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long thresh,
+			    unsigned long dirty,
+			    unsigned long bdi_dirty,
 			    unsigned long start_time)
 {
 	unsigned long now = jiffies;
@@ -532,6 +592,9 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
+	if (thresh)
+		global_update_bandwidth(thresh, dirty, now);
+
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
@@ -540,12 +603,16 @@ snapshot:
 }
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
+				 unsigned long thresh,
+				 unsigned long dirty,
+				 unsigned long bdi_dirty,
 				 unsigned long start_time)
 {
 	if (jiffies - bdi->bw_time_stamp <= MAX_PAUSE + MAX_PAUSE / 10)
 		return;
 	if (spin_trylock(&bdi->wb.list_lock)) {
-		__bdi_update_bandwidth(bdi, start_time);
+		__bdi_update_bandwidth(bdi, thresh, dirty, bdi_dirty,
+				       start_time);
 		spin_unlock(&bdi->wb.list_lock);
 	}
 }
@@ -625,7 +692,8 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-06-20 00:23:59.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-22 20:43:12.000000000 +0800
@@ -699,7 +699,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, start_time);
 }
 
 /*

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/7] writeback: introduce smoothed global dirty limit
  2011-06-22  0:04   ` Andrew Morton
@ 2011-06-22 14:24     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-22 14:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig, LKML

On Wed, Jun 22, 2011 at 08:04:44AM +0800, Andrew Morton wrote:
> On Sun, 19 Jun 2011 23:01:11 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > The start of a heavy weight application (ie. KVM) may instantly knock
> > down determine_dirtyable_memory() and hence the global/bdi dirty
> > thresholds.
> > 
> > So introduce global_dirty_limit for tracking the global dirty threshold
> > with policies
> > 
> > - follow downwards slowly
> > - follow up in one shot
> > 
> > global_dirty_limit can effectively mask out the impact of sudden drop of
> > dirtyable memory. It will be used in the next patch for two new type of
> > dirty limits.
> > 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  include/linux/writeback.h |    2 +
> >  mm/page-writeback.c       |   41 ++++++++++++++++++++++++++++++++++++
> >  2 files changed, 43 insertions(+)
> > 
> > --- linux-next.orig/include/linux/writeback.h	2011-06-19 22:56:18.000000000 +0800
> > +++ linux-next/include/linux/writeback.h	2011-06-19 22:59:29.000000000 +0800
> > @@ -88,6 +88,8 @@ static inline void laptop_sync_completio
> >  #endif
> >  void throttle_vm_writeout(gfp_t gfp_mask);
> >  
> > +extern unsigned long global_dirty_limit;
> > +
> >  /* These are exported to sysctl. */
> >  extern int dirty_background_ratio;
> >  extern unsigned long dirty_background_bytes;
> > --- linux-next.orig/mm/page-writeback.c	2011-06-19 22:56:18.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-06-19 22:59:29.000000000 +0800
> > @@ -116,6 +116,7 @@ EXPORT_SYMBOL(laptop_mode);
> >  
> >  /* End of sysctl-exported parameters */
> >  
> > +unsigned long global_dirty_limit;
> >  
> >  /*
> >   * Scale the writeback cache size proportional to the relative writeout speeds.
> > @@ -510,6 +511,43 @@ static void bdi_update_write_bandwidth(s
> >  	bdi->avg_write_bandwidth = avg;
> >  }
> >  
> > +static void update_dirty_limit(unsigned long thresh,
> > +				 unsigned long dirty)
> > +{
> > +	unsigned long limit = global_dirty_limit;
> > +
> > +	if (limit < thresh) {
> > +		limit = thresh;
> > +		goto update;
> > +	}
> > +
> > +	if (limit > thresh &&
> > +	    limit > dirty) {
> > +		limit -= (limit - max(thresh, dirty)) >> 5;
> > +		goto update;
> > +	}
> > +	return;
> > +update:
> > +	global_dirty_limit = limit;
> > +}
> 
> Are
> you
> using
> a
> 30
> column
> monitor
> over
> there?

Err nope... fixed.

> 
> This function is just crazy.  It compares various things, applies
> limits, churns them all together with magic constants and does it all
> in a refreshingly documentation-free manner.
> 
> How the heck is anyone supposed to understand what you were thinking
> when you typed it in?
> 
> Please, write for an audience.

Right, it's kind of playing black magics... hope the reposted patch
make them more clear.

> > +static void global_update_bandwidth(unsigned long thresh,
> > +				    unsigned long dirty,
> > +				    unsigned long now)
> > +{
> > +	static DEFINE_SPINLOCK(dirty_lock);
> > +
> > +	if (now - default_backing_dev_info.bw_time_stamp < MAX_PAUSE)
> > +		return;
> > +
> > +	spin_lock(&dirty_lock);
> > +	if (now - default_backing_dev_info.bw_time_stamp >= MAX_PAUSE) {
> > +		update_dirty_limit(thresh, dirty);
> > +		default_backing_dev_info.bw_time_stamp = now;
> > +	}
> > +	spin_unlock(&dirty_lock);
> > +}
> 
> Why is it playing with default_backing_dev_info?  That's only there to
> support filesystems which were too old-and-slack to implement
> backing-devs properly and it really shouldn't exist at all.

Good point!  So I replaced default_backing_dev_info.bw_time_stamp with
a static local variable.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state
  2011-06-21 21:14       ` Jan Kara
@ 2011-06-22 14:37         ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-22 14:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Andrew Morton, LKML

On Wed, Jun 22, 2011 at 05:14:20AM +0800, Jan Kara wrote:
> On Tue 21-06-11 23:07:52, Wu Fengguang wrote:
> > On Tue, Jun 21, 2011 at 05:38:18AM +0800, Jan Kara wrote:
> > > On Sun 19-06-11 23:01:15, Wu Fengguang wrote:
> > > > When there are only one (or several) dirtiers, dirty_exceeded is always
> > > > (or mostly) off. Converting to timestamp avoids this problem. It helps
> > > > to use smaller write_chunk for smoother throttling.
> > >   Hmm, what do you mean by "dirty_exceeded is always (or mostly) off"? I
> > > agree that dirty_exceeded handling is broken in current kernel when there
> > > are more dirtiers because of tasks having different dirty limits. Is this
> > > the problem you are speaking about?
> > 
> > For the 1-dirty case, it will quit on either of the two conditions
> > 
> > (1) no longer dirty exceeded, or
> > (2) write 1.5 times what it has dirtied
> > 
> > Note that (2) implies (1) because there is only 1 dirtier: if it takes
> > 4MB pages to make it dirty exceeded, it will sure drop out of the dirty
> > exceeded state after cleaned 6MB.
> > 
> > So the 1 dirtier will mostly see dirty_exceeded=0 inside
> > balance_dirty_pages_ratelimited_nr().
> > 
> > However that looks not a big problem. Since there is only 1 dirtier,
> > it will at most get dirty exceeded by 4MB, which is fairly acceptable.
>   Yes, 1 dirtier case works OK.
> 
> > > If yes, I think I have a cleaner fix for that - setting dirty_exceeded
> > > when bdi_dirty enters the range where per-task dirty limits can be and
> > > resetting it only when we leave that range. I can refresh the fix and
> > > send it to you...
> > 
> > Yes there is the per-task dynamic thresholds. However given that the
> > current scheme looks not too bad, maybe we can just do nothing for now
> > and skip this patch?
>   Well, the current scheme has problems already when there are two
> dirtiers. Assume process P1 has threshold T1 and process P2 has threshold
> T2, T1 < T2. P1 first hits the threshold, sets dirty_exceeded and does some
> writeback. Meanwhile P2 still dirties pages, eventually it enters
> balance_dirty_pages() sees threshold T2 is not exceeded and clears
> dirty_exceeded even though threshold T1 still is exceeded. After dirtying
> another 4 MB, P1 enters balance_dirty_pages() again and sets dirty_exceeded
> again.

That's right. Good explanations!

> This way, dirty_exceeded is ping-ponged between 0 and 1 basically in a
> random way. If we are unlucky enough, we can get beyond dirty limit by 4 MB
> for each dirtier instead of 32 KB originally designed...

The worst case is approaching the most light dirtier and randomly
throttle it.

> I already have a fix for this. Do you want me to base it on top of -mm tree
> or your latest writeback series?

OK, please. Whatever base should be fine.

Thanks,
Fengguang

> > > > Before patch, the wait time in balance_dirty_pages() are ~200ms:
> > > > 
> > > > [ 1093.397700] write_bandwidth: comm=swapper pages=1536 time=204ms
> > > > [ 1093.594319] write_bandwidth: comm=swapper pages=1536 time=196ms
> > > > [ 1093.796642] write_bandwidth: comm=swapper pages=1536 time=200ms
> > > > 
> > > > After patch, ~25ms:
> > > > 
> > > > [   90.261339] write_bandwidth: comm=swapper pages=192 time=20ms
> > > > [   90.293168] write_bandwidth: comm=swapper pages=192 time=24ms
> > > > [   90.323853] write_bandwidth: comm=swapper pages=192 time=24ms
> > > > [   90.354510] write_bandwidth: comm=swapper pages=192 time=28ms
> > > > [   90.389890] write_bandwidth: comm=swapper pages=192 time=28ms
> > > > [   90.421787] write_bandwidth: comm=swapper pages=192 time=24ms
> > > > 
> > > >  include/linux/backing-dev.h |    2 +-
> > > >  mm/backing-dev.c            |    2 --
> > > >  mm/page-writeback.c         |   24 ++++++++++++------------
> > > >  3 files changed, 13 insertions(+), 15 deletions(-)
> > > > 
> > > > --- linux-next.orig/mm/page-writeback.c	2011-06-19 22:59:49.000000000 +0800
> > > > +++ linux-next/mm/page-writeback.c	2011-06-19 22:59:53.000000000 +0800
> > > > @@ -483,6 +483,15 @@ unsigned long bdi_dirty_limit(struct bac
> > > >  	return bdi_dirty;
> > > >  }
> > > >  
> > > > +/*
> > > > + * last time exceeded (limit - limit/DIRTY_BRAKE)
> > > > + */
> > > > +static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
> > > > +				    unsigned long time_window)
> > > > +{
> > > > +	return jiffies - bdi->dirty_exceed_time <= time_window;
> > > > +}
> > > > +
> > > >  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> > > >  				       unsigned long elapsed,
> > > >  				       unsigned long written)
> > > > @@ -621,7 +630,6 @@ static void balance_dirty_pages(struct a
> > > >  	unsigned long bdi_thresh;
> > > >  	unsigned long pages_written = 0;
> > > >  	unsigned long pause = 1;
> > > > -	bool dirty_exceeded = false;
> > > >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > > >  	unsigned long start_time = jiffies;
> > > >  
> > > > @@ -669,14 +677,9 @@ static void balance_dirty_pages(struct a
> > > >  		 * bdi or process from holding back light ones; The latter is
> > > >  		 * the last resort safeguard.
> > > >  		 */
> > > > -		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> > > > -				  (nr_dirty > dirty_thresh);
> > > > -
> > > > -		if (!dirty_exceeded)
> > > > +		if (bdi_dirty <= bdi_thresh && nr_dirty <= dirty_thresh)
> > > >  			break;
> > > > -
> > > > -		if (!bdi->dirty_exceeded)
> > > > -			bdi->dirty_exceeded = 1;
> > > > +		bdi->dirty_exceed_time = jiffies;
> > > >  
> > > >  		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_dirty,
> > > >  				     start_time);
> > > > @@ -719,9 +722,6 @@ static void balance_dirty_pages(struct a
> > > >  			pause = HZ / 10;
> > > >  	}
> > > >  
> > > > -	if (!dirty_exceeded && bdi->dirty_exceeded)
> > > > -		bdi->dirty_exceeded = 0;
> > > > -
> > > >  	if (writeback_in_progress(bdi))
> > > >  		return;
> > > >  
> > > > @@ -775,7 +775,7 @@ void balance_dirty_pages_ratelimited_nr(
> > > >  		return;
> > > >  
> > > >  	ratelimit = ratelimit_pages;
> > > > -	if (mapping->backing_dev_info->dirty_exceeded)
> > > > +	if (dirty_exceeded_recently(bdi, MAX_PAUSE))
> > > >  		ratelimit = 8;
> > > >  
> > > >  	/*
> > > > --- linux-next.orig/include/linux/backing-dev.h	2011-06-19 22:54:58.000000000 +0800
> > > > +++ linux-next/include/linux/backing-dev.h	2011-06-19 22:59:53.000000000 +0800
> > > > @@ -79,7 +79,7 @@ struct backing_dev_info {
> > > >  	unsigned long avg_write_bandwidth;
> > > >  
> > > >  	struct prop_local_percpu completions;
> > > > -	int dirty_exceeded;
> > > > +	unsigned long dirty_exceed_time;
> > > >  
> > > >  	unsigned int min_ratio;
> > > >  	unsigned int max_ratio, max_prop_frac;
> > > > --- linux-next.orig/mm/backing-dev.c	2011-06-19 22:59:49.000000000 +0800
> > > > +++ linux-next/mm/backing-dev.c	2011-06-19 22:59:53.000000000 +0800
> > > > @@ -670,8 +670,6 @@ int bdi_init(struct backing_dev_info *bd
> > > >  			goto err;
> > > >  	}
> > > >  
> > > > -	bdi->dirty_exceeded = 0;
> > > > -
> > > >  	bdi->bw_time_stamp = jiffies;
> > > >  	bdi->written_stamp = 0;
> > > >  
> > > > 
> > > > 
> > > -- 
> > > Jan Kara <jack@suse.cz>
> > > SUSE Labs, CR
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits
  2011-06-22  0:20   ` Andrew Morton
@ 2011-06-23 13:18     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-23 13:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig, LKML

Andrew,

On Wed, Jun 22, 2011 at 08:20:43AM +0800, Andrew Morton wrote:
> On Sun, 19 Jun 2011 23:01:12 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > The max-pause limit helps to keep the sleep time inside
> > balance_dirty_pages() within 200ms. The 200ms max sleep means per task
> > rate limit of 8pages/200ms=160KB/s, which normally is enough to stop
> > dirtiers from continue pushing the dirty pages high, unless there are
> > a sufficient large number of slow dirtiers (ie. 500 tasks doing 160KB/s
> > will still sum up to 80MB/s, reaching the write bandwidth of a slow disk).
> > 
> > The pass-good limit helps to let go of the good bdi's in the presence of
> > a blocked bdi (ie. NFS server not responding) or slow USB disk which for
> > some reason build up a large number of initial dirty pages that refuse
> > to go away anytime soon.
> 
> The hard-wired numbers and hard-wired assumptions about device speeds
> shouldn't be here at all.  They will be sub-optimal (and sometimes
> extremely so) for all cases.  They will become wronger over time.  Or
> less wrong, depending upon which way they were originally wrong.

Yeah the changelog contains many specific numbers as examples..

The 200ms is mainly based on the maximum acceptable human perceptible
delays. It may be too large in some cases, so the value will be
decreased adaptively in a patch to be submitted later.

The DIRTY_MAXPAUSE and DIRTY_PASSGOOD are to get proportional values
to dirty_thresh, so is reasonably taken as adaptive.

> > +		dirty_thresh = hard_dirty_limit(dirty_thresh);
> > +		if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_MAXPAUSE &&
> > +		    jiffies - start_time > MAX_PAUSE)
> > +			break;
> > +		if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_PASSGOOD &&
> > +		    bdi_dirty < bdi_thresh)
> > +			break;
> 
> It appears that despite their similarity, DIRTY_MAXPAUSE is a
> dimensionless value whereas the units of MAX_PAUSE is jiffies.  Perhaps
> more care in naming choices would clarify things like this.

OK. I'll add _AREA suffixes to the DIRTY_MAXPAUSE macros to tell the
difference.

> The first comparison might be clearer if it used time_after().

Yeah, changed to time_after().

> Both statements need comments explaining what they do and *why they

Here is the patch with more comments :)

Thanks,
Fengguang
---
Subject: writeback: introduce max-pause and pass-good dirty limits
Date: Sun Jun 19 22:18:42 CST 2011

The max-pause limit helps to keep the sleep time inside
balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
normally is enough to stop dirtiers from continue pushing the dirty
pages high, unless there are a sufficient large number of slow dirtiers
(eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
write bandwidth of a slow disk and hence accumulating more and more dirty
pages).

The pass-good limit helps to let go of the good bdi's in the presence of
a blocked bdi (ie. NFS server not responding) or slow USB disk which for
some reason build up a large number of initial dirty pages that refuse
to go away anytime soon.

For example, given two bdi's A and B and the initial state

	bdi_thresh_A = dirty_thresh / 2
	bdi_thresh_B = dirty_thresh / 2
	bdi_dirty_A  = dirty_thresh / 2
	bdi_dirty_B  = dirty_thresh / 2

Then A get blocked, after a dozen seconds

	bdi_thresh_A = 0
	bdi_thresh_B = dirty_thresh
	bdi_dirty_A  = dirty_thresh / 2
	bdi_dirty_B  = dirty_thresh / 2

The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
will be effectively throttled by condition (nr_dirty < dirty_thresh).
This has two problems:
(1) we lose the protections for light dirtiers
(2) balance_dirty_pages() effectively becomes IO-less because the
    (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
    for IO, but balance_dirty_pages() loses an important way to break
    out of the loop which leads to more spread out throttle delays.

DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
example while this patch uses the more conservative value 8 so as not to
surprise people with too many dirty pages than expected.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |   21 +++++++++++++++++++++
 mm/page-writeback.c       |   28 ++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-06-23 10:31:40.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-23 10:31:40.000000000 +0800
@@ -7,6 +7,27 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+/*
+ * The 1/16 region above the global dirty limit will be put to maximum pauses:
+ *
+ *	(limit, limit + limit/DIRTY_MAXPAUSE_AREA)
+ *
+ * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
+ * to loops:
+ *
+ *	(limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA)
+ *
+ * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
+ * time) for the dirty pages to drop, unless written enough pages.
+ *
+ * The global dirty threshold is normally equal to the global dirty limit,
+ * except when the system suddenly allocates a lot of anonymous memory and
+ * knocks down the global dirty threshold quickly, in which case the global
+ * dirty limit will follow down slowly to prevent livelocking all dirtier tasks.
+ */
+#define DIRTY_MAXPAUSE_AREA		16
+#define DIRTY_PASSGOOD_AREA		8
+
 struct backing_dev_info;
 
 /*
--- linux-next.orig/mm/page-writeback.c	2011-06-23 10:31:40.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-23 10:59:47.000000000 +0800
@@ -399,6 +399,11 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long hard_dirty_limit(unsigned long thresh)
+{
+	return max(thresh, global_dirty_limit);
+}
+
 /*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
@@ -716,6 +721,29 @@ static void balance_dirty_pages(struct a
 		io_schedule_timeout(pause);
 		trace_balance_dirty_wait(bdi);
 
+		dirty_thresh = hard_dirty_limit(dirty_thresh);
+		/*
+		 * max-pause area. If dirty exceeded but still within this
+		 * area, no need to sleep for more than 200ms: (a) 8 pages per
+		 * 200ms is typically more than enough to curb heavy dirtiers;
+		 * (b) the pause time limit makes the dirtiers more responsive.
+		 */
+		if (nr_dirty < dirty_thresh +
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
+		    time_after(jiffies, start_time + MAX_PAUSE))
+			break;
+		/*
+		 * pass-good area. When some bdi gets blocked (eg. NFS server
+		 * not responding), or write bandwidth dropped dramatically due
+		 * to concurrent reads, or dirty threshold suddenly dropped and
+		 * the dirty pages cannot be brought down anytime soon (eg. on
+		 * slow USB stick), at least let go of the good bdi's.
+		 */
+		if (nr_dirty < dirty_thresh +
+			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
+		    bdi_dirty < bdi_thresh)
+			break;
+
 		/*
 		 * Increase the delay for each loop, up to our previous
 		 * default of taking a 100ms nap.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2011-06-23 13:18 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-19 15:01 [PATCH 0/7] more writeback patches for 3.1 Wu Fengguang
2011-06-19 15:01 ` [PATCH 1/7] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2011-06-20  7:45   ` Christoph Hellwig
2011-06-19 15:01 ` [PATCH 2/7] writeback: add parameters to __bdi_update_bandwidth() Wu Fengguang
2011-06-19 15:31   ` Christoph Hellwig
2011-06-19 15:35     ` Wu Fengguang
2011-06-19 15:01 ` [PATCH 3/7] writeback: introduce smoothed global dirty limit Wu Fengguang
2011-06-19 15:36   ` Christoph Hellwig
2011-06-19 15:55     ` Wu Fengguang
2011-06-21 23:59       ` Andrew Morton
2011-06-22 14:11         ` Wu Fengguang
2011-06-20 21:18   ` Jan Kara
2011-06-21 14:24     ` Wu Fengguang
2011-06-22  0:04   ` Andrew Morton
2011-06-22 14:24     ` Wu Fengguang
2011-06-19 15:01 ` [PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits Wu Fengguang
2011-06-22  0:20   ` Andrew Morton
2011-06-23 13:18     ` Wu Fengguang
2011-06-19 15:01 ` [PATCH 5/7] writeback: make writeback_control.nr_to_write straight Wu Fengguang
2011-06-19 15:35   ` Christoph Hellwig
2011-06-19 16:14     ` Wu Fengguang
2011-06-19 15:01 ` [PATCH 6/7] writeback: scale IO chunk size up to half device bandwidth Wu Fengguang
2011-06-19 15:01 ` [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state Wu Fengguang
2011-06-20 20:09   ` Christoph Hellwig
2011-06-21 10:00     ` Steven Whitehouse
2011-06-20 21:38   ` Jan Kara
2011-06-21 15:07     ` Wu Fengguang
2011-06-21 21:14       ` Jan Kara
2011-06-22 14:37         ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.