[PATCH 0/9] write bandwidth estimation and writeback fixes v2

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/9] write bandwidth estimation and writeback fixes v2
@ 2011-06-29 14:52 Wu Fengguang
  2011-06-29 14:52 ` [PATCH 1/9] writeback: make writeback_control.nr_to_write straight Wu Fengguang
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

Hi,

These patches are mainly seperated out from the original IO-less writeback
patchset. They are general useful as standalone improvements.

The v1 patches were posted in two threads:

	http://lkml.org/lkml/2011/6/12/69
	http://lkml.org/lkml/2011/6/19/97

In addition to the changes posted in the above threads, I also extended the
global_dirty_state trace with more fields, and improved some more changelog.

	[PATCH 1/9] writeback: make writeback_control.nr_to_write straight
	[PATCH 2/9] writeback: account per-bdi accumulated written pages
	[PATCH 3/9] writeback: bdi write bandwidth estimation
	[PATCH 4/9] writeback: show bdi write bandwidth in debugfs
	[PATCH 5/9] writeback: consolidate variable names in balance_dirty_pages()
	[PATCH 6/9] writeback: introduce smoothed global dirty limit
	[PATCH 7/9] writeback: introduce max-pause and pass-good dirty limits
	[PATCH 8/9] writeback: scale IO chunk size up to half device bandwidth
	[PATCH 9/9] writeback: trace global_dirty_state

They are git pullable from

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git fs-writeback

Thanks,
Fengguang



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/9] writeback: make writeback_control.nr_to_write straight
  2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
@ 2011-06-29 14:52 ` Wu Fengguang
  2011-06-30 16:24   ` Jan Kara
  2011-06-29 14:52 ` [PATCH 2/9] writeback: account per-bdi accumulated written pages Wu Fengguang
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-pass-work-to-writeback_sb_inodes.patch --]
[-- Type: text/plain, Size: 18692 bytes --]

Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.

struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.

It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.

It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.

Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 
 fs/fs-writeback.c                |  196 ++++++++++++++++-------------
 include/linux/writeback.h        |    6 
 include/trace/events/writeback.h |   39 ++++-
 mm/backing-dev.c                 |   17 --
 mm/page-writeback.c              |   17 --
 6 files changed, 148 insertions(+), 129 deletions(-)

change set:
- move writeback_control from wb_writeback() down to writeback_sb_inodes()
- change return value of writeback_sb_inodes()/__writeback_inodes_wb()
  to the number of pages and/or inodes written
- move writeback_control.older_than_this to struct wb_writeback_work
- remove writeback_control.inodes_written
- remove wbc_writeback_* trace events for now

--- linux-next.orig/fs/fs-writeback.c	2011-06-19 23:57:41.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-20 00:08:00.000000000 +0800
@@ -30,11 +30,21 @@
 #include "internal.h"
 
 /*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024L
+
+/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
+	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -472,7 +482,6 @@ writeback_single_inode(struct inode *ino
 			 * No need to add it back to the LRU.
 			 */
 			list_del_init(&inode->i_wb_list);
-			wbc->inodes_written++;
 		}
 	}
 	inode_sync_complete(inode);
@@ -506,6 +515,31 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
+static long writeback_chunk_size(struct wb_writeback_work *work)
+{
+	long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          writeback_sb_inodes()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
+		pages = LONG_MAX;
+	else
+		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+
+	return pages;
+}
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -513,18 +547,30 @@ static bool pin_sb_for_writeback(struct 
  * inodes. Otherwise write only ones which go sequentially
  * in reverse order.
  *
- * Return 1, if the caller writeback routine should be
- * interrupted. Otherwise return 0.
+ * Return the number of pages and/or inodes written.
  */
-static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
-		struct writeback_control *wbc, bool only_this_sb)
+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
 {
+	struct writeback_control wbc = {
+		.sync_mode		= work->sync_mode,
+		.tagged_writepages	= work->tagged_writepages,
+		.for_kupdate		= work->for_kupdate,
+		.for_background		= work->for_background,
+		.range_cyclic		= work->range_cyclic,
+		.range_start		= 0,
+		.range_end		= LLONG_MAX,
+	};
+	unsigned long start_time = jiffies;
+	long write_chunk;
+	long wrote = 0;  /* count both pages and inodes */
+
 	while (!list_empty(&wb->b_io)) {
-		long pages_skipped;
 		struct inode *inode = wb_inode(wb->b_io.prev);
 
 		if (inode->i_sb != sb) {
-			if (only_this_sb) {
+			if (work->sb) {
 				/*
 				 * We only want to write back data for this
 				 * superblock, move all inodes not belonging
@@ -539,7 +585,7 @@ static int writeback_sb_inodes(struct su
 			 * Bounce back to the caller to unpin this and
 			 * pin the next superblock.
 			 */
-			return 0;
+			break;
 		}
 
 		/*
@@ -553,12 +599,18 @@ static int writeback_sb_inodes(struct su
 			requeue_io(inode, wb);
 			continue;
 		}
-
 		__iget(inode);
+		write_chunk = writeback_chunk_size(work);
+		wbc.nr_to_write = write_chunk;
+		wbc.pages_skipped = 0;
+
+		writeback_single_inode(inode, wb, &wbc);
 
-		pages_skipped = wbc->pages_skipped;
-		writeback_single_inode(inode, wb, wbc);
-		if (wbc->pages_skipped != pages_skipped) {
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
+		if (!(inode->i_state & I_DIRTY))
+			wrote++;
+		if (wbc.pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
@@ -570,17 +622,25 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&wb->list_lock);
-		if (wbc->nr_to_write <= 0)
-			return 1;
+		/*
+		 * bail out to wb_writeback() often enough to check
+		 * background threshold and other termination conditions.
+		 */
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
-	/* b_io is empty */
-	return 1;
+	return wrote;
 }
 
-static void __writeback_inodes_wb(struct bdi_writeback *wb,
-				  struct writeback_control *wbc)
+static long __writeback_inodes_wb(struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
 {
-	int ret = 0;
+	unsigned long start_time = jiffies;
+	long wrote = 0;
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = wb_inode(wb->b_io.prev);
@@ -590,33 +650,37 @@ static void __writeback_inodes_wb(struct
 			requeue_io(inode, wb);
 			continue;
 		}
-		ret = writeback_sb_inodes(sb, wb, wbc, false);
+		wrote += writeback_sb_inodes(sb, wb, work);
 		drop_super(sb);
 
-		if (ret)
-			break;
+		/* refer to the same tests at the end of writeback_sb_inodes */
+		if (wrote) {
+			if (jiffies - start_time > HZ / 10UL)
+				break;
+			if (work->nr_pages <= 0)
+				break;
+		}
 	}
 	/* Leave any unwritten inodes on b_io */
+	return wrote;
 }
 
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc)
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages)
 {
+	struct wb_writeback_work work = {
+		.nr_pages	= nr_pages,
+		.sync_mode	= WB_SYNC_NONE,
+		.range_cyclic	= 1,
+	};
+
 	spin_lock(&wb->list_lock);
 	if (list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
-	__writeback_inodes_wb(wb, wbc);
+		queue_io(wb, NULL);
+	__writeback_inodes_wb(wb, &work);
 	spin_unlock(&wb->list_lock);
-}
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
+	return nr_pages - work.nr_pages;
+}
 
 static inline bool over_bground_thresh(void)
 {
@@ -646,42 +710,13 @@ static inline bool over_bground_thresh(v
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
-	struct writeback_control wbc = {
-		.sync_mode		= work->sync_mode,
-		.tagged_writepages	= work->tagged_writepages,
-		.older_than_this	= NULL,
-		.for_kupdate		= work->for_kupdate,
-		.for_background		= work->for_background,
-		.range_cyclic		= work->range_cyclic,
-	};
+	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	long wrote = 0;
-	long write_chunk = MAX_WRITEBACK_PAGES;
 	struct inode *inode;
-
-	if (!wbc.range_cyclic) {
-		wbc.range_start = 0;
-		wbc.range_end = LLONG_MAX;
-	}
-
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          writeback_sb_inodes()       <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_ALL || wbc.tagged_writepages)
-		write_chunk = LONG_MAX;
+	long progress;
 
 	oldest_jif = jiffies;
-	wbc.older_than_this = &oldest_jif;
+	work->older_than_this = &oldest_jif;
 
 	spin_lock(&wb->list_lock);
 	for (;;) {
@@ -711,24 +746,17 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_kupdate) {
 			oldest_jif = jiffies -
 				msecs_to_jiffies(dirty_expire_interval * 10);
-			wbc.older_than_this = &oldest_jif;
+			work->older_than_this = &oldest_jif;
 		}
 
-		wbc.nr_to_write = write_chunk;
-		wbc.pages_skipped = 0;
-		wbc.inodes_written = 0;
-
-		trace_wbc_writeback_start(&wbc, wb->bdi);
+		trace_writeback_start(wb->bdi, work);
 		if (list_empty(&wb->b_io))
-			queue_io(wb, wbc.older_than_this);
+			queue_io(wb, work->older_than_this);
 		if (work->sb)
-			writeback_sb_inodes(work->sb, wb, &wbc, true);
+			progress = writeback_sb_inodes(work->sb, wb, work);
 		else
-			__writeback_inodes_wb(wb, &wbc);
-		trace_wbc_writeback_written(&wbc, wb->bdi);
-
-		work->nr_pages -= write_chunk - wbc.nr_to_write;
-		wrote += write_chunk - wbc.nr_to_write;
+			progress = __writeback_inodes_wb(wb, work);
+		trace_writeback_written(wb->bdi, work);
 
 		/*
 		 * Did we write something? Try for more
@@ -738,9 +766,7 @@ static long wb_writeback(struct bdi_writ
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (wbc.nr_to_write < write_chunk)
-			continue;
-		if (wbc.inodes_written)
+		if (progress)
 			continue;
 		/*
 		 * No more inodes for IO, bail
@@ -753,8 +779,8 @@ static long wb_writeback(struct bdi_writ
 		 * we'll just busyloop.
 		 */
 		if (!list_empty(&wb->b_more_io))  {
+			trace_writeback_wait(wb->bdi, work);
 			inode = wb_inode(wb->b_more_io.prev);
-			trace_wbc_writeback_wait(&wbc, wb->bdi);
 			spin_lock(&inode->i_lock);
 			inode_wait_for_writeback(inode, wb);
 			spin_unlock(&inode->i_lock);
@@ -762,7 +788,7 @@ static long wb_writeback(struct bdi_writ
 	}
 	spin_unlock(&wb->list_lock);
 
-	return wrote;
+	return nr_pages - work->nr_pages;
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-06-19 23:57:41.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-20 00:03:25.000000000 +0800
@@ -24,12 +24,9 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
-	long inodes_written;		/* # of inodes written (at least) */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
@@ -56,8 +53,7 @@ void writeback_inodes_sb_nr(struct super
 int writeback_inodes_sb_if_idle(struct super_block *);
 int writeback_inodes_sb_nr_if_idle(struct super_block *, unsigned long nr);
 void sync_inodes_sb(struct super_block *);
-void writeback_inodes_wb(struct bdi_writeback *wb,
-		struct writeback_control *wbc);
+long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages);
 
--- linux-next.orig/mm/backing-dev.c	2011-06-19 23:57:41.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-20 00:13:48.000000000 +0800
@@ -260,18 +260,6 @@ int bdi_has_dirty_io(struct backing_dev_
 	return wb_has_dirty_io(&bdi->wb);
 }
 
-static void bdi_flush_io(struct backing_dev_info *bdi)
-{
-	struct writeback_control wbc = {
-		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
-		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
-	};
-
-	writeback_inodes_wb(&bdi->wb, &wbc);
-}
-
 /*
  * kupdated() used to do this. We cannot do it from the bdi_forker_thread()
  * or we risk deadlocking on ->s_umount. The longer term solution would be
@@ -457,9 +445,10 @@ static int bdi_forker_thread(void *ptr)
 			if (IS_ERR(task)) {
 				/*
 				 * If thread creation fails, force writeout of
-				 * the bdi from the thread.
+				 * the bdi from the thread. Hopefully 1024 is
+				 * large enough for efficient IO.
 				 */
-				bdi_flush_io(bdi);
+				writeback_inodes_wb(&bdi->wb, 1024);
 			} else {
 				/*
 				 * The spinlock makes sure we do not lose
--- linux-next.orig/mm/page-writeback.c	2011-06-19 23:58:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-20 00:03:25.000000000 +0800
@@ -492,13 +492,6 @@ static void balance_dirty_pages(struct a
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -560,17 +553,17 @@ static void balance_dirty_pages(struct a
 		 * threshold otherwise wait until the disk writes catch
 		 * up.
 		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
+		trace_balance_dirty_start(bdi);
 		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
+			pages_written += writeback_inodes_wb(&bdi->wb,
+							     write_chunk);
+			trace_balance_dirty_written(bdi, pages_written);
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		trace_balance_dirty_wait(bdi);
 
 		/*
 		 * Increase the delay for each loop, up to our previous
--- linux-next.orig/include/trace/events/writeback.h	2011-06-19 23:58:12.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-06-19 23:58:23.000000000 +0800
@@ -62,6 +62,9 @@ DEFINE_EVENT(writeback_work_class, name,
 DEFINE_WRITEBACK_WORK_EVENT(writeback_nothread);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
 
 TRACE_EVENT(writeback_pages_written,
 	TP_PROTO(long pages_written),
@@ -101,6 +104,30 @@ DEFINE_WRITEBACK_EVENT(writeback_bdi_reg
 DEFINE_WRITEBACK_EVENT(writeback_bdi_unregister);
 DEFINE_WRITEBACK_EVENT(writeback_thread_start);
 DEFINE_WRITEBACK_EVENT(writeback_thread_stop);
+DEFINE_WRITEBACK_EVENT(balance_dirty_start);
+DEFINE_WRITEBACK_EVENT(balance_dirty_wait);
+
+TRACE_EVENT(balance_dirty_written,
+
+	TP_PROTO(struct backing_dev_info *bdi, int written),
+
+	TP_ARGS(bdi, written),
+
+	TP_STRUCT__entry(
+		__array(char,	name, 32)
+		__field(int,	written)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->written = written;
+	),
+
+	TP_printk("bdi %s written %d",
+		  __entry->name,
+		  __entry->written
+	)
+);
 
 DECLARE_EVENT_CLASS(wbc_class,
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
@@ -114,7 +141,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -128,14 +154,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -145,7 +169,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
@@ -154,12 +177,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 DEFINE_EVENT(wbc_class, name, \
 	TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi), \
 	TP_ARGS(wbc, bdi))
-DEFINE_WBC_EVENT(wbc_writeback_start);
-DEFINE_WBC_EVENT(wbc_writeback_written);
-DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
 TRACE_EVENT(writeback_queue_io,
--- linux-next.orig/fs/btrfs/extent_io.c	2011-06-19 23:57:40.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2011-06-19 23:58:23.000000000 +0800
@@ -2551,7 +2551,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2584,7 +2583,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/9] writeback: account per-bdi accumulated written pages
  2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
  2011-06-29 14:52 ` [PATCH 1/9] writeback: make writeback_control.nr_to_write straight Wu Fengguang
@ 2011-06-29 14:52 ` Wu Fengguang
  2011-06-29 14:52 ` [PATCH 3/9] writeback: bdi write bandwidth estimation Wu Fengguang
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Michael Rubin, Wu Fengguang, LKML

[-- Attachment #1: writeback-bdi-written.patch --]
[-- Type: text/plain, Size: 2376 bytes --]

From: Jan Kara <jack@suse.cz>

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Peter Zijlstra <a.p.zijlstra@chello.nl>:
Move BDI_WRITTEN accounting into __bdi_writeout_inc().
This will cover and fix fuse, which only calls bdi_writeout_inc().

CC: Michael Rubin <mrubin@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |   10 ++++++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-06-12 13:29:02.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-12 20:48:47.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
--- linux-next.orig/mm/backing-dev.c	2011-06-12 13:29:02.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-12 20:49:51.000000000 +0800
@@ -97,6 +97,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -104,8 +105,13 @@ static int bdi_debug_stats_show(struct s
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh),
+		   K(dirty_thresh),
+		   K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   nr_dirty,
+		   nr_io,
+		   nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
--- linux-next.orig/mm/page-writeback.c	2011-06-12 13:29:06.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-12 20:48:47.000000000 +0800
@@ -219,6 +219,7 @@ int dirty_bytes_handler(struct ctl_table
  */
 static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
 {
+	__inc_bdi_stat(bdi, BDI_WRITTEN);
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
 }



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
  2011-06-29 14:52 ` [PATCH 1/9] writeback: make writeback_control.nr_to_write straight Wu Fengguang
  2011-06-29 14:52 ` [PATCH 2/9] writeback: account per-bdi accumulated written pages Wu Fengguang
@ 2011-06-29 14:52 ` Wu Fengguang
  2011-06-30 19:56   ` Jan Kara
                     ` (4 more replies)
  2011-06-29 14:52 ` [PATCH 4/9] writeback: show bdi write bandwidth in debugfs Wu Fengguang
                   ` (5 subsequent siblings)
  8 siblings, 5 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Li Shaohua, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: writeback-write-bandwidth.patch --]
[-- Type: text/plain, Size: 9612 bytes --]

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than one second will be skipped.

The estimated bandwidth will be reflecting how fast the device can
writeout when _fully utilized_, and won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time, if
not considering fluctuations, it will also remain high unless be knocked
down by possible concurrent reads that compete for the disk time and
bandwidth with async writes.

The estimation is not done purely in the flusher because there is no
guarantee for write_cache_pages() to return timely to update bandwidth.

The bdi->avg_write_bandwidth smoothing is very effective for filtering
out sudden spikes, however may be a little biased in long term.

The overheads are low because the bdi bandwidth update only occurs at
200ms intervals.

The 200ms update interval is suitable, becuase it's not possible to get
the real bandwidth for the instance at all, due to large fluctuations.

The NFS commits can be as large as seconds worth of data. One XFS
completion may be as large as half second worth of data if we are going
to increase the write chunk to half second worth of data. In ext4,
fluctuations with time period of around 5 seconds is observed. And there
is another pattern of irregular periods of up to 20 seconds on SSD tests.

That's why we are not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to do
another level of smoothing in avg_write_bandwidth.

CC: Li Shaohua <shaohua.li@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   13 +++++
 include/linux/backing-dev.h |    5 ++
 include/linux/writeback.h   |    3 +
 mm/backing-dev.c            |   12 +++++
 mm/page-writeback.c         |   81 ++++++++++++++++++++++++++++++++++
 5 files changed, 114 insertions(+)

To get an idea of the adaption speed and fluctuation range, here are
some real examples (check the red dots and the yellow line):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/xfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:51/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext3-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.22:02/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext4-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:57/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/btrfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.22:07/balance_dirty_pages-bandwidth.png

old version:
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/ext3-1dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-11-51/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/xfs-1dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-11-10/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-bandwidth.png

--- linux-next.orig/fs/fs-writeback.c	2011-06-23 10:31:40.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-29 16:23:05.000000000 +0800
@@ -693,6 +693,16 @@ static inline bool over_bground_thresh(v
 }
 
 /*
+ * Called under wb->list_lock. If there are multiple wb per bdi,
+ * only the flusher working on the first wb should do it.
+ */
+static void wb_update_bandwidth(struct bdi_writeback *wb,
+				unsigned long start_time)
+{
+	__bdi_update_bandwidth(wb->bdi, start_time);
+}
+
+/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
@@ -710,6 +720,7 @@ static inline bool over_bground_thresh(v
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
+	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
 	struct inode *inode;
@@ -758,6 +769,8 @@ static long wb_writeback(struct bdi_writ
 			progress = __writeback_inodes_wb(wb, work);
 		trace_writeback_written(wb->bdi, work);
 
+		wb_update_bandwidth(wb, wb_start);
+
 		/*
 		 * Did we write something? Try for more
 		 *
--- linux-next.orig/include/linux/backing-dev.h	2011-06-23 10:31:40.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-06-29 16:23:05.000000000 +0800
@@ -73,6 +73,11 @@ struct backing_dev_info {
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
+	unsigned long bw_time_stamp;
+	unsigned long written_stamp;
+	unsigned long write_bandwidth;
+	unsigned long avg_write_bandwidth;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/include/linux/writeback.h	2011-06-23 10:31:40.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-29 16:23:05.000000000 +0800
@@ -118,6 +118,9 @@ void global_dirty_limits(unsigned long *
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
 
+void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long start_time);
+
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied);
--- linux-next.orig/mm/backing-dev.c	2011-06-23 10:31:40.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-29 16:23:06.000000000 +0800
@@ -638,6 +638,11 @@ static void bdi_wb_init(struct bdi_write
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
+/*
+ * Initial write bandwidth: 100 MB/s
+ */
+#define INIT_BW		(100 << (20 - PAGE_SHIFT))
+
 int bdi_init(struct backing_dev_info *bdi)
 {
 	int i, err;
@@ -660,6 +665,13 @@ int bdi_init(struct backing_dev_info *bd
 	}
 
 	bdi->dirty_exceeded = 0;
+
+	bdi->bw_time_stamp = jiffies;
+	bdi->written_stamp = 0;
+
+	bdi->write_bandwidth = INIT_BW;
+	bdi->avg_write_bandwidth = INIT_BW;
+
 	err = prop_local_init_percpu(&bdi->completions);
 
 	if (err) {
--- linux-next.orig/mm/page-writeback.c	2011-06-23 10:31:40.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-29 16:23:44.000000000 +0800
@@ -37,6 +37,11 @@
 #include <trace/events/writeback.h>
 
 /*
+ * Sleep at most 200ms at a time in balance_dirty_pages().
+ */
+#define MAX_PAUSE	max(HZ/5, 1)
+
+/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
@@ -472,6 +477,79 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				       unsigned long elapsed,
+				       unsigned long written)
+{
+	const unsigned long period = roundup_pow_of_two(3 * HZ);
+	unsigned long avg = bdi->avg_write_bandwidth;
+	unsigned long old = bdi->write_bandwidth;
+	unsigned long cur;
+	u64 bw;
+
+	bw = written - bdi->written_stamp;
+	bw *= HZ;
+	if (unlikely(elapsed > period / 2)) {
+		do_div(bw, elapsed);
+		elapsed = period / 2;
+		bw *= elapsed;
+	}
+	bw += (u64)bdi->write_bandwidth * (period - elapsed);
+	cur = bw >> ilog2(period);
+
+	/*
+	 * one more level of smoothing, for filtering out sudden spikes
+	 */
+	if (avg > old && old > cur)
+		avg -= (avg - old) >> 3;
+
+	if (avg < old && old < cur)
+		avg += (old - avg) >> 3;
+
+	bdi->write_bandwidth = cur;
+	bdi->avg_write_bandwidth = avg;
+}
+
+void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long start_time)
+{
+	unsigned long now = jiffies;
+	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long written;
+
+	/*
+	 * rate-limit, only update once every 200ms.
+	 */
+	if (elapsed < MAX_PAUSE)
+		return;
+
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+
+	/*
+	 * Skip quiet periods when disk bandwidth is under-utilized.
+	 * (at least 1s idle time between two flusher runs)
+	 */
+	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
+		goto snapshot;
+
+	bdi_update_write_bandwidth(bdi, elapsed, written);
+
+snapshot:
+	bdi->written_stamp = written;
+	bdi->bw_time_stamp = now;
+}
+
+static void bdi_update_bandwidth(struct backing_dev_info *bdi,
+				 unsigned long start_time)
+{
+	if (jiffies - bdi->bw_time_stamp <= MAX_PAUSE + MAX_PAUSE / 10)
+		return;
+	if (spin_trylock(&bdi->wb.list_lock)) {
+		__bdi_update_bandwidth(bdi, start_time);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -491,6 +569,7 @@ static void balance_dirty_pages(struct a
 	unsigned long pause = 1;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long start_time = jiffies;
 
 	for (;;) {
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
@@ -545,6 +624,8 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
+		bdi_update_bandwidth(bdi, start_time);
+
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 4/9] writeback: show bdi write bandwidth in debugfs
  2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-06-29 14:52 ` [PATCH 3/9] writeback: bdi write bandwidth estimation Wu Fengguang
@ 2011-06-29 14:52 ` Wu Fengguang
  2011-06-29 14:52 ` [PATCH 5/9] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Theodore Tso, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 2005 bytes --]

Add a "BdiWriteBandwidth" entry and indent others in /debug/bdi/*/stats.

btw, increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

Impact: this could break user space tools if they are dumb enough to
depend on the number of white spaces.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/backing-dev.c	2011-06-12 20:57:03.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-06-12 20:58:28.000000000 +0800
@@ -92,23 +92,25 @@ static int bdi_debug_stats_show(struct s
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
-		   "BdiWriteback:     %8lu kB\n"
-		   "BdiReclaimable:   %8lu kB\n"
-		   "BdiDirtyThresh:   %8lu kB\n"
-		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n"
-		   "BdiWritten:       %8lu kB\n"
-		   "b_dirty:          %8lu\n"
-		   "b_io:             %8lu\n"
-		   "b_more_io:        %8lu\n"
-		   "bdi_list:         %8u\n"
-		   "state:            %8lx\n",
+		   "BdiWriteback:       %10lu kB\n"
+		   "BdiReclaimable:     %10lu kB\n"
+		   "BdiDirtyThresh:     %10lu kB\n"
+		   "DirtyThresh:        %10lu kB\n"
+		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiWritten:         %10lu kB\n"
+		   "BdiWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:            %10lu\n"
+		   "b_io:               %10lu\n"
+		   "b_more_io:          %10lu\n"
+		   "bdi_list:           %10u\n"
+		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh),
 		   K(dirty_thresh),
 		   K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty,
 		   nr_io,
 		   nr_more_io,



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 5/9] writeback: consolidate variable names in balance_dirty_pages()
  2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-06-29 14:52 ` [PATCH 4/9] writeback: show bdi write bandwidth in debugfs Wu Fengguang
@ 2011-06-29 14:52 ` Wu Fengguang
  2011-06-30 17:26   ` Jan Kara
  2011-06-29 14:52 ` [PATCH 6/9] writeback: introduce smoothed global dirty limit Wu Fengguang
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-cleanup-name-merge.patch --]
[-- Type: text/plain, Size: 2684 bytes --]

Introduce

	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS

in order to simplify many tests in the following patches.

balance_dirty_pages() will eventually care only about the dirty sums
besides nr_writeback.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-06-20 00:16:58.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-20 00:17:06.000000000 +0800
@@ -560,8 +560,9 @@ static void bdi_update_bandwidth(struct 
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long write_chunk)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	unsigned long nr_reclaimable, bdi_nr_reclaimable;
+	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
+	unsigned long bdi_dirty;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -574,7 +575,7 @@ static void balance_dirty_pages(struct a
 	for (;;) {
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -583,8 +584,7 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -602,10 +602,12 @@ static void balance_dirty_pages(struct a
 		 */
 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
 			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_nr_reclaimable +
+				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
 			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_nr_reclaimable +
+				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
 		/*
@@ -614,9 +616,8 @@ static void balance_dirty_pages(struct a
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
+				  (nr_dirty > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 6/9] writeback: introduce smoothed global dirty limit
  2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-06-29 14:52 ` [PATCH 5/9] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
@ 2011-06-29 14:52 ` Wu Fengguang
  2011-07-01 15:20   ` Andrea Righi
  2011-06-29 14:52 ` [PATCH 7/9] writeback: introduce max-pause and pass-good dirty limits Wu Fengguang
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-dirty-thresh-limit.patch --]
[-- Type: text/plain, Size: 6255 bytes --]

The start of a heavy weight application (ie. KVM) may instantly knock
down determine_dirtyable_memory() if the swap is not enabled or full.
global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
dirty thresholds that are _much_ lower than the global/bdi dirty pages.

balance_dirty_pages() will then heavily throttle all dirtiers including
the light ones, until the dirty pages drop below the new dirty thresholds.
During this _deep_ dirty-exceeded state, the system may appear rather
unresponsive to the users.

About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
threshold to heavy dirtiers than light ones, and the dirty pages will
be throttled around the heavy dirtiers' dirty threshold and reasonably
below the light dirtiers' dirty threshold. In this state, only the heavy
dirtiers will be throttled and the dirty pages are carefully controlled
to not exceed the light dirtiers' dirty threshold. However if the
threshold itself suddenly drops below the number of dirty pages, the
light dirtiers will get heavily throttled.

So introduce global_dirty_limit for tracking the global dirty threshold
with policies

- follow downwards slowly
- follow up in one shot

global_dirty_limit can effectively mask out the impact of sudden drop of
dirtyable memory. It will be used in the next patch for two new type of
dirty limits. Note that the new dirty limits are not going to avoid
throttling the light dirtiers, but could limit their sleep time to 200ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 -
 include/linux/writeback.h |    5 ++
 mm/page-writeback.c       |   72 +++++++++++++++++++++++++++++++++++-
 3 files changed, 76 insertions(+), 3 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2011-06-20 00:23:59.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-22 20:43:13.000000000 +0800
@@ -84,6 +84,8 @@ static inline void laptop_sync_completio
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 
+extern unsigned long global_dirty_limit;
+
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
 extern unsigned long dirty_background_bytes;
@@ -119,6 +121,9 @@ unsigned long bdi_dirty_limit(struct bac
 			       unsigned long dirty);
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long thresh,
+			    unsigned long dirty,
+			    unsigned long bdi_dirty,
 			    unsigned long start_time);
 
 void page_writeback_init(void);
--- linux-next.orig/mm/page-writeback.c	2011-06-20 00:23:59.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-22 22:06:29.000000000 +0800
@@ -116,6 +116,7 @@ EXPORT_SYMBOL(laptop_mode);
 
 /* End of sysctl-exported parameters */
 
+unsigned long global_dirty_limit;
 
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
@@ -510,7 +511,66 @@ static void bdi_update_write_bandwidth(s
 	bdi->avg_write_bandwidth = avg;
 }
 
+/*
+ * The global dirtyable memory and dirty threshold could be suddenly knocked
+ * down by a large amount (eg. on the startup of KVM in a swapless system).
+ * This may throw the system into deep dirty exceeded state and throttle
+ * heavy/light dirtiers alike. To retain good responsiveness, maintain
+ * global_dirty_limit for tracking slowly down to the knocked down dirty
+ * threshold.
+ */
+static void update_dirty_limit(unsigned long thresh, unsigned long dirty)
+{
+	unsigned long limit = global_dirty_limit;
+
+	/*
+	 * Follow up in one step.
+	 */
+	if (limit < thresh) {
+		limit = thresh;
+		goto update;
+	}
+
+	/*
+	 * Follow down slowly. Use the higher one as the target, because thresh
+	 * may drop below dirty. This is exactly the reason to introduce
+	 * global_dirty_limit which is guaranteed to lie above the dirty pages.
+	 */
+	thresh = max(thresh, dirty);
+	if (limit > thresh) {
+		limit -= (limit - thresh) >> 5;
+		goto update;
+	}
+	return;
+update:
+	global_dirty_limit = limit;
+}
+
+static void global_update_bandwidth(unsigned long thresh,
+				    unsigned long dirty,
+				    unsigned long now)
+{
+	static DEFINE_SPINLOCK(dirty_lock);
+	static unsigned long update_time;
+
+	/*
+	 * Do a lockless check first to optimize away locking for most time.
+	 */
+	if (now - update_time < MAX_PAUSE)
+		return;
+
+	spin_lock(&dirty_lock);
+	if (now - update_time >= MAX_PAUSE) {
+		update_dirty_limit(thresh, dirty);
+		update_time = now;
+	}
+	spin_unlock(&dirty_lock);
+}
+
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long thresh,
+			    unsigned long dirty,
+			    unsigned long bdi_dirty,
 			    unsigned long start_time)
 {
 	unsigned long now = jiffies;
@@ -532,6 +592,9 @@ void __bdi_update_bandwidth(struct backi
 	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
 		goto snapshot;
 
+	if (thresh)
+		global_update_bandwidth(thresh, dirty, now);
+
 	bdi_update_write_bandwidth(bdi, elapsed, written);
 
 snapshot:
@@ -540,12 +603,16 @@ snapshot:
 }
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
+				 unsigned long thresh,
+				 unsigned long dirty,
+				 unsigned long bdi_dirty,
 				 unsigned long start_time)
 {
 	if (jiffies - bdi->bw_time_stamp <= MAX_PAUSE + MAX_PAUSE / 10)
 		return;
 	if (spin_trylock(&bdi->wb.list_lock)) {
-		__bdi_update_bandwidth(bdi, start_time);
+		__bdi_update_bandwidth(bdi, thresh, dirty, bdi_dirty,
+				       start_time);
 		spin_unlock(&bdi->wb.list_lock);
 	}
 }
@@ -625,7 +692,8 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-06-20 00:23:59.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-22 20:43:12.000000000 +0800
@@ -699,7 +699,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, start_time);
 }
 
 /*



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 7/9] writeback: introduce max-pause and pass-good dirty limits
  2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-06-29 14:52 ` [PATCH 6/9] writeback: introduce smoothed global dirty limit Wu Fengguang
@ 2011-06-29 14:52 ` Wu Fengguang
  2011-06-29 14:52 ` [PATCH 8/9] writeback: scale IO chunk size up to half device bandwidth Wu Fengguang
  2011-06-29 14:52 ` [PATCH 9/9] writeback: trace global_dirty_state Wu Fengguang
  8 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-dirty-limits --]
[-- Type: text/plain, Size: 5130 bytes --]

The max-pause limit helps to keep the sleep time inside
balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
normally is enough to stop dirtiers from continue pushing the dirty
pages high, unless there are a sufficient large number of slow dirtiers
(eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
write bandwidth of a slow disk and hence accumulating more and more dirty
pages).

The pass-good limit helps to let go of the good bdi's in the presence of
a blocked bdi (ie. NFS server not responding) or slow USB disk which for
some reason build up a large number of initial dirty pages that refuse
to go away anytime soon.

For example, given two bdi's A and B and the initial state

	bdi_thresh_A = dirty_thresh / 2
	bdi_thresh_B = dirty_thresh / 2
	bdi_dirty_A  = dirty_thresh / 2
	bdi_dirty_B  = dirty_thresh / 2

Then A get blocked, after a dozen seconds

	bdi_thresh_A = 0
	bdi_thresh_B = dirty_thresh
	bdi_dirty_A  = dirty_thresh / 2
	bdi_dirty_B  = dirty_thresh / 2

The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
will be effectively throttled by condition (nr_dirty < dirty_thresh).
This has two problems:
(1) we lose the protections for light dirtiers
(2) balance_dirty_pages() effectively becomes IO-less because the
    (bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
    for IO, but balance_dirty_pages() loses an important way to break
    out of the loop which leads to more spread out throttle delays.

DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
example while this patch uses the more conservative value 8 so as not to
surprise people with too many dirty pages than expected.

The max-pause limit won't noticeably impact the speed dirty pages are
knocked down when there is a sudden drop of global/bdi dirty thresholds.
Because the heavy dirties will be throttled below 160KB/s which is slow
enough. It does help to avoid long dirty throttle delays and especially
will make light dirtiers more responsive.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |   21 +++++++++++++++++++++
 mm/page-writeback.c       |   28 ++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-06-23 10:31:40.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-23 10:31:40.000000000 +0800
@@ -7,6 +7,27 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+/*
+ * The 1/16 region above the global dirty limit will be put to maximum pauses:
+ *
+ *	(limit, limit + limit/DIRTY_MAXPAUSE_AREA)
+ *
+ * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
+ * to loops:
+ *
+ *	(limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA)
+ *
+ * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
+ * time) for the dirty pages to drop, unless written enough pages.
+ *
+ * The global dirty threshold is normally equal to the global dirty limit,
+ * except when the system suddenly allocates a lot of anonymous memory and
+ * knocks down the global dirty threshold quickly, in which case the global
+ * dirty limit will follow down slowly to prevent livelocking all dirtier tasks.
+ */
+#define DIRTY_MAXPAUSE_AREA		16
+#define DIRTY_PASSGOOD_AREA		8
+
 struct backing_dev_info;
 
 /*
--- linux-next.orig/mm/page-writeback.c	2011-06-23 10:31:40.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-23 10:59:47.000000000 +0800
@@ -399,6 +399,11 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long hard_dirty_limit(unsigned long thresh)
+{
+	return max(thresh, global_dirty_limit);
+}
+
 /*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
@@ -716,6 +721,29 @@ static void balance_dirty_pages(struct a
 		io_schedule_timeout(pause);
 		trace_balance_dirty_wait(bdi);
 
+		dirty_thresh = hard_dirty_limit(dirty_thresh);
+		/*
+		 * max-pause area. If dirty exceeded but still within this
+		 * area, no need to sleep for more than 200ms: (a) 8 pages per
+		 * 200ms is typically more than enough to curb heavy dirtiers;
+		 * (b) the pause time limit makes the dirtiers more responsive.
+		 */
+		if (nr_dirty < dirty_thresh +
+			       dirty_thresh / DIRTY_MAXPAUSE_AREA &&
+		    time_after(jiffies, start_time + MAX_PAUSE))
+			break;
+		/*
+		 * pass-good area. When some bdi gets blocked (eg. NFS server
+		 * not responding), or write bandwidth dropped dramatically due
+		 * to concurrent reads, or dirty threshold suddenly dropped and
+		 * the dirty pages cannot be brought down anytime soon (eg. on
+		 * slow USB stick), at least let go of the good bdi's.
+		 */
+		if (nr_dirty < dirty_thresh +
+			       dirty_thresh / DIRTY_PASSGOOD_AREA &&
+		    bdi_dirty < bdi_thresh)
+			break;
+
 		/*
 		 * Increase the delay for each loop, up to our previous
 		 * default of taking a 100ms nap.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 8/9] writeback: scale IO chunk size up to half device bandwidth
  2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
                   ` (6 preceding siblings ...)
  2011-06-29 14:52 ` [PATCH 7/9] writeback: introduce max-pause and pass-good dirty limits Wu Fengguang
@ 2011-06-29 14:52 ` Wu Fengguang
  2011-06-29 14:52 ` [PATCH 9/9] writeback: trace global_dirty_state Wu Fengguang
  8 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Theodore Tso, Chris Mason, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: writeback-128M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 4185 bytes --]

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

So remove the MAX_WRITEBACK_PAGES constraint. The writeback pages
will adapt to as large as the storage device can write within 500ms.

XFS is observed to do IO completions in a batch, and the batch size is
equal to the write chunk size. To avoid dirty pages to suddenly drop
out of balance_dirty_pages()'s dirty control scope and create large
fluctuations, the chunk size is also limited to half the control scope.

The balance_dirty_pages() control scrope is

	[(background_thresh + dirty_thresh) / 2, dirty_thresh]

which is by default [15%, 20%] of global dirty pages, whose range size
is dirty_thresh / DIRTY_FULL_SCOPE.

The adpative write chunk size will be rounded to the nearest 4MB
boundary.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   23 ++++++++++-------------
 include/linux/writeback.h |   11 +++++++++++
 2 files changed, 21 insertions(+), 13 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2011-06-22 22:15:36.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-06-23 10:08:24.000000000 +0800
@@ -8,6 +8,10 @@
 #include <linux/fs.h>
 
 /*
+ * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
+ *
+ *	(thresh - thresh/DIRTY_FULL_SCOPE, thresh)
+ *
  * The 1/16 region above the global dirty limit will be put to maximum pauses:
  *
  *	(limit, limit + limit/DIRTY_MAXPAUSE_AREA)
@@ -25,9 +29,16 @@
  * knocks down the global dirty threshold quickly, in which case the global
  * dirty limit will follow down slowly to prevent livelocking all dirtier tasks.
  */
+#define DIRTY_SCOPE		8
+#define DIRTY_FULL_SCOPE	(DIRTY_SCOPE / 2)
 #define DIRTY_MAXPAUSE_AREA		16
 #define DIRTY_PASSGOOD_AREA		8
 
+/*
+ * 4MB minimal write chunk size
+ */
+#define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
+
 struct backing_dev_info;
 
 /*
--- linux-next.orig/fs/fs-writeback.c	2011-06-22 20:43:12.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-06-23 10:08:01.000000000 +0800
@@ -30,15 +30,6 @@
 #include "internal.h"
 
 /*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024L
-
-/*
  * Passed into wb_writeback(), essentially a subset of writeback_control
  */
 struct wb_writeback_work {
@@ -515,7 +506,8 @@ static bool pin_sb_for_writeback(struct 
 	return false;
 }
 
-static long writeback_chunk_size(struct wb_writeback_work *work)
+static long writeback_chunk_size(struct backing_dev_info *bdi,
+				 struct wb_writeback_work *work)
 {
 	long pages;
 
@@ -534,8 +526,13 @@ static long writeback_chunk_size(struct 
 	 */
 	if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
 		pages = LONG_MAX;
-	else
-		pages = min(MAX_WRITEBACK_PAGES, work->nr_pages);
+	else {
+		pages = min(bdi->avg_write_bandwidth / 2,
+			    global_dirty_limit / DIRTY_SCOPE);
+		pages = min(pages, work->nr_pages);
+		pages = round_down(pages + MIN_WRITEBACK_PAGES,
+				   MIN_WRITEBACK_PAGES);
+	}
 
 	return pages;
 }
@@ -600,7 +597,7 @@ static long writeback_sb_inodes(struct s
 			continue;
 		}
 		__iget(inode);
-		write_chunk = writeback_chunk_size(work);
+		write_chunk = writeback_chunk_size(wb->bdi, work);
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 9/9] writeback: trace global_dirty_state
  2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
                   ` (7 preceding siblings ...)
  2011-06-29 14:52 ` [PATCH 8/9] writeback: scale IO chunk size up to half device bandwidth Wu Fengguang
@ 2011-06-29 14:52 ` Wu Fengguang
  2011-07-01 15:18   ` Christoph Hellwig
  8 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-06-29 14:52 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-trace-global-dirty-states.patch --]
[-- Type: text/plain, Size: 2467 bytes --]

Add trace event balance_dirty_state for showing the global dirty page
counts and thresholds at each global_dirty_limits() invocation.  This
will cover the callers throttle_vm_writeout(), over_bground_thresh()
and each balance_dirty_pages() loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   46 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |    1 
 2 files changed, 47 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-06-29 22:29:34.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-06-29 22:29:34.000000000 +0800
@@ -442,6 +442,7 @@ void global_dirty_limits(unsigned long *
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+	trace_global_dirty_state(background, dirty);
 }
 
 /**
--- linux-next.orig/include/trace/events/writeback.h	2011-06-29 22:29:34.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-06-29 22:33:30.000000000 +0800
@@ -204,6 +204,52 @@ TRACE_EVENT(writeback_queue_io,
 		__entry->moved)
 );
 
+TRACE_EVENT(global_dirty_state,
+
+	TP_PROTO(unsigned long background_thresh,
+		 unsigned long dirty_thresh
+	),
+
+	TP_ARGS(background_thresh,
+		dirty_thresh
+	),
+
+	TP_STRUCT__entry(
+		__field(unsigned long,	nr_dirty)
+		__field(unsigned long,	nr_writeback)
+		__field(unsigned long,	nr_unstable)
+		__field(unsigned long,	background_thresh)
+		__field(unsigned long,	dirty_thresh)
+		__field(unsigned long,	dirty_limit)
+		__field(unsigned long,	nr_dirtied)
+		__field(unsigned long,	nr_written)
+	),
+
+	TP_fast_assign(
+		__entry->nr_dirty	= global_page_state(NR_FILE_DIRTY);
+		__entry->nr_writeback	= global_page_state(NR_WRITEBACK);
+		__entry->nr_unstable	= global_page_state(NR_UNSTABLE_NFS);
+		__entry->nr_dirtied	= global_page_state(NR_DIRTIED);
+		__entry->nr_written	= global_page_state(NR_WRITTEN);
+		__entry->background_thresh = background_thresh;
+		__entry->dirty_thresh	= dirty_thresh;
+		__entry->dirty_limit = global_dirty_limit;
+	),
+
+	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
+		  "bg_thresh=%lu thresh=%lu limit=%lu "
+		  "dirtied=%lu written=%lu",
+		  __entry->nr_dirty,
+		  __entry->nr_writeback,
+		  __entry->nr_unstable,
+		  __entry->background_thresh,
+		  __entry->dirty_thresh,
+		  __entry->dirty_limit,
+		  __entry->nr_dirtied,
+		  __entry->nr_written
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/9] writeback: make writeback_control.nr_to_write straight
  2011-06-29 14:52 ` [PATCH 1/9] writeback: make writeback_control.nr_to_write straight Wu Fengguang
@ 2011-06-30 16:24   ` Jan Kara
  2011-07-01 12:03     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-06-30 16:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

On Wed 29-06-11 22:52:46, Wu Fengguang wrote:
> Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
> and initialize the struct writeback_control there.
> 
> struct writeback_control is basically designed to control writeback of a
> single file, but we keep abuse it for writing multiple files in
> writeback_sb_inodes() and its callers.
> 
> It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
> work->nr_pages starts to make sense, and instead of saving and restoring
> pages_skipped in writeback_sb_inodes it can always start with a clean
> zero value.
> 
> It also makes a neat IO pattern change: large dirty files are now
> written in the full 4MB writeback chunk size, rather than whatever
> remained quota in wbc->nr_to_write.
> 
> Acked-by: Jan Kara <jack@suse.cz>
> Proposed-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  Just one minor nit:

> @@ -570,17 +622,25 @@ static int writeback_sb_inodes(struct su
>  		iput(inode);
>  		cond_resched();
>  		spin_lock(&wb->list_lock);
> -		if (wbc->nr_to_write <= 0)
> -			return 1;
> +		/*
> +		 * bail out to wb_writeback() often enough to check
> +		 * background threshold and other termination conditions.
> +		 */
> +		if (wrote) {
> +			if (jiffies - start_time > HZ / 10UL)
> +				break;
  I guess this comparison should use time_before() macro - or maybe even
time_is_before_jiffies().

> +			if (work->nr_pages <= 0)
> +				break;
> +		}
>  	}
> -	/* b_io is empty */
> -	return 1;
> +	return wrote;
>  }
>  
> -static void __writeback_inodes_wb(struct bdi_writeback *wb,
> -				  struct writeback_control *wbc)
> +static long __writeback_inodes_wb(struct bdi_writeback *wb,
> +				  struct wb_writeback_work *work)
>  {
> -	int ret = 0;
> +	unsigned long start_time = jiffies;
> +	long wrote = 0;
>  
>  	while (!list_empty(&wb->b_io)) {
>  		struct inode *inode = wb_inode(wb->b_io.prev);
> @@ -590,33 +650,37 @@ static void __writeback_inodes_wb(struct
>  			requeue_io(inode, wb);
>  			continue;
>  		}
> -		ret = writeback_sb_inodes(sb, wb, wbc, false);
> +		wrote += writeback_sb_inodes(sb, wb, work);
>  		drop_super(sb);
>  
> -		if (ret)
> -			break;
> +		/* refer to the same tests at the end of writeback_sb_inodes */
> +		if (wrote) {
> +			if (jiffies - start_time > HZ / 10UL)
> +				break;
  And the same here.

> +			if (work->nr_pages <= 0)
> +				break;
> +		}
>  	}
>  	/* Leave any unwritten inodes on b_io */
> +	return wrote;
>  }
>  

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 5/9] writeback: consolidate variable names in balance_dirty_pages()
  2011-06-29 14:52 ` [PATCH 5/9] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
@ 2011-06-30 17:26   ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2011-06-30 17:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

On Wed 29-06-11 22:52:50, Wu Fengguang wrote:
> Introduce
> 
> 	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS
> 
> in order to simplify many tests in the following patches.
> 
> balance_dirty_pages() will eventually care only about the dirty sums
> besides nr_writeback.
  Looks OK.
Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-06-20 00:16:58.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-06-20 00:17:06.000000000 +0800
> @@ -560,8 +560,9 @@ static void bdi_update_bandwidth(struct 
>  static void balance_dirty_pages(struct address_space *mapping,
>  				unsigned long write_chunk)
>  {
> -	long nr_reclaimable, bdi_nr_reclaimable;
> -	long nr_writeback, bdi_nr_writeback;
> +	unsigned long nr_reclaimable, bdi_nr_reclaimable;
> +	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
> +	unsigned long bdi_dirty;
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> @@ -574,7 +575,7 @@ static void balance_dirty_pages(struct a
>  	for (;;) {
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
>  
>  		global_dirty_limits(&background_thresh, &dirty_thresh);
>  
> @@ -583,8 +584,7 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_reclaimable + nr_writeback <=
> -				(background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -602,10 +602,12 @@ static void balance_dirty_pages(struct a
>  		 */
>  		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
>  			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> -			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> +			bdi_dirty = bdi_nr_reclaimable +
> +				    bdi_stat_sum(bdi, BDI_WRITEBACK);
>  		} else {
>  			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> +			bdi_dirty = bdi_nr_reclaimable +
> +				    bdi_stat(bdi, BDI_WRITEBACK);
>  		}
>  
>  		/*
> @@ -614,9 +616,8 @@ static void balance_dirty_pages(struct a
>  		 * bdi or process from holding back light ones; The latter is
>  		 * the last resort safeguard.
>  		 */
> -		dirty_exceeded =
> -			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
> -			|| (nr_reclaimable + nr_writeback > dirty_thresh);
> +		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
> +				  (nr_dirty > dirty_thresh);
>  
>  		if (!dirty_exceeded)
>  			break;
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-06-29 14:52 ` [PATCH 3/9] writeback: bdi write bandwidth estimation Wu Fengguang
@ 2011-06-30 19:56   ` Jan Kara
  2011-07-01 14:58     ` Wu Fengguang
  2011-07-01 15:20   ` Andrea Righi
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-06-30 19:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li Shaohua, Peter Zijlstra, LKML

  Hello,

On Wed 29-06-11 22:52:48, Wu Fengguang wrote:
> The estimation value will start from 100MB/s and adapt to the real
> bandwidth in seconds.
> 
> It tries to update the bandwidth only when disk is fully utilized.
> Any inactive period of more than one second will be skipped.
> 
> The estimated bandwidth will be reflecting how fast the device can
> writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> The value will remain constant at disk idle time. At busy write time, if
> not considering fluctuations, it will also remain high unless be knocked
> down by possible concurrent reads that compete for the disk time and
> bandwidth with async writes.
> 
> The estimation is not done purely in the flusher because there is no
> guarantee for write_cache_pages() to return timely to update bandwidth.
> 
> The bdi->avg_write_bandwidth smoothing is very effective for filtering
> out sudden spikes, however may be a little biased in long term.
> 
> The overheads are low because the bdi bandwidth update only occurs at
> 200ms intervals.
> 
> The 200ms update interval is suitable, becuase it's not possible to get
> the real bandwidth for the instance at all, due to large fluctuations.
> 
> The NFS commits can be as large as seconds worth of data. One XFS
> completion may be as large as half second worth of data if we are going
> to increase the write chunk to half second worth of data. In ext4,
> fluctuations with time period of around 5 seconds is observed. And there
> is another pattern of irregular periods of up to 20 seconds on SSD tests.
> 
> That's why we are not only doing the estimation at 200ms intervals, but
> also averaging them over a period of 3 seconds and then go further to do
> another level of smoothing in avg_write_bandwidth.
  I was thinking about your formulas and also observing how it behaves when
writeback happens while the disk is loaded with other load as well (e.g.
grep -r of a large tree or cp from another partition).

I agree that some kind of averaging is needed. But how we average depends
on what do we need the number for. My thoughts were that there is not such
a thing as *the* write bandwidth since that depends on the background load
on the disk and also type of writes we do (sequential, random) as you noted
as well. What writeback needs to estimate in fact is "how fast can we write
this bulk of data?". Since we should ultimately size dirty limits and other
writeback tunables so that the bulk of data can be written in order of
seconds (but maybe much slower because of background load) I agree with
your first choice that we should measure written pages in a window of
several seconds - so your choice of 3 seconds is OK - but this number
should have a reasoning attached to it in a comment (something like my
elaborate above ;)

Now to your second level of smoothing - is it really useful? We already
average over several second window so that should really eliminate big
spikes comming from bumpy IO completion from storage (NFS might be a
separate question but we talked about that already). Your second level
of smoothing just spreads the window even further but if you find that
necessary, why not make the window larger in the first place? Also my
observations of avg-bandwidth and bandwidth when there's some background
read load (e.g. from grep) shows that in this case both bandwidth and
avg-bandwidth fluctuate +-10%. Finally, as I reasoned above, we are
interested in how much we can write in coming say 2 seconds so I don't see
a big point smoothing the value too much (I'd even say it's undesirable).
What do you think?

Some code related comments are below:

> --- linux-next.orig/fs/fs-writeback.c	2011-06-23 10:31:40.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-06-29 16:23:05.000000000 +0800
> @@ -693,6 +693,16 @@ static inline bool over_bground_thresh(v
>  }
>  
>  /*
> + * Called under wb->list_lock. If there are multiple wb per bdi,
> + * only the flusher working on the first wb should do it.
> + */
> +static void wb_update_bandwidth(struct bdi_writeback *wb,
> +				unsigned long start_time)
> +{
> +	__bdi_update_bandwidth(wb->bdi, start_time);
> +}
> +
> +/*
>   * Explicit flushing or periodic writeback of "old" data.
>   *
>   * Define "old": the first time one of an inode's pages is dirtied, we mark the
> @@ -710,6 +720,7 @@ static inline bool over_bground_thresh(v
>  static long wb_writeback(struct bdi_writeback *wb,
>  			 struct wb_writeback_work *work)
>  {
> +	unsigned long wb_start = jiffies;
>  	long nr_pages = work->nr_pages;
>  	unsigned long oldest_jif;
>  	struct inode *inode;
> @@ -758,6 +769,8 @@ static long wb_writeback(struct bdi_writ
>  			progress = __writeback_inodes_wb(wb, work);
>  		trace_writeback_written(wb->bdi, work);
>  
> +		wb_update_bandwidth(wb, wb_start);
> +
>  		/*
>  		 * Did we write something? Try for more
>  		 *
> --- linux-next.orig/include/linux/backing-dev.h	2011-06-23 10:31:40.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h	2011-06-29 16:23:05.000000000 +0800
> @@ -73,6 +73,11 @@ struct backing_dev_info {
>  
>  	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
>  
> +	unsigned long bw_time_stamp;
> +	unsigned long written_stamp;
> +	unsigned long write_bandwidth;
> +	unsigned long avg_write_bandwidth;
> +
  Some comments on what these are would be useful.

>  	struct prop_local_percpu completions;
>  	int dirty_exceeded;
>  
> --- linux-next.orig/mm/page-writeback.c	2011-06-23 10:31:40.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-06-29 16:23:44.000000000 +0800
> @@ -37,6 +37,11 @@
>  #include <trace/events/writeback.h>
>  
>  /*
> + * Sleep at most 200ms at a time in balance_dirty_pages().
> + */
> +#define MAX_PAUSE	max(HZ/5, 1)
  You use this for ratelimiting updates of bdi bandwidth. So we should
really have a separate (and properly commented) constant for it.

> +static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> +				       unsigned long elapsed,
> +				       unsigned long written)
> +{
> +	const unsigned long period = roundup_pow_of_two(3 * HZ);
> +	unsigned long avg = bdi->avg_write_bandwidth;
> +	unsigned long old = bdi->write_bandwidth;
> +	unsigned long cur;
> +	u64 bw;
> +
> +	bw = written - bdi->written_stamp;
> +	bw *= HZ;
> +	if (unlikely(elapsed > period / 2)) {
> +		do_div(bw, elapsed);
> +		elapsed = period / 2;
> +		bw *= elapsed;
> +	}
  What is the rationale behind this condition? Why cannot we just use 
written/elapsed when we have long enough period? Have you observed some
problems with it (I estimate bandwidth in my patches like:
static void update_bdi_throughput(struct backing_dev_info *bdi,
                                unsigned long written, unsigned long time)
{
       unsigned int deltams = jiffies_to_msecs(time - bdi->start_jiffies);

       written -= bdi->start_written;
       if (deltams > WINDOW_MS) {
               /* Add 1 to avoid 0 result */
               bdi->pages_per_s = 1 + ((u64)written) * MSEC_PER_SEC / deltams;
               return;
       }
       bdi->pages_per_s = 1 +
               (((u64)bdi->pages_per_s) * (WINDOW_MS - deltams) +
                ((u64)written) * MSEC_PER_SEC) / WINDOW_MS;
}

and I didn't observe any problems with it. Plus you need some comment
explaining how you compute the throughput. I have something like:
/*
 * When the throughput is computed, we consider an imaginary WINDOW_MS
 * miliseconds long window. In this window, we know that it took 'deltams'
 * miliseconds to write 'written' pages and for the rest of the window we
 * assume number of pages corresponding to the throughput we previously
 * computed to have been written. Thus we obtain total number of pages
 * written in the imaginary window and from it new throughput.
 */

> +	bw += (u64)bdi->write_bandwidth * (period - elapsed);
> +	cur = bw >> ilog2(period);
> +
> +	/*
> +	 * one more level of smoothing, for filtering out sudden spikes
> +	 */
> +	if (avg > old && old > cur)
> +		avg -= (avg - old) >> 3;
> +
> +	if (avg < old && old < cur)
> +		avg += (old - avg) >> 3;
> +
> +	bdi->write_bandwidth = cur;
> +	bdi->avg_write_bandwidth = avg;
> +}
> +
> +void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> +			    unsigned long start_time)
> +{
> +	unsigned long now = jiffies;
> +	unsigned long elapsed = now - bdi->bw_time_stamp;
> +	unsigned long written;
> +
> +	/*
> +	 * rate-limit, only update once every 200ms.
> +	 */
> +	if (elapsed < MAX_PAUSE)
> +		return;
> +
> +	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> +
> +	/*
> +	 * Skip quiet periods when disk bandwidth is under-utilized.
> +	 * (at least 1s idle time between two flusher runs)
> +	 */
> +	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
> +		goto snapshot;
  Cannot we just explicitely stamp the written_stamp and bw_time_stamp at
the beginning of wb_writeback and be done with it? We wouldn't have to
pass the time stamps, keep them in wb_writeback() and balance_dirty_pages()
etc.

> +
> +	bdi_update_write_bandwidth(bdi, elapsed, written);
> +
> +snapshot:
> +	bdi->written_stamp = written;
> +	bdi->bw_time_stamp = now;
> +}
> +
> +static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> +				 unsigned long start_time)
> +{
> +	if (jiffies - bdi->bw_time_stamp <= MAX_PAUSE + MAX_PAUSE / 10)
> +		return;
> +	if (spin_trylock(&bdi->wb.list_lock)) {
  I'm not sure this is a good idea - list_lock could be quite busy which
would possibly quite delay updates of bandwidth. I'd just unconditionally
take it.

> +		__bdi_update_bandwidth(bdi, start_time);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
> +
>  /*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
> @@ -491,6 +569,7 @@ static void balance_dirty_pages(struct a
>  	unsigned long pause = 1;
>  	bool dirty_exceeded = false;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	unsigned long start_time = jiffies;
>  
>  	for (;;) {
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> @@ -545,6 +624,8 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> +		bdi_update_bandwidth(bdi, start_time);
> +
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
>  		 * filesystems (i.e. NFS) in which data may have been
> 

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/9] writeback: make writeback_control.nr_to_write straight
  2011-06-30 16:24   ` Jan Kara
@ 2011-07-01 12:03     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-07-01 12:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Andrew Morton, LKML

>   Just one minor nit:
> 
> > @@ -570,17 +622,25 @@ static int writeback_sb_inodes(struct su
> >  		iput(inode);
> >  		cond_resched();
> >  		spin_lock(&wb->list_lock);
> > -		if (wbc->nr_to_write <= 0)
> > -			return 1;
> > +		/*
> > +		 * bail out to wb_writeback() often enough to check
> > +		 * background threshold and other termination conditions.
> > +		 */
> > +		if (wrote) {
> > +			if (jiffies - start_time > HZ / 10UL)
> > +				break;
>   I guess this comparison should use time_before() macro - or maybe even
> time_is_before_jiffies().

Fair enough. Changed to:

        if (time_is_before_jiffies(start_time + HZ / 10UL))

> > +			if (work->nr_pages <= 0)
> > +				break;
> > +		}
> >  	}
> > -	/* b_io is empty */
> > -	return 1;
> > +	return wrote;
> >  }
> >  
> > -static void __writeback_inodes_wb(struct bdi_writeback *wb,
> > -				  struct writeback_control *wbc)
> > +static long __writeback_inodes_wb(struct bdi_writeback *wb,
> > +				  struct wb_writeback_work *work)
> >  {
> > -	int ret = 0;
> > +	unsigned long start_time = jiffies;
> > +	long wrote = 0;
> >  
> >  	while (!list_empty(&wb->b_io)) {
> >  		struct inode *inode = wb_inode(wb->b_io.prev);
> > @@ -590,33 +650,37 @@ static void __writeback_inodes_wb(struct
> >  			requeue_io(inode, wb);
> >  			continue;
> >  		}
> > -		ret = writeback_sb_inodes(sb, wb, wbc, false);
> > +		wrote += writeback_sb_inodes(sb, wb, work);
> >  		drop_super(sb);
> >  
> > -		if (ret)
> > -			break;
> > +		/* refer to the same tests at the end of writeback_sb_inodes */
> > +		if (wrote) {
> > +			if (jiffies - start_time > HZ / 10UL)
> > +				break;
>   And the same here.

Changed together. Thanks for the review!

Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-06-30 19:56   ` Jan Kara
@ 2011-07-01 14:58     ` Wu Fengguang
  2011-07-04  3:05       ` Wu Fengguang
  2011-07-13 23:30       ` Jan Kara
  0 siblings, 2 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-07-01 14:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Li, Shaohua, Peter Zijlstra, LKML

Hi Jan,

On Fri, Jul 01, 2011 at 03:56:09AM +0800, Jan Kara wrote:
>   Hello,
> 
> On Wed 29-06-11 22:52:48, Wu Fengguang wrote:
> > The estimation value will start from 100MB/s and adapt to the real
> > bandwidth in seconds.
> > 
> > It tries to update the bandwidth only when disk is fully utilized.
> > Any inactive period of more than one second will be skipped.
> > 
> > The estimated bandwidth will be reflecting how fast the device can
> > writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> > The value will remain constant at disk idle time. At busy write time, if
> > not considering fluctuations, it will also remain high unless be knocked
> > down by possible concurrent reads that compete for the disk time and
> > bandwidth with async writes.
> > 
> > The estimation is not done purely in the flusher because there is no
> > guarantee for write_cache_pages() to return timely to update bandwidth.
> > 
> > The bdi->avg_write_bandwidth smoothing is very effective for filtering
> > out sudden spikes, however may be a little biased in long term.
> > 
> > The overheads are low because the bdi bandwidth update only occurs at
> > 200ms intervals.
> > 
> > The 200ms update interval is suitable, becuase it's not possible to get
> > the real bandwidth for the instance at all, due to large fluctuations.
> > 
> > The NFS commits can be as large as seconds worth of data. One XFS
> > completion may be as large as half second worth of data if we are going
> > to increase the write chunk to half second worth of data. In ext4,
> > fluctuations with time period of around 5 seconds is observed. And there
> > is another pattern of irregular periods of up to 20 seconds on SSD tests.
> > 
> > That's why we are not only doing the estimation at 200ms intervals, but
> > also averaging them over a period of 3 seconds and then go further to do
> > another level of smoothing in avg_write_bandwidth.
>   I was thinking about your formulas and also observing how it behaves when
> writeback happens while the disk is loaded with other load as well (e.g.
> grep -r of a large tree or cp from another partition).
> 
> I agree that some kind of averaging is needed. But how we average depends
> on what do we need the number for. My thoughts were that there is not such
> a thing as *the* write bandwidth since that depends on the background load
> on the disk and also type of writes we do (sequential, random) as you noted
> as well. What writeback needs to estimate in fact is "how fast can we write
> this bulk of data?". Since we should ultimately size dirty limits and other
> writeback tunables so that the bulk of data can be written in order of
> seconds (but maybe much slower because of background load) I agree with
> your first choice that we should measure written pages in a window of
> several seconds - so your choice of 3 seconds is OK - but this number
> should have a reasoning attached to it in a comment (something like my
> elaborate above ;)

Agree totally and thanks for the reasoning in great details ;)

> Now to your second level of smoothing - is it really useful? We already

It's useful for filtering out sudden disturbances. Oh I forgot to show
the SSD case which see sudden drops of throughput:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1SSD-64G/ext4-1dd-1M-64p-64288M-20%25-2.6.38-rc6-dt6+-2011-03-01-16-19/balance_dirty_pages-bandwidth.png

It's also very effective for XFS (see the below graph).

> average over several second window so that should really eliminate big
> spikes comming from bumpy IO completion from storage (NFS might be a
> separate question but we talked about that already). Your second level
> of smoothing just spreads the window even further but if you find that
> necessary, why not make the window larger in the first place? Also my

Because they are two different type of smoothing. I employed the same
smoothing as avg_write_bandwidth for bdi_dirty, where I wrote this
comment to illustrate its unique usefulness:

+       /*
+        * This ...
+        * ... is most effective on XFS,
+        * whose pattern is
+        *                                                             .
+        *      [.] dirty       [-] avg                       .       .
+        *                                                   .       .
+        *              .         .         .         .     .       .
+        *      ---------------------------------------    .       .
+        *            .         .         .         .     .       .
+        *           .         .         .         .     .       .
+        *          .         .         .         .     .       .
+        *         .         .         .         .     .       .
+        *        .         .         .         .
+        *       .         .         .         .      (fluctuated)
+        *      .         .         .         .
+        *     .         .         .         .
+        *
+        * @avg will remain flat at the cost of being biased towards high. In
+        * practice the error tend to be much smaller: thanks to more coarse
+        * grained fluctuations, @avg becomes the real average number for the
+        * last two rising lines of @dirty.
+        */

> observations of avg-bandwidth and bandwidth when there's some background
> read load (e.g. from grep) shows that in this case both bandwidth and
> avg-bandwidth fluctuate +-10%. 

You can more easily see their fluctuate ranges in my graphs :)
For example,

(compare the YELLOW line to the RED dots)

NFS
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-bandwidth.png

XFS
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-1dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-22-58/balance_dirty_pages-bandwidth.png

ext4
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext4-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:57/balance_dirty_pages-bandwidth.png

> Finally, as I reasoned above, we are
> interested in how much we can write in coming say 2 seconds so I don't see
> a big point smoothing the value too much (I'd even say it's undesirable).
> What do you think?

Yeah we typically get 20% ->write_bandwidth fluctuation range in the
above graphs (except for NFS), which seems reasonably good for guiding
the write chunk size.

However I would still like to favor a more stable value and
->avg_write_bandwidth looks appealing with much smaller 3%-10% ranges
in steady state. Because the fluctuations of estimated write bandwidth
will directly become the fluctuations in write chunk sizes over time
as well as the application throttled write speed in future IO-less
balance_dirty_pages().
 
> Some code related comments are below:
> 
> > --- linux-next.orig/fs/fs-writeback.c	2011-06-23 10:31:40.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2011-06-29 16:23:05.000000000 +0800
> > @@ -693,6 +693,16 @@ static inline bool over_bground_thresh(v
> >  }
> >  
> >  /*
> > + * Called under wb->list_lock. If there are multiple wb per bdi,
> > + * only the flusher working on the first wb should do it.
> > + */
> > +static void wb_update_bandwidth(struct bdi_writeback *wb,
> > +				unsigned long start_time)
> > +{
> > +	__bdi_update_bandwidth(wb->bdi, start_time);
> > +}
> > +
> > +/*
> >   * Explicit flushing or periodic writeback of "old" data.
> >   *
> >   * Define "old": the first time one of an inode's pages is dirtied, we mark the
> > @@ -710,6 +720,7 @@ static inline bool over_bground_thresh(v
> >  static long wb_writeback(struct bdi_writeback *wb,
> >  			 struct wb_writeback_work *work)
> >  {
> > +	unsigned long wb_start = jiffies;
> >  	long nr_pages = work->nr_pages;
> >  	unsigned long oldest_jif;
> >  	struct inode *inode;
> > @@ -758,6 +769,8 @@ static long wb_writeback(struct bdi_writ
> >  			progress = __writeback_inodes_wb(wb, work);
> >  		trace_writeback_written(wb->bdi, work);
> >  
> > +		wb_update_bandwidth(wb, wb_start);
> > +
> >  		/*
> >  		 * Did we write something? Try for more
> >  		 *
> > --- linux-next.orig/include/linux/backing-dev.h	2011-06-23 10:31:40.000000000 +0800
> > +++ linux-next/include/linux/backing-dev.h	2011-06-29 16:23:05.000000000 +0800
> > @@ -73,6 +73,11 @@ struct backing_dev_info {
> >  
> >  	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
> >  
> > +	unsigned long bw_time_stamp;
> > +	unsigned long written_stamp;
> > +	unsigned long write_bandwidth;
> > +	unsigned long avg_write_bandwidth;
> > +
>   Some comments on what these are would be useful.

Sure:

        unsigned long bw_time_stamp;    /* last time write bw is updated */
        unsigned long written_stamp;    /* pages written at bw_time_stamp */
        unsigned long write_bandwidth;  /* the estimated write bandwidth */
        unsigned long avg_write_bandwidth; /* further smoothed write bw */

> >  	struct prop_local_percpu completions;
> >  	int dirty_exceeded;
> >  
> > --- linux-next.orig/mm/page-writeback.c	2011-06-23 10:31:40.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-06-29 16:23:44.000000000 +0800
> > @@ -37,6 +37,11 @@
> >  #include <trace/events/writeback.h>
> >  
> >  /*
> > + * Sleep at most 200ms at a time in balance_dirty_pages().
> > + */
> > +#define MAX_PAUSE	max(HZ/5, 1)
>   You use this for ratelimiting updates of bdi bandwidth. So we should
> really have a separate (and properly commented) constant for it.

Yeah, how about this?

/*
 * Estimate write bandwidth at 200ms intervals.
 */
#define BANDWIDTH_INTERVAL      max(HZ/5, 1)

> > +static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> > +				       unsigned long elapsed,
> > +				       unsigned long written)
> > +{
> > +	const unsigned long period = roundup_pow_of_two(3 * HZ);
> > +	unsigned long avg = bdi->avg_write_bandwidth;
> > +	unsigned long old = bdi->write_bandwidth;
> > +	unsigned long cur;
> > +	u64 bw;
> > +
> > +	bw = written - bdi->written_stamp;
> > +	bw *= HZ;
> > +	if (unlikely(elapsed > period / 2)) {
> > +		do_div(bw, elapsed);
> > +		elapsed = period / 2;
> > +		bw *= elapsed;
> > +	}
>   What is the rationale behind this condition? Why cannot we just use 
> written/elapsed when we have long enough period? Have you observed some

Yeah, in normal case the formula is

(written * HZ / elapsed) * elapsed + bdi->write_bandwidth * (period - elapsed)
------------------------------------------------------------------------------
                                 period

and for longer than half-period time, the formula gives equal weights to new
and smoothed old bandwidth:

(written * HZ / elapsed) * (period/2) + bdi->write_bandwidth * (period/2)
------------------------------------------------------------------------------
                                 period

which normally happens when background writeout is going on without
any tasks throttled (if throttled, they'll kick the bandwidth
updates), especially in the case of the initial write chunk for slow
devices.

For example, ->write_bandwidth is initialized to 100MB/s, resulting in
the initial 50MB write chunk size (in [PATCH 8/9]). Which will take 5s
when writing a large file to a 10MB/s USB stick.

In that case, we do need to quickly adapt to the real bandwidth. So
I'll take your approach. Thanks for pointing this out.

> problems with it (I estimate bandwidth in my patches like:
> static void update_bdi_throughput(struct backing_dev_info *bdi,
>                                 unsigned long written, unsigned long time)
> {
>        unsigned int deltams = jiffies_to_msecs(time - bdi->start_jiffies);
> 
>        written -= bdi->start_written;
>        if (deltams > WINDOW_MS) {
>                /* Add 1 to avoid 0 result */
>                bdi->pages_per_s = 1 + ((u64)written) * MSEC_PER_SEC / deltams;
>                return;
>        }
>        bdi->pages_per_s = 1 +
>                (((u64)bdi->pages_per_s) * (WINDOW_MS - deltams) +
>                 ((u64)written) * MSEC_PER_SEC) / WINDOW_MS;
> }
> 
> and I didn't observe any problems with it. Plus you need some comment
> explaining how you compute the throughput. I have something like:
> /*
>  * When the throughput is computed, we consider an imaginary WINDOW_MS
>  * miliseconds long window. In this window, we know that it took 'deltams'
>  * miliseconds to write 'written' pages and for the rest of the window we
>  * assume number of pages corresponding to the throughput we previously
>  * computed to have been written. Thus we obtain total number of pages
>  * written in the imaginary window and from it new throughput.
>  */

Oh I thought the code is clear enough because it's the standard
running average technique... or let's write down the before-simplified
formula?

        /*
         * bw = written * HZ / elapsed
         *
         *                   bw * elapsed + write_bandwidth * (period - elapsed)
         * write_bandwidth = ---------------------------------------------------
         *                                          period
         */

> > +	bw += (u64)bdi->write_bandwidth * (period - elapsed);
> > +	cur = bw >> ilog2(period);
> > +
> > +	/*
> > +	 * one more level of smoothing, for filtering out sudden spikes
> > +	 */
> > +	if (avg > old && old > cur)
> > +		avg -= (avg - old) >> 3;
> > +
> > +	if (avg < old && old < cur)
> > +		avg += (old - avg) >> 3;
> > +
> > +	bdi->write_bandwidth = cur;
> > +	bdi->avg_write_bandwidth = avg;
> > +}
> > +
> > +void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> > +			    unsigned long start_time)
> > +{
> > +	unsigned long now = jiffies;
> > +	unsigned long elapsed = now - bdi->bw_time_stamp;
> > +	unsigned long written;
> > +
> > +	/*
> > +	 * rate-limit, only update once every 200ms.
> > +	 */
> > +	if (elapsed < MAX_PAUSE)
> > +		return;
> > +
> > +	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> > +
> > +	/*
> > +	 * Skip quiet periods when disk bandwidth is under-utilized.
> > +	 * (at least 1s idle time between two flusher runs)
> > +	 */
> > +	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
> > +		goto snapshot;
>   Cannot we just explicitely stamp the written_stamp and bw_time_stamp at
> the beginning of wb_writeback and be done with it? We wouldn't have to
> pass the time stamps, keep them in wb_writeback() and balance_dirty_pages()
> etc.

I'm afraid that stamping may unnecessarily disturb/invalidate valid
accounting periods. For example, if the flusher keeps working on tiny
works, it will effectively disable bandwidth updates.

> > +
> > +	bdi_update_write_bandwidth(bdi, elapsed, written);
> > +
> > +snapshot:
> > +	bdi->written_stamp = written;
> > +	bdi->bw_time_stamp = now;
> > +}
> > +
> > +static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> > +				 unsigned long start_time)
> > +{
> > +	if (jiffies - bdi->bw_time_stamp <= MAX_PAUSE + MAX_PAUSE / 10)
> > +		return;
> > +	if (spin_trylock(&bdi->wb.list_lock)) {
>   I'm not sure this is a good idea - list_lock could be quite busy which
> would possibly quite delay updates of bandwidth. I'd just unconditionally
> take it.

OK, I'll do it. It's good to start it correct. We always have the
option to optimize if it's proved to be a hot spot.
 
Thanks,
Fengguang

> > +		__bdi_update_bandwidth(bdi, start_time);
> > +		spin_unlock(&bdi->wb.list_lock);
> > +	}
> > +}
> > +
> >  /*
> >   * balance_dirty_pages() must be called by processes which are generating dirty
> >   * data.  It looks at the number of dirty pages in the machine and will force
> > @@ -491,6 +569,7 @@ static void balance_dirty_pages(struct a
> >  	unsigned long pause = 1;
> >  	bool dirty_exceeded = false;
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > +	unsigned long start_time = jiffies;
> >  
> >  	for (;;) {
> >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > @@ -545,6 +624,8 @@ static void balance_dirty_pages(struct a
> >  		if (!bdi->dirty_exceeded)
> >  			bdi->dirty_exceeded = 1;
> >  
> > +		bdi_update_bandwidth(bdi, start_time);
> > +
> >  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> >  		 * Unstable writes are a feature of certain networked
> >  		 * filesystems (i.e. NFS) in which data may have been
> > 
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 9/9] writeback: trace global_dirty_state
  2011-06-29 14:52 ` [PATCH 9/9] writeback: trace global_dirty_state Wu Fengguang
@ 2011-07-01 15:18   ` Christoph Hellwig
  2011-07-01 15:45     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2011-07-01 15:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

On Wed, Jun 29, 2011 at 10:52:54PM +0800, Wu Fengguang wrote:
> Add trace event balance_dirty_state for showing the global dirty page
> counts and thresholds at each global_dirty_limits() invocation.  This
> will cover the callers throttle_vm_writeout(), over_bground_thresh()
> and each balance_dirty_pages() loop.

Given that this is a rather uncontoversial diagnostics patch I'd
move it to the front of the queue.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-06-29 14:52 ` [PATCH 3/9] writeback: bdi write bandwidth estimation Wu Fengguang
  2011-06-30 19:56   ` Jan Kara
@ 2011-07-01 15:20   ` Andrea Righi
  2011-07-08 11:53     ` Wu Fengguang
  2011-07-01 18:32   ` Vivek Goyal
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Andrea Righi @ 2011-07-01 15:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li Shaohua, Peter Zijlstra, LKML

On Wed, Jun 29, 2011 at 10:52:48PM +0800, Wu Fengguang wrote:
> The estimation value will start from 100MB/s and adapt to the real
> bandwidth in seconds.
> 
> It tries to update the bandwidth only when disk is fully utilized.
> Any inactive period of more than one second will be skipped.
> 
> The estimated bandwidth will be reflecting how fast the device can
> writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> The value will remain constant at disk idle time. At busy write time, if
> not considering fluctuations, it will also remain high unless be knocked
> down by possible concurrent reads that compete for the disk time and
> bandwidth with async writes.
> 
> The estimation is not done purely in the flusher because there is no
> guarantee for write_cache_pages() to return timely to update bandwidth.
> 
> The bdi->avg_write_bandwidth smoothing is very effective for filtering
> out sudden spikes, however may be a little biased in long term.
> 
> The overheads are low because the bdi bandwidth update only occurs at
> 200ms intervals.
> 
> The 200ms update interval is suitable, becuase it's not possible to get
> the real bandwidth for the instance at all, due to large fluctuations.
> 
> The NFS commits can be as large as seconds worth of data. One XFS
> completion may be as large as half second worth of data if we are going
> to increase the write chunk to half second worth of data. In ext4,
> fluctuations with time period of around 5 seconds is observed. And there
> is another pattern of irregular periods of up to 20 seconds on SSD tests.
> 
> That's why we are not only doing the estimation at 200ms intervals, but
> also averaging them over a period of 3 seconds and then go further to do
> another level of smoothing in avg_write_bandwidth.
> 
> CC: Li Shaohua <shaohua.li@intel.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---

...

Another small time nitpick.

> +
> +static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> +				 unsigned long start_time)
> +{
> +	if (jiffies - bdi->bw_time_stamp <= MAX_PAUSE + MAX_PAUSE / 10)

	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + MAX_PAUSE +
						MAX_PAUSE / 10)

> +		return;
> +	if (spin_trylock(&bdi->wb.list_lock)) {
> +		__bdi_update_bandwidth(bdi, start_time);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
> +
>  /*
>   * balance_dirty_pages() must be called by processes which are generating dirty
>   * data.  It looks at the number of dirty pages in the machine and will force
> @@ -491,6 +569,7 @@ static void balance_dirty_pages(struct a
>  	unsigned long pause = 1;
>  	bool dirty_exceeded = false;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	unsigned long start_time = jiffies;
>  
>  	for (;;) {
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> @@ -545,6 +624,8 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> +		bdi_update_bandwidth(bdi, start_time);
> +
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
>  		 * filesystems (i.e. NFS) in which data may have been
> 
> 

-Andrea

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/9] writeback: introduce smoothed global dirty limit
  2011-06-29 14:52 ` [PATCH 6/9] writeback: introduce smoothed global dirty limit Wu Fengguang
@ 2011-07-01 15:20   ` Andrea Righi
  2011-07-08 11:51     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Righi @ 2011-07-01 15:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

On Wed, Jun 29, 2011 at 10:52:51PM +0800, Wu Fengguang wrote:
> The start of a heavy weight application (ie. KVM) may instantly knock
> down determine_dirtyable_memory() if the swap is not enabled or full.
> global_dirty_limits() and bdi_dirty_limit() will in turn get global/bdi
> dirty thresholds that are _much_ lower than the global/bdi dirty pages.
> 
> balance_dirty_pages() will then heavily throttle all dirtiers including
> the light ones, until the dirty pages drop below the new dirty thresholds.
> During this _deep_ dirty-exceeded state, the system may appear rather
> unresponsive to the users.
> 
> About "deep" dirty-exceeded: task_dirty_limit() assigns 1/8 lower dirty
> threshold to heavy dirtiers than light ones, and the dirty pages will
> be throttled around the heavy dirtiers' dirty threshold and reasonably
> below the light dirtiers' dirty threshold. In this state, only the heavy
> dirtiers will be throttled and the dirty pages are carefully controlled
> to not exceed the light dirtiers' dirty threshold. However if the
> threshold itself suddenly drops below the number of dirty pages, the
> light dirtiers will get heavily throttled.
> 
> So introduce global_dirty_limit for tracking the global dirty threshold
> with policies
> 
> - follow downwards slowly
> - follow up in one shot
> 
> global_dirty_limit can effectively mask out the impact of sudden drop of
> dirtyable memory. It will be used in the next patch for two new type of
> dirty limits. Note that the new dirty limits are not going to avoid
> throttling the light dirtiers, but could limit their sleep time to 200ms.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

...

> +static void global_update_bandwidth(unsigned long thresh,
> +				    unsigned long dirty,
> +				    unsigned long now)
> +{
> +	static DEFINE_SPINLOCK(dirty_lock);
> +	static unsigned long update_time;
> +
> +	/*
> +	 * Do a lockless check first to optimize away locking for most time.
> +	 */
> +	if (now - update_time < MAX_PAUSE)

	if (time_before(now, update_time + MAX_PAUSE))

> +		return;
> +
> +	spin_lock(&dirty_lock);
> +	if (now - update_time >= MAX_PAUSE) {

	if (time_after_eq(now, update_time + MAX_PAUSE))

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 9/9] writeback: trace global_dirty_state
  2011-07-01 15:18   ` Christoph Hellwig
@ 2011-07-01 15:45     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-07-01 15:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Andrew Morton, LKML

On Fri, Jul 01, 2011 at 11:18:01PM +0800, Christoph Hellwig wrote:
> On Wed, Jun 29, 2011 at 10:52:54PM +0800, Wu Fengguang wrote:
> > Add trace event balance_dirty_state for showing the global dirty page
> > counts and thresholds at each global_dirty_limits() invocation.  This
> > will cover the callers throttle_vm_writeout(), over_bground_thresh()
> > and each balance_dirty_pages() loop.
> 
> Given that this is a rather uncontoversial diagnostics patch I'd
> move it to the front of the queue.

Yeah it was.. since then it's expanded with the global dirty limit
field, which is newly introduced in [PATCH 6/9].

I'll move it before the larger-chunk-size patch :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-06-29 14:52 ` [PATCH 3/9] writeback: bdi write bandwidth estimation Wu Fengguang
  2011-06-30 19:56   ` Jan Kara
  2011-07-01 15:20   ` Andrea Righi
@ 2011-07-01 18:32   ` Vivek Goyal
  2011-07-23  8:02     ` Wu Fengguang
  2011-07-01 19:19   ` Vivek Goyal
  2011-07-01 19:29   ` Vivek Goyal
  4 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2011-07-01 18:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li Shaohua, Peter Zijlstra, LKML

On Wed, Jun 29, 2011 at 10:52:48PM +0800, Wu Fengguang wrote:
> The estimation value will start from 100MB/s and adapt to the real
> bandwidth in seconds.
> 
> It tries to update the bandwidth only when disk is fully utilized.
> Any inactive period of more than one second will be skipped.
> 
> The estimated bandwidth will be reflecting how fast the device can
> writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> The value will remain constant at disk idle time. At busy write time, if
> not considering fluctuations, it will also remain high unless be knocked
> down by possible concurrent reads that compete for the disk time and
> bandwidth with async writes.
> 
> The estimation is not done purely in the flusher because there is no
> guarantee for write_cache_pages() to return timely to update bandwidth.
> 
> The bdi->avg_write_bandwidth smoothing is very effective for filtering
> out sudden spikes, however may be a little biased in long term.
> 
> The overheads are low because the bdi bandwidth update only occurs at
> 200ms intervals.
> 
> The 200ms update interval is suitable, becuase it's not possible to get
> the real bandwidth for the instance at all, due to large fluctuations.
> 
> The NFS commits can be as large as seconds worth of data. One XFS
> completion may be as large as half second worth of data if we are going
> to increase the write chunk to half second worth of data. In ext4,
> fluctuations with time period of around 5 seconds is observed. And there
> is another pattern of irregular periods of up to 20 seconds on SSD tests.
> 
> That's why we are not only doing the estimation at 200ms intervals, but
> also averaging them over a period of 3 seconds and then go further to do
> another level of smoothing in avg_write_bandwidth.

What IO scheduler have you used for testing? CFQ now a days almost chokes
async requests in presence of lots of sync IO. Have you done some testing
with that scenario and see how quickly you adjust to that change.

/me is trying to wrap his head around all the smoothing and bandwidth
calculation functions. Wished there was more explanation to it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-06-29 14:52 ` [PATCH 3/9] writeback: bdi write bandwidth estimation Wu Fengguang
                     ` (2 preceding siblings ...)
  2011-07-01 18:32   ` Vivek Goyal
@ 2011-07-01 19:19   ` Vivek Goyal
  2011-07-01 19:29   ` Vivek Goyal
  4 siblings, 0 replies; 29+ messages in thread
From: Vivek Goyal @ 2011-07-01 19:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li Shaohua, Peter Zijlstra, LKML

On Wed, Jun 29, 2011 at 10:52:48PM +0800, Wu Fengguang wrote:
> The estimation value will start from 100MB/s and adapt to the real
> bandwidth in seconds.
> 
> It tries to update the bandwidth only when disk is fully utilized.
> Any inactive period of more than one second will be skipped.
> 
> The estimated bandwidth will be reflecting how fast the device can
> writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> The value will remain constant at disk idle time. At busy write time, if
> not considering fluctuations, it will also remain high unless be knocked
> down by possible concurrent reads that compete for the disk time and
> bandwidth with async writes.
> 
> The estimation is not done purely in the flusher because there is no
> guarantee for write_cache_pages() to return timely to update bandwidth.
> 
> The bdi->avg_write_bandwidth smoothing is very effective for filtering
> out sudden spikes, however may be a little biased in long term.
> 
> The overheads are low because the bdi bandwidth update only occurs at
> 200ms intervals.
> 
> The 200ms update interval is suitable, becuase it's not possible to get
> the real bandwidth for the instance at all, due to large fluctuations.
> 
> The NFS commits can be as large as seconds worth of data. One XFS
> completion may be as large as half second worth of data if we are going
> to increase the write chunk to half second worth of data. In ext4,
> fluctuations with time period of around 5 seconds is observed. And there
> is another pattern of irregular periods of up to 20 seconds on SSD tests.
> 
> That's why we are not only doing the estimation at 200ms intervals, but
> also averaging them over a period of 3 seconds and then go further to do
> another level of smoothing in avg_write_bandwidth.
> 
> CC: Li Shaohua <shaohua.li@intel.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |   13 +++++
>  include/linux/backing-dev.h |    5 ++
>  include/linux/writeback.h   |    3 +
>  mm/backing-dev.c            |   12 +++++
>  mm/page-writeback.c         |   81 ++++++++++++++++++++++++++++++++++
>  5 files changed, 114 insertions(+)
> 
> To get an idea of the adaption speed and fluctuation range, here are
> some real examples (check the red dots and the yellow line):
> 

IIUC, following examples are with dd workload only with variation in file
systems. How about variation of workload (mix of seq and random writes,
competining synchronous workload) and variation of io schedulers. All of the
above should change write bandwidth more unpredictably and it would be
interesting to see how well algorithm works in those cases.

Thanks
Vivek
 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/xfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:51/balance_dirty_pages-bandwidth.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext3-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.22:02/balance_dirty_pages-bandwidth.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext4-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:57/balance_dirty_pages-bandwidth.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/btrfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.22:07/balance_dirty_pages-bandwidth.png
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-06-29 14:52 ` [PATCH 3/9] writeback: bdi write bandwidth estimation Wu Fengguang
                     ` (3 preceding siblings ...)
  2011-07-01 19:19   ` Vivek Goyal
@ 2011-07-01 19:29   ` Vivek Goyal
  2011-07-23  8:07     ` Wu Fengguang
  4 siblings, 1 reply; 29+ messages in thread
From: Vivek Goyal @ 2011-07-01 19:29 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li Shaohua, Peter Zijlstra, LKML

On Wed, Jun 29, 2011 at 10:52:48PM +0800, Wu Fengguang wrote:
> The estimation value will start from 100MB/s and adapt to the real
> bandwidth in seconds.
> 
> It tries to update the bandwidth only when disk is fully utilized.
> Any inactive period of more than one second will be skipped.

Is this piece of code being tested in your graphs? Are there any
inactive periods which get filtered out in your workload? I am 
assuming in a continuous dd, we will not have periods of inactivity
1 second long.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-07-01 14:58     ` Wu Fengguang
@ 2011-07-04  3:05       ` Wu Fengguang
  2011-07-13 23:30       ` Jan Kara
  1 sibling, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-07-04  3:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Andrew Morton,
	Li, Shaohua, Peter Zijlstra, LKML

Jan,

This is the updated patch after your review.

Thanks,
Fengguang
---
Subject: writeback: bdi write bandwidth estimation
Date: Sun Aug 29 11:22:30 CST 2010

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than one second will be skipped.

The estimated bandwidth will be reflecting how fast the device can
writeout when _fully utilized_, and won't drop to 0 when it goes idle.
The value will remain constant at disk idle time. At busy write time, if
not considering fluctuations, it will also remain high unless be knocked
down by possible concurrent reads that compete for the disk time and
bandwidth with async writes.

The estimation is not done purely in the flusher because there is no
guarantee for write_cache_pages() to return timely to update bandwidth.

The bdi->avg_write_bandwidth smoothing is very effective for filtering
out sudden spikes, however may be a little biased in long term.

The overheads are low because the bdi bandwidth update only occurs at
200ms intervals.

The 200ms update interval is suitable, because it's not possible to get
the real bandwidth for the instance at all, due to large fluctuations.

The NFS commits can be as large as seconds worth of data. One XFS
completion may be as large as half second worth of data if we are going
to increase the write chunk to half second worth of data. In ext4,
fluctuations with time period of around 5 seconds is observed. And there
is another pattern of irregular periods of up to 20 seconds on SSD tests.

That's why we are not only doing the estimation at 200ms intervals, but
also averaging them over a period of 3 seconds and then go further to do
another level of smoothing in avg_write_bandwidth.

CC: Li Shaohua <shaohua.li@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   13 +++++
 include/linux/backing-dev.h |    5 +
 include/linux/writeback.h   |    3 +
 mm/backing-dev.c            |   12 ++++
 mm/page-writeback.c         |   87 ++++++++++++++++++++++++++++++++++
 5 files changed, 120 insertions(+)

To get an idea of the adaption speed and fluctuation range, here are
some real examples (check the red dots and the yellow line):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/xfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:51/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext3-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.22:02/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext4-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:57/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/btrfs-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.22:07/balance_dirty_pages-bandwidth.png

old version:
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/ext3-1dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-11-51/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/xfs-1dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-11-10/balance_dirty_pages-bandwidth.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-bandwidth.png

--- linux-next.orig/fs/fs-writeback.c	2011-07-01 21:03:16.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-07-01 21:10:48.000000000 +0800
@@ -693,6 +693,16 @@ static inline bool over_bground_thresh(v
 }
 
 /*
+ * Called under wb->list_lock. If there are multiple wb per bdi,
+ * only the flusher working on the first wb should do it.
+ */
+static void wb_update_bandwidth(struct bdi_writeback *wb,
+				unsigned long start_time)
+{
+	__bdi_update_bandwidth(wb->bdi, start_time);
+}
+
+/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
@@ -710,6 +720,7 @@ static inline bool over_bground_thresh(v
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
+	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
 	struct inode *inode;
@@ -758,6 +769,8 @@ static long wb_writeback(struct bdi_writ
 			progress = __writeback_inodes_wb(wb, work);
 		trace_writeback_written(wb->bdi, work);
 
+		wb_update_bandwidth(wb, wb_start);
+
 		/*
 		 * Did we write something? Try for more
 		 *
--- linux-next.orig/include/linux/backing-dev.h	2011-07-01 21:03:16.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-07-01 21:19:03.000000000 +0800
@@ -73,6 +73,11 @@ struct backing_dev_info {
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
+	unsigned long bw_time_stamp;	/* last time write bw is updated */
+	unsigned long written_stamp;	/* pages written at bw_time_stamp */
+	unsigned long write_bandwidth;	/* the estimated write bandwidth */
+	unsigned long avg_write_bandwidth; /* further smoothed write bw */
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/include/linux/writeback.h	2011-07-01 21:03:16.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-07-01 21:10:48.000000000 +0800
@@ -118,6 +118,9 @@ void global_dirty_limits(unsigned long *
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
 
+void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long start_time);
+
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied);
--- linux-next.orig/mm/backing-dev.c	2011-07-01 21:03:16.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-07-01 21:10:48.000000000 +0800
@@ -638,6 +638,11 @@ static void bdi_wb_init(struct bdi_write
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
+/*
+ * Initial write bandwidth: 100 MB/s
+ */
+#define INIT_BW		(100 << (20 - PAGE_SHIFT))
+
 int bdi_init(struct backing_dev_info *bdi)
 {
 	int i, err;
@@ -660,6 +665,13 @@ int bdi_init(struct backing_dev_info *bd
 	}
 
 	bdi->dirty_exceeded = 0;
+
+	bdi->bw_time_stamp = jiffies;
+	bdi->written_stamp = 0;
+
+	bdi->write_bandwidth = INIT_BW;
+	bdi->avg_write_bandwidth = INIT_BW;
+
 	err = prop_local_init_percpu(&bdi->completions);
 
 	if (err) {
--- linux-next.orig/mm/page-writeback.c	2011-07-01 21:03:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-07-01 22:43:30.000000000 +0800
@@ -37,6 +37,11 @@
 #include <trace/events/writeback.h>
 
 /*
+ * Estimate write bandwidth at 200ms intervals.
+ */
+#define BANDWIDTH_INTERVAL	max(HZ/5, 1)
+
+/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
@@ -471,6 +476,85 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				       unsigned long elapsed,
+				       unsigned long written)
+{
+	const unsigned long period = roundup_pow_of_two(3 * HZ);
+	unsigned long avg = bdi->avg_write_bandwidth;
+	unsigned long old = bdi->write_bandwidth;
+	u64 bw;
+
+	/*
+	 * bw = written * HZ / elapsed
+	 *
+	 *                   bw * elapsed + write_bandwidth * (period - elapsed)
+	 * write_bandwidth = ---------------------------------------------------
+	 *                                          period
+	 */
+	bw = written - bdi->written_stamp;
+	bw *= HZ;
+	if (unlikely(elapsed > period)) {
+		do_div(bw, elapsed);
+		avg = bw;
+		goto out;
+	}
+	bw += (u64)bdi->write_bandwidth * (period - elapsed);
+	bw >>= ilog2(period);
+
+	/*
+	 * one more level of smoothing, for filtering out sudden spikes
+	 */
+	if (avg > old && old >= (unsigned long)bw)
+		avg -= (avg - old) >> 3;
+
+	if (avg < old && old <= (unsigned long)bw)
+		avg += (old - avg) >> 3;
+
+out:
+	bdi->write_bandwidth = bw;
+	bdi->avg_write_bandwidth = avg;
+}
+
+void __bdi_update_bandwidth(struct backing_dev_info *bdi,
+			    unsigned long start_time)
+{
+	unsigned long now = jiffies;
+	unsigned long elapsed = now - bdi->bw_time_stamp;
+	unsigned long written;
+
+	/*
+	 * rate-limit, only update once every 200ms.
+	 */
+	if (elapsed < BANDWIDTH_INTERVAL)
+		return;
+
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+
+	/*
+	 * Skip quiet periods when disk bandwidth is under-utilized.
+	 * (at least 1s idle time between two flusher runs)
+	 */
+	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
+		goto snapshot;
+
+	bdi_update_write_bandwidth(bdi, elapsed, written);
+
+snapshot:
+	bdi->written_stamp = written;
+	bdi->bw_time_stamp = now;
+}
+
+static void bdi_update_bandwidth(struct backing_dev_info *bdi,
+				 unsigned long start_time)
+{
+	if (jiffies - bdi->bw_time_stamp <= BANDWIDTH_INTERVAL)
+		return;
+	spin_lock(&bdi->wb.list_lock);
+	__bdi_update_bandwidth(bdi, start_time);
+	spin_unlock(&bdi->wb.list_lock);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -490,6 +574,7 @@ static void balance_dirty_pages(struct a
 	unsigned long pause = 1;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long start_time = jiffies;
 
 	for (;;) {
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
@@ -544,6 +629,8 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
+		bdi_update_bandwidth(bdi, start_time);
+
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 6/9] writeback: introduce smoothed global dirty limit
  2011-07-01 15:20   ` Andrea Righi
@ 2011-07-08 11:51     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-07-08 11:51 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, LKML

> > +	if (now - update_time < MAX_PAUSE)
> 
> 	if (time_before(now, update_time + MAX_PAUSE))
> 
> > +		return;
> > +
> > +	spin_lock(&dirty_lock);
> > +	if (now - update_time >= MAX_PAUSE) {
> 
> 	if (time_after_eq(now, update_time + MAX_PAUSE))

Changed, thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-07-01 15:20   ` Andrea Righi
@ 2011-07-08 11:53     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-07-08 11:53 UTC (permalink / raw)
  To: Andrea Righi
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li, Shaohua, Peter Zijlstra, LKML

Andrea,

> Another small time nitpick.
> 
> > +
> > +static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> > +				 unsigned long start_time)
> > +{
> > +	if (jiffies - bdi->bw_time_stamp <= MAX_PAUSE + MAX_PAUSE / 10)
> 
> 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + MAX_PAUSE +
> 						MAX_PAUSE / 10)

Done, and simplified to

        if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
                return;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-07-01 14:58     ` Wu Fengguang
  2011-07-04  3:05       ` Wu Fengguang
@ 2011-07-13 23:30       ` Jan Kara
  2011-07-23  7:26         ` Wu Fengguang
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-07-13 23:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li, Shaohua, Peter Zijlstra, LKML

  Hi Fengguang,

On Fri 01-07-11 22:58:31, Wu Fengguang wrote:
> On Fri, Jul 01, 2011 at 03:56:09AM +0800, Jan Kara wrote:
> > On Wed 29-06-11 22:52:48, Wu Fengguang wrote:
> > > The estimation value will start from 100MB/s and adapt to the real
> > > bandwidth in seconds.
> > > 
> > > It tries to update the bandwidth only when disk is fully utilized.
> > > Any inactive period of more than one second will be skipped.
> > > 
> > > The estimated bandwidth will be reflecting how fast the device can
> > > writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> > > The value will remain constant at disk idle time. At busy write time, if
> > > not considering fluctuations, it will also remain high unless be knocked
> > > down by possible concurrent reads that compete for the disk time and
> > > bandwidth with async writes.
> > > 
> > > The estimation is not done purely in the flusher because there is no
> > > guarantee for write_cache_pages() to return timely to update bandwidth.
> > > 
> > > The bdi->avg_write_bandwidth smoothing is very effective for filtering
> > > out sudden spikes, however may be a little biased in long term.
> > > 
> > > The overheads are low because the bdi bandwidth update only occurs at
> > > 200ms intervals.
> > > 
> > > The 200ms update interval is suitable, becuase it's not possible to get
> > > the real bandwidth for the instance at all, due to large fluctuations.
> > > 
> > > The NFS commits can be as large as seconds worth of data. One XFS
> > > completion may be as large as half second worth of data if we are going
> > > to increase the write chunk to half second worth of data. In ext4,
> > > fluctuations with time period of around 5 seconds is observed. And there
> > > is another pattern of irregular periods of up to 20 seconds on SSD tests.
> > > 
> > > That's why we are not only doing the estimation at 200ms intervals, but
> > > also averaging them over a period of 3 seconds and then go further to do
> > > another level of smoothing in avg_write_bandwidth.
> >   I was thinking about your formulas and also observing how it behaves when
> > writeback happens while the disk is loaded with other load as well (e.g.
> > grep -r of a large tree or cp from another partition).
> > 
> > I agree that some kind of averaging is needed. But how we average depends
> > on what do we need the number for. My thoughts were that there is not such
> > a thing as *the* write bandwidth since that depends on the background load
> > on the disk and also type of writes we do (sequential, random) as you noted
> > as well. What writeback needs to estimate in fact is "how fast can we write
> > this bulk of data?". Since we should ultimately size dirty limits and other
> > writeback tunables so that the bulk of data can be written in order of
> > seconds (but maybe much slower because of background load) I agree with
> > your first choice that we should measure written pages in a window of
> > several seconds - so your choice of 3 seconds is OK - but this number
> > should have a reasoning attached to it in a comment (something like my
> > elaborate above ;)
> 
> Agree totally and thanks for the reasoning in great details ;)
> 
> > Now to your second level of smoothing - is it really useful? We already
> 
> It's useful for filtering out sudden disturbances. Oh I forgot to show
> the SSD case which see sudden drops of throughput:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1SSD-64G/ext4-1dd-1M-64p-64288M-20%25-2.6.38-rc6-dt6+-2011-03-01-16-19/balance_dirty_pages-bandwidth.png
> 
> It's also very effective for XFS (see the below graph).
  I see. I think I finally understood what your second level of smoothing
does. When e.g. IO is stalled for some time and then is getting up to speed
for a while your second level of smoothing erases this spike when the stall
is shorter than twice update time of the bandwidth (if it is more,
bandwidth will drop two times in a row and you start decreasing your
smoothed bandwidth as well).

> > average over several second window so that should really eliminate big
> > spikes comming from bumpy IO completion from storage (NFS might be a
> > separate question but we talked about that already). Your second level
> > of smoothing just spreads the window even further but if you find that
> > necessary, why not make the window larger in the first place? Also my
> 
> Because they are two different type of smoothing. I employed the same
> smoothing as avg_write_bandwidth for bdi_dirty, where I wrote this
> comment to illustrate its unique usefulness:
  But is your second level of smoothing really that much different from
making the window over which we average larger? E.g. if you have 3s window
and the IO stalls 1s, you will see 33% variation in computed bandwidth.
But if you had 10s window, you would see only 10% variation.

To get some real numbers I've run simple dd on XFS filesystem and plotted
basic computed bandwidth and smoothed bandwidth with 3s and 10s window.
The results are at:
http://beta.suse.com/private/jack/bandwidth-tests/fengguang-dd-write.png
http://beta.suse.com/private/jack/bandwidth-tests/fengguang-dd-write-10s.png

You can see that with 10s window basic bandwidth is (unsuprisingly) quite
closer to your smoothed bandwidth than with 3s window. Of course if the
variations in throughput are longer in time, the throughput will oscilate
more even with larger window. But that is the case with your smoothed
bandwidth as well and it is in fact desirable because as soon as amount of
data we can write per second is lower for several seconds, we have to
really consider this and change writeback tunables accordingly. To
demostrate the changes in smoothed bandwidth, I've run a test where we are
writing lots of 10MB files to the filesystem and in paralel we read randomly
1-1000MB from the filesystem and then sleep for 1-15s. The results (with 3s
and 10s window) are at:
http://beta.suse.com/private/jack/bandwidth-tests/fengguang-dd-read-dd-write.png
http://beta.suse.com/private/jack/bandwidth-tests/fengguang-dd-read-dd-write-10s.png

You can see that both basic and smoothed bandwidth are not that much
different even with 3s window and with 10s window the differences are
negligible I'd say.

So both from my understanding and my experiments, I'd say that the basic
computation of bandwidth should be enough and if you want to be on a
smoother side, you can just increase the window size and you will get
rather similar results as with your second level of smoothing.

> +       /*
> +        * This ...
> +        * ... is most effective on XFS,
> +        * whose pattern is
> +        *                                                             .
> +        *      [.] dirty       [-] avg                       .       .
> +        *                                                   .       .
> +        *              .         .         .         .     .       .
> +        *      ---------------------------------------    .       .
> +        *            .         .         .         .     .       .
> +        *           .         .         .         .     .       .
> +        *          .         .         .         .     .       .
> +        *         .         .         .         .     .       .
> +        *        .         .         .         .
> +        *       .         .         .         .      (fluctuated)
> +        *      .         .         .         .
> +        *     .         .         .         .
> +        *
> +        * @avg will remain flat at the cost of being biased towards high. In
> +        * practice the error tend to be much smaller: thanks to more coarse
> +        * grained fluctuations, @avg becomes the real average number for the
> +        * last two rising lines of @dirty.
> +        */
> 
> > observations of avg-bandwidth and bandwidth when there's some background
> > read load (e.g. from grep) shows that in this case both bandwidth and
> > avg-bandwidth fluctuate +-10%. 
> 
> You can more easily see their fluctuate ranges in my graphs :)
> For example,
> 
> (compare the YELLOW line to the RED dots)
> 
> NFS
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-bandwidth.png
> 
> XFS
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-1dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-22-58/balance_dirty_pages-bandwidth.png
> 
> ext4
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G/ext4-1dd-4k-8p-2948M-20:10-3.0.0-rc2-next-20110610+-2011-06-12.21:57/balance_dirty_pages-bandwidth.png
> 
> > Finally, as I reasoned above, we are
> > interested in how much we can write in coming say 2 seconds so I don't see
> > a big point smoothing the value too much (I'd even say it's undesirable).
> > What do you think?
> 
> Yeah we typically get 20% ->write_bandwidth fluctuation range in the
> above graphs (except for NFS), which seems reasonably good for guiding
> the write chunk size.
> 
> However I would still like to favor a more stable value and
> ->avg_write_bandwidth looks appealing with much smaller 3%-10% ranges
> in steady state. Because the fluctuations of estimated write bandwidth
> will directly become the fluctuations in write chunk sizes over time
> as well as the application throttled write speed in future IO-less
> balance_dirty_pages().
  I've written about this above.

> Oh I thought the code is clear enough because it's the standard
> running average technique... or let's write down the before-simplified
> formula?
> 
>         /*
>          * bw = written * HZ / elapsed
>          *
>          *                   bw * elapsed + write_bandwidth * (period - elapsed)
>          * write_bandwidth = ---------------------------------------------------
>          *                                          period
>          */
  Yes, it is a standard running average but you shouldn't have to decode
that from the code. The formula explains that nicely. Thanks.

> > > +	bw += (u64)bdi->write_bandwidth * (period - elapsed);
> > > +	cur = bw >> ilog2(period);
> > > +
> > > +	/*
> > > +	 * one more level of smoothing, for filtering out sudden spikes
> > > +	 */
> > > +	if (avg > old && old > cur)
> > > +		avg -= (avg - old) >> 3;
> > > +
> > > +	if (avg < old && old < cur)
> > > +		avg += (old - avg) >> 3;
> > > +
> > > +	bdi->write_bandwidth = cur;
> > > +	bdi->avg_write_bandwidth = avg;
> > > +}
> > > +
> > > +void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> > > +			    unsigned long start_time)
> > > +{
> > > +	unsigned long now = jiffies;
> > > +	unsigned long elapsed = now - bdi->bw_time_stamp;
> > > +	unsigned long written;
> > > +
> > > +	/*
> > > +	 * rate-limit, only update once every 200ms.
> > > +	 */
> > > +	if (elapsed < MAX_PAUSE)
> > > +		return;
> > > +
> > > +	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> > > +
> > > +	/*
> > > +	 * Skip quiet periods when disk bandwidth is under-utilized.
> > > +	 * (at least 1s idle time between two flusher runs)
> > > +	 */
> > > +	if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
> > > +		goto snapshot;
> >   Cannot we just explicitely stamp the written_stamp and bw_time_stamp at
> > the beginning of wb_writeback and be done with it? We wouldn't have to
> > pass the time stamps, keep them in wb_writeback() and balance_dirty_pages()
> > etc.
> 
> I'm afraid that stamping may unnecessarily disturb/invalidate valid
> accounting periods. For example, if the flusher keeps working on tiny
> works, it will effectively disable bandwidth updates.
  OK, sounds reasonable.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-07-13 23:30       ` Jan Kara
@ 2011-07-23  7:26         ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-07-23  7:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Vivek Goyal, linux-fsdevel, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li, Shaohua, Peter Zijlstra, LKML

Hi Jan,

On Thu, Jul 14, 2011 at 07:30:16AM +0800, Jan Kara wrote:
>   Hi Fengguang,
>
> On Fri 01-07-11 22:58:31, Wu Fengguang wrote:
> > On Fri, Jul 01, 2011 at 03:56:09AM +0800, Jan Kara wrote:
> > > On Wed 29-06-11 22:52:48, Wu Fengguang wrote:
> > > > The estimation value will start from 100MB/s and adapt to the real
> > > > bandwidth in seconds.
> > > >
> > > > It tries to update the bandwidth only when disk is fully utilized.
> > > > Any inactive period of more than one second will be skipped.
> > > >
> > > > The estimated bandwidth will be reflecting how fast the device can
> > > > writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> > > > The value will remain constant at disk idle time. At busy write time, if
> > > > not considering fluctuations, it will also remain high unless be knocked
> > > > down by possible concurrent reads that compete for the disk time and
> > > > bandwidth with async writes.
> > > >
> > > > The estimation is not done purely in the flusher because there is no
> > > > guarantee for write_cache_pages() to return timely to update bandwidth.
> > > >
> > > > The bdi->avg_write_bandwidth smoothing is very effective for filtering
> > > > out sudden spikes, however may be a little biased in long term.
> > > >
> > > > The overheads are low because the bdi bandwidth update only occurs at
> > > > 200ms intervals.
> > > >
> > > > The 200ms update interval is suitable, becuase it's not possible to get
> > > > the real bandwidth for the instance at all, due to large fluctuations.
> > > >
> > > > The NFS commits can be as large as seconds worth of data. One XFS
> > > > completion may be as large as half second worth of data if we are going
> > > > to increase the write chunk to half second worth of data. In ext4,
> > > > fluctuations with time period of around 5 seconds is observed. And there
> > > > is another pattern of irregular periods of up to 20 seconds on SSD tests.
> > > >
> > > > That's why we are not only doing the estimation at 200ms intervals, but
> > > > also averaging them over a period of 3 seconds and then go further to do
> > > > another level of smoothing in avg_write_bandwidth.
> > >   I was thinking about your formulas and also observing how it behaves when
> > > writeback happens while the disk is loaded with other load as well (e.g.
> > > grep -r of a large tree or cp from another partition).
> > >
> > > I agree that some kind of averaging is needed. But how we average depends
> > > on what do we need the number for. My thoughts were that there is not such
> > > a thing as *the* write bandwidth since that depends on the background load
> > > on the disk and also type of writes we do (sequential, random) as you noted
> > > as well. What writeback needs to estimate in fact is "how fast can we write
> > > this bulk of data?". Since we should ultimately size dirty limits and other
> > > writeback tunables so that the bulk of data can be written in order of
> > > seconds (but maybe much slower because of background load) I agree with
> > > your first choice that we should measure written pages in a window of
> > > several seconds - so your choice of 3 seconds is OK - but this number
> > > should have a reasoning attached to it in a comment (something like my
> > > elaborate above ;)
> >
> > Agree totally and thanks for the reasoning in great details ;)
> >
> > > Now to your second level of smoothing - is it really useful? We already
> >
> > It's useful for filtering out sudden disturbances. Oh I forgot to show
> > the SSD case which see sudden drops of throughput:
> >
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1SSD-64G/ext4-1dd-1M-64p-64288M-20%25-2.6.38-rc6-dt6+-2011-03-01-16-19/balance_dirty_pages-bandwidth.png
> >
> > It's also very effective for XFS (see the below graph).
>   I see. I think I finally understood what your second level of smoothing
> does. When e.g. IO is stalled for some time and then is getting up to speed
> for a while your second level of smoothing erases this spike when the stall
> is shorter than twice update time of the bandwidth (if it is more,
> bandwidth will drop two times in a row and you start decreasing your
> smoothed bandwidth as well).

Yeah it does help that case. However that's not the main use case in my mind.

Technically speaking, the input to the write bandwidth estimation
is a series of points (T_i, W_i), where T_i is the i-th time delta
and W_i is the i-th written delta.

T_i may be any value from BANDWIDTH_INTERVAL=200ms to 1s or even more.
For example, XFS will regularly have T_i up to 500ms after the patch
"writeback: scale IO chunk size up to half device bandwidth".

When T_i > 200ms, the direct estimated write_bandwidth will be
consisted of a series of sudden-up-slow-down spikes. The below
graph shows such a spike. The merit of avg_write_bandwidth is,
it can _completely_ smooth out such spikes.

        *
        *  *
        *     *                               [*] write_bandwdith
        *        *                            [.] avg_write_bandwidth
        *           *
        *               *
        *                   *
        *                       *
        *                            *
        *                                 *
        *                                       *
        *                                              *
  ......*........................................................
  *******

You cannot observe such patterns in ext3/4 because they do IO
completions much more frequently than BANDWIDTH_INTERVAL. However if
you take a close look at XFS, it's happening all the time:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G-60%25/xfs-1dd-1M-8p-3911M-60%25-2.6.38-rc5-dt6+-2011-02-22-11-10/balance_dirty_pages-bandwidth-500.png

As you can see, avg_write_bandwidth has the drawback of possibly
introducing some systematic error in such case. And here lies the
trade off:

- "accurate" and "timely" bandwidth estimation, not sensitive to fluctuations
  => use write_bandwdith

- constant fluctuations very undesirable, not sensitive to estimation errors
  => use avg_write_bandwdith

IMHO avg_write_bandwdith will be particular useful for IO-less
balance_dirty_pages(), as the users are in general much more sensitive
to the application dirty bandwidth (which will directly inherit the
fluctuations from avg_write_bandwdith) than the number of dirty pages
(whose distance to the dirty threshold will be affected by the
estimation error).

> > > average over several second window so that should really eliminate big
> > > spikes comming from bumpy IO completion from storage (NFS might be a
> > > separate question but we talked about that already). Your second level
> > > of smoothing just spreads the window even further but if you find that
> > > necessary, why not make the window larger in the first place? Also my
> >
> > Because they are two different type of smoothing. I employed the same
> > smoothing as avg_write_bandwidth for bdi_dirty, where I wrote this
> > comment to illustrate its unique usefulness:
>   But is your second level of smoothing really that much different from
> making the window over which we average larger? E.g. if you have 3s window
> and the IO stalls 1s, you will see 33% variation in computed bandwidth.
> But if you had 10s window, you would see only 10% variation.

The 2nd level smoothing can in theory cut down half fluctuations even
for ext3/4, under the constraint of the same response time. For
example, given the below bdi->write_bandwdith curve, whenever it is
returning to avg_write_bandwidth and balance point,
avg_write_bandwidth will _stop_ tracking it and stay close to the
balance point. This conditional tracking helps cancel fluctuations.

                           depart --.             .-- return
                                     \    .-.    /
                           .-.        \  /   \  /        .-.  <--- fluctuating write_bandwdith
  .-.     _     .-.       /   \         /     \         /   \
-/---\---/-\---/---\-----/-----\-------/-------\-------/-----\---- balance point
/     `-'   `-'     \   /       \     /         \     /       \
                     `-'         \   /           \   /         `-'
                                  `-'             `-'                   

> To get some real numbers I've run simple dd on XFS filesystem and plotted
> basic computed bandwidth and smoothed bandwidth with 3s and 10s window.
> The results are at:
> http://beta.suse.com/private/jack/bandwidth-tests/fengguang-dd-write.png
> http://beta.suse.com/private/jack/bandwidth-tests/fengguang-dd-write-10s.png

Nice graphs! 

> You can see that with 10s window basic bandwidth is (unsuprisingly) quite
> closer to your smoothed bandwidth than with 3s window. Of course if the
> variations in throughput are longer in time, the throughput will oscilate
> more even with larger window. But that is the case with your smoothed
> bandwidth as well and it is in fact desirable because as soon as amount of
> data we can write per second is lower for several seconds, we have to
> really consider this and change writeback tunables accordingly. To
> demostrate the changes in smoothed bandwidth, I've run a test where we are
> writing lots of 10MB files to the filesystem and in paralel we read randomly
> 1-1000MB from the filesystem and then sleep for 1-15s. The results (with 3s
> and 10s window) are at:
> http://beta.suse.com/private/jack/bandwidth-tests/fengguang-dd-read-dd-write.png
> http://beta.suse.com/private/jack/bandwidth-tests/fengguang-dd-read-dd-write-10s.png

Thanks for the disturbance tests -- the response curve with intermittent reads
actually answered one of the questions raised by Vivek :)

However you seem to miss the point of avg_write_bandwidth, explained below,
which I failed to mention in the first place. Sorry for that!

> You can see that both basic and smoothed bandwidth are not that much
> different even with 3s window and with 10s window the differences are
> negligible I'd say.

Yes, that's intended. avg_write_bandwidth should track large workload
changes over long time as good as possible.

> So both from my understanding and my experiments, I'd say that the basic
> computation of bandwidth should be enough and if you want to be on a
> smoother side, you can just increase the window size and you will get
> rather similar results as with your second level of smoothing.

avg_write_bandwidth aims to cancel fluctuations within some 1-3 seconds window.

As you can see from your graphs, it does yield much more smooth curve
in that granularity. Simply increasing write_bandwdith's estimation
window from 3s to 10s is much less effective in providing the seconds
long window's smoothness as by avg_write_bandwidth and has the cost of
larger lags in response to changed workload. 

In summary, the bandwidth estimation tries to reflect the real
bandwidth in some timely fashion, while trying to filter out the
_regular_ fluctuations as much as possible. Note that it still tries
to track the _long term_ "fluctuations" that reflect workload changes.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-07-01 18:32   ` Vivek Goyal
@ 2011-07-23  8:02     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-07-23  8:02 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li, Shaohua, Peter Zijlstra, LKML

[-- Attachment #1: Type: text/plain, Size: 2988 bytes --]

On Sat, Jul 02, 2011 at 02:32:52AM +0800, Vivek Goyal wrote:
> On Wed, Jun 29, 2011 at 10:52:48PM +0800, Wu Fengguang wrote:
> > The estimation value will start from 100MB/s and adapt to the real
> > bandwidth in seconds.
> > 
> > It tries to update the bandwidth only when disk is fully utilized.
> > Any inactive period of more than one second will be skipped.
> > 
> > The estimated bandwidth will be reflecting how fast the device can
> > writeout when _fully utilized_, and won't drop to 0 when it goes idle.
> > The value will remain constant at disk idle time. At busy write time, if
> > not considering fluctuations, it will also remain high unless be knocked
> > down by possible concurrent reads that compete for the disk time and
> > bandwidth with async writes.
> > 
> > The estimation is not done purely in the flusher because there is no
> > guarantee for write_cache_pages() to return timely to update bandwidth.
> > 
> > The bdi->avg_write_bandwidth smoothing is very effective for filtering
> > out sudden spikes, however may be a little biased in long term.
> > 
> > The overheads are low because the bdi bandwidth update only occurs at
> > 200ms intervals.
> > 
> > The 200ms update interval is suitable, becuase it's not possible to get
> > the real bandwidth for the instance at all, due to large fluctuations.
> > 
> > The NFS commits can be as large as seconds worth of data. One XFS
> > completion may be as large as half second worth of data if we are going
> > to increase the write chunk to half second worth of data. In ext4,
> > fluctuations with time period of around 5 seconds is observed. And there
> > is another pattern of irregular periods of up to 20 seconds on SSD tests.
> > 
> > That's why we are not only doing the estimation at 200ms intervals, but
> > also averaging them over a period of 3 seconds and then go further to do
> > another level of smoothing in avg_write_bandwidth.
> 
> What IO scheduler have you used for testing?

I'm using the default CFQ.

> CFQ now a days almost chokes async requests in presence of lots of
> sync IO.

That's right.

> Have you done some testing with that scenario and see how quickly
> you adjust to that change.

Jan has kindly offered nice graphs on intermixed ASYNC writes and SYNC
read periods.

The attached graphs show another independent experiment of doing 1GB
reads in the middle of a 1-dd test on ext3 with 3GB memory.

The 1GB read takes ~20s to finish. In the mean while, the write
throughput drops to around 10MB/s as shown by "iostat 1". In graph
balance_dirty_pages-bandwidth.png, the red "write bandwidth" and
yellow "avg bandwidth" curves show similar drop of write throughput,
and the full response curves are around 10s long.

> /me is trying to wrap his head around all the smoothing and bandwidth
> calculation functions. Wished there was more explanation to it.

Please see the other email to Jan. Sorry I should have written that
explanation down in the changelog.

Thanks,
Fengguang

[-- Attachment #2: iostat-bw.png --]
[-- Type: image/png, Size: 72920 bytes --]

[-- Attachment #3: global_dirtied_written.png --]
[-- Type: image/png, Size: 39535 bytes --]

[-- Attachment #4: global_dirty_state.png --]
[-- Type: image/png, Size: 83019 bytes --]

[-- Attachment #5: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 102909 bytes --]

[-- Attachment #6: balance_dirty_pages-bandwidth.png --]
[-- Type: image/png, Size: 68047 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/9] writeback: bdi write bandwidth estimation
  2011-07-01 19:29   ` Vivek Goyal
@ 2011-07-23  8:07     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-07-23  8:07 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Jan Kara, Dave Chinner, Christoph Hellwig,
	Andrew Morton, Li, Shaohua, Peter Zijlstra, LKML

On Sat, Jul 02, 2011 at 03:29:15AM +0800, Vivek Goyal wrote:
> On Wed, Jun 29, 2011 at 10:52:48PM +0800, Wu Fengguang wrote:
> > The estimation value will start from 100MB/s and adapt to the real
> > bandwidth in seconds.
> > 
> > It tries to update the bandwidth only when disk is fully utilized.
> > Any inactive period of more than one second will be skipped.
> 
> Is this piece of code being tested in your graphs? Are there any
> inactive periods which get filtered out in your workload? I am 
> assuming in a continuous dd, we will not have periods of inactivity
> 1 second long.

There are no inactive periods in the dd tests. However I did try
"watch cat /debug/bdi/8:0/stats" while starting some write workloads
in background. When the write stops, the bandwidth does stop updating.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2011-07-23  8:07 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-29 14:52 [PATCH 0/9] write bandwidth estimation and writeback fixes v2 Wu Fengguang
2011-06-29 14:52 ` [PATCH 1/9] writeback: make writeback_control.nr_to_write straight Wu Fengguang
2011-06-30 16:24   ` Jan Kara
2011-07-01 12:03     ` Wu Fengguang
2011-06-29 14:52 ` [PATCH 2/9] writeback: account per-bdi accumulated written pages Wu Fengguang
2011-06-29 14:52 ` [PATCH 3/9] writeback: bdi write bandwidth estimation Wu Fengguang
2011-06-30 19:56   ` Jan Kara
2011-07-01 14:58     ` Wu Fengguang
2011-07-04  3:05       ` Wu Fengguang
2011-07-13 23:30       ` Jan Kara
2011-07-23  7:26         ` Wu Fengguang
2011-07-01 15:20   ` Andrea Righi
2011-07-08 11:53     ` Wu Fengguang
2011-07-01 18:32   ` Vivek Goyal
2011-07-23  8:02     ` Wu Fengguang
2011-07-01 19:19   ` Vivek Goyal
2011-07-01 19:29   ` Vivek Goyal
2011-07-23  8:07     ` Wu Fengguang
2011-06-29 14:52 ` [PATCH 4/9] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2011-06-29 14:52 ` [PATCH 5/9] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2011-06-30 17:26   ` Jan Kara
2011-06-29 14:52 ` [PATCH 6/9] writeback: introduce smoothed global dirty limit Wu Fengguang
2011-07-01 15:20   ` Andrea Righi
2011-07-08 11:51     ` Wu Fengguang
2011-06-29 14:52 ` [PATCH 7/9] writeback: introduce max-pause and pass-good dirty limits Wu Fengguang
2011-06-29 14:52 ` [PATCH 8/9] writeback: scale IO chunk size up to half device bandwidth Wu Fengguang
2011-06-29 14:52 ` [PATCH 9/9] writeback: trace global_dirty_state Wu Fengguang
2011-07-01 15:18   ` Christoph Hellwig
2011-07-01 15:45     ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.