[PATCH 00/13] writeback patches for 2.6.36

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/13] writeback patches for 2.6.36
@ 2010-08-05 16:10 Wu Fengguang
  2010-08-05 16:10 ` [PATCH 01/13] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
                   ` (13 more replies)
  0 siblings, 14 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, Jan Kara, Peter Zijlstra, linux-fsdevel,
	linux-mm

Andrew,

These are writeback patches intended for 2.6.36.

It's combined from 2 previous patchsets:

	writeback cleanups and trivial fixes <http://lkml.org/lkml/2010/7/10/153>
	writeback: try to write older pages first <http://lkml.org/lkml/2010/7/22/47>

changelog:
- removed patch "writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages()"
- added patch "writeback: explicit low bound for vm.dirty_ratio"
- use "if (list_empty(&wb->b_io))" directly in "writeback: sync expired inodes first in background writeback"
- fix misplaced chunk for removing more_io in include/trace/events/ext4.h
- update comments in "writeback: fix queue_io() ordering"
- update comments in "writeback: add comment to the dirty limits functions"
- patch "writeback: try more writeback as long as something was written" will
  no longer livelock sync() with Jan's sync() livelock avoidance patches

	[PATCH 01/13] writeback: reduce calls to global_page_state in balance_dirty_pages()
	[PATCH 02/13] writeback: avoid unnecessary calculation of bdi dirty thresholds
	[PATCH 03/13] writeback: add comment to the dirty limits functions
	[PATCH 04/13] writeback: dont redirty tail an inode with dirty pages
	[PATCH 05/13] writeback: fix queue_io() ordering
	[PATCH 06/13] writeback: merge for_kupdate and !for_kupdate cases
	[PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
	[PATCH 08/13] writeback: pass writeback_control down to move_expired_inodes()
	[PATCH 09/13] writeback: the kupdate expire timestamp should be a moving target
	[PATCH 10/13] writeback: kill writeback_control.more_io
	[PATCH 11/13] writeback: sync expired inodes first in background writeback
	[PATCH 12/13] writeback: try more writeback as long as something was written
	[PATCH 13/13] writeback: introduce writeback_control.inodes_written

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/13] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
@ 2010-08-05 16:10 ` Wu Fengguang
  2010-08-05 16:10 ` [PATCH 02/13] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Jens Axboe, Peter Zijlstra,
	Richard Kennedy, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-less-balance-calc.patch --]
[-- Type: text/plain, Size: 7668 bytes --]

Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.

'perf stats' of simple fio write tests shows the reduction in cache
access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD
AthlonX2 with 3Gb memory (dirty_threshold approx 600 Mb) running each
test 10 times, dropping the fasted & slowest values then taking the
average & standard deviation

		average (s.d.) in millions (10^6)
2.6.31-rc8	648.6 (14.6)
+patch		620.1 (16.5)

Achieving this reduction is by dropping clip_bdi_dirty_limit as it
rereads the counters to apply the dirty_threshold and moving this check
up into balance_dirty_pages where it has already read the counters.

Also by rearrange the for loop to only contain one copy of the limit
tests allows the pdflush test after the loop to use the local copies of
the counters rather than rereading them.

In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.

Fengguang:

This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
accurate we don't need to do routinely clip. A simple dirty limit check
would be enough.

The check is necessary because, in principle we should throttle
everything calling balance_dirty_pages() when we're over the total
limit, as said by Peter.

We now set and clear dirty_exceeded not only based on bdi dirty limits,
but also on the global dirty limit. The global limit check is added in
place of clip_bdi_dirty_limit() for safety and not intended as a
behavior change. The bdi limits should be tight enough to keep all dirty
pages under the global limit at most time; occasional small exceeding
should be OK though. The change makes the logic more obvious: the global
limit is the ultimate goal and shall be always imposed.

We may now start background writeback work based on outdated conditions.
That's safe because the bdi flush thread will (and have to) double check
the states. It reduces overall overheads because the test based on old
states still have good chance to be right.

[akpm@linux-foundation.org] fix uninitialized dirty_exceeded

CC: Jan Kara <jack@suse.cz>
CC: Jens Axboe <axboe@kernel.dk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   95 ++++++++++++++----------------------------
 1 file changed, 33 insertions(+), 62 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-07-20 12:53:17.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-07-20 12:54:58.000000000 +0800
@@ -253,32 +253,6 @@ static void bdi_writeout_fraction(struct
 	}
 }
 
-/*
- * Clip the earned share of dirty pages to that which is actually available.
- * This avoids exceeding the total dirty_limit when the floating averages
- * fluctuate too quickly.
- */
-static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
-		unsigned long dirty, unsigned long *pbdi_dirty)
-{
-	unsigned long avail_dirty;
-
-	avail_dirty = global_page_state(NR_FILE_DIRTY) +
-		 global_page_state(NR_WRITEBACK) +
-		 global_page_state(NR_UNSTABLE_NFS) +
-		 global_page_state(NR_WRITEBACK_TEMP);
-
-	if (avail_dirty < dirty)
-		avail_dirty = dirty - avail_dirty;
-	else
-		avail_dirty = 0;
-
-	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
-		bdi_stat(bdi, BDI_WRITEBACK);
-
-	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
-}
-
 static inline void task_dirties_fraction(struct task_struct *tsk,
 		long *numerator, long *denominator)
 {
@@ -469,7 +443,6 @@ get_dirty_limits(unsigned long *pbackgro
 			bdi_dirty = dirty * bdi->max_ratio / 100;
 
 		*pbdi_dirty = bdi_dirty;
-		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
 		task_dirty_limit(current, pbdi_dirty);
 	}
 }
@@ -491,7 +464,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
-
+	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
@@ -509,10 +482,35 @@ static void balance_dirty_pages(struct a
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
 
-		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		/*
+		 * In order to avoid the stacked BDI deadlock we need
+		 * to ensure we accurately count the 'dirty' pages when
+		 * the threshold is low.
+		 *
+		 * Otherwise it would be possible to get thresh+n pages
+		 * reported dirty, even though there are thresh-m pages
+		 * actually dirty; with m+n sitting in the percpu
+		 * deltas.
+		 */
+		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+		} else {
+			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		}
+
+		/*
+		 * The bdi thresh is somehow "soft" limit derived from the
+		 * global "hard" limit. The former helps to prevent heavy IO
+		 * bdi or process from holding back light ones; The latter is
+		 * the last resort safeguard.
+		 */
+		dirty_exceeded =
+			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
+			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+		if (!dirty_exceeded)
 			break;
 
 		/*
@@ -540,34 +538,10 @@ static void balance_dirty_pages(struct a
 		if (bdi_nr_reclaimable > bdi_thresh) {
 			writeback_inodes_wb(&bdi->wb, &wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
-			get_dirty_limits(&background_thresh, &dirty_thresh,
-				       &bdi_thresh, bdi);
 			trace_wbc_balance_dirty_written(&wbc, bdi);
+			if (pages_written >= write_chunk)
+				break;		/* We've done our duty */
 		}
-
-		/*
-		 * In order to avoid the stacked BDI deadlock we need
-		 * to ensure we accurately count the 'dirty' pages when
-		 * the threshold is low.
-		 *
-		 * Otherwise it would be possible to get thresh+n pages
-		 * reported dirty, even though there are thresh-m pages
-		 * actually dirty; with m+n sitting in the percpu
-		 * deltas.
-		 */
-		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else if (bdi_nr_reclaimable) {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
-		}
-
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-			break;
-		if (pages_written >= write_chunk)
-			break;		/* We've done our duty */
-
 		trace_wbc_balance_dirty_wait(&wbc, bdi);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
@@ -581,8 +555,7 @@ static void balance_dirty_pages(struct a
 			pause = HZ / 10;
 	}
 
-	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
-			bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
@@ -597,9 +570,7 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 02/13] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
  2010-08-05 16:10 ` [PATCH 01/13] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
@ 2010-08-05 16:10 ` Wu Fengguang
  2010-08-06 10:14   ` Peter Zijlstra
  2010-08-05 16:10 ` [PATCH 03/13] writeback: add comment to the dirty limits functions Wu Fengguang
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Peter Zijlstra, Dave Chinner,
	Christoph Hellwig, Mel Gorman, Chris Mason, Jens Axboe, Jan Kara,
	linux-fsdevel, linux-mm

[-- Attachment #1: writeback-less-bdi-calc.patch --]
[-- Type: text/plain, Size: 6308 bytes --]

Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(),
so that the latter can be avoided when under global dirty background
threshold (which is the normal state for most systems).

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    5 +-
 mm/backing-dev.c          |    3 -
 mm/page-writeback.c       |   75 ++++++++++++++++++------------------
 4 files changed, 44 insertions(+), 41 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-07-29 14:34:34.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-03 23:14:19.000000000 +0800
@@ -267,10 +267,11 @@ static inline void task_dirties_fraction
  *
  *   dirty -= (dirty/8) * p_{t}
  */
-static void task_dirty_limit(struct task_struct *tsk, unsigned long *pdirty)
+static unsigned long task_dirty_limit(struct task_struct *tsk,
+				       unsigned long bdi_dirty)
 {
 	long numerator, denominator;
-	unsigned long dirty = *pdirty;
+	unsigned long dirty = bdi_dirty;
 	u64 inv = dirty >> 3;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
@@ -278,10 +279,8 @@ static void task_dirty_limit(struct task
 	do_div(inv, denominator);
 
 	dirty -= inv;
-	if (dirty < *pdirty/2)
-		dirty = *pdirty/2;
 
-	*pdirty = dirty;
+	return max(dirty, bdi_dirty/2);
 }
 
 /*
@@ -391,9 +390,7 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
-void
-get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
-		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
 	unsigned long background;
 	unsigned long dirty;
@@ -425,26 +422,28 @@ get_dirty_limits(unsigned long *pbackgro
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+}
 
-	if (bdi) {
-		u64 bdi_dirty;
-		long numerator, denominator;
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
+			       unsigned long dirty)
+{
+	u64 bdi_dirty;
+	long numerator, denominator;
 
-		/*
-		 * Calculate this BDI's share of the dirty ratio.
-		 */
-		bdi_writeout_fraction(bdi, &numerator, &denominator);
+	/*
+	 * Calculate this BDI's share of the dirty ratio.
+	 */
+	bdi_writeout_fraction(bdi, &numerator, &denominator);
 
-		bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
-		bdi_dirty *= numerator;
-		do_div(bdi_dirty, denominator);
-		bdi_dirty += (dirty * bdi->min_ratio) / 100;
-		if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
-			bdi_dirty = dirty * bdi->max_ratio / 100;
+	bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
+	bdi_dirty *= numerator;
+	do_div(bdi_dirty, denominator);
 
-		*pbdi_dirty = bdi_dirty;
-		task_dirty_limit(current, pbdi_dirty);
-	}
+	bdi_dirty += (dirty * bdi->min_ratio) / 100;
+	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
+		bdi_dirty = dirty * bdi->max_ratio / 100;
+
+	return bdi_dirty;
 }
 
 /*
@@ -475,13 +474,24 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh,
-				&bdi_thresh, bdi);
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
 
+		global_dirty_limits(&background_thresh, &dirty_thresh);
+
+		/*
+		 * Throttle it only when the background writeback cannot
+		 * catch-up. This avoids (excessively) small writeouts
+		 * when the bdi limits are ramping up.
+		 */
+		if (nr_reclaimable + nr_writeback <
+				(background_thresh + dirty_thresh) / 2)
+			break;
+
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
 		 * to ensure we accurately count the 'dirty' pages when
@@ -513,15 +523,6 @@ static void balance_dirty_pages(struct a
 		if (!dirty_exceeded)
 			break;
 
-		/*
-		 * Throttle it only when the background writeback cannot
-		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
-		 */
-		if (nr_reclaimable + nr_writeback <
-				(background_thresh + dirty_thresh) / 2)
-			break;
-
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
@@ -634,7 +635,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+		global_dirty_limits(&background_thresh, &dirty_thresh);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
--- linux-next.orig/fs/fs-writeback.c	2010-07-29 14:34:34.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-03 23:11:12.000000000 +0800
@@ -594,7 +594,7 @@ static inline bool over_bground_thresh(v
 {
 	unsigned long background_thresh, dirty_thresh;
 
-	get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+	global_dirty_limits(&background_thresh, &dirty_thresh);
 
 	return (global_page_state(NR_FILE_DIRTY) +
 		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
--- linux-next.orig/mm/backing-dev.c	2010-07-29 14:34:34.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-08-03 23:11:10.000000000 +0800
@@ -83,7 +83,8 @@ static int bdi_debug_stats_show(struct s
 		nr_more_io++;
 	spin_unlock(&inode_lock);
 
-	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
--- linux-next.orig/include/linux/writeback.h	2010-07-29 14:34:34.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-08-03 23:11:10.000000000 +0800
@@ -124,8 +124,9 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
-void get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
-		      unsigned long *pbdi_dirty, struct backing_dev_info *bdi);
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
+			       unsigned long dirty);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 03/13] writeback: add comment to the dirty limits functions
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
  2010-08-05 16:10 ` [PATCH 01/13] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
  2010-08-05 16:10 ` [PATCH 02/13] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
@ 2010-08-05 16:10 ` Wu Fengguang
  2010-08-06 10:17   ` Peter Zijlstra
  2010-08-05 16:10 ` [PATCH 04/13] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Christoph Hellwig, Dave Chinner, Jens Axboe,
	Peter Zijlstra, Mel Gorman, Chris Mason, Jan Kara, linux-fsdevel,
	linux-mm

[-- Attachment #1: dirty-limits-comment.patch --]
[-- Type: text/plain, Size: 2650 bytes --]

Document global_dirty_limits(), bdi_dirty_limit() and task_dirty_limit().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-08-03 23:14:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-05 00:37:17.000000000 +0800
@@ -261,11 +261,18 @@ static inline void task_dirties_fraction
 }
 
 /*
- * scale the dirty limit
+ * task_dirty_limit - scale down dirty throttling threshold for one task
  *
  * task specific dirty limit:
  *
  *   dirty -= (dirty/8) * p_{t}
+ *
+ * To protect light/slow dirtying tasks from heavier/fast ones, we start
+ * throttling individual tasks before reaching the bdi dirty limit.
+ * Relatively low thresholds will be allocated to heavy dirtiers. So when
+ * dirty pages grow large, heavy dirtiers will be throttled first, which will
+ * effectively curb the growth of dirty pages. Light dirtiers with high enough
+ * dirty threshold may never get throttled.
  */
 static unsigned long task_dirty_limit(struct task_struct *tsk,
 				       unsigned long bdi_dirty)
@@ -390,6 +397,15 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+/**
+ * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ *
+ * Calculate the dirty thresholds based on sysctl parameters
+ * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
+ * - vm.dirty_ratio             or  vm.dirty_bytes
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * runtime tasks.
+ */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
 	unsigned long background;
@@ -424,8 +440,17 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
-			       unsigned long dirty)
+/**
+ * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ *
+ * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * - starving fast devices
+ * - piling up dirty pages (that will take long time to sync) on slow devices
+ *
+ * The bdi's share of dirty limit will be adapting to its throughput and
+ * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
+ */
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 {
 	u64 bdi_dirty;
 	long numerator, denominator;



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 04/13] writeback: dont redirty tail an inode with dirty pages
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (2 preceding siblings ...)
  2010-08-05 16:10 ` [PATCH 03/13] writeback: add comment to the dirty limits functions Wu Fengguang
@ 2010-08-05 16:10 ` Wu Fengguang
  2010-08-05 16:10 ` [PATCH 05/13] writeback: fix queue_io() ordering Wu Fengguang
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, David Howells, Dave Chinner,
	Christoph Hellwig, Jan Kara, Mel Gorman, Chris Mason, Jens Axboe,
	Peter Zijlstra, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-xfs-fast-redirty.patch --]
[-- Type: text/plain, Size: 2797 bytes --]

This extends commit b3af9468ae (writeback: don't delay inodes redirtied
by a fast dirtier) to the !kupdate case.

It also simplifies logic. Note that the I_DIRTY_PAGES test/handling is
merged into the PAGECACHE_TAG_DIRTY case.  I_DIRTY_PAGES (at the line
removed by this patch) means there are _new_ pages get dirtied during
writeback, while PAGECACHE_TAG_DIRTY means there are dirty pages. In
this sense, the PAGECACHE_TAG_DIRTY test covers the I_DIRTY_PAGES case.

In *_set_page_dirty*(), PAGECACHE_TAG_DIRTY is set racelessly, while
I_DIRTY_PAGES might be set on the inode for a page just truncated.  It
has no real impact on this patch -- it's actually slightly better now.

afs_fsync() always set I_DIRTY_PAGES after calling afs_writepages(),
maybe to keep the inode in the dirty list. That's a different code path,
so won't impact the requeue-vs-redirty timing here permenantly.

CC: David Howells <dhowells@redhat.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-11 09:13:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-12 23:26:06.000000000 +0800
@@ -367,18 +367,7 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
-		if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
-			/*
-			 * More pages get dirtied by a fast dirtier.
-			 */
-			goto select_queue;
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * At least XFS will redirty the inode during the
-			 * writeback (delalloc) and on io completion (isize).
-			 */
-			redirty_tail(inode);
-		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
@@ -400,7 +389,6 @@ writeback_single_inode(struct inode *ino
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-select_queue:
 				if (wbc->nr_to_write <= 0) {
 					/*
 					 * slice used up: queue for next turn
@@ -423,6 +411,14 @@ select_queue:
 				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * Filesystems can dirty the inode during writeback
+			 * operations, such as delayed allocation during
+			 * submission or metadata updates after data IO
+			 * completion.
+			 */
+			redirty_tail(inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 05/13] writeback: fix queue_io() ordering
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (3 preceding siblings ...)
  2010-08-05 16:10 ` [PATCH 04/13] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
@ 2010-08-05 16:10 ` Wu Fengguang
  2010-08-05 16:10 ` [PATCH 06/13] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Dave Chinner, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Christoph Hellwig, Mel Gorman, Chris Mason,
	Jens Axboe, Jan Kara, Peter Zijlstra, linux-fsdevel, linux-mm

[-- Attachment #1: queue_io-fix.patch --]
[-- Type: text/plain, Size: 1323 bytes --]

This was not a bug, since b_io is empty for kupdate writeback.
The next patch will do requeue_io() for non-kupdate writeback,
so let's fix it.

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:03:37.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:21:43.000000000 +0800
@@ -253,10 +253,18 @@ static void move_expired_inodes(struct l
 
 /*
  * Queue all expired dirty inodes for io, eldest first.
+ * Before
+ *         newly dirtied     b_dirty    b_io    b_more_io
+ *         =============>    gf         edc     BA
+ * After
+ *         newly dirtied     b_dirty    b_io    b_more_io
+ *         =============>    g          fBAedc
+ *                                           |
+ *                                           +--> dequeue for IO
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	list_splice_init(&wb->b_more_io, wb->b_io.prev);
+	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 06/13] writeback: merge for_kupdate and !for_kupdate cases
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (4 preceding siblings ...)
  2010-08-05 16:10 ` [PATCH 05/13] writeback: fix queue_io() ordering Wu Fengguang
@ 2010-08-05 16:10 ` Wu Fengguang
  2010-08-05 16:10 ` [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio Wu Fengguang
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Dave Chinner, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Christoph Hellwig, Mel Gorman, Chris Mason,
	Jens Axboe, Jan Kara, Peter Zijlstra, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-merge-kupdate-cases.patch --]
[-- Type: text/plain, Size: 3908 bytes --]

Unify the logic for kupdate and non-kupdate cases.
There won't be starvation because the inodes requeued into b_more_io
will later be spliced _after_ the remaining inodes in b_io, hence won't
stand in the way of other inodes in the next run.

It avoids unnecessary redirty_tail() calls, hence the update of
i_dirtied_when. The timestamp update is undesirable because it could
later delay the inode's periodic writeback, or may exclude the inode
from the data integrity sync operation (which checks timestamp to
avoid extra work and livelock).

===
How the redirty_tail() comes about:

It was a long story.. This redirty_tail() was introduced with
wbc.more_io.  The initial patch for more_io actually does not have the
redirty_tail(), and when it's merged, several 100% iowait bug reports
arised:

reiserfs:
        http://lkml.org/lkml/2007/10/23/93

jfs:    
        commit 29a424f28390752a4ca2349633aaacc6be494db5
        JFS: clear PAGECACHE_TAG_DIRTY for no-write pages
        
ext2:   
        http://www.spinics.net/linux/lists/linux-ext4/msg04762.html

They are all old bugs hidden in various filesystems that become
"visible" with the more_io patch. At the time, the ext2 bug is thought
to be "trivial", so not fixed. Instead the following updated more_io
patch with redirty_tail() is merged: 

	http://www.spinics.net/linux/lists/linux-ext4/msg04507.html

This will in general prevent 100% on ext2 and possibly other unknown FS bugs.

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   43 ++++++++++---------------------------------
 1 file changed, 10 insertions(+), 33 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:21:43.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:22:02.000000000 +0800
@@ -378,45 +378,22 @@ writeback_single_inode(struct inode *ino
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything. Redirty
-			 * the inode; Move it from b_io onto b_more_io/b_dirty.
+			 * sometimes bales out without doing anything.
 			 */
-			/*
-			 * akpm: if the caller was the kupdate function we put
-			 * this inode at the head of b_dirty so it gets first
-			 * consideration.  Otherwise, move it to the tail, for
-			 * the reasons described there.  I'm not really sure
-			 * how much sense this makes.  Presumably I had a good
-			 * reasons for doing it this way, and I'd rather not
-			 * muck with it at present.
-			 */
-			if (wbc->for_kupdate) {
+			inode->i_state |= I_DIRTY_PAGES;
+			if (wbc->nr_to_write <= 0) {
 				/*
-				 * For the kupdate function we move the inode
-				 * to b_more_io so it will get more writeout as
-				 * soon as the queue becomes uncongested.
+				 * slice used up: queue for next turn
 				 */
-				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
-					/*
-					 * slice used up: queue for next turn
-					 */
-					requeue_io(inode);
-				} else {
-					/*
-					 * somehow blocked: retry later
-					 */
-					redirty_tail(inode);
-				}
+				requeue_io(inode);
 			} else {
 				/*
-				 * Otherwise fully redirty the inode so that
-				 * other inodes on this superblock will get some
-				 * writeout.  Otherwise heavy writing to one
-				 * file would indefinitely suspend writeout of
-				 * all the other files.
+				 * Writeback blocked by something other than
+				 * congestion. Delay the inode for some time to
+				 * avoid spinning on the CPU (100% iowait)
+				 * retrying writeback of the dirty page/inode
+				 * that cannot be performed immediately.
 				 */
-				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
 		} else if (inode->i_state & I_DIRTY) {



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (5 preceding siblings ...)
  2010-08-05 16:10 ` [PATCH 06/13] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
@ 2010-08-05 16:10 ` Wu Fengguang
  2010-08-05 23:34   ` Andrew Morton
  2010-08-05 16:10 ` [PATCH 08/13] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Peter Zijlstra, Dave Chinner,
	Christoph Hellwig, Mel Gorman, Chris Mason, Jens Axboe, Jan Kara,
	linux-fsdevel, linux-mm

[-- Attachment #1: min-dirty-ratio.patch --]
[-- Type: text/plain, Size: 1999 bytes --]

Force a user visible low bound of 5% for the vm.dirty_ratio interface.

Currently global_dirty_limits() applies a low bound of 5% for
vm_dirty_ratio.  This is not very user visible -- if the user sets
vm.dirty_ratio=1, the operation seems to succeed but will be rounded up
to 5% when used.

Another problem is inconsistency: calc_period_shift() uses the plain
vm_dirty_ratio value, which may be a problem when vm.dirty_ratio is set
to < 5 by the user.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 kernel/sysctl.c     |    3 ++-
 mm/page-writeback.c |   10 ++--------
 2 files changed, 4 insertions(+), 9 deletions(-)

--- linux-next.orig/kernel/sysctl.c	2010-08-05 22:48:34.000000000 +0800
+++ linux-next/kernel/sysctl.c	2010-08-05 22:48:47.000000000 +0800
@@ -126,6 +126,7 @@ static int ten_thousand = 10000;
 
 /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
 static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
+static int dirty_ratio_min = 5;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -1031,7 +1032,7 @@ static struct ctl_table vm_table[] = {
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
 		.proc_handler	= dirty_ratio_handler,
-		.extra1		= &zero,
+		.extra1		= &dirty_ratio_min,
 		.extra2		= &one_hundred,
 	},
 	{
--- linux-next.orig/mm/page-writeback.c	2010-08-05 22:48:42.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-05 22:48:47.000000000 +0800
@@ -415,14 +415,8 @@ void global_dirty_limits(unsigned long *
 
 	if (vm_dirty_bytes)
 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
-	else {
-		int dirty_ratio;
-
-		dirty_ratio = vm_dirty_ratio;
-		if (dirty_ratio < 5)
-			dirty_ratio = 5;
-		dirty = (dirty_ratio * available_memory) / 100;
-	}
+	else
+		dirty = (vm_dirty_ratio * available_memory) / 100;
 
 	if (dirty_background_bytes)
 		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 08/13] writeback: pass writeback_control down to move_expired_inodes()
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (6 preceding siblings ...)
  2010-08-05 16:10 ` [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio Wu Fengguang
@ 2010-08-05 16:10 ` Wu Fengguang
  2010-08-05 16:11 ` [PATCH 09/13] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, Chris Mason, Jens Axboe, Peter Zijlstra,
	linux-fsdevel, linux-mm

[-- Attachment #1: writeback-pass-wbc-to-queue_io.patch --]
[-- Type: text/plain, Size: 2533 bytes --]

This is to prepare for moving the dirty expire policy to move_expired_inodes().
No behavior change.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:28:18.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:28:19.000000000 +0800
@@ -213,8 +213,8 @@ static bool inode_dirtied_after(struct i
  * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
  */
 static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
+				struct list_head *dispatch_queue,
+				struct writeback_control *wbc)
 {
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
@@ -224,8 +224,8 @@ static void move_expired_inodes(struct l
 
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (older_than_this &&
-		    inode_dirtied_after(inode, *older_than_this))
+		if (wbc->older_than_this &&
+		    inode_dirtied_after(inode, *wbc->older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -262,10 +262,10 @@ static void move_expired_inodes(struct l
  *                                           |
  *                                           +--> dequeue for IO
  */
-static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, struct writeback_control *wbc)
 {
 	list_splice_init(&wb->b_more_io, &wb->b_io);
-	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, wbc);
 }
 
 static int write_inode(struct inode *inode, struct writeback_control *wbc)
@@ -528,7 +528,7 @@ void writeback_inodes_wb(struct bdi_writ
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
 		struct inode *inode = list_entry(wb->b_io.prev,
@@ -557,7 +557,7 @@ static void __writeback_inodes_sb(struct
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
 	if (!wbc->for_kupdate || list_empty(&wb->b_io))
-		queue_io(wb, wbc->older_than_this);
+		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);
 }



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 09/13] writeback: the kupdate expire timestamp should be a moving target
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (7 preceding siblings ...)
  2010-08-05 16:10 ` [PATCH 08/13] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
@ 2010-08-05 16:11 ` Wu Fengguang
  2010-08-05 16:11 ` [PATCH 10/13] writeback: kill writeback_control.more_io Wu Fengguang
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Mel Gorman, Itaru Kitayama,
	Dave Chinner, Christoph Hellwig, Chris Mason, Jens Axboe,
	Peter Zijlstra, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-remove-older_than_this.patch --]
[-- Type: text/plain, Size: 7118 bytes --]

Dynamically compute the dirty expire timestamp at queue_io() time.
Also remove writeback_control.older_than_this which is no longer used.

writeback_control.older_than_this used to be determined at entrance to
the kupdate writeback work. This _static_ timestamp may go stale if the
kupdate work runs on and on. The flusher may then stuck with some old
busy inodes, never considering newly expired inodes thereafter.

This has two possible problems:

- It is unfair for a large dirty inode to delay (for a long time) the
  writeback of small dirty inodes.

- As time goes by, the large and busy dirty inode may contain only
  _freshly_ dirtied pages. Ignoring newly expired dirty inodes risks
  delaying the expired dirty pages to the end of LRU lists, triggering
  the evil pageout(). Nevertheless this patch merely addresses part
  of the problem.

[kitayama@cl.bb4u.ne.jp] fix btrfs and ext4 references

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c             |    2 --
 fs/fs-writeback.c                |   24 +++++++++---------------
 include/linux/writeback.h        |    2 --
 include/trace/events/writeback.h |    6 +-----
 mm/backing-dev.c                 |    1 -
 mm/page-writeback.c              |    1 -
 6 files changed, 10 insertions(+), 26 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:28:29.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:28:29.000000000 +0800
@@ -216,16 +216,23 @@ static void move_expired_inodes(struct l
 				struct list_head *dispatch_queue,
 				struct writeback_control *wbc)
 {
+	unsigned long expire_interval = 0;
+	unsigned long older_than_this;
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
+	if (wbc->for_kupdate) {
+		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+		older_than_this = jiffies - expire_interval;
+	}
+
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
-		if (wbc->older_than_this &&
-		    inode_dirtied_after(inode, *wbc->older_than_this))
+		if (expire_interval &&
+		    inode_dirtied_after(inode, older_than_this))
 			break;
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
@@ -592,29 +599,19 @@ static inline bool over_bground_thresh(v
  * Try to run once per dirty_writeback_interval.  But if a writeback event
  * takes longer than a dirty_writeback_interval interval, then leave a
  * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
  */
 static long wb_writeback(struct bdi_writeback *wb,
 			 struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
-		.older_than_this	= NULL,
 		.for_kupdate		= work->for_kupdate,
 		.for_background		= work->for_background,
 		.range_cyclic		= work->range_cyclic,
 	};
-	unsigned long oldest_jif;
 	long wrote = 0;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
-		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
-	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
@@ -1007,9 +1004,6 @@ EXPORT_SYMBOL(__mark_inode_dirty);
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
  *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
  * If `bdi' is non-zero then we're being asked to writeback a specific queue.
  * This function assumes that the blockdev superblock's inodes are backed by
  * a variety of queues, so all inodes are searched.  For other superblocks,
--- linux-next.orig/include/linux/writeback.h	2010-08-05 23:28:29.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-08-05 23:28:29.000000000 +0800
@@ -28,8 +28,6 @@ enum writeback_sync_modes {
  */
 struct writeback_control {
 	enum writeback_sync_modes sync_mode;
-	unsigned long *older_than_this;	/* If !NULL, only write back inodes
-					   older than this */
 	unsigned long wb_start;         /* Time writeback_inodes_wb was
 					   called. This is needed to avoid
 					   extra jobs and livelock */
--- linux-next.orig/include/trace/events/writeback.h	2010-08-05 23:28:17.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-08-05 23:28:29.000000000 +0800
@@ -101,7 +101,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
 		__field(int, more_io)
-		__field(unsigned long, older_than_this)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -116,14 +115,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
 		__entry->more_io	= wbc->more_io;
-		__entry->older_than_this = wbc->older_than_this ?
-						*wbc->older_than_this : 0;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d older=0x%lx "
+		"bgrd=%d reclm=%d cyclic=%d more=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -134,7 +131,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
 		__entry->more_io,
-		__entry->older_than_this,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/mm/page-writeback.c	2010-08-05 23:28:29.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-05 23:28:29.000000000 +0800
@@ -488,7 +488,6 @@ static void balance_dirty_pages(struct a
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
 			.nr_to_write	= write_chunk,
 			.range_cyclic	= 1,
 		};
--- linux-next.orig/mm/backing-dev.c	2010-08-05 23:28:29.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-08-05 23:28:29.000000000 +0800
@@ -271,7 +271,6 @@ static void bdi_flush_io(struct backing_
 {
 	struct writeback_control wbc = {
 		.sync_mode		= WB_SYNC_NONE,
-		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 		.nr_to_write		= 1024,
 	};
--- linux-next.orig/fs/btrfs/extent_io.c	2010-08-05 23:28:17.000000000 +0800
+++ linux-next/fs/btrfs/extent_io.c	2010-08-05 23:28:29.000000000 +0800
@@ -2595,7 +2595,6 @@ int extent_write_full_page(struct extent
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= wbc->sync_mode,
-		.older_than_this = NULL,
 		.nr_to_write	= 64,
 		.range_start	= page_offset(page) + PAGE_CACHE_SIZE,
 		.range_end	= (loff_t)-1,
@@ -2628,7 +2627,6 @@ int extent_write_locked_range(struct ext
 	};
 	struct writeback_control wbc_writepages = {
 		.sync_mode	= mode,
-		.older_than_this = NULL,
 		.nr_to_write	= nr_pages * 2,
 		.range_start	= start,
 		.range_end	= end + 1,



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 10/13] writeback: kill writeback_control.more_io
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (8 preceding siblings ...)
  2010-08-05 16:11 ` [PATCH 09/13] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
@ 2010-08-05 16:11 ` Wu Fengguang
  2010-08-05 16:11 ` [PATCH 11/13] writeback: sync expired inodes first in background writeback Wu Fengguang
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Mel Gorman, Minchan Kim,
	Dave Chinner, Christoph Hellwig, Chris Mason, Jens Axboe,
	Peter Zijlstra, linux-fsdevel, linux-mm

[-- Attachment #1: writeback-kill-more_io.patch --]
[-- Type: text/plain, Size: 4248 bytes --]

When wbc.more_io was first introduced, it indicates whether there are
at least one superblock whose s_more_io contains more IO work. Now with
the per-bdi writeback, it can be replaced with a simple b_more_io test.

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |    9 ++-------
 include/linux/writeback.h        |    1 -
 include/trace/events/ext4.h      |    5 +----
 include/trace/events/writeback.h |    5 +----
 4 files changed, 4 insertions(+), 16 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:28:10.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:28:11.000000000 +0800
@@ -516,12 +516,8 @@ static int writeback_sb_inodes(struct su
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			return 1;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 	}
 	/* b_io is empty */
 	return 1;
@@ -631,7 +627,6 @@ static long wb_writeback(struct bdi_writ
 		if (work->for_background && !over_bground_thresh())
 			break;
 
-		wbc.more_io = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
 
@@ -653,7 +648,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * Didn't write everything and we don't have more IO, bail
 		 */
-		if (!wbc.more_io)
+		if (list_empty(&wb->b_more_io))
 			break;
 		/*
 		 * Did we write something? Try for more
--- linux-next.orig/include/linux/writeback.h	2010-08-05 23:28:10.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-08-05 23:28:11.000000000 +0800
@@ -49,7 +49,6 @@ struct writeback_control {
 	unsigned for_background:1;	/* A background writeback */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux-next.orig/include/trace/events/writeback.h	2010-08-05 23:28:10.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-08-05 23:28:11.000000000 +0800
@@ -100,7 +100,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__field(int, for_background)
 		__field(int, for_reclaim)
 		__field(int, range_cyclic)
-		__field(int, more_io)
 		__field(long, range_start)
 		__field(long, range_end)
 	),
@@ -114,13 +113,12 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background	= wbc->for_background;
 		__entry->for_reclaim	= wbc->for_reclaim;
 		__entry->range_cyclic	= wbc->range_cyclic;
-		__entry->more_io	= wbc->more_io;
 		__entry->range_start	= (long)wbc->range_start;
 		__entry->range_end	= (long)wbc->range_end;
 	),
 
 	TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
-		"bgrd=%d reclm=%d cyclic=%d more=%d "
+		"bgrd=%d reclm=%d cyclic=%d "
 		"start=0x%lx end=0x%lx",
 		__entry->name,
 		__entry->nr_to_write,
@@ -130,7 +128,6 @@ DECLARE_EVENT_CLASS(wbc_class,
 		__entry->for_background,
 		__entry->for_reclaim,
 		__entry->range_cyclic,
-		__entry->more_io,
 		__entry->range_start,
 		__entry->range_end)
 )
--- linux-next.orig/include/trace/events/ext4.h	2010-08-05 23:27:33.000000000 +0800
+++ linux-next/include/trace/events/ext4.h	2010-08-05 23:28:11.000000000 +0800
@@ -305,7 +305,6 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__field(	int,	ret			)
 		__field(	int,	pages_written		)
 		__field(	long,	pages_skipped		)
-		__field(	char,	more_io			)	
 		__field(       pgoff_t,	writeback_index		)
 	),
 
@@ -315,15 +314,13 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__entry->ret		= ret;
 		__entry->pages_written	= pages_written;
 		__entry->pages_skipped	= wbc->pages_skipped;
-		__entry->more_io	= wbc->more_io;
 		__entry->writeback_index = inode->i_mapping->writeback_index;
 	),
 
-	TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld more_io %d writeback_index %lu",
+       TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld writeback_index %lu",
 		  jbd2_dev_to_name(__entry->dev),
 		  (unsigned long) __entry->ino, __entry->ret,
 		  __entry->pages_written, __entry->pages_skipped,
-		  __entry->more_io,
 		  (unsigned long) __entry->writeback_index)
 );
 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 11/13] writeback: sync expired inodes first in background writeback
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (9 preceding siblings ...)
  2010-08-05 16:11 ` [PATCH 10/13] writeback: kill writeback_control.more_io Wu Fengguang
@ 2010-08-05 16:11 ` Wu Fengguang
  2010-08-05 16:11 ` [PATCH 12/13] writeback: try more writeback as long as something was written Wu Fengguang
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Jan Kara, Mel Gorman, Dave Chinner,
	Christoph Hellwig, Chris Mason, Jens Axboe, Peter Zijlstra,
	linux-fsdevel, linux-mm

[-- Attachment #1: writeback-expired-for-background.patch --]
[-- Type: text/plain, Size: 2868 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

The policy is
- enqueue all newly expired inodes at each queue_io() time
- enqueue all dirty inodes if there are no more expired inodes to sync

This will help reduce the number of dirty pages encountered by page
reclaim, eg. the pageout() calls. Normally older inodes contain older
dirty pages, which are more close to the end of the LRU lists. So
syncing older inodes first helps reducing the dirty pages reached by
the page reclaim code.

CC: Jan Kara <jack@suse.cz>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:28:35.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:30:27.000000000 +0800
@@ -217,14 +217,14 @@ static void move_expired_inodes(struct l
 				struct writeback_control *wbc)
 {
 	unsigned long expire_interval = 0;
-	unsigned long older_than_this;
+	unsigned long older_than_this = 0; /* reset to kill gcc warning */
 	LIST_HEAD(tmp);
 	struct list_head *pos, *node;
 	struct super_block *sb = NULL;
 	struct inode *inode;
 	int do_sb_sort = 0;
 
-	if (wbc->for_kupdate) {
+	if (wbc->for_kupdate || wbc->for_background) {
 		expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
 		older_than_this = jiffies - expire_interval;
 	}
@@ -232,8 +232,20 @@ static void move_expired_inodes(struct l
 	while (!list_empty(delaying_queue)) {
 		inode = list_entry(delaying_queue->prev, struct inode, i_list);
 		if (expire_interval &&
-		    inode_dirtied_after(inode, older_than_this))
+		    inode_dirtied_after(inode, older_than_this)) {
+			/*
+			 * background writeback will start with expired inodes,
+			 * and then fresh inodes. This order helps reduce the
+			 * number of dirty pages reaching the end of LRU lists
+			 * and cause trouble to the page reclaim.
+			 */
+			if (wbc->for_background &&
+			    list_empty(dispatch_queue) && list_empty(&tmp)) {
+				expire_interval = 0;
+				continue;
+			}
 			break;
+		}
 		if (sb && sb != inode->i_sb)
 			do_sb_sort = 1;
 		sb = inode->i_sb;
@@ -530,7 +542,8 @@ void writeback_inodes_wb(struct bdi_writ
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 
 	while (!list_empty(&wb->b_io)) {
@@ -559,7 +572,7 @@ static void __writeback_inodes_sb(struct
 
 	wbc->wb_start = jiffies; /* livelock avoidance */
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc);
 	writeback_sb_inodes(sb, wb, wbc, true);
 	spin_unlock(&inode_lock);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 12/13] writeback: try more writeback as long as something was written
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (10 preceding siblings ...)
  2010-08-05 16:11 ` [PATCH 11/13] writeback: sync expired inodes first in background writeback Wu Fengguang
@ 2010-08-05 16:11 ` Wu Fengguang
  2010-08-05 17:00   ` Jan Kara
  2010-08-05 16:11 ` [PATCH 13/13] writeback: introduce writeback_control.inodes_written Wu Fengguang
  2010-08-05 23:08 ` [PATCH 00/13] writeback patches for 2.6.36 Andrew Morton
  13 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, Jan Kara, Peter Zijlstra, linux-fsdevel,
	linux-mm

[-- Attachment #1: writeback-background-retry.patch --]
[-- Type: text/plain, Size: 1915 bytes --]

writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
they only populate b_io when necessary at entrance time. When the queued
set of inodes are all synced, they just return, possibly with
wbc.nr_to_write > 0.

For kupdate and background writeback, there may be more eligible inodes
sitting in b_dirty when the current set of b_io inodes are completed. So
it is necessary to try another round of writeback as long as we made some
progress in this round. When there are no more eligible inodes, no more
inodes will be enqueued in queue_io(), hence nothing could/will be
synced and we may safely bail.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:30:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:30:45.000000000 +0800
@@ -654,20 +654,23 @@ static long wb_writeback(struct bdi_writ
 		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 
 		/*
-		 * If we consumed everything, see if we have more
+		 * Did we write something? Try for more
+		 *
+		 * This is needed _before_ the b_more_io test because the
+		 * background writeback moves inodes to b_io and works on
+		 * them in batches (in order to sync old pages first).  The
+		 * completion of the current batch does not necessarily mean
+		 * the overall work is done.
 		 */
-		if (wbc.nr_to_write <= 0)
+		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+
 		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * Nothing written and no more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
-		/*
-		 * Did we write something? Try for more
-		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
-			continue;
+
 		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 13/13] writeback: introduce writeback_control.inodes_written
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (11 preceding siblings ...)
  2010-08-05 16:11 ` [PATCH 12/13] writeback: try more writeback as long as something was written Wu Fengguang
@ 2010-08-05 16:11 ` Wu Fengguang
  2010-08-05 23:08 ` [PATCH 00/13] writeback patches for 2.6.36 Andrew Morton
  13 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, LKML, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, Jan Kara, Peter Zijlstra, linux-fsdevel,
	linux-mm

[-- Attachment #1: writeback-inodes_written.patch --]
[-- Type: text/plain, Size: 1843 bytes --]

Introduce writeback_control.inodes_written to count successful
->write_inode() calls.  A non-zero value means there are some
progress on writeback, in which case more writeback will be tried.

This prevents aborting a background writeback work prematurely when
the current set of inodes for IO happen to be only metadata-only dirty.

Acked-by: Mel Gorman <mel@csn.ul.ie> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    5 +++++
 include/linux/writeback.h |    1 +
 2 files changed, 6 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:30:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-08-05 23:38:55.000000000 +0800
@@ -389,6 +389,8 @@ writeback_single_inode(struct inode *ino
 		int err = write_inode(inode, wbc);
 		if (ret == 0)
 			ret = err;
+		if (!err)
+			wbc->inodes_written++;
 	}
 
 	spin_lock(&inode_lock);
@@ -642,6 +644,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
+		wbc.inodes_written = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
 		if (work->sb)
@@ -664,6 +667,8 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
 			continue;
+		if (wbc.inodes_written)
+			continue;
 
 		/*
 		 * Nothing written and no more inodes for IO, bail
--- linux-next.orig/include/linux/writeback.h	2010-08-05 23:28:35.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-08-05 23:40:46.000000000 +0800
@@ -34,6 +34,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long inodes_written;		/* # of inodes (metadata) written */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 12/13] writeback: try more writeback as long as something was written
  2010-08-05 16:11 ` [PATCH 12/13] writeback: try more writeback as long as something was written Wu Fengguang
@ 2010-08-05 17:00   ` Jan Kara
  2010-08-05 22:39     ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2010-08-05 17:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, Jan Kara, Peter Zijlstra, linux-fsdevel,
	linux-mm

On Fri 06-08-10 00:11:03, Wu Fengguang wrote:
> writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> they only populate b_io when necessary at entrance time. When the queued
> set of inodes are all synced, they just return, possibly with
> wbc.nr_to_write > 0.
> 
> For kupdate and background writeback, there may be more eligible inodes
> sitting in b_dirty when the current set of b_io inodes are completed. So
> it is necessary to try another round of writeback as long as we made some
> progress in this round. When there are no more eligible inodes, no more
> inodes will be enqueued in queue_io(), hence nothing could/will be
> synced and we may safely bail.
  This looks like a sane thing to do. Just one comment below...
 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:30:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-08-05 23:30:45.000000000 +0800
> @@ -654,20 +654,23 @@ static long wb_writeback(struct bdi_writ
>  		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>  
>  		/*
> -		 * If we consumed everything, see if we have more
> +		 * Did we write something? Try for more
> +		 *
> +		 * This is needed _before_ the b_more_io test because the
> +		 * background writeback moves inodes to b_io and works on
  Well, this applies generally to any writeback, not just a background one
right? Whenever we process all inodes from b_io list and move them
somewhere else than b_more_io, then this applies. Some new dirty data could
have arrived while we were doing the write... I'm just afraid that in some
pathological cases this could result in bad writeback pattern - like if
there is some process which manages to dirty just a few pages while we are
doing writeout, this looping could result in writing just a few pages in
each round which is bad for fragmentation etc.
  Actually, this comment probably also applies to your patch where you
change the queueing logic in writeback_single_inode(), doesn't it?

								Honza

> +		 * them in batches (in order to sync old pages first).  The
> +		 * completion of the current batch does not necessarily mean
> +		 * the overall work is done.
>  		 */
> -		if (wbc.nr_to_write <= 0)
> +		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
>  			continue;
> +
>  		/*
> -		 * Didn't write everything and we don't have more IO, bail
> +		 * Nothing written and no more inodes for IO, bail
>  		 */
>  		if (list_empty(&wb->b_more_io))
>  			break;
> -		/*
> -		 * Did we write something? Try for more
> -		 */
> -		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> -			continue;
> +
>  		/*
>  		 * Nothing written. Wait for some inode to
>  		 * become available for writeback. Otherwise
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 12/13] writeback: try more writeback as long as something was written
  2010-08-05 17:00   ` Jan Kara
@ 2010-08-05 22:39     ` Wu Fengguang
  2010-08-05 22:50       ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2010-08-05 22:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, Peter Zijlstra, linux-fsdevel, linux-mm

On Fri, Aug 06, 2010 at 01:00:16AM +0800, Jan Kara wrote:
> On Fri 06-08-10 00:11:03, Wu Fengguang wrote:
> > writeback_inodes_wb()/__writeback_inodes_sb() are not aggressive in that
> > they only populate b_io when necessary at entrance time. When the queued
> > set of inodes are all synced, they just return, possibly with
> > wbc.nr_to_write > 0.
> > 
> > For kupdate and background writeback, there may be more eligible inodes
> > sitting in b_dirty when the current set of b_io inodes are completed. So
> > it is necessary to try another round of writeback as long as we made some
> > progress in this round. When there are no more eligible inodes, no more
> > inodes will be enqueued in queue_io(), hence nothing could/will be
> > synced and we may safely bail.
>   This looks like a sane thing to do. Just one comment below...
>  
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |   19 +++++++++++--------
> >  1 file changed, 11 insertions(+), 8 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2010-08-05 23:30:27.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2010-08-05 23:30:45.000000000 +0800
> > @@ -654,20 +654,23 @@ static long wb_writeback(struct bdi_writ
> >  		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> >  
> >  		/*
> > -		 * If we consumed everything, see if we have more
> > +		 * Did we write something? Try for more
> > +		 *
> > +		 * This is needed _before_ the b_more_io test because the
> > +		 * background writeback moves inodes to b_io and works on
>   Well, this applies generally to any writeback, not just a background one
> right? Whenever we process all inodes from b_io list and move them
> somewhere else than b_more_io, then this applies. Some new dirty data could
> have arrived while we were doing the write...

Right. Only that it is a requirement for background writeback.
For others this patch is not a necessity.

> I'm just afraid that in some
> pathological cases this could result in bad writeback pattern - like if
> there is some process which manages to dirty just a few pages while we are
> doing writeout, this looping could result in writing just a few pages in
> each round which is bad for fragmentation etc.

Such inodes will be redirty_tail()ed here:

                if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
                        /*
                         * We didn't write back all the pages.  nfs_writepages()
                         * sometimes bales out without doing anything.
                         */
                        inode->i_state |= I_DIRTY_PAGES;
                        if (wbc->nr_to_write <= 0) {
                                /*
                                 * slice used up: queue for next turn
                                 */
                                requeue_io(inode);
                        } else {
                                /*
                                 * Writeback blocked by something other than
                                 * congestion. Delay the inode for some time to
                                 * avoid spinning on the CPU (100% iowait)
                                 * retrying writeback of the dirty page/inode
                                 * that cannot be performed immediately.
                                 */
                                redirty_tail(inode);
                        }

>   Actually, this comment probably also applies to your patch where you
> change the queueing logic in writeback_single_inode(), doesn't it?

Can you elaborate?
 
Thanks,
Fengguang

> 
> > +		 * them in batches (in order to sync old pages first).  The
> > +		 * completion of the current batch does not necessarily mean
> > +		 * the overall work is done.
> >  		 */
> > -		if (wbc.nr_to_write <= 0)
> > +		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> >  			continue;
> > +
> >  		/*
> > -		 * Didn't write everything and we don't have more IO, bail
> > +		 * Nothing written and no more inodes for IO, bail
> >  		 */
> >  		if (list_empty(&wb->b_more_io))
> >  			break;
> > -		/*
> > -		 * Did we write something? Try for more
> > -		 */
> > -		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> > -			continue;
> > +
> >  		/*
> >  		 * Nothing written. Wait for some inode to
> >  		 * become available for writeback. Otherwise
> > 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 12/13] writeback: try more writeback as long as something was written
  2010-08-05 22:39     ` Wu Fengguang
@ 2010-08-05 22:50       ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2010-08-05 22:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, LKML, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, Peter Zijlstra,
	linux-fsdevel, linux-mm

On Fri 06-08-10 06:39:29, Wu Fengguang wrote:
> On Fri, Aug 06, 2010 at 01:00:16AM +0800, Jan Kara wrote:
> > I'm just afraid that in some
> > pathological cases this could result in bad writeback pattern - like if
> > there is some process which manages to dirty just a few pages while we are
> > doing writeout, this looping could result in writing just a few pages in
> > each round which is bad for fragmentation etc.
> 
> Such inodes will be redirty_tail()ed here:
> 
>                 if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>                         /*
>                          * We didn't write back all the pages.  nfs_writepages()
>                          * sometimes bales out without doing anything.
>                          */
>                         inode->i_state |= I_DIRTY_PAGES;
>                         if (wbc->nr_to_write <= 0) {
>                                 /*
>                                  * slice used up: queue for next turn
>                                  */
>                                 requeue_io(inode);
>                         } else {
>                                 /*
>                                  * Writeback blocked by something other than
>                                  * congestion. Delay the inode for some time to
>                                  * avoid spinning on the CPU (100% iowait)
>                                  * retrying writeback of the dirty page/inode
>                                  * that cannot be performed immediately.
>                                  */
>                                 redirty_tail(inode);
>                         }
  Yes. And then, when there are no inodes in b_more_io, they get queued
again for writeback. So for non-background WB_SYNC_NONE writeback we can
just write a few pages over and over again... Oh, ok we won't because of
my start_time fix I suppose. Maybe a comment about this by the nr_to_write
< MAX_WRITEBACK_PAGES check would be good.

> >   Actually, this comment probably also applies to your patch where you
> > change the queueing logic in writeback_single_inode(), doesn't it?
> 
> Can you elaborate?
  Sorry, my comment only applies to this particular patch. In your change
to writeback_single_inode() you requeue_io() only if nr_to_write <= 0.

								Honza

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/13] writeback patches for 2.6.36
  2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
                   ` (12 preceding siblings ...)
  2010-08-05 16:11 ` [PATCH 13/13] writeback: introduce writeback_control.inodes_written Wu Fengguang
@ 2010-08-05 23:08 ` Andrew Morton
  13 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2010-08-05 23:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: LKML, Dave Chinner, Christoph Hellwig, Mel Gorman, Chris Mason,
	Jens Axboe, Jan Kara, Peter Zijlstra, linux-fsdevel, linux-mm

On Fri, 06 Aug 2010 00:10:51 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> These are writeback patches intended for 2.6.36.

I won't be here Friday.  I need to get my junk into mainline on Monday.
Everybody and his dog is working on writeback, which is good, but
there's a lot of flux here.

So, no, I think it's too late to be thinking about 2.6.36-rc1.  Let's
slow down a bit, review and test each other's proposals and work out
what we want to do in the longer term.  After we've done that, we can
calmly and carefully take a look to see if there are any nice goodies
which we want to slip into 2.6.36-rc3 or thereabouts.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
  2010-08-05 16:10 ` [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio Wu Fengguang
@ 2010-08-05 23:34   ` Andrew Morton
  2010-08-06 12:44     ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2010-08-05 23:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: LKML, Peter Zijlstra, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, Jan Kara, linux-fsdevel,
	linux-mm

On Fri, 06 Aug 2010 00:10:58 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Force a user visible low bound of 5% for the vm.dirty_ratio interface.
> 
> Currently global_dirty_limits() applies a low bound of 5% for
> vm_dirty_ratio.  This is not very user visible -- if the user sets
> vm.dirty_ratio=1, the operation seems to succeed but will be rounded up
> to 5% when used.
> 
> Another problem is inconsistency: calc_period_shift() uses the plain
> vm_dirty_ratio value, which may be a problem when vm.dirty_ratio is set
> to < 5 by the user.

The changelog describes the old behaviour but doesn't describe the
proposed new behaviour.

> --- linux-next.orig/kernel/sysctl.c	2010-08-05 22:48:34.000000000 +0800
> +++ linux-next/kernel/sysctl.c	2010-08-05 22:48:47.000000000 +0800
> @@ -126,6 +126,7 @@ static int ten_thousand = 10000;
>  
>  /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
>  static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
> +static int dirty_ratio_min = 5;
>  
>  /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
>  static int maxolduid = 65535;
> @@ -1031,7 +1032,7 @@ static struct ctl_table vm_table[] = {
>  		.maxlen		= sizeof(vm_dirty_ratio),
>  		.mode		= 0644,
>  		.proc_handler	= dirty_ratio_handler,
> -		.extra1		= &zero,
> +		.extra1		= &dirty_ratio_min,
>  		.extra2		= &one_hundred,
>  	},

I forget how the procfs core handles this.  Presumably the write will
now fail with -EINVAL or something?  So people's scripts will now
error out and their space shuttles will crash?

All of which illustrates why it's important to fully describe changes
in the changelog!  So people can consider and discuss the end-user
implications of a change.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/13] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-08-05 16:10 ` [PATCH 02/13] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
@ 2010-08-06 10:14   ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-06 10:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, Jan Kara, linux-fsdevel, linux-mm

On Fri, 2010-08-06 at 00:10 +0800, Wu Fengguang wrote:
> plain text document attachment (writeback-less-bdi-calc.patch)
> Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(),
> so that the latter can be avoided when under global dirty background
> threshold (which is the normal state for most systems).
> 
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/13] writeback: add comment to the dirty limits functions
  2010-08-05 16:10 ` [PATCH 03/13] writeback: add comment to the dirty limits functions Wu Fengguang
@ 2010-08-06 10:17   ` Peter Zijlstra
  2010-08-07 16:47     ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2010-08-06 10:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, LKML, Christoph Hellwig, Dave Chinner, Jens Axboe,
	Mel Gorman, Chris Mason, Jan Kara, linux-fsdevel, linux-mm

On Fri, 2010-08-06 at 00:10 +0800, Wu Fengguang wrote:

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> +/**
> + * bdi_dirty_limit - @bdi's share of dirty throttling threshold
> + *
> + * Allocate high/low dirty limits to fast/slow devices, in order to prevent
> + * - starving fast devices
> + * - piling up dirty pages (that will take long time to sync) on slow devices
> + *
> + * The bdi's share of dirty limit will be adapting to its throughput and
> + * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
> + */ 

Another thing solved by the introduction of per-bdi dirty limits (and
now per-bdi flushing) is the whole stacked-bdi writeout deadlock.

Although I'm not sure we want/need to mention that here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
  2010-08-05 23:34   ` Andrew Morton
@ 2010-08-06 12:44     ` Wu Fengguang
  2010-08-10  3:12       ` KOSAKI Motohiro
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2010-08-06 12:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Peter Zijlstra, Dave Chinner, Christoph Hellwig,
	Mel Gorman, Chris Mason, Jens Axboe, Jan Kara, linux-fsdevel,
	linux-mm

On Fri, Aug 06, 2010 at 07:34:01AM +0800, Andrew Morton wrote:
> On Fri, 06 Aug 2010 00:10:58 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Force a user visible low bound of 5% for the vm.dirty_ratio interface.
> > 
> > Currently global_dirty_limits() applies a low bound of 5% for
> > vm_dirty_ratio.  This is not very user visible -- if the user sets
> > vm.dirty_ratio=1, the operation seems to succeed but will be rounded up
> > to 5% when used.
> > 
> > Another problem is inconsistency: calc_period_shift() uses the plain
> > vm_dirty_ratio value, which may be a problem when vm.dirty_ratio is set
> > to < 5 by the user.
> 
> The changelog describes the old behaviour but doesn't describe the
> proposed new behaviour.

Yeah, fixed below.

> > --- linux-next.orig/kernel/sysctl.c	2010-08-05 22:48:34.000000000 +0800
> > +++ linux-next/kernel/sysctl.c	2010-08-05 22:48:47.000000000 +0800
> > @@ -126,6 +126,7 @@ static int ten_thousand = 10000;
> >  
> >  /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
> >  static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
> > +static int dirty_ratio_min = 5;
> >  
> >  /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
> >  static int maxolduid = 65535;
> > @@ -1031,7 +1032,7 @@ static struct ctl_table vm_table[] = {
> >  		.maxlen		= sizeof(vm_dirty_ratio),
> >  		.mode		= 0644,
> >  		.proc_handler	= dirty_ratio_handler,
> > -		.extra1		= &zero,
> > +		.extra1		= &dirty_ratio_min,
> >  		.extra2		= &one_hundred,
> >  	},
> 
> I forget how the procfs core handles this.  Presumably the write will
> now fail with -EINVAL or something?

Right.
         # echo 111 > /proc/sys/vm/dirty_ratio
         echo: write error: invalid argument

> So people's scripts will now error out and their space shuttles will
> crash?

Looks like a serious problem. I'm now much more reserved on pushing
this patch :)

> All of which illustrates why it's important to fully describe changes
> in the changelog!  So people can consider and discuss the end-user
> implications of a change.

Good point. Here is the patch with updated changelog.

Thanks,
Fengguang
---
Subject: writeback: explicit low bound for vm.dirty_ratio
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Thu Jul 15 10:28:57 CST 2010

Force a user visible low bound of 5% for the vm.dirty_ratio interface.

This is an interface change. When doing

	echo N > /proc/sys/vm/dirty_ratio

where N < 5, the old behavior is pretend to accept the value, while
the new behavior is to reject it explicitly with -EINVAL.  This will
possibly break user space if they checks the return value.

Currently global_dirty_limits() applies a low bound of 5% for
vm_dirty_ratio.  This is not very user visible -- if the user sets
vm.dirty_ratio=1, the operation seems to succeed but will be rounded up
to 5% when used.

Another problem is inconsistency: calc_period_shift() uses the plain
vm_dirty_ratio value, which may be a problem when vm.dirty_ratio is set
to < 5 by the user.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 kernel/sysctl.c     |    3 ++-
 mm/page-writeback.c |   10 ++--------
 2 files changed, 4 insertions(+), 9 deletions(-)

--- linux-next.orig/kernel/sysctl.c	2010-08-05 22:48:34.000000000 +0800
+++ linux-next/kernel/sysctl.c	2010-08-05 22:48:47.000000000 +0800
@@ -126,6 +126,7 @@ static int ten_thousand = 10000;
 
 /* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
 static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
+static int dirty_ratio_min = 5;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -1031,7 +1032,7 @@ static struct ctl_table vm_table[] = {
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
 		.proc_handler	= dirty_ratio_handler,
-		.extra1		= &zero,
+		.extra1		= &dirty_ratio_min,
 		.extra2		= &one_hundred,
 	},
 	{
--- linux-next.orig/mm/page-writeback.c	2010-08-05 22:48:42.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-05 22:48:47.000000000 +0800
@@ -415,14 +415,8 @@ void global_dirty_limits(unsigned long *
 
 	if (vm_dirty_bytes)
 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
-	else {
-		int dirty_ratio;
-
-		dirty_ratio = vm_dirty_ratio;
-		if (dirty_ratio < 5)
-			dirty_ratio = 5;
-		dirty = (dirty_ratio * available_memory) / 100;
-	}
+	else
+		dirty = (vm_dirty_ratio * available_memory) / 100;
 
 	if (dirty_background_bytes)
 		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/13] writeback: add comment to the dirty limits functions
  2010-08-06 10:17   ` Peter Zijlstra
@ 2010-08-07 16:47     ` Wu Fengguang
  0 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-07 16:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, LKML, Christoph Hellwig, Dave Chinner, Jens Axboe,
	Mel Gorman, Chris Mason, Jan Kara, linux-fsdevel, linux-mm

On Fri, Aug 06, 2010 at 06:17:26PM +0800, Peter Zijlstra wrote:
> On Fri, 2010-08-06 at 00:10 +0800, Wu Fengguang wrote:
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> > +/**
> > + * bdi_dirty_limit - @bdi's share of dirty throttling threshold
> > + *
> > + * Allocate high/low dirty limits to fast/slow devices, in order to prevent
> > + * - starving fast devices
> > + * - piling up dirty pages (that will take long time to sync) on slow devices
> > + *
> > + * The bdi's share of dirty limit will be adapting to its throughput and
> > + * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
> > + */ 
> 
> Another thing solved by the introduction of per-bdi dirty limits (and
> now per-bdi flushing) is the whole stacked-bdi writeout deadlock.
> 
> Although I'm not sure we want/need to mention that here.

The changelog looks like a suitable place :)

Thanks,
Fengguang
---
Subject: writeback: add comment to the dirty limits functions
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Thu Jul 15 09:54:25 CST 2010

Document global_dirty_limits(), bdi_dirty_limit() and task_dirty_limit().

Note that another thing solved by the introduction of per-bdi dirty
limits (and now per-bdi flushing) is the whole stacked-bdi writeout
deadlock.						-- Peter

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-08-03 23:14:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-05 00:37:17.000000000 +0800
@@ -261,11 +261,18 @@ static inline void task_dirties_fraction
 }
 
 /*
- * scale the dirty limit
+ * task_dirty_limit - scale down dirty throttling threshold for one task
  *
  * task specific dirty limit:
  *
  *   dirty -= (dirty/8) * p_{t}
+ *
+ * To protect light/slow dirtying tasks from heavier/fast ones, we start
+ * throttling individual tasks before reaching the bdi dirty limit.
+ * Relatively low thresholds will be allocated to heavy dirtiers. So when
+ * dirty pages grow large, heavy dirtiers will be throttled first, which will
+ * effectively curb the growth of dirty pages. Light dirtiers with high enough
+ * dirty threshold may never get throttled.
  */
 static unsigned long task_dirty_limit(struct task_struct *tsk,
 				       unsigned long bdi_dirty)
@@ -390,6 +397,15 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+/**
+ * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ *
+ * Calculate the dirty thresholds based on sysctl parameters
+ * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
+ * - vm.dirty_ratio             or  vm.dirty_bytes
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * runtime tasks.
+ */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
 	unsigned long background;
@@ -424,8 +440,17 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
-			       unsigned long dirty)
+/**
+ * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ *
+ * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * - starving fast devices
+ * - piling up dirty pages (that will take long time to sync) on slow devices
+ *
+ * The bdi's share of dirty limit will be adapting to its throughput and
+ * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
+ */
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 {
 	u64 bdi_dirty;
 	long numerator, denominator;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
  2010-08-06 12:44     ` Wu Fengguang
@ 2010-08-10  3:12       ` KOSAKI Motohiro
  2010-08-10  3:57         ` Neil Brown
  2010-08-10 18:06         ` Wu Fengguang
  0 siblings, 2 replies; 28+ messages in thread
From: KOSAKI Motohiro @ 2010-08-10  3:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Andrew Morton, LKML, Peter Zijlstra,
	Dave Chinner, Christoph Hellwig, Mel Gorman, Chris Mason,
	Jens Axboe, Jan Kara, linux-fsdevel, linux-mm

> Subject: writeback: explicit low bound for vm.dirty_ratio
> From: Wu Fengguang <fengguang.wu@intel.com>
> Date: Thu Jul 15 10:28:57 CST 2010
> 
> Force a user visible low bound of 5% for the vm.dirty_ratio interface.
> 
> This is an interface change. When doing
> 
> 	echo N > /proc/sys/vm/dirty_ratio
> 
> where N < 5, the old behavior is pretend to accept the value, while
> the new behavior is to reject it explicitly with -EINVAL.  This will
> possibly break user space if they checks the return value.

Umm.. I dislike this change. Is there any good reason to refuse explicit 
admin's will? Why 1-4% is so bad? Internal clipping can be changed later
but explicit error behavior is hard to change later.

personally I prefer to
 - accept all value, or
 - clipping value in dirty_ratio_handler 

Both don't have explicit ABI change.

Thanks.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
  2010-08-10  3:12       ` KOSAKI Motohiro
@ 2010-08-10  3:57         ` Neil Brown
  2010-08-10 13:29           ` Jan Kara
  2010-08-10 18:12           ` Wu Fengguang
  2010-08-10 18:06         ` Wu Fengguang
  1 sibling, 2 replies; 28+ messages in thread
From: Neil Brown @ 2010-08-10  3:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Wu Fengguang, Andrew Morton, LKML, Peter Zijlstra, Dave Chinner,
	Christoph Hellwig, Mel Gorman, Chris Mason, Jens Axboe, Jan Kara,
	linux-fsdevel, linux-mm

On Tue, 10 Aug 2010 12:12:06 +0900 (JST)
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > Subject: writeback: explicit low bound for vm.dirty_ratio
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > Date: Thu Jul 15 10:28:57 CST 2010
> > 
> > Force a user visible low bound of 5% for the vm.dirty_ratio interface.
> > 
> > This is an interface change. When doing
> > 
> > 	echo N > /proc/sys/vm/dirty_ratio
> > 
> > where N < 5, the old behavior is pretend to accept the value, while
> > the new behavior is to reject it explicitly with -EINVAL.  This will
> > possibly break user space if they checks the return value.
> 
> Umm.. I dislike this change. Is there any good reason to refuse explicit 
> admin's will? Why 1-4% is so bad? Internal clipping can be changed later
> but explicit error behavior is hard to change later.

As a data-point, I had a situation a while back where I needed a value below
1 to get desired behaviour.  The system had lots of RAM and fairly slow
write-back (over NFS) so a 'sync' could take minutes.

So I would much prefer allowing not only 1-4, but also fraction values!!!

I can see no justification at all for setting a lower bound of 5.  Even zero
can be useful - for testing purposes mostly.

NeilBrown

> personally I prefer to
>  - accept all value, or
>  - clipping value in dirty_ratio_handler 
> 
> Both don't have explicit ABI change.
> 
> Thanks.
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
  2010-08-10  3:57         ` Neil Brown
@ 2010-08-10 13:29           ` Jan Kara
  2010-08-10 18:12           ` Wu Fengguang
  1 sibling, 0 replies; 28+ messages in thread
From: Jan Kara @ 2010-08-10 13:29 UTC (permalink / raw)
  To: Neil Brown
  Cc: KOSAKI Motohiro, Wu Fengguang, Andrew Morton, LKML,
	Peter Zijlstra, Dave Chinner, Christoph Hellwig, Mel Gorman,
	Chris Mason, Jens Axboe, Jan Kara, linux-fsdevel, linux-mm

On Tue 10-08-10 13:57:12, Neil Brown wrote:
> On Tue, 10 Aug 2010 12:12:06 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > Subject: writeback: explicit low bound for vm.dirty_ratio
> > > From: Wu Fengguang <fengguang.wu@intel.com>
> > > Date: Thu Jul 15 10:28:57 CST 2010
> > > 
> > > Force a user visible low bound of 5% for the vm.dirty_ratio interface.
> > > 
> > > This is an interface change. When doing
> > > 
> > > 	echo N > /proc/sys/vm/dirty_ratio
> > > 
> > > where N < 5, the old behavior is pretend to accept the value, while
> > > the new behavior is to reject it explicitly with -EINVAL.  This will
> > > possibly break user space if they checks the return value.
> > 
> > Umm.. I dislike this change. Is there any good reason to refuse explicit 
> > admin's will? Why 1-4% is so bad? Internal clipping can be changed later
> > but explicit error behavior is hard to change later.
> 
> As a data-point, I had a situation a while back where I needed a value below
> 1 to get desired behaviour.  The system had lots of RAM and fairly slow
> write-back (over NFS) so a 'sync' could take minutes.
> 
> So I would much prefer allowing not only 1-4, but also fraction values!!!
> 
> I can see no justification at all for setting a lower bound of 5.  Even zero
> can be useful - for testing purposes mostly.
  If you run on a recent kernel, /proc/sys/vm/dirty_background_bytes and
dirty_bytes is what was introduced exactly for these purposes. Not that I
would think that magic clipping at 5% is a good thing...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
  2010-08-10  3:12       ` KOSAKI Motohiro
  2010-08-10  3:57         ` Neil Brown
@ 2010-08-10 18:06         ` Wu Fengguang
  1 sibling, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-10 18:06 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, LKML, Peter Zijlstra, Dave Chinner,
	Christoph Hellwig, Mel Gorman, Chris Mason, Jens Axboe, Jan Kara,
	linux-fsdevel, linux-mm, Neil Brown

On Tue, Aug 10, 2010 at 11:12:06AM +0800, KOSAKI Motohiro wrote:
> > Subject: writeback: explicit low bound for vm.dirty_ratio
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > Date: Thu Jul 15 10:28:57 CST 2010
> > 
> > Force a user visible low bound of 5% for the vm.dirty_ratio interface.
> > 
> > This is an interface change. When doing
> > 
> > 	echo N > /proc/sys/vm/dirty_ratio
> > 
> > where N < 5, the old behavior is pretend to accept the value, while
> > the new behavior is to reject it explicitly with -EINVAL.  This will
> > possibly break user space if they checks the return value.
> 
> Umm.. I dislike this change. Is there any good reason to refuse explicit 
> admin's will? Why 1-4% is so bad? Internal clipping can be changed later
> but explicit error behavior is hard to change later.
> 
> personally I prefer to
>  - accept all value, or
>  - clipping value in dirty_ratio_handler 
> 
> Both don't have explicit ABI change.

Good point. Sorry for being ignorance. Neil is right that there is no
reason to impose some low bound. So the first option looks good.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio
  2010-08-10  3:57         ` Neil Brown
  2010-08-10 13:29           ` Jan Kara
@ 2010-08-10 18:12           ` Wu Fengguang
  1 sibling, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2010-08-10 18:12 UTC (permalink / raw)
  To: Neil Brown
  Cc: KOSAKI Motohiro, Andrew Morton, LKML, Peter Zijlstra,
	Dave Chinner, Christoph Hellwig, Mel Gorman, Chris Mason,
	Jens Axboe, Jan Kara, linux-fsdevel, linux-mm

On Tue, Aug 10, 2010 at 11:57:12AM +0800, Neil Brown wrote:
> On Tue, 10 Aug 2010 12:12:06 +0900 (JST)
> KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > > Subject: writeback: explicit low bound for vm.dirty_ratio
> > > From: Wu Fengguang <fengguang.wu@intel.com>
> > > Date: Thu Jul 15 10:28:57 CST 2010
> > > 
> > > Force a user visible low bound of 5% for the vm.dirty_ratio interface.
> > > 
> > > This is an interface change. When doing
> > > 
> > > 	echo N > /proc/sys/vm/dirty_ratio
> > > 
> > > where N < 5, the old behavior is pretend to accept the value, while
> > > the new behavior is to reject it explicitly with -EINVAL.  This will
> > > possibly break user space if they checks the return value.
> > 
> > Umm.. I dislike this change. Is there any good reason to refuse explicit 
> > admin's will? Why 1-4% is so bad? Internal clipping can be changed later
> > but explicit error behavior is hard to change later.
> 
> As a data-point, I had a situation a while back where I needed a value below
> 1 to get desired behaviour.  The system had lots of RAM and fairly slow
> write-back (over NFS) so a 'sync' could take minutes.

Jan, here is a use case to limit dirty pages on slow devices :)

> So I would much prefer allowing not only 1-4, but also fraction values!!!
> 
> I can see no justification at all for setting a lower bound of 5.  Even zero
> can be useful - for testing purposes mostly.

Neil, that's perfectly legitimate need which I overlooked.
It seems that the vm.dirty_bytes parameter will work for your case.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-08-10 18:12 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-05 16:10 [PATCH 00/13] writeback patches for 2.6.36 Wu Fengguang
2010-08-05 16:10 ` [PATCH 01/13] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
2010-08-05 16:10 ` [PATCH 02/13] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
2010-08-06 10:14   ` Peter Zijlstra
2010-08-05 16:10 ` [PATCH 03/13] writeback: add comment to the dirty limits functions Wu Fengguang
2010-08-06 10:17   ` Peter Zijlstra
2010-08-07 16:47     ` Wu Fengguang
2010-08-05 16:10 ` [PATCH 04/13] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
2010-08-05 16:10 ` [PATCH 05/13] writeback: fix queue_io() ordering Wu Fengguang
2010-08-05 16:10 ` [PATCH 06/13] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
2010-08-05 16:10 ` [PATCH 07/13] writeback: explicit low bound for vm.dirty_ratio Wu Fengguang
2010-08-05 23:34   ` Andrew Morton
2010-08-06 12:44     ` Wu Fengguang
2010-08-10  3:12       ` KOSAKI Motohiro
2010-08-10  3:57         ` Neil Brown
2010-08-10 13:29           ` Jan Kara
2010-08-10 18:12           ` Wu Fengguang
2010-08-10 18:06         ` Wu Fengguang
2010-08-05 16:10 ` [PATCH 08/13] writeback: pass writeback_control down to move_expired_inodes() Wu Fengguang
2010-08-05 16:11 ` [PATCH 09/13] writeback: the kupdate expire timestamp should be a moving target Wu Fengguang
2010-08-05 16:11 ` [PATCH 10/13] writeback: kill writeback_control.more_io Wu Fengguang
2010-08-05 16:11 ` [PATCH 11/13] writeback: sync expired inodes first in background writeback Wu Fengguang
2010-08-05 16:11 ` [PATCH 12/13] writeback: try more writeback as long as something was written Wu Fengguang
2010-08-05 17:00   ` Jan Kara
2010-08-05 22:39     ` Wu Fengguang
2010-08-05 22:50       ` Jan Kara
2010-08-05 16:11 ` [PATCH 13/13] writeback: introduce writeback_control.inodes_written Wu Fengguang
2010-08-05 23:08 ` [PATCH 00/13] writeback patches for 2.6.36 Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).