All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm: balance_dirty_pages. reduce calls to global_page_state to reduce cache references
@ 2009-08-21 11:59 ` Richard Kennedy
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Kennedy @ 2009-08-21 11:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: chris.mason, linux-mm, lkml, Peter Zijlstra, Jens Axboe

Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.

'perf stats' of simple fio write tests shows the reduction in cache
access.
Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2 with
3Gb memory (dirty_threshold approx 600 Mb)
running each test 10 times, taking the average & standard deviation

		average (s.d.) in millions (10^6)
2.6.31-rc6	661 (9.88)
+patch		604 (4.19)

Achieving this reduction is by dropping clip_bdi_dirty_limit as it  
rereads the counters to apply the dirty_threshold and moving this check
up into balance_dirty_pages where it has already read the counters.

Also by rearrange the for loop to only contain one copy of the limit
tests allows the pdflush test after the loop to use the local copies of
the counters rather than rereading then.

In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.

I have tried to retain the existing behavior as much as possible, but
have added NR_WRITEBACK_TEMP to nr_writeback. This counter was used in
clip_bdi_dirty_limit but not in balance_dirty_pages, grep suggests this
is only used by FUSE but I haven't done any testing on that. It does
seem logical to count all the WRITEBACK pages when making the throttling
decisions so this change should be more correct ;)

I have been running this patch for over a week and have had no problems
with it and generally see improved disk write performance on a variety
of tests & workloads, even in the worst cases performance is the same as
the unpatched kernel. I also tried this on a Intel ATOM 330 twincore
system and saw similar improvements.

    
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
----
 page-writeback.c |  116 ++++++++++++++++++++-----------------------------------
 1 file changed, 43 insertions(+), 73 deletions(-)

This patch is against 2.6.31-rc6.



diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..6f18e40 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -260,32 +260,6 @@ static void bdi_writeout_fraction(struct backing_dev_info *bdi,
 	}
 }
 
-/*
- * Clip the earned share of dirty pages to that which is actually available.
- * This avoids exceeding the total dirty_limit when the floating averages
- * fluctuate too quickly.
- */
-static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
-		unsigned long dirty, unsigned long *pbdi_dirty)
-{
-	unsigned long avail_dirty;
-
-	avail_dirty = global_page_state(NR_FILE_DIRTY) +
-		 global_page_state(NR_WRITEBACK) +
-		 global_page_state(NR_UNSTABLE_NFS) +
-		 global_page_state(NR_WRITEBACK_TEMP);
-
-	if (avail_dirty < dirty)
-		avail_dirty = dirty - avail_dirty;
-	else
-		avail_dirty = 0;
-
-	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
-		bdi_stat(bdi, BDI_WRITEBACK);
-
-	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
-}
-
 static inline void task_dirties_fraction(struct task_struct *tsk,
 		long *numerator, long *denominator)
 {
@@ -478,7 +452,6 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 			bdi_dirty = dirty * bdi->max_ratio / 100;
 
 		*pbdi_dirty = bdi_dirty;
-		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
 		task_dirty_limit(current, pbdi_dirty);
 	}
 }
@@ -512,45 +485,12 @@ static void balance_dirty_pages(struct address_space *mapping)
 		};
 
 		get_dirty_limits(&background_thresh, &dirty_thresh,
-				&bdi_thresh, bdi);
+				 &bdi_thresh, bdi);
 
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
-
-		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
-
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-			break;
-
-		/*
-		 * Throttle it only when the background writeback cannot
-		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
-		 */
-		if (nr_reclaimable + nr_writeback <
-				(background_thresh + dirty_thresh) / 2)
-			break;
-
-		if (!bdi->dirty_exceeded)
-			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes(&wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			get_dirty_limits(&background_thresh, &dirty_thresh,
-				       &bdi_thresh, bdi);
-		}
+			global_page_state(NR_UNSTABLE_NFS);
+		nr_writeback = global_page_state(NR_WRITEBACK) +
+			global_page_state(NR_WRITEBACK_TEMP);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -570,16 +510,48 @@ static void balance_dirty_pages(struct address_space *mapping)
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-			break;
-		if (pages_written >= write_chunk)
-			break;		/* We've done our duty */
+		/* always throttle if over threshold */
+		if (nr_reclaimable + nr_writeback < dirty_thresh) {
+
+			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+				break;
+
+			/*
+			 * Throttle it only when the background writeback cannot
+			 * catch-up. This avoids (excessively) small writeouts
+			 * when the bdi limits are ramping up.
+			 */
+			if (nr_reclaimable + nr_writeback <
+			    (background_thresh + dirty_thresh) / 2)
+				break;
+
+			/* done enough? */
+			if (pages_written >= write_chunk)
+				break;
+		}
+		if (!bdi->dirty_exceeded)
+			bdi->dirty_exceeded = 1;
 
+		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 * Only move pages to writeback if this bdi is over its
+		 * threshold otherwise wait until the disk writes catch
+		 * up.
+		 */
+		if (bdi_nr_reclaimable > bdi_thresh) {
+			writeback_inodes(&wbc);
+			pages_written += write_chunk - wbc.nr_to_write;
+			if (wbc.nr_to_write == 0)
+				continue;
+		}
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 	}
 
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
-			bdi->dirty_exceeded)
+	    bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
@@ -593,10 +565,8 @@ static void balance_dirty_pages(struct address_space *mapping)
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
-					  + global_page_state(NR_UNSTABLE_NFS)
-					  > background_thresh)))
+	if ((laptop_mode && pages_written) || (!laptop_mode &&
+	     (nr_reclaimable > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH] mm: balance_dirty_pages. reduce calls to global_page_state to reduce cache references
@ 2009-08-21 11:59 ` Richard Kennedy
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Kennedy @ 2009-08-21 11:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: chris.mason, linux-mm, lkml, Peter Zijlstra, Jens Axboe

Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.

'perf stats' of simple fio write tests shows the reduction in cache
access.
Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2 with
3Gb memory (dirty_threshold approx 600 Mb)
running each test 10 times, taking the average & standard deviation

		average (s.d.) in millions (10^6)
2.6.31-rc6	661 (9.88)
+patch		604 (4.19)

Achieving this reduction is by dropping clip_bdi_dirty_limit as it  
rereads the counters to apply the dirty_threshold and moving this check
up into balance_dirty_pages where it has already read the counters.

Also by rearrange the for loop to only contain one copy of the limit
tests allows the pdflush test after the loop to use the local copies of
the counters rather than rereading then.

In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.

I have tried to retain the existing behavior as much as possible, but
have added NR_WRITEBACK_TEMP to nr_writeback. This counter was used in
clip_bdi_dirty_limit but not in balance_dirty_pages, grep suggests this
is only used by FUSE but I haven't done any testing on that. It does
seem logical to count all the WRITEBACK pages when making the throttling
decisions so this change should be more correct ;)

I have been running this patch for over a week and have had no problems
with it and generally see improved disk write performance on a variety
of tests & workloads, even in the worst cases performance is the same as
the unpatched kernel. I also tried this on a Intel ATOM 330 twincore
system and saw similar improvements.

    
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
----
 page-writeback.c |  116 ++++++++++++++++++++-----------------------------------
 1 file changed, 43 insertions(+), 73 deletions(-)

This patch is against 2.6.31-rc6.



diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..6f18e40 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -260,32 +260,6 @@ static void bdi_writeout_fraction(struct backing_dev_info *bdi,
 	}
 }
 
-/*
- * Clip the earned share of dirty pages to that which is actually available.
- * This avoids exceeding the total dirty_limit when the floating averages
- * fluctuate too quickly.
- */
-static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
-		unsigned long dirty, unsigned long *pbdi_dirty)
-{
-	unsigned long avail_dirty;
-
-	avail_dirty = global_page_state(NR_FILE_DIRTY) +
-		 global_page_state(NR_WRITEBACK) +
-		 global_page_state(NR_UNSTABLE_NFS) +
-		 global_page_state(NR_WRITEBACK_TEMP);
-
-	if (avail_dirty < dirty)
-		avail_dirty = dirty - avail_dirty;
-	else
-		avail_dirty = 0;
-
-	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
-		bdi_stat(bdi, BDI_WRITEBACK);
-
-	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
-}
-
 static inline void task_dirties_fraction(struct task_struct *tsk,
 		long *numerator, long *denominator)
 {
@@ -478,7 +452,6 @@ get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 			bdi_dirty = dirty * bdi->max_ratio / 100;
 
 		*pbdi_dirty = bdi_dirty;
-		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
 		task_dirty_limit(current, pbdi_dirty);
 	}
 }
@@ -512,45 +485,12 @@ static void balance_dirty_pages(struct address_space *mapping)
 		};
 
 		get_dirty_limits(&background_thresh, &dirty_thresh,
-				&bdi_thresh, bdi);
+				 &bdi_thresh, bdi);
 
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
-
-		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
-
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-			break;
-
-		/*
-		 * Throttle it only when the background writeback cannot
-		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
-		 */
-		if (nr_reclaimable + nr_writeback <
-				(background_thresh + dirty_thresh) / 2)
-			break;
-
-		if (!bdi->dirty_exceeded)
-			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes(&wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			get_dirty_limits(&background_thresh, &dirty_thresh,
-				       &bdi_thresh, bdi);
-		}
+			global_page_state(NR_UNSTABLE_NFS);
+		nr_writeback = global_page_state(NR_WRITEBACK) +
+			global_page_state(NR_WRITEBACK_TEMP);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -570,16 +510,48 @@ static void balance_dirty_pages(struct address_space *mapping)
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-			break;
-		if (pages_written >= write_chunk)
-			break;		/* We've done our duty */
+		/* always throttle if over threshold */
+		if (nr_reclaimable + nr_writeback < dirty_thresh) {
+
+			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+				break;
+
+			/*
+			 * Throttle it only when the background writeback cannot
+			 * catch-up. This avoids (excessively) small writeouts
+			 * when the bdi limits are ramping up.
+			 */
+			if (nr_reclaimable + nr_writeback <
+			    (background_thresh + dirty_thresh) / 2)
+				break;
+
+			/* done enough? */
+			if (pages_written >= write_chunk)
+				break;
+		}
+		if (!bdi->dirty_exceeded)
+			bdi->dirty_exceeded = 1;
 
+		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 * Only move pages to writeback if this bdi is over its
+		 * threshold otherwise wait until the disk writes catch
+		 * up.
+		 */
+		if (bdi_nr_reclaimable > bdi_thresh) {
+			writeback_inodes(&wbc);
+			pages_written += write_chunk - wbc.nr_to_write;
+			if (wbc.nr_to_write == 0)
+				continue;
+		}
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 	}
 
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
-			bdi->dirty_exceeded)
+	    bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
@@ -593,10 +565,8 @@ static void balance_dirty_pages(struct address_space *mapping)
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
-					  + global_page_state(NR_UNSTABLE_NFS)
-					  > background_thresh)))
+	if ((laptop_mode && pages_written) || (!laptop_mode &&
+	     (nr_reclaimable > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: balance_dirty_pages. reduce calls to global_page_state to reduce cache references
  2009-08-21 11:59 ` Richard Kennedy
  (?)
@ 2009-08-21 14:04 ` Peter Zijlstra
  2009-08-25 11:46   ` Miklos Szeredi
  -1 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2009-08-21 14:04 UTC (permalink / raw)
  To: Richard Kennedy; +Cc: Andrew Morton, chris.mason, lkml, Jens Axboe, miklos

(removed linux-mm because it seems to be ill atm)

On Fri, 2009-08-21 at 12:59 +0100, Richard Kennedy wrote:
> Reducing the number of times balance_dirty_pages calls global_page_state
> reduces the cache references and so improves write performance on a
> variety of workloads.
> 
> 'perf stats' of simple fio write tests shows the reduction in cache
> access.
> Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2 with
> 3Gb memory (dirty_threshold approx 600 Mb)
> running each test 10 times, taking the average & standard deviation
> 
> 		average (s.d.) in millions (10^6)
> 2.6.31-rc6	661 (9.88)
> +patch		604 (4.19)

Nice.

> Achieving this reduction is by dropping clip_bdi_dirty_limit as it  
> rereads the counters to apply the dirty_threshold and moving this check
> up into balance_dirty_pages where it has already read the counters.

OK, so what you did is first check the total dirty limit, and only if
that is ok, check the per-BDI limit, now why didn't I think of that ;-)

> Also by rearrange the for loop to only contain one copy of the limit
> tests allows the pdflush test after the loop to use the local copies of
> the counters rather than rereading then.
> 
> In the common case with no throttling it now calls global_page_state 5
> fewer times and bdi_stat 2 fewer.
> 
> I have tried to retain the existing behavior as much as possible, but
> have added NR_WRITEBACK_TEMP to nr_writeback. This counter was used in
> clip_bdi_dirty_limit but not in balance_dirty_pages, grep suggests this
> is only used by FUSE but I haven't done any testing on that. It does
> seem logical to count all the WRITEBACK pages when making the throttling
> decisions so this change should be more correct ;)

Right, the NR_WRITEBACK_TEMP thing is a FUSE feature, its used in
writable mmap() support for FUSE things.

I must admit to forgetting the exact semantics of the things, maybe
Miklos can remind us.

> Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>

Looks good here

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> ----
>  page-writeback.c |  116 ++++++++++++++++++++-----------------------------------
>  1 file changed, 43 insertions(+), 73 deletions(-)

> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 81627eb..6f18e40 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c

> @@ -512,45 +485,12 @@ static void balance_dirty_pages(struct address_space *mapping)
>  		};
>  
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
> +				 &bdi_thresh, bdi);
>  
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +			global_page_state(NR_UNSTABLE_NFS);
> +		nr_writeback = global_page_state(NR_WRITEBACK) +
> +			global_page_state(NR_WRITEBACK_TEMP);
>  
>  		/*
>  		 * In order to avoid the stacked BDI deadlock we need
> @@ -570,16 +510,48 @@ static void balance_dirty_pages(struct address_space *mapping)
>  			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
>  		}
>  
> +		/* always throttle if over threshold */
> +		if (nr_reclaimable + nr_writeback < dirty_thresh) {
> +
> +			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> +				break;
> +
> +			/*
> +			 * Throttle it only when the background writeback cannot
> +			 * catch-up. This avoids (excessively) small writeouts
> +			 * when the bdi limits are ramping up.
> +			 */
> +			if (nr_reclaimable + nr_writeback <
> +			    (background_thresh + dirty_thresh) / 2)
> +				break;
> +
> +			/* done enough? */
> +			if (pages_written >= write_chunk)
> +				break;
> +		}
> +		if (!bdi->dirty_exceeded)
> +			bdi->dirty_exceeded = 1;
>  
> +		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> +		 * Unstable writes are a feature of certain networked
> +		 * filesystems (i.e. NFS) in which data may have been
> +		 * written to the server's write cache, but has not yet
> +		 * been flushed to permanent storage.
> +		 * Only move pages to writeback if this bdi is over its
> +		 * threshold otherwise wait until the disk writes catch
> +		 * up.
> +		 */
> +		if (bdi_nr_reclaimable > bdi_thresh) {
> +			writeback_inodes(&wbc);
> +			pages_written += write_chunk - wbc.nr_to_write;
> +			if (wbc.nr_to_write == 0)
> +				continue;
> +		}
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  	}
>  
>  	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> +	    bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
>  	if (writeback_in_progress(bdi))
> @@ -593,10 +565,8 @@ static void balance_dirty_pages(struct address_space *mapping)
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	if ((laptop_mode && pages_written) || (!laptop_mode &&
> +	     (nr_reclaimable > background_thresh)))
>  		pdflush_operation(background_writeout, 0);
>  }
>  
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: balance_dirty_pages. reduce calls to global_page_state to reduce cache references
  2009-08-21 14:04 ` Peter Zijlstra
@ 2009-08-25 11:46   ` Miklos Szeredi
  2009-08-26 17:05     ` Richard Kennedy
  0 siblings, 1 reply; 6+ messages in thread
From: Miklos Szeredi @ 2009-08-25 11:46 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: richard, akpm, chris.mason, linux-kernel, jens.axboe, miklos

On Fri, 21 Aug 2009, Peter Zijlstra wrote:
> > I have tried to retain the existing behavior as much as possible, but
> > have added NR_WRITEBACK_TEMP to nr_writeback. This counter was used in
> > clip_bdi_dirty_limit but not in balance_dirty_pages, grep suggests this
> > is only used by FUSE but I haven't done any testing on that. It does
> > seem logical to count all the WRITEBACK pages when making the throttling
> > decisions so this change should be more correct ;)
> 
> Right, the NR_WRITEBACK_TEMP thing is a FUSE feature, its used in
> writable mmap() support for FUSE things.
> 
> I must admit to forgetting the exact semantics of the things, maybe
> Miklos can remind us.

I'll try: fuse is special because it needs writeback to be "very
asynchronous".  What I mean by this is that writing a dirty page may
block indefinitely and that shouldn't hold up unrelated filesystem or
memory operations.

To satisfy this, fuse copies contents of dirty pages over to
"temporary pages" and queues the write request with this temporary
page, not the original page-cache page.

This has two effects:

 - the page-cache page does not remain in "writeback" state but is
   cleaned immediately

 - the NR_WRITEBACK counter is not incremented for the duration of the
   writeback

The first one is important because vmscan and page migration do
wait_on_page_writeback() in some circumstances, which would block on
fuse writebacks.

The second one is important because vmscan will throttle writeout if
the NR_WRITEBACK counter goes over the dirty threshold
(throttle_vm_writeout).  There were long discussions about this one,
but in the end no one could surely tell how this works and why it is
important.  But NR_WRITEBACK_TEMP must not be counted there, otherwise
the page scanning can deadlock with fuse filesystems.

About balance_dirty_pages() I'm not quite sure.  By a similar logic we
don't want NR_WRITEBACK_TEMP pages to contribute to throttling
unrelated filesytem writebacks.  That might (through recursion via a
userspace filesystem) lead to a deadlock.

So my recommendation is that we should retain the old behavior.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: balance_dirty_pages. reduce calls to global_page_state to reduce cache references
  2009-08-25 11:46   ` Miklos Szeredi
@ 2009-08-26 17:05     ` Richard Kennedy
  2009-08-27  9:21       ` Miklos Szeredi
  0 siblings, 1 reply; 6+ messages in thread
From: Richard Kennedy @ 2009-08-26 17:05 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: a.p.zijlstra, akpm, chris.mason, linux-kernel, jens.axboe

On Tue, 2009-08-25 at 13:46 +0200, Miklos Szeredi wrote:
> On Fri, 21 Aug 2009, Peter Zijlstra wrote:
> > > I have tried to retain the existing behavior as much as possible, but
> > > have added NR_WRITEBACK_TEMP to nr_writeback. This counter was used in
> > > clip_bdi_dirty_limit but not in balance_dirty_pages, grep suggests this
> > > is only used by FUSE but I haven't done any testing on that. It does
> > > seem logical to count all the WRITEBACK pages when making the throttling
> > > decisions so this change should be more correct ;)
> > 
> > Right, the NR_WRITEBACK_TEMP thing is a FUSE feature, its used in
> > writable mmap() support for FUSE things.
> > 
> > I must admit to forgetting the exact semantics of the things, maybe
> > Miklos can remind us.
> 
> I'll try: fuse is special because it needs writeback to be "very
> asynchronous".  What I mean by this is that writing a dirty page may
> block indefinitely and that shouldn't hold up unrelated filesystem or
> memory operations.
> 
> To satisfy this, fuse copies contents of dirty pages over to
> "temporary pages" and queues the write request with this temporary
> page, not the original page-cache page.
> 
> This has two effects:
> 
>  - the page-cache page does not remain in "writeback" state but is
>    cleaned immediately
> 
>  - the NR_WRITEBACK counter is not incremented for the duration of the
>    writeback
> 
> The first one is important because vmscan and page migration do
> wait_on_page_writeback() in some circumstances, which would block on
> fuse writebacks.
> 
> The second one is important because vmscan will throttle writeout if
> the NR_WRITEBACK counter goes over the dirty threshold
> (throttle_vm_writeout).  There were long discussions about this one,
> but in the end no one could surely tell how this works and why it is
> important.  But NR_WRITEBACK_TEMP must not be counted there, otherwise
> the page scanning can deadlock with fuse filesystems.
> 
> About balance_dirty_pages() I'm not quite sure.  By a similar logic we
> don't want NR_WRITEBACK_TEMP pages to contribute to throttling
> unrelated filesytem writebacks.  That might (through recursion via a
> userspace filesystem) lead to a deadlock.
> 
> So my recommendation is that we should retain the old behavior.
> 
> Thanks,
> Miklos

Thanks for the explanation.

The old behavior was to use NR_WRITEBACK_TEMP to clip the bdi
thresholds.
Should we continue to do this or just ignore NR_WRIEBACK_TEMP completely
in balance_dirty_pages ?

regards
Richard


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: balance_dirty_pages. reduce calls to global_page_state to reduce cache references
  2009-08-26 17:05     ` Richard Kennedy
@ 2009-08-27  9:21       ` Miklos Szeredi
  0 siblings, 0 replies; 6+ messages in thread
From: Miklos Szeredi @ 2009-08-27  9:21 UTC (permalink / raw)
  To: richard; +Cc: miklos, a.p.zijlstra, akpm, chris.mason, linux-kernel, jens.axboe

On Wed, 26 Aug 2009, Richard Kennedy wrote:
> The old behavior was to use NR_WRITEBACK_TEMP to clip the bdi
> thresholds.

Right.

> Should we continue to do this or just ignore NR_WRIEBACK_TEMP completely
> in balance_dirty_pages ?

I think we should.  Using NR_WRIEBACK_TEMP for setting up per-bdi
thresholds should be OK.  It limits the total number of writback pages
(including fuse ones) but doesn't prevent individual bdi's from
continuing writback at their own pace.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-08-27  9:22 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-08-21 11:59 [RFC PATCH] mm: balance_dirty_pages. reduce calls to global_page_state to reduce cache references Richard Kennedy
2009-08-21 11:59 ` Richard Kennedy
2009-08-21 14:04 ` Peter Zijlstra
2009-08-25 11:46   ` Miklos Szeredi
2009-08-26 17:05     ` Richard Kennedy
2009-08-27  9:21       ` Miklos Szeredi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.