linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] dirty throttling bits for 3.3
@ 2011-11-21 13:03 Wu Fengguang
  2011-11-21 13:03 ` [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-11-21 13:03 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Peter Zijlstra, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

Hi,

I'd like to push these 5 dirty throttling improvements to linux-next,
targeting for inclusion in Linux 3.3.

It's a comfortably small series for review ;) And there are no changes since
the last v12 post. (I went through them once again but find nothing to change.)

 [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth
 [PATCH 2/5] writeback: charge leaked page dirties to active tasks
 [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
 [PATCH 4/5] writeback: fix dirtied pages accounting on redirty
 [PATCH 5/5] writeback: dirty ratelimit - think time compensation

 include/linux/sched.h            |    1 
 include/linux/writeback.h        |    4 +
 include/trace/events/writeback.h |   14 ++++--
 kernel/exit.c                    |    2 
 kernel/fork.c                    |    1 
 mm/page-writeback.c              |   64 ++++++++++++++++++++++++++---
 6 files changed, 77 insertions(+), 9 deletions(-)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth
  2011-11-21 13:03 [PATCH 0/5] dirty throttling bits for 3.3 Wu Fengguang
@ 2011-11-21 13:03 ` Wu Fengguang
  2011-11-21 22:50   ` Jan Kara
  2011-11-21 13:03 ` [PATCH 2/5] writeback: charge leaked page dirties to active tasks Wu Fengguang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-11-21 13:03 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Wu Fengguang, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: ref-bw-up-bound --]
[-- Type: text/plain, Size: 970 bytes --]

Add an upper limit to balanced_rate according to the below inequality.
This filters out some rare but huge singular points, which at least
enables more readable gnuplot figures.

When there are N dd dirtiers,

	balanced_dirty_ratelimit = write_bw / N

So it holds that

	balanced_dirty_ratelimit <= write_bw

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    5 +++++
 1 file changed, 5 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-11-17 20:18:03.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-17 20:18:23.000000000 +0800
@@ -804,6 +804,11 @@ static void bdi_update_dirty_ratelimit(s
 	 */
 	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
 					   dirty_rate | 1);
+	/*
+	 * balanced_dirty_ratelimit ~= (write_bw / N) <= write_bw
+	 */
+	if (unlikely(balanced_dirty_ratelimit > write_bw))
+		balanced_dirty_ratelimit = write_bw;
 
 	/*
 	 * We could safely do this and return immediately:



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2/5] writeback: charge leaked page dirties to active tasks
  2011-11-21 13:03 [PATCH 0/5] dirty throttling bits for 3.3 Wu Fengguang
  2011-11-21 13:03 ` [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
@ 2011-11-21 13:03 ` Wu Fengguang
  2011-11-21 21:49   ` Andrew Morton
  2011-11-21 13:03 ` [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-11-21 13:03 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: writeback-save-leaks-at-exit.patch --]
[-- Type: text/plain, Size: 2568 bytes --]

It's a years long problem that a large number of short-lived dirtiers
(eg. gcc instances in a fast kernel build) may starve long-run dirtiers
(eg. dd) as well as pushing the dirty pages to the global hard limit.

The solution is to charge the pages dirtied by the exited gcc to the
other random dirtying tasks. It sounds not perfect, however should
behave good enough in practice, seeing as that throttled tasks aren't
actually running so those that are running are more likely to pick it up
and get throttled, therefore promoting an equal spread.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 kernel/exit.c             |    2 ++
 mm/page-writeback.c       |   12 ++++++++++++
 3 files changed, 16 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-11-17 20:57:02.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-11-17 20:57:12.000000000 +0800
@@ -7,6 +7,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+DECLARE_PER_CPU(int, dirty_leaks);
+
 /*
  * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
  *
--- linux-next.orig/mm/page-writeback.c	2011-11-17 20:57:04.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-17 20:57:13.000000000 +0800
@@ -1194,6 +1194,7 @@ void set_page_dirty_balance(struct page 
 }
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
+DEFINE_PER_CPU(int, dirty_leaks) = 0;
 
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
@@ -1242,6 +1243,17 @@ void balance_dirty_pages_ratelimited_nr(
 			ratelimit = 0;
 		}
 	}
+	/*
+	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
+	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
+	 * the dirty throttling and livelock other long-run dirtiers.
+	 */
+	p = &__get_cpu_var(dirty_leaks);
+	if (*p > 0 && current->nr_dirtied < ratelimit) {
+		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
+		*p -= nr_pages_dirtied;
+		current->nr_dirtied += nr_pages_dirtied;
+	}
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
--- linux-next.orig/kernel/exit.c	2011-11-17 20:57:02.000000000 +0800
+++ linux-next/kernel/exit.c	2011-11-17 20:57:04.000000000 +0800
@@ -1037,6 +1037,8 @@ NORET_TYPE void do_exit(long code)
 	validate_creds_for_do_exit(tsk);
 
 	preempt_disable();
+	if (tsk->nr_dirtied)
+		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
 	exit_rcu();
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-21 13:03 [PATCH 0/5] dirty throttling bits for 3.3 Wu Fengguang
  2011-11-21 13:03 ` [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
  2011-11-21 13:03 ` [PATCH 2/5] writeback: charge leaked page dirties to active tasks Wu Fengguang
@ 2011-11-21 13:03 ` Wu Fengguang
  2011-11-22  0:11   ` Jan Kara
  2011-11-21 13:03 ` [PATCH 4/5] writeback: fix dirtied pages accounting on redirty Wu Fengguang
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-11-21 13:03 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Wu Fengguang, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: writeback-accurate-task-dirtied.patch --]
[-- Type: text/plain, Size: 1088 bytes --]

When dd in 512bytes, generic_perform_write() calls
balance_dirty_pages_ratelimited() 8 times for the same page, but
obviously the page is only dirtied once.

Fix it by accounting nr_dirtied at page dirty time.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-11-17 20:57:13.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-17 20:57:15.000000000 +0800
@@ -1224,8 +1224,6 @@ void balance_dirty_pages_ratelimited_nr(
 	if (bdi->dirty_exceeded)
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
 
-	current->nr_dirtied += nr_pages_dirtied;
-
 	preempt_disable();
 	/*
 	 * This prevents one CPU to accumulate too many dirtied pages without
@@ -1734,6 +1732,7 @@ void account_page_dirtied(struct page *p
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		current->nr_dirtied++;
 	}
 }
 EXPORT_SYMBOL(account_page_dirtied);



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 4/5] writeback: fix dirtied pages accounting on redirty
  2011-11-21 13:03 [PATCH 0/5] dirty throttling bits for 3.3 Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-11-21 13:03 ` [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
@ 2011-11-21 13:03 ` Wu Fengguang
  2011-11-21 21:51   ` Andrew Morton
  2011-11-21 13:03 ` [PATCH 5/5] writeback: dirty ratelimit - think time compensation Wu Fengguang
  2011-11-23 12:44 ` [PATCH 0/5] dirty throttling bits for 3.3 Peter Zijlstra
  5 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-11-21 13:03 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Wu Fengguang, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: writeback-account-redirty --]
[-- Type: text/plain, Size: 2080 bytes --]

De-account the accumulative dirty counters on page redirty.

Page redirties (very common in ext4) will introduce mismatch between
counters (a) and (b)

a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
b) NR_WRITTEN, BDI_WRITTEN

This will introduce systematic errors in balanced_rate and result in
dirty page position errors (ie. the dirty pages are no longer balanced
around the global/bdi setpoints).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 mm/page-writeback.c       |   12 ++++++++++++
 2 files changed, 14 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-11-17 20:57:15.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-17 20:57:16.000000000 +0800
@@ -1792,6 +1792,17 @@ int __set_page_dirty_nobuffers(struct pa
 }
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
+void account_page_redirty(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	if (mapping && mapping_cap_account_dirty(mapping)) {
+		current->nr_dirtied--;
+		dec_zone_page_state(page, NR_DIRTIED);
+		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+	}
+}
+EXPORT_SYMBOL(account_page_redirty);
+
 /*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
@@ -1800,6 +1811,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers
 int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
 {
 	wbc->pages_skipped++;
+	account_page_redirty(page);
 	return __set_page_dirty_nobuffers(page);
 }
 EXPORT_SYMBOL(redirty_page_for_writepage);
--- linux-next.orig/include/linux/writeback.h	2011-11-17 20:57:12.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-11-17 20:57:16.000000000 +0800
@@ -197,6 +197,8 @@ void writeback_set_ratelimit(void);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
 
+void account_page_redirty(struct page *page);
+
 /* pdflush.c */
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 5/5] writeback: dirty ratelimit - think time compensation
  2011-11-21 13:03 [PATCH 0/5] dirty throttling bits for 3.3 Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-11-21 13:03 ` [PATCH 4/5] writeback: fix dirtied pages accounting on redirty Wu Fengguang
@ 2011-11-21 13:03 ` Wu Fengguang
  2011-11-23 12:44 ` [PATCH 0/5] dirty throttling bits for 3.3 Peter Zijlstra
  5 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-11-21 13:03 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Wu Fengguang, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: think-time-compensation --]
[-- Type: text/plain, Size: 7782 bytes --]

Compensate the task's think time when computing the final pause time,
so that ->dirty_ratelimit can be executed accurately.

        think time := time spend outside of balance_dirty_pages()

In the rare case that the task slept longer than the 200ms period time
(result in negative pause time), the sleep time will be compensated in
the following periods, too, if it's less than 1 second.

Accumulated errors are carefully avoided as long as the max pause area
is not hitted.

Pseudo code:

        period = pages_dirtied / task_ratelimit;
        think = jiffies - dirty_paused_when;
        pause = period - think;

1) normal case: period > think

        pause = period - think
        dirty_paused_when = jiffies + pause
        nr_dirtied = 0

                             period time
              |===============================>|
                  think time      pause time
              |===============>|==============>|
        ------|----------------|---------------|------------------------
        dirty_paused_when   jiffies


2) no pause case: period <= think

        don't pause; reduce future pause time by:
        dirty_paused_when += period
        nr_dirtied = 0

                           period time
              |===============================>|
                                  think time
              |===================================================>|
        ------|--------------------------------+-------------------|----
        dirty_paused_when                                       jiffies

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h            |    1 
 include/trace/events/writeback.h |   14 ++++++++---
 kernel/fork.c                    |    1 
 mm/page-writeback.c              |   36 +++++++++++++++++++++++++----
 4 files changed, 45 insertions(+), 7 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-11-17 20:12:22.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-11-17 20:12:35.000000000 +0800
@@ -1527,6 +1527,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long dirty_paused_when; /* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2011-11-17 20:12:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-17 20:12:38.000000000 +0800
@@ -1010,6 +1010,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	long period;
 	long pause = 0;
 	long uninitialized_var(max_pause);
 	bool dirty_exceeded = false;
@@ -1020,6 +1021,8 @@ static void balance_dirty_pages(struct a
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		unsigned long now = jiffies;
+
 		/*
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been
@@ -1039,8 +1042,11 @@ static void balance_dirty_pages(struct a
 		 */
 		freerun = dirty_freerun_ceiling(dirty_thresh,
 						background_thresh);
-		if (nr_dirty <= freerun)
+		if (nr_dirty <= freerun) {
+			current->dirty_paused_when = now;
+			current->nr_dirtied = 0;
 			break;
+		}
 
 		if (unlikely(!writeback_in_progress(bdi)))
 			bdi_start_background_writeback(bdi);
@@ -1098,10 +1104,21 @@ static void balance_dirty_pages(struct a
 		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
 							RATELIMIT_CALC_SHIFT;
 		if (unlikely(task_ratelimit == 0)) {
+			period = max_pause;
 			pause = max_pause;
 			goto pause;
 		}
-		pause = HZ * pages_dirtied / task_ratelimit;
+		period = HZ * pages_dirtied / task_ratelimit;
+		pause = period;
+		if (current->dirty_paused_when)
+			pause -= now - current->dirty_paused_when;
+		/*
+		 * For less than 1s think time (ext3/4 may block the dirtier
+		 * for up to 800ms from time to time on 1-HDD; so does xfs,
+		 * however at much less frequency), try to compensate it in
+		 * future periods by updating the virtual time; otherwise just
+		 * do a reset, as it may be a light dirtier.
+		 */
 		if (unlikely(pause <= 0)) {
 			trace_balance_dirty_pages(bdi,
 						  dirty_thresh,
@@ -1112,8 +1129,16 @@ static void balance_dirty_pages(struct a
 						  dirty_ratelimit,
 						  task_ratelimit,
 						  pages_dirtied,
+						  period,
 						  pause,
 						  start_time);
+			if (pause < -HZ) {
+				current->dirty_paused_when = now;
+				current->nr_dirtied = 0;
+			} else if (period) {
+				current->dirty_paused_when += period;
+				current->nr_dirtied = 0;
+			}
 			pause = 1; /* avoid resetting nr_dirtied_pause below */
 			break;
 		}
@@ -1129,11 +1154,15 @@ pause:
 					  dirty_ratelimit,
 					  task_ratelimit,
 					  pages_dirtied,
+					  period,
 					  pause,
 					  start_time);
 		__set_current_state(TASK_KILLABLE);
 		io_schedule_timeout(pause);
 
+		current->dirty_paused_when = now + pause;
+		current->nr_dirtied = 0;
+
 		/*
 		 * This is typically equal to (nr_dirty < dirty_thresh) and can
 		 * also keep "1000+ dd on a slow USB stick" under control.
@@ -1148,11 +1177,10 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	current->nr_dirtied = 0;
 	if (pause == 0) { /* in freerun area */
 		current->nr_dirtied_pause =
 				dirty_poll_interval(nr_dirty, dirty_thresh);
-	} else if (pause <= max_pause / 4 &&
+	} else if (period <= max_pause / 4 &&
 		   pages_dirtied >= current->nr_dirtied_pause) {
 		current->nr_dirtied_pause = clamp_val(
 					dirty_ratelimit * (max_pause / 2) / HZ,
--- linux-next.orig/kernel/fork.c	2011-11-17 20:12:23.000000000 +0800
+++ linux-next/kernel/fork.c	2011-11-17 20:12:35.000000000 +0800
@@ -1296,6 +1296,7 @@ static struct task_struct *copy_process(
 
 	p->nr_dirtied = 0;
 	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+	p->dirty_paused_when = 0;
 
 	/*
 	 * Ok, make it visible to the rest of the system.
--- linux-next.orig/include/trace/events/writeback.h	2011-11-17 19:13:41.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-11-17 20:12:35.000000000 +0800
@@ -289,12 +289,13 @@ TRACE_EVENT(balance_dirty_pages,
 		 unsigned long dirty_ratelimit,
 		 unsigned long task_ratelimit,
 		 unsigned long dirtied,
+		 unsigned long period,
 		 long pause,
 		 unsigned long start_time),
 
 	TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
 		dirty_ratelimit, task_ratelimit,
-		dirtied, pause, start_time),
+		dirtied, period, pause, start_time),
 
 	TP_STRUCT__entry(
 		__array(	 char,	bdi, 32)
@@ -309,6 +310,8 @@ TRACE_EVENT(balance_dirty_pages,
 		__field(unsigned int,	dirtied_pause)
 		__field(unsigned long,	paused)
 		__field(	 long,	pause)
+		__field(unsigned long,	period)
+		__field(	 long,	think)
 	),
 
 	TP_fast_assign(
@@ -325,6 +328,9 @@ TRACE_EVENT(balance_dirty_pages,
 		__entry->task_ratelimit	= KBps(task_ratelimit);
 		__entry->dirtied	= dirtied;
 		__entry->dirtied_pause	= current->nr_dirtied_pause;
+		__entry->think		= current->dirty_paused_when == 0 ? 0 :
+			 (long)(jiffies - current->dirty_paused_when) * 1000/HZ;
+		__entry->period		= period * 1000 / HZ;
 		__entry->pause		= pause * 1000 / HZ;
 		__entry->paused		= (jiffies - start_time) * 1000 / HZ;
 	),
@@ -335,7 +341,7 @@ TRACE_EVENT(balance_dirty_pages,
 		  "bdi_setpoint=%lu bdi_dirty=%lu "
 		  "dirty_ratelimit=%lu task_ratelimit=%lu "
 		  "dirtied=%u dirtied_pause=%u "
-		  "paused=%lu pause=%ld",
+		  "paused=%lu pause=%ld period=%lu think=%ld",
 		  __entry->bdi,
 		  __entry->limit,
 		  __entry->setpoint,
@@ -347,7 +353,9 @@ TRACE_EVENT(balance_dirty_pages,
 		  __entry->dirtied,
 		  __entry->dirtied_pause,
 		  __entry->paused,	/* ms */
-		  __entry->pause	/* ms */
+		  __entry->pause,	/* ms */
+		  __entry->period,	/* ms */
+		  __entry->think	/* ms */
 	  )
 );
 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/5] writeback: charge leaked page dirties to active tasks
  2011-11-21 13:03 ` [PATCH 2/5] writeback: charge leaked page dirties to active tasks Wu Fengguang
@ 2011-11-21 21:49   ` Andrew Morton
  2011-11-21 23:46     ` Jan Kara
  2011-11-22 13:35     ` Wu Fengguang
  0 siblings, 2 replies; 29+ messages in thread
From: Andrew Morton @ 2011-11-21 21:49 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig, LKML

On Mon, 21 Nov 2011 21:03:44 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> It's a years long problem that a large number of short-lived dirtiers
> (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
> (eg. dd) as well as pushing the dirty pages to the global hard limit.
> 
> The solution is to charge the pages dirtied by the exited gcc to the
> other random dirtying tasks. It sounds not perfect, however should
> behave good enough in practice, seeing as that throttled tasks aren't
> actually running so those that are running are more likely to pick it up
> and get throttled, therefore promoting an equal spread.
> 
> --- linux-next.orig/mm/page-writeback.c	2011-11-17 20:57:04.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-17 20:57:13.000000000 +0800
> @@ -1194,6 +1194,7 @@ void set_page_dirty_balance(struct page 
>  }
>  
>  static DEFINE_PER_CPU(int, bdp_ratelimits);
> +DEFINE_PER_CPU(int, dirty_leaks) = 0;

This is a poor identifier for a global symbol.  Generally such symbols
should at least identify what subsystem they belong to.

Also, this would be a good site at whcih to document the global
symbol's role.  The writeback code needs a lot of documentation. Of
the design-level kind.

>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> @@ -1242,6 +1243,17 @@ void balance_dirty_pages_ratelimited_nr(
>  			ratelimit = 0;
>  		}
>  	}
> +	/*
> +	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
> +	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
> +	 * the dirty throttling and livelock other long-run dirtiers.
> +	 */
> +	p = &__get_cpu_var(dirty_leaks);
> +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> +		*p -= nr_pages_dirtied;
> +		current->nr_dirtied += nr_pages_dirtied;
> +	}
>  	preempt_enable();
>  
>  	if (unlikely(current->nr_dirtied >= ratelimit))
> --- linux-next.orig/kernel/exit.c	2011-11-17 20:57:02.000000000 +0800
> +++ linux-next/kernel/exit.c	2011-11-17 20:57:04.000000000 +0800
> @@ -1037,6 +1037,8 @@ NORET_TYPE void do_exit(long code)
>  	validate_creds_for_do_exit(tsk);
>  
>  	preempt_disable();
> +	if (tsk->nr_dirtied)
> +		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);

Whatever problem this code is solving, it only solved it in certain
cases.  For example, if tasks are forking, dirtying and exiting at a
rapid rate on CPU 0 then all the other CPUs don't know anything about
this and we didn't fix anything.

IOW, it all seems like a half-assed bandaid.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/5] writeback: fix dirtied pages accounting on redirty
  2011-11-21 13:03 ` [PATCH 4/5] writeback: fix dirtied pages accounting on redirty Wu Fengguang
@ 2011-11-21 21:51   ` Andrew Morton
  2011-11-22 13:59     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2011-11-21 21:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig, LKML

On Mon, 21 Nov 2011 21:03:46 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> De-account the accumulative dirty counters on page redirty.
> 
> Page redirties (very common in ext4) will introduce mismatch between
> counters (a) and (b)
> 
> a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
> b) NR_WRITTEN, BDI_WRITTEN
> 
> This will introduce systematic errors in balanced_rate and result in
> dirty page position errors (ie. the dirty pages are no longer balanced
> around the global/bdi setpoints).
> 
> ...
>
> --- linux-next.orig/mm/page-writeback.c	2011-11-17 20:57:15.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-17 20:57:16.000000000 +0800
> @@ -1792,6 +1792,17 @@ int __set_page_dirty_nobuffers(struct pa
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> +void account_page_redirty(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +	if (mapping && mapping_cap_account_dirty(mapping)) {
> +		current->nr_dirtied--;
> +		dec_zone_page_state(page, NR_DIRTIED);
> +		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> +	}
> +}

Again, writeback doesn't seem to be the best place to be adding
undocumented code.

> +EXPORT_SYMBOL(account_page_redirty);

The export is unneeded?



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth
  2011-11-21 13:03 ` [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
@ 2011-11-21 22:50   ` Jan Kara
  2011-11-22  6:41     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-11-21 22:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Mon 21-11-11 21:03:43, Wu Fengguang wrote:
> Add an upper limit to balanced_rate according to the below inequality.
> This filters out some rare but huge singular points, which at least
> enables more readable gnuplot figures.
> 
> When there are N dd dirtiers,
> 
> 	balanced_dirty_ratelimit = write_bw / N
> 
> So it holds that
> 
> 	balanced_dirty_ratelimit <= write_bw
  The change makes sense, but do we understand why there are such huge
singular points? Are they due to errors in estimation of bandwidth or due
to errors in dirtying rate computations (e.g. due to truncates), or
something else?

								Honza

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-11-17 20:18:03.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-17 20:18:23.000000000 +0800
> @@ -804,6 +804,11 @@ static void bdi_update_dirty_ratelimit(s
>  	 */
>  	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
>  					   dirty_rate | 1);
> +	/*
> +	 * balanced_dirty_ratelimit ~= (write_bw / N) <= write_bw
> +	 */
> +	if (unlikely(balanced_dirty_ratelimit > write_bw))
> +		balanced_dirty_ratelimit = write_bw;
>  
>  	/*
>  	 * We could safely do this and return immediately:
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/5] writeback: charge leaked page dirties to active tasks
  2011-11-21 21:49   ` Andrew Morton
@ 2011-11-21 23:46     ` Jan Kara
  2011-11-22 13:35     ` Wu Fengguang
  1 sibling, 0 replies; 29+ messages in thread
From: Jan Kara @ 2011-11-21 23:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, linux-fsdevel, Jan Kara, Peter Zijlstra,
	Christoph Hellwig, LKML

On Mon 21-11-11 13:49:29, Andrew Morton wrote:
> On Mon, 21 Nov 2011 21:03:44 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > It's a years long problem that a large number of short-lived dirtiers
> > (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
> > (eg. dd) as well as pushing the dirty pages to the global hard limit.
> > 
> > The solution is to charge the pages dirtied by the exited gcc to the
> > other random dirtying tasks. It sounds not perfect, however should
> > behave good enough in practice, seeing as that throttled tasks aren't
> > actually running so those that are running are more likely to pick it up
> > and get throttled, therefore promoting an equal spread.
> > 
> > --- linux-next.orig/mm/page-writeback.c	2011-11-17 20:57:04.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-11-17 20:57:13.000000000 +0800
> > @@ -1194,6 +1194,7 @@ void set_page_dirty_balance(struct page 
> >  }
> >  
> >  static DEFINE_PER_CPU(int, bdp_ratelimits);
> > +DEFINE_PER_CPU(int, dirty_leaks) = 0;
> 
> This is a poor identifier for a global symbol.  Generally such symbols
> should at least identify what subsystem they belong to.
> 
> Also, this would be a good site at whcih to document the global
> symbol's role.  The writeback code needs a lot of documentation. Of
> the design-level kind.
> 
> >  /**
> >   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
> > @@ -1242,6 +1243,17 @@ void balance_dirty_pages_ratelimited_nr(
> >  			ratelimit = 0;
> >  		}
> >  	}
> > +	/*
> > +	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
> > +	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
> > +	 * the dirty throttling and livelock other long-run dirtiers.
> > +	 */
> > +	p = &__get_cpu_var(dirty_leaks);
> > +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> > +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> > +		*p -= nr_pages_dirtied;
> > +		current->nr_dirtied += nr_pages_dirtied;
> > +	}
> >  	preempt_enable();
> >  
> >  	if (unlikely(current->nr_dirtied >= ratelimit))
> > --- linux-next.orig/kernel/exit.c	2011-11-17 20:57:02.000000000 +0800
> > +++ linux-next/kernel/exit.c	2011-11-17 20:57:04.000000000 +0800
> > @@ -1037,6 +1037,8 @@ NORET_TYPE void do_exit(long code)
> >  	validate_creds_for_do_exit(tsk);
> >  
> >  	preempt_disable();
> > +	if (tsk->nr_dirtied)
> > +		__this_cpu_add(dirty_leaks, tsk->nr_dirtied);
> 
> Whatever problem this code is solving, it only solved it in certain
> cases.  For example, if tasks are forking, dirtying and exiting at a
> rapid rate on CPU 0 then all the other CPUs don't know anything about
> this and we didn't fix anything.
  I think it will work - dirty_leaks at CPU0 will get leaked pages from the
first task when it dies, the second task will pick them up and as soon as
it dies it puts all the pages back to dirty_leaks at CPU0. This goes on and
on and eventually, dirty_leaks will accumulate enough pages so that a task
dirtying even a single page will enter balance_dirty_pages() as if it
dirtied 'ratelimit' pages.
 
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-21 13:03 ` [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
@ 2011-11-22  0:11   ` Jan Kara
  2011-11-22  9:21     ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-11-22  0:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Mon 21-11-11 21:03:45, Wu Fengguang wrote:
> When dd in 512bytes, generic_perform_write() calls
> balance_dirty_pages_ratelimited() 8 times for the same page, but
> obviously the page is only dirtied once.
> 
> Fix it by accounting nr_dirtied at page dirty time.
  Well, but after this change, the interface balance_dirty_ratelimited_nr()
is strange because the argument is only used for per-CPU ratelimiting and
not for per-task ratelimiting... So if you do this switch then I'd also
switch bdp_ratelimits to get consistent results and a clean interface and
completely kill balance_dirty_pages_ratelimited_nr().

								Honza

> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-11-17 20:57:13.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-17 20:57:15.000000000 +0800
> @@ -1224,8 +1224,6 @@ void balance_dirty_pages_ratelimited_nr(
>  	if (bdi->dirty_exceeded)
>  		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
>  
> -	current->nr_dirtied += nr_pages_dirtied;
> -
>  	preempt_disable();
>  	/*
>  	 * This prevents one CPU to accumulate too many dirtied pages without
> @@ -1734,6 +1732,7 @@ void account_page_dirtied(struct page *p
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
>  		task_io_account_write(PAGE_CACHE_SIZE);
> +		current->nr_dirtied++;
>  	}
>  }
>  EXPORT_SYMBOL(account_page_dirtied);
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth
  2011-11-21 22:50   ` Jan Kara
@ 2011-11-22  6:41     ` Wu Fengguang
  2011-11-22 21:04       ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-11-22  6:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Christoph Hellwig, Andrew Morton, LKML

On Tue, Nov 22, 2011 at 06:50:49AM +0800, Jan Kara wrote:
> On Mon 21-11-11 21:03:43, Wu Fengguang wrote:
> > Add an upper limit to balanced_rate according to the below inequality.
> > This filters out some rare but huge singular points, which at least
> > enables more readable gnuplot figures.
> > 
> > When there are N dd dirtiers,
> > 
> > 	balanced_dirty_ratelimit = write_bw / N
> > 
> > So it holds that
> > 
> > 	balanced_dirty_ratelimit <= write_bw
>   The change makes sense, but do we understand why there are such huge
> singular points? Are they due to errors in estimation of bandwidth or due
> to errors in dirtying rate computations (e.g. due to truncates), or
> something else?

Good point. I'll add this to the changelog:

The singular points originate from dirty_rate in the below formular:

        balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
where
        dirty_rate = (number of page dirties in the past 200ms) / 200ms

In the extreme case, if all dd tasks suddenly get blocked on something 
else and hence no pages are dirtied at all, dirty_rate will be 0 and
balanced_dirty_ratelimit will be inf. This could happen in reality.

There won't be tiny singular points though, as long as the dirty pages
lie inside the dirty control area (above the freerun region).
Because there the dd tasks will be throttled by balanced_dirty_pages()
and won't be able to suddenly dirty much more pages than average.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22  0:11   ` Jan Kara
@ 2011-11-22  9:21     ` Wu Fengguang
  2011-11-22 12:21       ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-11-22  9:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Christoph Hellwig, Andrew Morton, LKML

On Tue, Nov 22, 2011 at 08:11:27AM +0800, Jan Kara wrote:
> On Mon 21-11-11 21:03:45, Wu Fengguang wrote:
> > When dd in 512bytes, generic_perform_write() calls
> > balance_dirty_pages_ratelimited() 8 times for the same page, but
> > obviously the page is only dirtied once.
> > 
> > Fix it by accounting nr_dirtied at page dirty time.
>   Well, but after this change, the interface balance_dirty_ratelimited_nr()
> is strange because the argument is only used for per-CPU ratelimiting and
> not for per-task ratelimiting...

Yeah I was vaguely aware of this... and still choose to ignore this
since the patchset looked already forbiddingly large at the time ;)

> So if you do this switch then I'd also
> switch bdp_ratelimits to get consistent results and a clean interface and
> completely kill balance_dirty_pages_ratelimited_nr().

Following your suggestions to change ratelimiting as well :)

I'll do the interface change with a standalone patch.

Thanks,
Fengguang
---
Subject: writeback: fix dirtied pages accounting on sub-page writes
Date: Thu Apr 14 07:52:37 CST 2011

When dd in 512bytes, generic_perform_write() calls
balance_dirty_pages_ratelimited() 8 times for the same page, but
obviously the page is only dirtied once.

Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-11-22 16:59:48.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-22 17:12:20.000000000 +0800
@@ -1231,8 +1231,6 @@ void balance_dirty_pages_ratelimited_nr(
 	if (bdi->dirty_exceeded)
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
 
-	current->nr_dirtied += nr_pages_dirtied;
-
 	preempt_disable();
 	/*
 	 * This prevents one CPU to accumulate too many dirtied pages without
@@ -1243,12 +1241,9 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-	else {
-		*p += nr_pages_dirtied;
-		if (unlikely(*p >= ratelimit_pages)) {
-			*p = 0;
-			ratelimit = 0;
-		}
+	else if (unlikely(*p >= ratelimit_pages)) {
+		*p = 0;
+		ratelimit = 0;
 	}
 	/*
 	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
@@ -1743,6 +1738,8 @@ void account_page_dirtied(struct page *p
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		current->nr_dirtied++;
+		__get_cpu_var(bdp_ratelimits)++;
 	}
 }
 EXPORT_SYMBOL(account_page_dirtied);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22  9:21     ` Wu Fengguang
@ 2011-11-22 12:21       ` Jan Kara
  2011-11-22 12:30         ` Wu Fengguang
                           ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Jan Kara @ 2011-11-22 12:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Tue 22-11-11 17:21:11, Wu Fengguang wrote:
> On Tue, Nov 22, 2011 at 08:11:27AM +0800, Jan Kara wrote:
> > On Mon 21-11-11 21:03:45, Wu Fengguang wrote:
> > > When dd in 512bytes, generic_perform_write() calls
> > > balance_dirty_pages_ratelimited() 8 times for the same page, but
> > > obviously the page is only dirtied once.
> > > 
> > > Fix it by accounting nr_dirtied at page dirty time.
> >   Well, but after this change, the interface balance_dirty_ratelimited_nr()
> > is strange because the argument is only used for per-CPU ratelimiting and
> > not for per-task ratelimiting...
> 
> Yeah I was vaguely aware of this... and still choose to ignore this
> since the patchset looked already forbiddingly large at the time ;)
> 
> > So if you do this switch then I'd also
> > switch bdp_ratelimits to get consistent results and a clean interface and
> > completely kill balance_dirty_pages_ratelimited_nr().
> 
> Following your suggestions to change ratelimiting as well :)
> 
> I'll do the interface change with a standalone patch.
  OK.

> ---
> Subject: writeback: fix dirtied pages accounting on sub-page writes
> Date: Thu Apr 14 07:52:37 CST 2011
> 
> When dd in 512bytes, generic_perform_write() calls
> balance_dirty_pages_ratelimited() 8 times for the same page, but
> obviously the page is only dirtied once.
> 
> Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   13 +++++--------
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-11-22 16:59:48.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-22 17:12:20.000000000 +0800
> @@ -1231,8 +1231,6 @@ void balance_dirty_pages_ratelimited_nr(
>  	if (bdi->dirty_exceeded)
>  		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
>  
> -	current->nr_dirtied += nr_pages_dirtied;
> -
>  	preempt_disable();
>  	/*
>  	 * This prevents one CPU to accumulate too many dirtied pages without
> @@ -1243,12 +1241,9 @@ void balance_dirty_pages_ratelimited_nr(
>  	p =  &__get_cpu_var(bdp_ratelimits);
>  	if (unlikely(current->nr_dirtied >= ratelimit))
>  		*p = 0;
> -	else {
> -		*p += nr_pages_dirtied;
> -		if (unlikely(*p >= ratelimit_pages)) {
> -			*p = 0;
> -			ratelimit = 0;
> -		}
> +	else if (unlikely(*p >= ratelimit_pages)) {
> +		*p = 0;
> +		ratelimit = 0;
>  	}
>  	/*
>  	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
> @@ -1743,6 +1738,8 @@ void account_page_dirtied(struct page *p
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
>  		task_dirty_inc(current);
>  		task_io_account_write(PAGE_CACHE_SIZE);
> +		current->nr_dirtied++;
> +		__get_cpu_var(bdp_ratelimits)++;
  I think you need preempt_disable() and preempt_enable() pair around
__get_cpu_var(). Otherwise a process could get rescheduled in the middle of
read-modify-write cycle...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22 12:21       ` Jan Kara
@ 2011-11-22 12:30         ` Wu Fengguang
  2011-11-22 12:48           ` Jan Kara
  2011-11-22 12:57         ` Peter Zijlstra
  2011-11-28 13:51         ` Wu Fengguang
  2 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-11-22 12:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Christoph Hellwig, Andrew Morton, LKML

> > @@ -1743,6 +1738,8 @@ void account_page_dirtied(struct page *p
> >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> >  		task_dirty_inc(current);
> >  		task_io_account_write(PAGE_CACHE_SIZE);
> > +		current->nr_dirtied++;
> > +		__get_cpu_var(bdp_ratelimits)++;
>   I think you need preempt_disable() and preempt_enable() pair around
> __get_cpu_var(). Otherwise a process could get rescheduled in the middle of
> read-modify-write cycle...

Hmm, I'm not worried about it at all, because bdp_ratelimits don't
need to be accurate. In normal cases it won't even trigger one single
call to balance_dirty_pages().

btw, account_page_dirtied() is called inside spinlock, will it be
sufficient?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22 12:30         ` Wu Fengguang
@ 2011-11-22 12:48           ` Jan Kara
  2011-11-22 13:02             ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-11-22 12:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Tue 22-11-11 20:30:01, Wu Fengguang wrote:
> > > @@ -1743,6 +1738,8 @@ void account_page_dirtied(struct page *p
> > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> > >  		task_dirty_inc(current);
> > >  		task_io_account_write(PAGE_CACHE_SIZE);
> > > +		current->nr_dirtied++;
> > > +		__get_cpu_var(bdp_ratelimits)++;
> >   I think you need preempt_disable() and preempt_enable() pair around
> > __get_cpu_var(). Otherwise a process could get rescheduled in the middle of
> > read-modify-write cycle...
> 
> Hmm, I'm not worried about it at all, because bdp_ratelimits don't
> need to be accurate. In normal cases it won't even trigger one single
> call to balance_dirty_pages().
  I agree regarding the accuracy. But the CPU can change when the process
is scheduled again. So you could modify counter of a CPU you are not
running on. And that can cause bad things...

> btw, account_page_dirtied() is called inside spinlock, will it be
> sufficient?
  Currently it is not enough in real-time kernels and when sleeping
spinlocks work gets merged it won't be enough even in standard kernels...
And in kernels where spinlock means preemption is disabled
preempt_enable/disable will be almost for free...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22 12:21       ` Jan Kara
  2011-11-22 12:30         ` Wu Fengguang
@ 2011-11-22 12:57         ` Peter Zijlstra
  2011-11-22 13:07           ` Wu Fengguang
  2011-11-28 13:51         ` Wu Fengguang
  2 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2011-11-22 12:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, linux-fsdevel, Christoph Hellwig, Andrew Morton, LKML

On Tue, 2011-11-22 at 13:21 +0100, Jan Kara wrote:
> > +             __get_cpu_var(bdp_ratelimits)++;
>   I think you need preempt_disable() and preempt_enable() pair around
> __get_cpu_var(). Otherwise a process could get rescheduled in the middle of
> read-modify-write cycle... 

there's of course the this_cpu_inc(bdp_ratelimits); thing.

On x86 that'll turn into a single insn, on others it will add the
required preempt_disable/enable bits.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22 12:48           ` Jan Kara
@ 2011-11-22 13:02             ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-11-22 13:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Christoph Hellwig, Andrew Morton, LKML

On Tue, Nov 22, 2011 at 08:48:11PM +0800, Jan Kara wrote:
> On Tue 22-11-11 20:30:01, Wu Fengguang wrote:
> > > > @@ -1743,6 +1738,8 @@ void account_page_dirtied(struct page *p
> > > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> > > >  		task_dirty_inc(current);
> > > >  		task_io_account_write(PAGE_CACHE_SIZE);
> > > > +		current->nr_dirtied++;
> > > > +		__get_cpu_var(bdp_ratelimits)++;
> > >   I think you need preempt_disable() and preempt_enable() pair around
> > > __get_cpu_var(). Otherwise a process could get rescheduled in the middle of
> > > read-modify-write cycle...
> > 
> > Hmm, I'm not worried about it at all, because bdp_ratelimits don't
> > need to be accurate. In normal cases it won't even trigger one single
> > call to balance_dirty_pages().
>   I agree regarding the accuracy. But the CPU can change when the process
> is scheduled again. So you could modify counter of a CPU you are not
> running on. And that can cause bad things...

Will modifying another CPU's per-cpu data lead to more serious
problems than inaccuracy? If not, it would be fine. bdp_ratelimits is
only meant to be a coarse grained safeguard after all :-)

> > btw, account_page_dirtied() is called inside spinlock, will it be
> > sufficient?
>   Currently it is not enough in real-time kernels and when sleeping
> spinlocks work gets merged it won't be enough even in standard kernels...
> And in kernels where spinlock means preemption is disabled
> preempt_enable/disable will be almost for free...

I see, spinlock won't be a general superset of preempt_enable/disable
indeed.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22 12:57         ` Peter Zijlstra
@ 2011-11-22 13:07           ` Wu Fengguang
  2011-11-22 13:41             ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-11-22 13:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Andrew Morton, LKML

On Tue, Nov 22, 2011 at 08:57:42PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-11-22 at 13:21 +0100, Jan Kara wrote:
> > > +             __get_cpu_var(bdp_ratelimits)++;
> >   I think you need preempt_disable() and preempt_enable() pair around
> > __get_cpu_var(). Otherwise a process could get rescheduled in the middle of
> > read-modify-write cycle... 
> 
> there's of course the this_cpu_inc(bdp_ratelimits); thing.
> 
> On x86 that'll turn into a single insn, on others it will add the
> required preempt_disable/enable bits.

It's good to know that. But what if we don't really care which CPU
data it's increasing, and can accept losing some increases due to the
resulted race condition?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/5] writeback: charge leaked page dirties to active tasks
  2011-11-21 21:49   ` Andrew Morton
  2011-11-21 23:46     ` Jan Kara
@ 2011-11-22 13:35     ` Wu Fengguang
  1 sibling, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-11-22 13:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig, LKML

On Tue, Nov 22, 2011 at 05:49:29AM +0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 21:03:44 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > It's a years long problem that a large number of short-lived dirtiers
> > (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
> > (eg. dd) as well as pushing the dirty pages to the global hard limit.
> > 
> > The solution is to charge the pages dirtied by the exited gcc to the
> > other random dirtying tasks. It sounds not perfect, however should
> > behave good enough in practice, seeing as that throttled tasks aren't
> > actually running so those that are running are more likely to pick it up
> > and get throttled, therefore promoting an equal spread.
> > 
> > --- linux-next.orig/mm/page-writeback.c	2011-11-17 20:57:04.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-11-17 20:57:13.000000000 +0800
> > @@ -1194,6 +1194,7 @@ void set_page_dirty_balance(struct page 
> >  }
> >  
> >  static DEFINE_PER_CPU(int, bdp_ratelimits);
> > +DEFINE_PER_CPU(int, dirty_leaks) = 0;
> 
> This is a poor identifier for a global symbol.  Generally such symbols
> should at least identify what subsystem they belong to.

Yes it is, "dirty_throttle_leaks" should look better.

> Also, this would be a good site at whcih to document the global
> symbol's role.  The writeback code needs a lot of documentation. Of
> the design-level kind.

Agreed, I just added this comment:

/*
 * Normal tasks are throttled by
 *      loop {
 *              dirty tsk->nr_dirtied_pause pages;
 *              take a snap in balance_dirty_pages();
 *      }
 * However there is a worst case. If every task exit immediately when dirtied
 * (tsk->nr_dirtied_pause - 1) pages, balance_dirty_pages() will never be
 * called to throttle the page dirties. The solution is to save the not yet
 * throttled page dirties in dirty_throttle_leaks on task exit and charge them
 * randomly into the running tasks. This works well for the above worst case,
 * as the new task will pick up and accumulate the old task's leaked dirty
 * count and eventually get throttled.
 */

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22 13:07           ` Wu Fengguang
@ 2011-11-22 13:41             ` Wu Fengguang
  2011-11-22 13:53               ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Wu Fengguang @ 2011-11-22 13:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Andrew Morton, LKML

On Tue, Nov 22, 2011 at 09:07:50PM +0800, Wu Fengguang wrote:
> On Tue, Nov 22, 2011 at 08:57:42PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-11-22 at 13:21 +0100, Jan Kara wrote:
> > > > +             __get_cpu_var(bdp_ratelimits)++;
> > >   I think you need preempt_disable() and preempt_enable() pair around
> > > __get_cpu_var(). Otherwise a process could get rescheduled in the middle of
> > > read-modify-write cycle... 
> > 
> > there's of course the this_cpu_inc(bdp_ratelimits); thing.
> > 
> > On x86 that'll turn into a single insn, on others it will add the
> > required preempt_disable/enable bits.
> 
> It's good to know that. But what if we don't really care which CPU
> data it's increasing, and can accept losing some increases due to the
> resulted race condition?

I just added a comment for it, hope it helps :)

                /*
                 * This is racy, however bdp_ratelimits merely serves as a
                 * gross safeguard. We don't really care the exact CPU it's
                 * charging to and the resulted inaccuracy is acceptable.
                 */
                __get_cpu_var(bdp_ratelimits)++;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22 13:41             ` Wu Fengguang
@ 2011-11-22 13:53               ` Peter Zijlstra
  2011-11-22 14:11                 ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2011-11-22 13:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Andrew Morton, LKML,
	Linus Torvalds, paulmck

On Tue, 2011-11-22 at 21:41 +0800, Wu Fengguang wrote:
> On Tue, Nov 22, 2011 at 09:07:50PM +0800, Wu Fengguang wrote:
> > On Tue, Nov 22, 2011 at 08:57:42PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-11-22 at 13:21 +0100, Jan Kara wrote:
> > > > > +             __get_cpu_var(bdp_ratelimits)++;
> > > >   I think you need preempt_disable() and preempt_enable() pair around
> > > > __get_cpu_var(). Otherwise a process could get rescheduled in the middle of
> > > > read-modify-write cycle... 
> > > 
> > > there's of course the this_cpu_inc(bdp_ratelimits); thing.
> > > 
> > > On x86 that'll turn into a single insn, on others it will add the
> > > required preempt_disable/enable bits.
> > 
> > It's good to know that. But what if we don't really care which CPU
> > data it's increasing, and can accept losing some increases due to the
> > resulted race condition?
> 
> I just added a comment for it, hope it helps :)
> 
>                 /*
>                  * This is racy, however bdp_ratelimits merely serves as a
>                  * gross safeguard. We don't really care the exact CPU it's
>                  * charging to and the resulted inaccuracy is acceptable.
>                  */
>                 __get_cpu_var(bdp_ratelimits)++;

Thing is, I'm not sure how much update you can effectively wreck by
interleaving the RmW cycles of two CPUs like this.

Simply loosing a few increments would be fine, but what are the
practical implications of actually relying on this behaviour and how do
various architectures cope.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 4/5] writeback: fix dirtied pages accounting on redirty
  2011-11-21 21:51   ` Andrew Morton
@ 2011-11-22 13:59     ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-11-22 13:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig, LKML,
	Chris Mason

On Tue, Nov 22, 2011 at 05:51:46AM +0800, Andrew Morton wrote:
> On Mon, 21 Nov 2011 21:03:46 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > De-account the accumulative dirty counters on page redirty.
> > 
> > Page redirties (very common in ext4) will introduce mismatch between
> > counters (a) and (b)
> > 
> > a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
> > b) NR_WRITTEN, BDI_WRITTEN
> > 
> > This will introduce systematic errors in balanced_rate and result in
> > dirty page position errors (ie. the dirty pages are no longer balanced
> > around the global/bdi setpoints).
> > 
> > ...
> >
> > --- linux-next.orig/mm/page-writeback.c	2011-11-17 20:57:15.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-11-17 20:57:16.000000000 +0800
> > @@ -1792,6 +1792,17 @@ int __set_page_dirty_nobuffers(struct pa
> >  }
> >  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
> >  
> > +void account_page_redirty(struct page *page)
> > +{
> > +	struct address_space *mapping = page->mapping;
> > +	if (mapping && mapping_cap_account_dirty(mapping)) {
> > +		current->nr_dirtied--;
> > +		dec_zone_page_state(page, NR_DIRTIED);
> > +		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> > +	}
> > +}
> 
> Again, writeback doesn't seem to be the best place to be adding
> undocumented code.

Sorry, hope this comment can explain when to call the function and why.

/*
 * Call this whenever redirtying a page, to de-account the dirty counters
 * (NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied), so that they match the written
 * counters (NR_WRITTEN, BDI_WRITTEN) in long term. The mismatches will lead to
 * systematic errors in balanced_dirty_ratelimit and the dirty pages position
 * control.
 */

> > +EXPORT_SYMBOL(account_page_redirty);
> 
> The export is unneeded?

btrfs will need to call into it. Here is the patch.  I'll include it
in the v2 series.

Thanks,
Fengguang
---

Subject: btrfs: fix dirtied pages accounting on sub-page writes
Date: Mon Aug 08 15:19:47 CST 2011

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited_nr() should be called once instead of 4
times, the latter makes the dirtier tasks be throttled much too heavy.

Fix it with proper de-accounting on clear_page_dirty_for_io().

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/btrfs/file.c	2011-11-17 20:13:47.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2011-11-17 20:18:51.000000000 +0800
@@ -1136,7 +1136,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (clear_page_dirty_for_io(pages[i]))
+			account_page_redirty(pages[i]);
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22 13:53               ` Peter Zijlstra
@ 2011-11-22 14:11                 ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-11-22 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, linux-fsdevel, Christoph Hellwig, Andrew Morton, LKML,
	Linus Torvalds, paulmck

On Tue, Nov 22, 2011 at 09:53:19PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-11-22 at 21:41 +0800, Wu Fengguang wrote:
> > On Tue, Nov 22, 2011 at 09:07:50PM +0800, Wu Fengguang wrote:
> > > On Tue, Nov 22, 2011 at 08:57:42PM +0800, Peter Zijlstra wrote:
> > > > On Tue, 2011-11-22 at 13:21 +0100, Jan Kara wrote:
> > > > > > +             __get_cpu_var(bdp_ratelimits)++;
> > > > >   I think you need preempt_disable() and preempt_enable() pair around
> > > > > __get_cpu_var(). Otherwise a process could get rescheduled in the middle of
> > > > > read-modify-write cycle... 
> > > > 
> > > > there's of course the this_cpu_inc(bdp_ratelimits); thing.
> > > > 
> > > > On x86 that'll turn into a single insn, on others it will add the
> > > > required preempt_disable/enable bits.
> > > 
> > > It's good to know that. But what if we don't really care which CPU
> > > data it's increasing, and can accept losing some increases due to the
> > > resulted race condition?
> > 
> > I just added a comment for it, hope it helps :)
> > 
> >                 /*
> >                  * This is racy, however bdp_ratelimits merely serves as a
> >                  * gross safeguard. We don't really care the exact CPU it's
> >                  * charging to and the resulted inaccuracy is acceptable.
> >                  */
> >                 __get_cpu_var(bdp_ratelimits)++;
> 
> Thing is, I'm not sure how much update you can effectively wreck by
> interleaving the RmW cycles of two CPUs like this.

Yeah there is the side effect of cache bouncing, which makes it not a
clear win...and pure lose on x86...

> Simply loosing a few increments would be fine, but what are the
> practical implications of actually relying on this behaviour and how do
> various architectures cope.

OK I'll give up the weird (mis-)use of the per-cpu data structure :)

Thanks,
Fengguang
---
Subject: writeback: fix dirtied pages accounting on sub-page writes
Date: Thu Apr 14 07:52:37 CST 2011

When dd in 512bytes, generic_perform_write() calls
balance_dirty_pages_ratelimited() 8 times for the same page, but
obviously the page is only dirtied once.

Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-11-22 22:01:56.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-22 22:02:32.000000000 +0800
@@ -1246,8 +1246,6 @@ void balance_dirty_pages_ratelimited_nr(
 	if (bdi->dirty_exceeded)
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
 
-	current->nr_dirtied += nr_pages_dirtied;
-
 	preempt_disable();
 	/*
 	 * This prevents one CPU to accumulate too many dirtied pages without
@@ -1258,12 +1256,9 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-	else {
-		*p += nr_pages_dirtied;
-		if (unlikely(*p >= ratelimit_pages)) {
-			*p = 0;
-			ratelimit = 0;
-		}
+	else if (unlikely(*p >= ratelimit_pages)) {
+		*p = 0;
+		ratelimit = 0;
 	}
 	/*
 	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
@@ -1758,6 +1753,8 @@ void account_page_dirtied(struct page *p
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		current->nr_dirtied++;
+		this_cpu_inc(bdp_ratelimits);
 	}
 }
 EXPORT_SYMBOL(account_page_dirtied);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth
  2011-11-22  6:41     ` Wu Fengguang
@ 2011-11-22 21:04       ` Jan Kara
  2011-11-23 13:17         ` Wu Fengguang
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2011-11-22 21:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Tue 22-11-11 14:41:49, Wu Fengguang wrote:
> On Tue, Nov 22, 2011 at 06:50:49AM +0800, Jan Kara wrote:
> > On Mon 21-11-11 21:03:43, Wu Fengguang wrote:
> > > Add an upper limit to balanced_rate according to the below inequality.
> > > This filters out some rare but huge singular points, which at least
> > > enables more readable gnuplot figures.
> > > 
> > > When there are N dd dirtiers,
> > > 
> > > 	balanced_dirty_ratelimit = write_bw / N
> > > 
> > > So it holds that
> > > 
> > > 	balanced_dirty_ratelimit <= write_bw
> >   The change makes sense, but do we understand why there are such huge
> > singular points? Are they due to errors in estimation of bandwidth or due
> > to errors in dirtying rate computations (e.g. due to truncates), or
> > something else?
> 
> Good point. I'll add this to the changelog:
> 
> The singular points originate from dirty_rate in the below formular:
> 
>         balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
> where
>         dirty_rate = (number of page dirties in the past 200ms) / 200ms
> 
> In the extreme case, if all dd tasks suddenly get blocked on something 
> else and hence no pages are dirtied at all, dirty_rate will be 0 and
> balanced_dirty_ratelimit will be inf. This could happen in reality.
> 
> There won't be tiny singular points though, as long as the dirty pages
> lie inside the dirty control area (above the freerun region).
> Because there the dd tasks will be throttled by balanced_dirty_pages()
> and won't be able to suddenly dirty much more pages than average.
  OK, I see. Thanks for explanation.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/5] dirty throttling bits for 3.3
  2011-11-21 13:03 [PATCH 0/5] dirty throttling bits for 3.3 Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-11-21 13:03 ` [PATCH 5/5] writeback: dirty ratelimit - think time compensation Wu Fengguang
@ 2011-11-23 12:44 ` Peter Zijlstra
  2011-11-28 13:56   ` Wu Fengguang
  5 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2011-11-23 12:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Christoph Hellwig, Andrew Morton, LKML

On Mon, 2011-11-21 at 21:03 +0800, Wu Fengguang wrote:
> Hi,
> 
> I'd like to push these 5 dirty throttling improvements to linux-next,
> targeting for inclusion in Linux 3.3.
> 
> It's a comfortably small series for review ;) And there are no changes since
> the last v12 post. (I went through them once again but find nothing to change.)
> 
>  [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth
>  [PATCH 2/5] writeback: charge leaked page dirties to active tasks
>  [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
>  [PATCH 4/5] writeback: fix dirtied pages accounting on redirty
>  [PATCH 5/5] writeback: dirty ratelimit - think time compensation

Given the updates I'm ok with this series,

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth
  2011-11-22 21:04       ` Jan Kara
@ 2011-11-23 13:17         ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-11-23 13:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Christoph Hellwig, Andrew Morton, LKML

On Tue, Nov 22, 2011 at 10:04:37PM +0100, Jan Kara wrote:
> On Tue 22-11-11 14:41:49, Wu Fengguang wrote:
> > On Tue, Nov 22, 2011 at 06:50:49AM +0800, Jan Kara wrote:
> > > On Mon 21-11-11 21:03:43, Wu Fengguang wrote:
> > > > Add an upper limit to balanced_rate according to the below inequality.
> > > > This filters out some rare but huge singular points, which at least
> > > > enables more readable gnuplot figures.
> > > > 
> > > > When there are N dd dirtiers,
> > > > 
> > > > 	balanced_dirty_ratelimit = write_bw / N
> > > > 
> > > > So it holds that
> > > > 
> > > > 	balanced_dirty_ratelimit <= write_bw
> > >   The change makes sense, but do we understand why there are such huge
> > > singular points? Are they due to errors in estimation of bandwidth or due
> > > to errors in dirtying rate computations (e.g. due to truncates), or
> > > something else?
> > 
> > Good point. I'll add this to the changelog:
> > 
> > The singular points originate from dirty_rate in the below formular:
> > 
> >         balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
> > where
> >         dirty_rate = (number of page dirties in the past 200ms) / 200ms
> > 
> > In the extreme case, if all dd tasks suddenly get blocked on something 
> > else and hence no pages are dirtied at all, dirty_rate will be 0 and
> > balanced_dirty_ratelimit will be inf. This could happen in reality.
> > 
> > There won't be tiny singular points though, as long as the dirty pages
> > lie inside the dirty control area (above the freerun region).
> > Because there the dd tasks will be throttled by balanced_dirty_pages()
> > and won't be able to suddenly dirty much more pages than average.
>   OK, I see. Thanks for explanation.

I'd like to comment that these huge singular points is not a real
threat, since they are _guaranteed_ to be filtered out by these lines
in bdi_update_dirty_ratelimit():

         * |task_ratelimit - dirty_ratelimit| is used to limit the step size
         * and filter out the sigular points of balanced_dirty_ratelimit. Which
         * keeps jumping around randomly and can even leap far away at times
         * due to the small 200ms estimation period of dirty_rate (we want to
         * keep that period small to reduce time lags).
         */
        step = 0;
        if (dirty < setpoint) {
                x = min(bdi->balanced_dirty_ratelimit,
==>                      min(balanced_dirty_ratelimit, task_ratelimit));
                if (dirty_ratelimit < x)
                        step = x - dirty_ratelimit;
        } else {
                x = max(bdi->balanced_dirty_ratelimit,
                         max(balanced_dirty_ratelimit, task_ratelimit));
                if (dirty_ratelimit > x)
                        step = dirty_ratelimit - x;
        }

The caveat is, task_ratelimit which is based on the number of dirty
pages will never _suddenly_ fly away like balanced_dirty_ratelimit.
So any weirdly large balanced_dirty_ratelimit will be cut down to the
level of task_ratelimit.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-22 12:21       ` Jan Kara
  2011-11-22 12:30         ` Wu Fengguang
  2011-11-22 12:57         ` Peter Zijlstra
@ 2011-11-28 13:51         ` Wu Fengguang
  2 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:51 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Christoph Hellwig, Andrew Morton, LKML

Jan,

On Tue, Nov 22, 2011 at 01:21:57PM +0100, Jan Kara wrote:
> On Tue 22-11-11 17:21:11, Wu Fengguang wrote:
> > On Tue, Nov 22, 2011 at 08:11:27AM +0800, Jan Kara wrote:
> > > On Mon 21-11-11 21:03:45, Wu Fengguang wrote:
> > > > When dd in 512bytes, generic_perform_write() calls
> > > > balance_dirty_pages_ratelimited() 8 times for the same page, but
> > > > obviously the page is only dirtied once.
> > > > 
> > > > Fix it by accounting nr_dirtied at page dirty time.
> > >   Well, but after this change, the interface balance_dirty_ratelimited_nr()
> > > is strange because the argument is only used for per-CPU ratelimiting and
> > > not for per-task ratelimiting...
> > 
> > Yeah I was vaguely aware of this... and still choose to ignore this
> > since the patchset looked already forbiddingly large at the time ;)
> > 
> > > So if you do this switch then I'd also
> > > switch bdp_ratelimits to get consistent results and a clean interface and
> > > completely kill balance_dirty_pages_ratelimited_nr().
> > 
> > Following your suggestions to change ratelimiting as well :)
> > 
> > I'll do the interface change with a standalone patch.
>   OK.

It looks worthwhile to consider a better scheme to reduce calls into
balance_dirty_pages_ratelimited_nr() before doing the interface
change. Currently that function is called on every page in
generic_perform_write().

I have no clear idea for now, so would like to delay the interface
change (after this patch).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/5] dirty throttling bits for 3.3
  2011-11-23 12:44 ` [PATCH 0/5] dirty throttling bits for 3.3 Peter Zijlstra
@ 2011-11-28 13:56   ` Wu Fengguang
  0 siblings, 0 replies; 29+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Jan Kara, Christoph Hellwig, Andrew Morton, LKML

On Wed, Nov 23, 2011 at 08:44:09PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 21:03 +0800, Wu Fengguang wrote:
> > Hi,
> > 
> > I'd like to push these 5 dirty throttling improvements to linux-next,
> > targeting for inclusion in Linux 3.3.
> > 
> > It's a comfortably small series for review ;) And there are no changes since
> > the last v12 post. (I went through them once again but find nothing to change.)
> > 
> >  [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth
> >  [PATCH 2/5] writeback: charge leaked page dirties to active tasks
> >  [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes
> >  [PATCH 4/5] writeback: fix dirtied pages accounting on redirty
> >  [PATCH 5/5] writeback: dirty ratelimit - think time compensation
> 
> Given the updates I'm ok with this series,
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> Thanks!

Thank you! I'll post the updated series with your acks :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2011-11-28 13:56 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-21 13:03 [PATCH 0/5] dirty throttling bits for 3.3 Wu Fengguang
2011-11-21 13:03 ` [PATCH 1/5] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
2011-11-21 22:50   ` Jan Kara
2011-11-22  6:41     ` Wu Fengguang
2011-11-22 21:04       ` Jan Kara
2011-11-23 13:17         ` Wu Fengguang
2011-11-21 13:03 ` [PATCH 2/5] writeback: charge leaked page dirties to active tasks Wu Fengguang
2011-11-21 21:49   ` Andrew Morton
2011-11-21 23:46     ` Jan Kara
2011-11-22 13:35     ` Wu Fengguang
2011-11-21 13:03 ` [PATCH 3/5] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
2011-11-22  0:11   ` Jan Kara
2011-11-22  9:21     ` Wu Fengguang
2011-11-22 12:21       ` Jan Kara
2011-11-22 12:30         ` Wu Fengguang
2011-11-22 12:48           ` Jan Kara
2011-11-22 13:02             ` Wu Fengguang
2011-11-22 12:57         ` Peter Zijlstra
2011-11-22 13:07           ` Wu Fengguang
2011-11-22 13:41             ` Wu Fengguang
2011-11-22 13:53               ` Peter Zijlstra
2011-11-22 14:11                 ` Wu Fengguang
2011-11-28 13:51         ` Wu Fengguang
2011-11-21 13:03 ` [PATCH 4/5] writeback: fix dirtied pages accounting on redirty Wu Fengguang
2011-11-21 21:51   ` Andrew Morton
2011-11-22 13:59     ` Wu Fengguang
2011-11-21 13:03 ` [PATCH 5/5] writeback: dirty ratelimit - think time compensation Wu Fengguang
2011-11-23 12:44 ` [PATCH 0/5] dirty throttling bits for 3.3 Peter Zijlstra
2011-11-28 13:56   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).