linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] dirty throttling bits for 3.3 (v2)
@ 2011-11-28 13:53 Wu Fengguang
  2011-11-28 13:53 ` [PATCH 1/7] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Peter Zijlstra, Christoph Hellwig, Andrew Morton,
	Wu Fengguang, LKML

Hi,

There are now 7 dirty throttling improvements targeting for Linux 3.3.

Changes since v1:
- added many comments and enriched changelog 
- rename dirty_leaks to dirty_throttle_leaks
- add the btrfs accounting fix for sub-page writes
- account bdp_ratelimits at page dirty time (and use this_cpu_inc)

 [PATCH 1/7] writeback: balanced_rate cannot exceed write bandwidth
 [PATCH 2/7] writeback: charge leaked page dirties to active tasks
 [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes
 [PATCH 4/7] writeback: fix dirtied pages accounting on redirty
 [PATCH 5/7] btrfs: fix dirtied pages accounting on sub-page writes
 [PATCH 6/7] writeback: dirty ratelimit - think time compensation
 [PATCH 7/7] writeback: comment on the bdi dirty threshold

 fs/btrfs/file.c                  |    3 
 include/linux/sched.h            |    1 
 include/linux/writeback.h        |    4 +
 include/trace/events/writeback.h |   14 ++-
 kernel/exit.c                    |    2 
 kernel/fork.c                    |    1 
 mm/page-writeback.c              |  116 +++++++++++++++++++++++++----
 7 files changed, 123 insertions(+), 18 deletions(-)

Thanks,
Fengguang



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/7] writeback: balanced_rate cannot exceed write bandwidth
  2011-11-28 13:53 [PATCH 0/7] dirty throttling bits for 3.3 (v2) Wu Fengguang
@ 2011-11-28 13:53 ` Wu Fengguang
  2011-12-07 10:21   ` Jan Kara
  2011-11-28 13:53 ` [PATCH 2/7] writeback: charge leaked page dirties to active tasks Wu Fengguang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: ref-bw-up-bound --]
[-- Type: text/plain, Size: 2138 bytes --]

Add an upper limit to balanced_rate according to the below inequality.
This filters out some rare but huge singular points, which at least
enables more readable gnuplot figures.

When there are N dd dirtiers,

	balanced_dirty_ratelimit = write_bw / N

So it holds that

	balanced_dirty_ratelimit <= write_bw

The singular points originate from dirty_rate in the below formular:

        balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
where
	dirty_rate = (number of page dirties in the past 200ms) / 200ms

In the extreme case, if all dd tasks suddenly get blocked on something
else and hence no pages are dirtied at all, dirty_rate will be 0 and
balanced_dirty_ratelimit will be inf. This could happen in reality.

Note that these huge singular points are not a real threat, since they
are _guaranteed_ to be filtered out by the
	min(balanced_dirty_ratelimit, task_ratelimit)
line in bdi_update_dirty_ratelimit(). task_ratelimit is based on the
number of dirty pages, which will never _suddenly_ fly away like
balanced_dirty_ratelimit. So any weirdly large balanced_dirty_ratelimit
will be cut down to the level of task_ratelimit.

There won't be tiny singular points though, as long as the dirty pages
lie inside the dirty throttling region (above the freerun region).
Because there the dd tasks will be throttled by balanced_dirty_pages()
and won't be able to suddenly dirty much more pages than average.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    5 +++++
 1 file changed, 5 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-11-17 20:18:03.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-17 20:18:23.000000000 +0800
@@ -804,6 +804,11 @@ static void bdi_update_dirty_ratelimit(s
 	 */
 	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
 					   dirty_rate | 1);
+	/*
+	 * balanced_dirty_ratelimit ~= (write_bw / N) <= write_bw
+	 */
+	if (unlikely(balanced_dirty_ratelimit > write_bw))
+		balanced_dirty_ratelimit = write_bw;
 
 	/*
 	 * We could safely do this and return immediately:



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/7] writeback: charge leaked page dirties to active tasks
  2011-11-28 13:53 [PATCH 0/7] dirty throttling bits for 3.3 (v2) Wu Fengguang
  2011-11-28 13:53 ` [PATCH 1/7] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
@ 2011-11-28 13:53 ` Wu Fengguang
  2011-12-07 10:23   ` Jan Kara
  2011-11-28 13:53 ` [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: writeback-save-leaks-at-exit.patch --]
[-- Type: text/plain, Size: 3320 bytes --]

It's a years long problem that a large number of short-lived dirtiers
(eg. gcc instances in a fast kernel build) may starve long-run dirtiers
(eg. dd) as well as pushing the dirty pages to the global hard limit.

The solution is to charge the pages dirtied by the exited gcc to the
other random dirtying tasks. It sounds not perfect, however should
behave good enough in practice, seeing as that throttled tasks aren't
actually running so those that are running are more likely to pick it up
and get throttled, therefore promoting an equal spread.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 kernel/exit.c             |    2 ++
 mm/page-writeback.c       |   27 +++++++++++++++++++++++++++
 3 files changed, 31 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-11-28 21:23:19.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-11-28 21:23:20.000000000 +0800
@@ -7,6 +7,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 
+DECLARE_PER_CPU(int, dirty_throttle_leaks);
+
 /*
  * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
  *
--- linux-next.orig/mm/page-writeback.c	2011-11-28 21:23:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-28 21:23:20.000000000 +0800
@@ -1195,6 +1195,22 @@ void set_page_dirty_balance(struct page 
 
 static DEFINE_PER_CPU(int, bdp_ratelimits);
 
+/*
+ * Normal tasks are throttled by
+ *	loop {
+ *		dirty tsk->nr_dirtied_pause pages;
+ *		take a snap in balance_dirty_pages();
+ *	}
+ * However there is a worst case. If every task exit immediately when dirtied
+ * (tsk->nr_dirtied_pause - 1) pages, balance_dirty_pages() will never be
+ * called to throttle the page dirties. The solution is to save the not yet
+ * throttled page dirties in dirty_throttle_leaks on task exit and charge them
+ * randomly into the running tasks. This works well for the above worst case,
+ * as the new task will pick up and accumulate the old task's leaked dirty
+ * count and eventually get throttled.
+ */
+DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
+
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -1242,6 +1258,17 @@ void balance_dirty_pages_ratelimited_nr(
 			ratelimit = 0;
 		}
 	}
+	/*
+	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
+	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
+	 * the dirty throttling and livelock other long-run dirtiers.
+	 */
+	p = &__get_cpu_var(dirty_throttle_leaks);
+	if (*p > 0 && current->nr_dirtied < ratelimit) {
+		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
+		*p -= nr_pages_dirtied;
+		current->nr_dirtied += nr_pages_dirtied;
+	}
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
--- linux-next.orig/kernel/exit.c	2011-11-28 21:23:19.000000000 +0800
+++ linux-next/kernel/exit.c	2011-11-28 21:23:20.000000000 +0800
@@ -1037,6 +1037,8 @@ NORET_TYPE void do_exit(long code)
 	validate_creds_for_do_exit(tsk);
 
 	preempt_disable();
+	if (tsk->nr_dirtied)
+		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
 	exit_rcu();
 	/* causes final put_task_struct in finish_task_switch(). */
 	tsk->state = TASK_DEAD;



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-28 13:53 [PATCH 0/7] dirty throttling bits for 3.3 (v2) Wu Fengguang
  2011-11-28 13:53 ` [PATCH 1/7] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
  2011-11-28 13:53 ` [PATCH 2/7] writeback: charge leaked page dirties to active tasks Wu Fengguang
@ 2011-11-28 13:53 ` Wu Fengguang
  2011-12-07 10:53   ` Jan Kara
  2011-11-28 13:53 ` [PATCH 4/7] writeback: fix dirtied pages accounting on redirty Wu Fengguang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: writeback-accurate-task-dirtied.patch --]
[-- Type: text/plain, Size: 1640 bytes --]

When dd in 512bytes, generic_perform_write() calls
balance_dirty_pages_ratelimited() 8 times for the same page, but
obviously the page is only dirtied once.

Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-11-28 21:23:20.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-28 21:23:23.000000000 +0800
@@ -1239,8 +1239,6 @@ void balance_dirty_pages_ratelimited_nr(
 	if (bdi->dirty_exceeded)
 		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
 
-	current->nr_dirtied += nr_pages_dirtied;
-
 	preempt_disable();
 	/*
 	 * This prevents one CPU to accumulate too many dirtied pages without
@@ -1251,12 +1249,9 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	if (unlikely(current->nr_dirtied >= ratelimit))
 		*p = 0;
-	else {
-		*p += nr_pages_dirtied;
-		if (unlikely(*p >= ratelimit_pages)) {
-			*p = 0;
-			ratelimit = 0;
-		}
+	else if (unlikely(*p >= ratelimit_pages)) {
+		*p = 0;
+		ratelimit = 0;
 	}
 	/*
 	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
@@ -1749,6 +1744,8 @@ void account_page_dirtied(struct page *p
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_io_account_write(PAGE_CACHE_SIZE);
+		current->nr_dirtied++;
+		this_cpu_inc(bdp_ratelimits);
 	}
 }
 EXPORT_SYMBOL(account_page_dirtied);



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/7] writeback: fix dirtied pages accounting on redirty
  2011-11-28 13:53 [PATCH 0/7] dirty throttling bits for 3.3 (v2) Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-11-28 13:53 ` [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
@ 2011-11-28 13:53 ` Wu Fengguang
  2011-12-07 16:09   ` Jan Kara
  2011-11-28 13:53 ` [PATCH 5/7] btrfs: fix dirtied pages accounting on sub-page writes Wu Fengguang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: writeback-account-redirty --]
[-- Type: text/plain, Size: 2551 bytes --]

De-account the accumulative dirty counters on page redirty.

Page redirties (very common in ext4) will introduce mismatch between
counters (a) and (b)

a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
b) NR_WRITTEN, BDI_WRITTEN

This will introduce systematic errors in balanced_rate and result in
dirty page position errors (ie. the dirty pages are no longer balanced
around the global/bdi setpoints).

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    2 ++
 mm/page-writeback.c       |   19 +++++++++++++++++++
 2 files changed, 21 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-11-28 21:23:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-28 21:23:24.000000000 +0800
@@ -1806,6 +1806,24 @@ int __set_page_dirty_nobuffers(struct pa
 EXPORT_SYMBOL(__set_page_dirty_nobuffers);
 
 /*
+ * Call this whenever redirtying a page, to de-account the dirty counters
+ * (NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied), so that they match the written
+ * counters (NR_WRITTEN, BDI_WRITTEN) in long term. The mismatches will lead to
+ * systematic errors in balanced_dirty_ratelimit and the dirty pages position
+ * control.
+ */
+void account_page_redirty(struct page *page)
+{
+	struct address_space *mapping = page->mapping;
+	if (mapping && mapping_cap_account_dirty(mapping)) {
+		current->nr_dirtied--;
+		dec_zone_page_state(page, NR_DIRTIED);
+		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+	}
+}
+EXPORT_SYMBOL(account_page_redirty);
+
+/*
  * When a writepage implementation decides that it doesn't want to write this
  * page for some reason, it should redirty the locked page via
  * redirty_page_for_writepage() and it should then unlock the page and return 0
@@ -1813,6 +1831,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers
 int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
 {
 	wbc->pages_skipped++;
+	account_page_redirty(page);
 	return __set_page_dirty_nobuffers(page);
 }
 EXPORT_SYMBOL(redirty_page_for_writepage);
--- linux-next.orig/include/linux/writeback.h	2011-11-28 21:23:20.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-11-28 21:23:24.000000000 +0800
@@ -197,6 +197,8 @@ void writeback_set_ratelimit(void);
 void tag_pages_for_writeback(struct address_space *mapping,
 			     pgoff_t start, pgoff_t end);
 
+void account_page_redirty(struct page *page);
+
 /* pdflush.c */
 extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
 				   read-only. */



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/7] btrfs: fix dirtied pages accounting on sub-page writes
  2011-11-28 13:53 [PATCH 0/7] dirty throttling bits for 3.3 (v2) Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-11-28 13:53 ` [PATCH 4/7] writeback: fix dirtied pages accounting on redirty Wu Fengguang
@ 2011-11-28 13:53 ` Wu Fengguang
  2011-11-28 14:16   ` Wu Fengguang
  2011-11-28 13:53 ` [PATCH 6/7] writeback: dirty ratelimit - think time compensation Wu Fengguang
  2011-11-28 13:53 ` [PATCH 7/7] writeback: comment on the bdi dirty threshold Wu Fengguang
  6 siblings, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Peter Zijlstra,
	Christoph Hellwig, Andrew Morton, LKML

[-- Attachment #1: btrfs-account-redirty --]
[-- Type: text/plain, Size: 857 bytes --]

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited_nr() should be called once instead of 4
times, the latter makes the dirtier tasks be throttled much too heavy.

Fix it with proper de-accounting on clear_page_dirty_for_io().

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-next.orig/fs/btrfs/file.c	2011-11-17 20:13:47.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2011-11-17 20:18:51.000000000 +0800
@@ -1136,7 +1136,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (clear_page_dirty_for_io(pages[i]))
+			account_page_redirty(pages[i]);
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 6/7] writeback: dirty ratelimit - think time compensation
  2011-11-28 13:53 [PATCH 0/7] dirty throttling bits for 3.3 (v2) Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-11-28 13:53 ` [PATCH 5/7] btrfs: fix dirtied pages accounting on sub-page writes Wu Fengguang
@ 2011-11-28 13:53 ` Wu Fengguang
  2011-12-07 16:14   ` Jan Kara
  2011-11-28 13:53 ` [PATCH 7/7] writeback: comment on the bdi dirty threshold Wu Fengguang
  6 siblings, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: think-time-compensation --]
[-- Type: text/plain, Size: 7832 bytes --]

Compensate the task's think time when computing the final pause time,
so that ->dirty_ratelimit can be executed accurately.

        think time := time spend outside of balance_dirty_pages()

In the rare case that the task slept longer than the 200ms period time
(result in negative pause time), the sleep time will be compensated in
the following periods, too, if it's less than 1 second.

Accumulated errors are carefully avoided as long as the max pause area
is not hitted.

Pseudo code:

        period = pages_dirtied / task_ratelimit;
        think = jiffies - dirty_paused_when;
        pause = period - think;

1) normal case: period > think

        pause = period - think
        dirty_paused_when = jiffies + pause
        nr_dirtied = 0

                             period time
              |===============================>|
                  think time      pause time
              |===============>|==============>|
        ------|----------------|---------------|------------------------
        dirty_paused_when   jiffies


2) no pause case: period <= think

        don't pause; reduce future pause time by:
        dirty_paused_when += period
        nr_dirtied = 0

                           period time
              |===============================>|
                                  think time
              |===================================================>|
        ------|--------------------------------+-------------------|----
        dirty_paused_when                                       jiffies

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h            |    1 
 include/trace/events/writeback.h |   14 ++++++++---
 kernel/fork.c                    |    1 
 mm/page-writeback.c              |   36 +++++++++++++++++++++++++----
 4 files changed, 45 insertions(+), 7 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-11-17 20:12:22.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-11-17 20:12:35.000000000 +0800
@@ -1527,6 +1527,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long dirty_paused_when; /* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2011-11-17 20:12:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-17 20:12:38.000000000 +0800
@@ -1010,6 +1010,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	long period;
 	long pause = 0;
 	long uninitialized_var(max_pause);
 	bool dirty_exceeded = false;
@@ -1020,6 +1021,8 @@ static void balance_dirty_pages(struct a
 	unsigned long start_time = jiffies;
 
 	for (;;) {
+		unsigned long now = jiffies;
+
 		/*
 		 * Unstable writes are a feature of certain networked
 		 * filesystems (i.e. NFS) in which data may have been
@@ -1039,8 +1042,11 @@ static void balance_dirty_pages(struct a
 		 */
 		freerun = dirty_freerun_ceiling(dirty_thresh,
 						background_thresh);
-		if (nr_dirty <= freerun)
+		if (nr_dirty <= freerun) {
+			current->dirty_paused_when = now;
+			current->nr_dirtied = 0;
 			break;
+		}
 
 		if (unlikely(!writeback_in_progress(bdi)))
 			bdi_start_background_writeback(bdi);
@@ -1098,10 +1104,21 @@ static void balance_dirty_pages(struct a
 		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
 							RATELIMIT_CALC_SHIFT;
 		if (unlikely(task_ratelimit == 0)) {
+			period = max_pause;
 			pause = max_pause;
 			goto pause;
 		}
-		pause = HZ * pages_dirtied / task_ratelimit;
+		period = HZ * pages_dirtied / task_ratelimit;
+		pause = period;
+		if (current->dirty_paused_when)
+			pause -= now - current->dirty_paused_when;
+		/*
+		 * For less than 1s think time (ext3/4 may block the dirtier
+		 * for up to 800ms from time to time on 1-HDD; so does xfs,
+		 * however at much less frequency), try to compensate it in
+		 * future periods by updating the virtual time; otherwise just
+		 * do a reset, as it may be a light dirtier.
+		 */
 		if (unlikely(pause <= 0)) {
 			trace_balance_dirty_pages(bdi,
 						  dirty_thresh,
@@ -1112,8 +1129,16 @@ static void balance_dirty_pages(struct a
 						  dirty_ratelimit,
 						  task_ratelimit,
 						  pages_dirtied,
+						  period,
 						  pause,
 						  start_time);
+			if (pause < -HZ) {
+				current->dirty_paused_when = now;
+				current->nr_dirtied = 0;
+			} else if (period) {
+				current->dirty_paused_when += period;
+				current->nr_dirtied = 0;
+			}
 			pause = 1; /* avoid resetting nr_dirtied_pause below */
 			break;
 		}
@@ -1129,11 +1154,15 @@ pause:
 					  dirty_ratelimit,
 					  task_ratelimit,
 					  pages_dirtied,
+					  period,
 					  pause,
 					  start_time);
 		__set_current_state(TASK_KILLABLE);
 		io_schedule_timeout(pause);
 
+		current->dirty_paused_when = now + pause;
+		current->nr_dirtied = 0;
+
 		/*
 		 * This is typically equal to (nr_dirty < dirty_thresh) and can
 		 * also keep "1000+ dd on a slow USB stick" under control.
@@ -1148,11 +1177,10 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	current->nr_dirtied = 0;
 	if (pause == 0) { /* in freerun area */
 		current->nr_dirtied_pause =
 				dirty_poll_interval(nr_dirty, dirty_thresh);
-	} else if (pause <= max_pause / 4 &&
+	} else if (period <= max_pause / 4 &&
 		   pages_dirtied >= current->nr_dirtied_pause) {
 		current->nr_dirtied_pause = clamp_val(
 					dirty_ratelimit * (max_pause / 2) / HZ,
--- linux-next.orig/kernel/fork.c	2011-11-17 20:12:23.000000000 +0800
+++ linux-next/kernel/fork.c	2011-11-17 20:12:35.000000000 +0800
@@ -1296,6 +1296,7 @@ static struct task_struct *copy_process(
 
 	p->nr_dirtied = 0;
 	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
+	p->dirty_paused_when = 0;
 
 	/*
 	 * Ok, make it visible to the rest of the system.
--- linux-next.orig/include/trace/events/writeback.h	2011-11-17 19:13:41.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-11-17 20:12:35.000000000 +0800
@@ -289,12 +289,13 @@ TRACE_EVENT(balance_dirty_pages,
 		 unsigned long dirty_ratelimit,
 		 unsigned long task_ratelimit,
 		 unsigned long dirtied,
+		 unsigned long period,
 		 long pause,
 		 unsigned long start_time),
 
 	TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
 		dirty_ratelimit, task_ratelimit,
-		dirtied, pause, start_time),
+		dirtied, period, pause, start_time),
 
 	TP_STRUCT__entry(
 		__array(	 char,	bdi, 32)
@@ -309,6 +310,8 @@ TRACE_EVENT(balance_dirty_pages,
 		__field(unsigned int,	dirtied_pause)
 		__field(unsigned long,	paused)
 		__field(	 long,	pause)
+		__field(unsigned long,	period)
+		__field(	 long,	think)
 	),
 
 	TP_fast_assign(
@@ -325,6 +328,9 @@ TRACE_EVENT(balance_dirty_pages,
 		__entry->task_ratelimit	= KBps(task_ratelimit);
 		__entry->dirtied	= dirtied;
 		__entry->dirtied_pause	= current->nr_dirtied_pause;
+		__entry->think		= current->dirty_paused_when == 0 ? 0 :
+			 (long)(jiffies - current->dirty_paused_when) * 1000/HZ;
+		__entry->period		= period * 1000 / HZ;
 		__entry->pause		= pause * 1000 / HZ;
 		__entry->paused		= (jiffies - start_time) * 1000 / HZ;
 	),
@@ -335,7 +341,7 @@ TRACE_EVENT(balance_dirty_pages,
 		  "bdi_setpoint=%lu bdi_dirty=%lu "
 		  "dirty_ratelimit=%lu task_ratelimit=%lu "
 		  "dirtied=%u dirtied_pause=%u "
-		  "paused=%lu pause=%ld",
+		  "paused=%lu pause=%ld period=%lu think=%ld",
 		  __entry->bdi,
 		  __entry->limit,
 		  __entry->setpoint,
@@ -347,7 +353,9 @@ TRACE_EVENT(balance_dirty_pages,
 		  __entry->dirtied,
 		  __entry->dirtied_pause,
 		  __entry->paused,	/* ms */
-		  __entry->pause	/* ms */
+		  __entry->pause,	/* ms */
+		  __entry->period,	/* ms */
+		  __entry->think	/* ms */
 	  )
 );
 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 7/7] writeback: comment on the bdi dirty threshold
  2011-11-28 13:53 [PATCH 0/7] dirty throttling bits for 3.3 (v2) Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-11-28 13:53 ` [PATCH 6/7] writeback: dirty ratelimit - think time compensation Wu Fengguang
@ 2011-11-28 13:53 ` Wu Fengguang
  2011-12-07 10:57   ` Jan Kara
  6 siblings, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2011-11-28 13:53 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Andrew Morton, LKML

[-- Attachment #1: writeback-comment-bdi_thresh.patch --]
[-- Type: text/plain, Size: 2676 bytes --]

We do "floating proportions" to let active devices to grow its target
share of dirty pages and stalled/inactive devices to decrease its target
share over time.

It works well except in the case of "an inactive disk suddenly goes
busy", where the initial target share may be too small. To mitigate
this, bdi_position_ratio() has the below line to raise a small
bdi_thresh when it's safe to do so, so that the disk be feed with enough
dirty pages for efficient IO and in turn fast rampup of bdi_thresh:

        bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);

balance_dirty_pages() normally does negative feedback control which
adjusts ratelimit to balance the bdi dirty pages around the target.
In some extreme cases when that is not enough, it will have to block 
the tasks completely until the bdi dirty pages drop below bdi_thresh.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-11-23 10:57:41.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-23 11:44:39.000000000 +0800
@@ -411,8 +411,13 @@ void global_dirty_limits(unsigned long *
  *
  * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
  * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
- * And the "limit" in the name is not seriously taken as hard limit in
- * balance_dirty_pages().
+ *
+ * Note that balance_dirty_pages() will only seriously take it as a hard limit
+ * when sleeping max_pause per page is not enough to keep the dirty pages under
+ * control. For example, when the device is completely stalled due to some error
+ * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
+ * In the other normal situations, it acts more gently by throttling the tasks
+ * more (rather than completely block them) when the bdi dirty pages go high.
  *
  * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
@@ -594,6 +599,13 @@ static unsigned long bdi_position_ratio(
 	 */
 	if (unlikely(bdi_thresh > thresh))
 		bdi_thresh = thresh;
+	/*
+	 * It's very possible that bdi_thresh is close to 0 not because the
+	 * device is slow, but that it has remained inactive for long time.
+	 * Honour such devices a reasonable good (hopefully IO efficient)
+	 * threshold, so that the occasional writes won't be blocked and active
+	 * writes can rampup the threshold quickly.
+	 */
 	bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
 	/*
 	 * scale global setpoint to bdi's:



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/7] btrfs: fix dirtied pages accounting on sub-page writes
  2011-11-28 13:53 ` [PATCH 5/7] btrfs: fix dirtied pages accounting on sub-page writes Wu Fengguang
@ 2011-11-28 14:16   ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2011-11-28 14:16 UTC (permalink / raw)
  To: Chris Mason
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

Hi Chris,

This btrfs patch depends on the account_page_redirty() introduced
earlier in this series.

Shall I carry it together with this patchset?

Thanks,
Fengguang

On Mon, Nov 28, 2011 at 09:53:43PM +0800, Wu, Fengguang wrote:
> When doing 1KB sequential writes to the same page,
> balance_dirty_pages_ratelimited_nr() should be called once instead of 4
> times, the latter makes the dirtier tasks be throttled much too heavy.
> 
> Fix it with proper de-accounting on clear_page_dirty_for_io().
> 
> CC: Chris Mason <chris.mason@oracle.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/btrfs/file.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/fs/btrfs/file.c	2011-11-17 20:13:47.000000000 +0800
> +++ linux-next/fs/btrfs/file.c	2011-11-17 20:18:51.000000000 +0800
> @@ -1136,7 +1136,8 @@ again:
>  				     GFP_NOFS);
>  	}
>  	for (i = 0; i < num_pages; i++) {
> -		clear_page_dirty_for_io(pages[i]);
> +		if (clear_page_dirty_for_io(pages[i]))
> +			account_page_redirty(pages[i]);
>  		set_page_extent_mapped(pages[i]);
>  		WARN_ON(!PageLocked(pages[i]));
>  	}
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] writeback: balanced_rate cannot exceed write bandwidth
  2011-11-28 13:53 ` [PATCH 1/7] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
@ 2011-12-07 10:21   ` Jan Kara
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Kara @ 2011-12-07 10:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Mon 28-11-11 21:53:39, Wu Fengguang wrote:
> Add an upper limit to balanced_rate according to the below inequality.
> This filters out some rare but huge singular points, which at least
> enables more readable gnuplot figures.
> 
> When there are N dd dirtiers,
> 
> 	balanced_dirty_ratelimit = write_bw / N
> 
> So it holds that
> 
> 	balanced_dirty_ratelimit <= write_bw
> 
> The singular points originate from dirty_rate in the below formular:
> 
>         balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate
> where
> 	dirty_rate = (number of page dirties in the past 200ms) / 200ms
> 
> In the extreme case, if all dd tasks suddenly get blocked on something
> else and hence no pages are dirtied at all, dirty_rate will be 0 and
> balanced_dirty_ratelimit will be inf. This could happen in reality.
> 
> Note that these huge singular points are not a real threat, since they
> are _guaranteed_ to be filtered out by the
> 	min(balanced_dirty_ratelimit, task_ratelimit)
> line in bdi_update_dirty_ratelimit(). task_ratelimit is based on the
> number of dirty pages, which will never _suddenly_ fly away like
> balanced_dirty_ratelimit. So any weirdly large balanced_dirty_ratelimit
> will be cut down to the level of task_ratelimit.
> 
> There won't be tiny singular points though, as long as the dirty pages
> lie inside the dirty throttling region (above the freerun region).
> Because there the dd tasks will be throttled by balanced_dirty_pages()
> and won't be able to suddenly dirty much more pages than average.
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  After your explanation I agree as well. So you can add
  Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/page-writeback.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-11-17 20:18:03.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-17 20:18:23.000000000 +0800
> @@ -804,6 +804,11 @@ static void bdi_update_dirty_ratelimit(s
>  	 */
>  	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
>  					   dirty_rate | 1);
> +	/*
> +	 * balanced_dirty_ratelimit ~= (write_bw / N) <= write_bw
> +	 */
> +	if (unlikely(balanced_dirty_ratelimit > write_bw))
> +		balanced_dirty_ratelimit = write_bw;
>  
>  	/*
>  	 * We could safely do this and return immediately:
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/7] writeback: charge leaked page dirties to active tasks
  2011-11-28 13:53 ` [PATCH 2/7] writeback: charge leaked page dirties to active tasks Wu Fengguang
@ 2011-12-07 10:23   ` Jan Kara
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Kara @ 2011-12-07 10:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Mon 28-11-11 21:53:40, Wu Fengguang wrote:
> It's a years long problem that a large number of short-lived dirtiers
> (eg. gcc instances in a fast kernel build) may starve long-run dirtiers
> (eg. dd) as well as pushing the dirty pages to the global hard limit.
> 
> The solution is to charge the pages dirtied by the exited gcc to the
> other random dirtying tasks. It sounds not perfect, however should
> behave good enough in practice, seeing as that throttled tasks aren't
> actually running so those that are running are more likely to pick it up
> and get throttled, therefore promoting an equal spread.
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  Since I don't see a better solution here :)
Acked-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  include/linux/writeback.h |    2 ++
>  kernel/exit.c             |    2 ++
>  mm/page-writeback.c       |   27 +++++++++++++++++++++++++++
>  3 files changed, 31 insertions(+)
> 
> --- linux-next.orig/include/linux/writeback.h	2011-11-28 21:23:19.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-11-28 21:23:20.000000000 +0800
> @@ -7,6 +7,8 @@
>  #include <linux/sched.h>
>  #include <linux/fs.h>
>  
> +DECLARE_PER_CPU(int, dirty_throttle_leaks);
> +
>  /*
>   * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
>   *
> --- linux-next.orig/mm/page-writeback.c	2011-11-28 21:23:19.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-28 21:23:20.000000000 +0800
> @@ -1195,6 +1195,22 @@ void set_page_dirty_balance(struct page 
>  
>  static DEFINE_PER_CPU(int, bdp_ratelimits);
>  
> +/*
> + * Normal tasks are throttled by
> + *	loop {
> + *		dirty tsk->nr_dirtied_pause pages;
> + *		take a snap in balance_dirty_pages();
> + *	}
> + * However there is a worst case. If every task exit immediately when dirtied
> + * (tsk->nr_dirtied_pause - 1) pages, balance_dirty_pages() will never be
> + * called to throttle the page dirties. The solution is to save the not yet
> + * throttled page dirties in dirty_throttle_leaks on task exit and charge them
> + * randomly into the running tasks. This works well for the above worst case,
> + * as the new task will pick up and accumulate the old task's leaked dirty
> + * count and eventually get throttled.
> + */
> +DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
> +
>  /**
>   * balance_dirty_pages_ratelimited_nr - balance dirty memory state
>   * @mapping: address_space which was dirtied
> @@ -1242,6 +1258,17 @@ void balance_dirty_pages_ratelimited_nr(
>  			ratelimit = 0;
>  		}
>  	}
> +	/*
> +	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
> +	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
> +	 * the dirty throttling and livelock other long-run dirtiers.
> +	 */
> +	p = &__get_cpu_var(dirty_throttle_leaks);
> +	if (*p > 0 && current->nr_dirtied < ratelimit) {
> +		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> +		*p -= nr_pages_dirtied;
> +		current->nr_dirtied += nr_pages_dirtied;
> +	}
>  	preempt_enable();
>  
>  	if (unlikely(current->nr_dirtied >= ratelimit))
> --- linux-next.orig/kernel/exit.c	2011-11-28 21:23:19.000000000 +0800
> +++ linux-next/kernel/exit.c	2011-11-28 21:23:20.000000000 +0800
> @@ -1037,6 +1037,8 @@ NORET_TYPE void do_exit(long code)
>  	validate_creds_for_do_exit(tsk);
>  
>  	preempt_disable();
> +	if (tsk->nr_dirtied)
> +		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
>  	exit_rcu();
>  	/* causes final put_task_struct in finish_task_switch(). */
>  	tsk->state = TASK_DEAD;
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes
  2011-11-28 13:53 ` [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
@ 2011-12-07 10:53   ` Jan Kara
  2011-12-07 12:08     ` Wu Fengguang
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2011-12-07 10:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Mon 28-11-11 21:53:41, Wu Fengguang wrote:
> When dd in 512bytes, generic_perform_write() calls
> balance_dirty_pages_ratelimited() 8 times for the same page, but
> obviously the page is only dirtied once.
  Actually, for ppc where pages can be 64 KB the problem is even worse.

> Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.
  I was wondering about one more thing - couldn't we rather check in
generic_perform_write() whether the page was dirty before calling
->write_end and call balance_dirty_pages_ratelimited() only if it wasn't?

  For generic_perform_write() it doesn't really matter that much since we
do things page-by-page anyway but other callers could be more efficient...

								Honza

> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   13 +++++--------
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-11-28 21:23:20.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-28 21:23:23.000000000 +0800
> @@ -1239,8 +1239,6 @@ void balance_dirty_pages_ratelimited_nr(
>  	if (bdi->dirty_exceeded)
>  		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
>  
> -	current->nr_dirtied += nr_pages_dirtied;
> -
>  	preempt_disable();
>  	/*
>  	 * This prevents one CPU to accumulate too many dirtied pages without
> @@ -1251,12 +1249,9 @@ void balance_dirty_pages_ratelimited_nr(
>  	p =  &__get_cpu_var(bdp_ratelimits);
>  	if (unlikely(current->nr_dirtied >= ratelimit))
>  		*p = 0;
> -	else {
> -		*p += nr_pages_dirtied;
> -		if (unlikely(*p >= ratelimit_pages)) {
> -			*p = 0;
> -			ratelimit = 0;
> -		}
> +	else if (unlikely(*p >= ratelimit_pages)) {
> +		*p = 0;
> +		ratelimit = 0;
>  	}
>  	/*
>  	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
> @@ -1749,6 +1744,8 @@ void account_page_dirtied(struct page *p
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
>  		task_io_account_write(PAGE_CACHE_SIZE);
> +		current->nr_dirtied++;
> +		this_cpu_inc(bdp_ratelimits);
>  	}
>  }
>  EXPORT_SYMBOL(account_page_dirtied);
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 7/7] writeback: comment on the bdi dirty threshold
  2011-11-28 13:53 ` [PATCH 7/7] writeback: comment on the bdi dirty threshold Wu Fengguang
@ 2011-12-07 10:57   ` Jan Kara
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Kara @ 2011-12-07 10:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Mon 28-11-11 21:53:45, Wu Fengguang wrote:
> We do "floating proportions" to let active devices to grow its target
> share of dirty pages and stalled/inactive devices to decrease its target
> share over time.
> 
> It works well except in the case of "an inactive disk suddenly goes
> busy", where the initial target share may be too small. To mitigate
> this, bdi_position_ratio() has the below line to raise a small
> bdi_thresh when it's safe to do so, so that the disk be feed with enough
> dirty pages for efficient IO and in turn fast rampup of bdi_thresh:
> 
>         bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
> 
> balance_dirty_pages() normally does negative feedback control which
> adjusts ratelimit to balance the bdi dirty pages around the target.
> In some extreme cases when that is not enough, it will have to block 
> the tasks completely until the bdi dirty pages drop below bdi_thresh.
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  Looks good.

  Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/page-writeback.c |   16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-11-23 10:57:41.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-23 11:44:39.000000000 +0800
> @@ -411,8 +411,13 @@ void global_dirty_limits(unsigned long *
>   *
>   * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
>   * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
> - * And the "limit" in the name is not seriously taken as hard limit in
> - * balance_dirty_pages().
> + *
> + * Note that balance_dirty_pages() will only seriously take it as a hard limit
> + * when sleeping max_pause per page is not enough to keep the dirty pages under
> + * control. For example, when the device is completely stalled due to some error
> + * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
> + * In the other normal situations, it acts more gently by throttling the tasks
> + * more (rather than completely block them) when the bdi dirty pages go high.
>   *
>   * It allocates high/low dirty limits to fast/slow devices, in order to prevent
>   * - starving fast devices
> @@ -594,6 +599,13 @@ static unsigned long bdi_position_ratio(
>  	 */
>  	if (unlikely(bdi_thresh > thresh))
>  		bdi_thresh = thresh;
> +	/*
> +	 * It's very possible that bdi_thresh is close to 0 not because the
> +	 * device is slow, but that it has remained inactive for long time.
> +	 * Honour such devices a reasonable good (hopefully IO efficient)
> +	 * threshold, so that the occasional writes won't be blocked and active
> +	 * writes can rampup the threshold quickly.
> +	 */
>  	bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
>  	/*
>  	 * scale global setpoint to bdi's:
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes
  2011-12-07 10:53   ` Jan Kara
@ 2011-12-07 12:08     ` Wu Fengguang
  2011-12-07 16:07       ` Jan Kara
  0 siblings, 1 reply; 18+ messages in thread
From: Wu Fengguang @ 2011-12-07 12:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Christoph Hellwig, Andrew Morton, LKML

On Wed, Dec 07, 2011 at 06:53:40PM +0800, Jan Kara wrote:
> On Mon 28-11-11 21:53:41, Wu Fengguang wrote:
> > When dd in 512bytes, generic_perform_write() calls
> > balance_dirty_pages_ratelimited() 8 times for the same page, but
> > obviously the page is only dirtied once.
>   Actually, for ppc where pages can be 64 KB the problem is even worse.

Ah yes.

> > Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.
>   I was wondering about one more thing - couldn't we rather check in
> generic_perform_write() whether the page was dirty before calling
> ->write_end and call balance_dirty_pages_ratelimited() only if it wasn't?

Cough.. the very original version does that exactly, then you raised
some concern here:

        https://lkml.org/lkml/2011/4/13/554

The discussion goes on and eventually I get to the current version
that looks most acceptable in the three options.

>   For generic_perform_write() it doesn't really matter that much since we
> do things page-by-page anyway but other callers could be more efficient...

That's right.

Thanks,
Fengguang

> > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   13 +++++--------
> >  1 file changed, 5 insertions(+), 8 deletions(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2011-11-28 21:23:20.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-11-28 21:23:23.000000000 +0800
> > @@ -1239,8 +1239,6 @@ void balance_dirty_pages_ratelimited_nr(
> >  	if (bdi->dirty_exceeded)
> >  		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
> >  
> > -	current->nr_dirtied += nr_pages_dirtied;
> > -
> >  	preempt_disable();
> >  	/*
> >  	 * This prevents one CPU to accumulate too many dirtied pages without
> > @@ -1251,12 +1249,9 @@ void balance_dirty_pages_ratelimited_nr(
> >  	p =  &__get_cpu_var(bdp_ratelimits);
> >  	if (unlikely(current->nr_dirtied >= ratelimit))
> >  		*p = 0;
> > -	else {
> > -		*p += nr_pages_dirtied;
> > -		if (unlikely(*p >= ratelimit_pages)) {
> > -			*p = 0;
> > -			ratelimit = 0;
> > -		}
> > +	else if (unlikely(*p >= ratelimit_pages)) {
> > +		*p = 0;
> > +		ratelimit = 0;
> >  	}
> >  	/*
> >  	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
> > @@ -1749,6 +1744,8 @@ void account_page_dirtied(struct page *p
> >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> >  		task_io_account_write(PAGE_CACHE_SIZE);
> > +		current->nr_dirtied++;
> > +		this_cpu_inc(bdp_ratelimits);
> >  	}
> >  }
> >  EXPORT_SYMBOL(account_page_dirtied);
> > 
> > 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes
  2011-12-07 12:08     ` Wu Fengguang
@ 2011-12-07 16:07       ` Jan Kara
  2011-12-08  2:44         ` Wu Fengguang
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2011-12-07 16:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Wed 07-12-11 20:08:18, Wu Fengguang wrote:
> On Wed, Dec 07, 2011 at 06:53:40PM +0800, Jan Kara wrote:
> > On Mon 28-11-11 21:53:41, Wu Fengguang wrote:
> > > When dd in 512bytes, generic_perform_write() calls
> > > balance_dirty_pages_ratelimited() 8 times for the same page, but
> > > obviously the page is only dirtied once.
> >   Actually, for ppc where pages can be 64 KB the problem is even worse.
> 
> Ah yes.
> 
> > > Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.
> >   I was wondering about one more thing - couldn't we rather check in
> > generic_perform_write() whether the page was dirty before calling
> > ->write_end and call balance_dirty_pages_ratelimited() only if it wasn't?
> 
> Cough.. the very original version does that exactly, then you raised
> some concern here:
> 
>         https://lkml.org/lkml/2011/4/13/554
> 
> The discussion goes on and eventually I get to the current version
> that looks most acceptable in the three options.
  Good point. I should have researched web (or my memory) more closely ;)
Thanks for the pointer - it has reminded me why using PageDirty isn't quite
perfect.

> >   For generic_perform_write() it doesn't really matter that much since we
> > do things page-by-page anyway but other callers could be more efficient...
> 
> That's right.
  You can add:
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> > > Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |   13 +++++--------
> > >  1 file changed, 5 insertions(+), 8 deletions(-)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2011-11-28 21:23:20.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2011-11-28 21:23:23.000000000 +0800
> > > @@ -1239,8 +1239,6 @@ void balance_dirty_pages_ratelimited_nr(
> > >  	if (bdi->dirty_exceeded)
> > >  		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
> > >  
> > > -	current->nr_dirtied += nr_pages_dirtied;
> > > -
> > >  	preempt_disable();
> > >  	/*
> > >  	 * This prevents one CPU to accumulate too many dirtied pages without
> > > @@ -1251,12 +1249,9 @@ void balance_dirty_pages_ratelimited_nr(
> > >  	p =  &__get_cpu_var(bdp_ratelimits);
> > >  	if (unlikely(current->nr_dirtied >= ratelimit))
> > >  		*p = 0;
> > > -	else {
> > > -		*p += nr_pages_dirtied;
> > > -		if (unlikely(*p >= ratelimit_pages)) {
> > > -			*p = 0;
> > > -			ratelimit = 0;
> > > -		}
> > > +	else if (unlikely(*p >= ratelimit_pages)) {
> > > +		*p = 0;
> > > +		ratelimit = 0;
> > >  	}
> > >  	/*
> > >  	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
> > > @@ -1749,6 +1744,8 @@ void account_page_dirtied(struct page *p
> > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > >  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> > >  		task_io_account_write(PAGE_CACHE_SIZE);
> > > +		current->nr_dirtied++;
> > > +		this_cpu_inc(bdp_ratelimits);
> > >  	}
> > >  }
> > >  EXPORT_SYMBOL(account_page_dirtied);
> > > 
> > > 
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/7] writeback: fix dirtied pages accounting on redirty
  2011-11-28 13:53 ` [PATCH 4/7] writeback: fix dirtied pages accounting on redirty Wu Fengguang
@ 2011-12-07 16:09   ` Jan Kara
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Kara @ 2011-12-07 16:09 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Mon 28-11-11 21:53:42, Wu Fengguang wrote:
> De-account the accumulative dirty counters on page redirty.
> 
> Page redirties (very common in ext4) will introduce mismatch between
> counters (a) and (b)
> 
> a) NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied
> b) NR_WRITTEN, BDI_WRITTEN
> 
> This will introduce systematic errors in balanced_rate and result in
> dirty page position errors (ie. the dirty pages are no longer balanced
> around the global/bdi setpoints).
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  You can add:

  Acked-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  include/linux/writeback.h |    2 ++
>  mm/page-writeback.c       |   19 +++++++++++++++++++
>  2 files changed, 21 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-11-28 21:23:23.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-28 21:23:24.000000000 +0800
> @@ -1806,6 +1806,24 @@ int __set_page_dirty_nobuffers(struct pa
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
>  /*
> + * Call this whenever redirtying a page, to de-account the dirty counters
> + * (NR_DIRTIED, BDI_DIRTIED, tsk->nr_dirtied), so that they match the written
> + * counters (NR_WRITTEN, BDI_WRITTEN) in long term. The mismatches will lead to
> + * systematic errors in balanced_dirty_ratelimit and the dirty pages position
> + * control.
> + */
> +void account_page_redirty(struct page *page)
> +{
> +	struct address_space *mapping = page->mapping;
> +	if (mapping && mapping_cap_account_dirty(mapping)) {
> +		current->nr_dirtied--;
> +		dec_zone_page_state(page, NR_DIRTIED);
> +		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> +	}
> +}
> +EXPORT_SYMBOL(account_page_redirty);
> +
> +/*
>   * When a writepage implementation decides that it doesn't want to write this
>   * page for some reason, it should redirty the locked page via
>   * redirty_page_for_writepage() and it should then unlock the page and return 0
> @@ -1813,6 +1831,7 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers
>  int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page)
>  {
>  	wbc->pages_skipped++;
> +	account_page_redirty(page);
>  	return __set_page_dirty_nobuffers(page);
>  }
>  EXPORT_SYMBOL(redirty_page_for_writepage);
> --- linux-next.orig/include/linux/writeback.h	2011-11-28 21:23:20.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-11-28 21:23:24.000000000 +0800
> @@ -197,6 +197,8 @@ void writeback_set_ratelimit(void);
>  void tag_pages_for_writeback(struct address_space *mapping,
>  			     pgoff_t start, pgoff_t end);
>  
> +void account_page_redirty(struct page *page);
> +
>  /* pdflush.c */
>  extern int nr_pdflush_threads;	/* Global so it can be exported to sysctl
>  				   read-only. */
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/7] writeback: dirty ratelimit - think time compensation
  2011-11-28 13:53 ` [PATCH 6/7] writeback: dirty ratelimit - think time compensation Wu Fengguang
@ 2011-12-07 16:14   ` Jan Kara
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Kara @ 2011-12-07 16:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Jan Kara, Peter Zijlstra, Christoph Hellwig,
	Andrew Morton, LKML

On Mon 28-11-11 21:53:44, Wu Fengguang wrote:
> Compensate the task's think time when computing the final pause time,
> so that ->dirty_ratelimit can be executed accurately.
> 
>         think time := time spend outside of balance_dirty_pages()
> 
> In the rare case that the task slept longer than the 200ms period time
> (result in negative pause time), the sleep time will be compensated in
> the following periods, too, if it's less than 1 second.
> 
> Accumulated errors are carefully avoided as long as the max pause area
> is not hitted.
> 
> Pseudo code:
> 
>         period = pages_dirtied / task_ratelimit;
>         think = jiffies - dirty_paused_when;
>         pause = period - think;
> 
> 1) normal case: period > think
> 
>         pause = period - think
>         dirty_paused_when = jiffies + pause
>         nr_dirtied = 0
> 
>                              period time
>               |===============================>|
>                   think time      pause time
>               |===============>|==============>|
>         ------|----------------|---------------|------------------------
>         dirty_paused_when   jiffies
> 
> 
> 2) no pause case: period <= think
> 
>         don't pause; reduce future pause time by:
>         dirty_paused_when += period
>         nr_dirtied = 0
> 
>                            period time
>               |===============================>|
>                                   think time
>               |===================================================>|
>         ------|--------------------------------+-------------------|----
>         dirty_paused_when                                       jiffies
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  Looks good. You can add:

  Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/sched.h            |    1 
>  include/trace/events/writeback.h |   14 ++++++++---
>  kernel/fork.c                    |    1 
>  mm/page-writeback.c              |   36 +++++++++++++++++++++++++----
>  4 files changed, 45 insertions(+), 7 deletions(-)
> 
> --- linux-next.orig/include/linux/sched.h	2011-11-17 20:12:22.000000000 +0800
> +++ linux-next/include/linux/sched.h	2011-11-17 20:12:35.000000000 +0800
> @@ -1527,6 +1527,7 @@ struct task_struct {
>  	 */
>  	int nr_dirtied;
>  	int nr_dirtied_pause;
> +	unsigned long dirty_paused_when; /* start of a write-and-pause period */
>  
>  #ifdef CONFIG_LATENCYTOP
>  	int latency_record_count;
> --- linux-next.orig/mm/page-writeback.c	2011-11-17 20:12:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-17 20:12:38.000000000 +0800
> @@ -1010,6 +1010,7 @@ static void balance_dirty_pages(struct a
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> +	long period;
>  	long pause = 0;
>  	long uninitialized_var(max_pause);
>  	bool dirty_exceeded = false;
> @@ -1020,6 +1021,8 @@ static void balance_dirty_pages(struct a
>  	unsigned long start_time = jiffies;
>  
>  	for (;;) {
> +		unsigned long now = jiffies;
> +
>  		/*
>  		 * Unstable writes are a feature of certain networked
>  		 * filesystems (i.e. NFS) in which data may have been
> @@ -1039,8 +1042,11 @@ static void balance_dirty_pages(struct a
>  		 */
>  		freerun = dirty_freerun_ceiling(dirty_thresh,
>  						background_thresh);
> -		if (nr_dirty <= freerun)
> +		if (nr_dirty <= freerun) {
> +			current->dirty_paused_when = now;
> +			current->nr_dirtied = 0;
>  			break;
> +		}
>  
>  		if (unlikely(!writeback_in_progress(bdi)))
>  			bdi_start_background_writeback(bdi);
> @@ -1098,10 +1104,21 @@ static void balance_dirty_pages(struct a
>  		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
>  							RATELIMIT_CALC_SHIFT;
>  		if (unlikely(task_ratelimit == 0)) {
> +			period = max_pause;
>  			pause = max_pause;
>  			goto pause;
>  		}
> -		pause = HZ * pages_dirtied / task_ratelimit;
> +		period = HZ * pages_dirtied / task_ratelimit;
> +		pause = period;
> +		if (current->dirty_paused_when)
> +			pause -= now - current->dirty_paused_when;
> +		/*
> +		 * For less than 1s think time (ext3/4 may block the dirtier
> +		 * for up to 800ms from time to time on 1-HDD; so does xfs,
> +		 * however at much less frequency), try to compensate it in
> +		 * future periods by updating the virtual time; otherwise just
> +		 * do a reset, as it may be a light dirtier.
> +		 */
>  		if (unlikely(pause <= 0)) {
>  			trace_balance_dirty_pages(bdi,
>  						  dirty_thresh,
> @@ -1112,8 +1129,16 @@ static void balance_dirty_pages(struct a
>  						  dirty_ratelimit,
>  						  task_ratelimit,
>  						  pages_dirtied,
> +						  period,
>  						  pause,
>  						  start_time);
> +			if (pause < -HZ) {
> +				current->dirty_paused_when = now;
> +				current->nr_dirtied = 0;
> +			} else if (period) {
> +				current->dirty_paused_when += period;
> +				current->nr_dirtied = 0;
> +			}
>  			pause = 1; /* avoid resetting nr_dirtied_pause below */
>  			break;
>  		}
> @@ -1129,11 +1154,15 @@ pause:
>  					  dirty_ratelimit,
>  					  task_ratelimit,
>  					  pages_dirtied,
> +					  period,
>  					  pause,
>  					  start_time);
>  		__set_current_state(TASK_KILLABLE);
>  		io_schedule_timeout(pause);
>  
> +		current->dirty_paused_when = now + pause;
> +		current->nr_dirtied = 0;
> +
>  		/*
>  		 * This is typically equal to (nr_dirty < dirty_thresh) and can
>  		 * also keep "1000+ dd on a slow USB stick" under control.
> @@ -1148,11 +1177,10 @@ pause:
>  	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> -	current->nr_dirtied = 0;
>  	if (pause == 0) { /* in freerun area */
>  		current->nr_dirtied_pause =
>  				dirty_poll_interval(nr_dirty, dirty_thresh);
> -	} else if (pause <= max_pause / 4 &&
> +	} else if (period <= max_pause / 4 &&
>  		   pages_dirtied >= current->nr_dirtied_pause) {
>  		current->nr_dirtied_pause = clamp_val(
>  					dirty_ratelimit * (max_pause / 2) / HZ,
> --- linux-next.orig/kernel/fork.c	2011-11-17 20:12:23.000000000 +0800
> +++ linux-next/kernel/fork.c	2011-11-17 20:12:35.000000000 +0800
> @@ -1296,6 +1296,7 @@ static struct task_struct *copy_process(
>  
>  	p->nr_dirtied = 0;
>  	p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
> +	p->dirty_paused_when = 0;
>  
>  	/*
>  	 * Ok, make it visible to the rest of the system.
> --- linux-next.orig/include/trace/events/writeback.h	2011-11-17 19:13:41.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h	2011-11-17 20:12:35.000000000 +0800
> @@ -289,12 +289,13 @@ TRACE_EVENT(balance_dirty_pages,
>  		 unsigned long dirty_ratelimit,
>  		 unsigned long task_ratelimit,
>  		 unsigned long dirtied,
> +		 unsigned long period,
>  		 long pause,
>  		 unsigned long start_time),
>  
>  	TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
>  		dirty_ratelimit, task_ratelimit,
> -		dirtied, pause, start_time),
> +		dirtied, period, pause, start_time),
>  
>  	TP_STRUCT__entry(
>  		__array(	 char,	bdi, 32)
> @@ -309,6 +310,8 @@ TRACE_EVENT(balance_dirty_pages,
>  		__field(unsigned int,	dirtied_pause)
>  		__field(unsigned long,	paused)
>  		__field(	 long,	pause)
> +		__field(unsigned long,	period)
> +		__field(	 long,	think)
>  	),
>  
>  	TP_fast_assign(
> @@ -325,6 +328,9 @@ TRACE_EVENT(balance_dirty_pages,
>  		__entry->task_ratelimit	= KBps(task_ratelimit);
>  		__entry->dirtied	= dirtied;
>  		__entry->dirtied_pause	= current->nr_dirtied_pause;
> +		__entry->think		= current->dirty_paused_when == 0 ? 0 :
> +			 (long)(jiffies - current->dirty_paused_when) * 1000/HZ;
> +		__entry->period		= period * 1000 / HZ;
>  		__entry->pause		= pause * 1000 / HZ;
>  		__entry->paused		= (jiffies - start_time) * 1000 / HZ;
>  	),
> @@ -335,7 +341,7 @@ TRACE_EVENT(balance_dirty_pages,
>  		  "bdi_setpoint=%lu bdi_dirty=%lu "
>  		  "dirty_ratelimit=%lu task_ratelimit=%lu "
>  		  "dirtied=%u dirtied_pause=%u "
> -		  "paused=%lu pause=%ld",
> +		  "paused=%lu pause=%ld period=%lu think=%ld",
>  		  __entry->bdi,
>  		  __entry->limit,
>  		  __entry->setpoint,
> @@ -347,7 +353,9 @@ TRACE_EVENT(balance_dirty_pages,
>  		  __entry->dirtied,
>  		  __entry->dirtied_pause,
>  		  __entry->paused,	/* ms */
> -		  __entry->pause	/* ms */
> +		  __entry->pause,	/* ms */
> +		  __entry->period,	/* ms */
> +		  __entry->think	/* ms */
>  	  )
>  );
>  
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes
  2011-12-07 16:07       ` Jan Kara
@ 2011-12-08  2:44         ` Wu Fengguang
  0 siblings, 0 replies; 18+ messages in thread
From: Wu Fengguang @ 2011-12-08  2:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Christoph Hellwig, Andrew Morton, LKML

On Thu, Dec 08, 2011 at 12:07:53AM +0800, Jan Kara wrote:
> On Wed 07-12-11 20:08:18, Wu Fengguang wrote:
> > On Wed, Dec 07, 2011 at 06:53:40PM +0800, Jan Kara wrote:
> > > On Mon 28-11-11 21:53:41, Wu Fengguang wrote:
> > > > When dd in 512bytes, generic_perform_write() calls
> > > > balance_dirty_pages_ratelimited() 8 times for the same page, but
> > > > obviously the page is only dirtied once.
> > >   Actually, for ppc where pages can be 64 KB the problem is even worse.
> > 
> > Ah yes.
> > 
> > > > Fix it by accounting tsk->nr_dirtied and bdp_ratelimits at page dirty time.
> > >   I was wondering about one more thing - couldn't we rather check in
> > > generic_perform_write() whether the page was dirty before calling
> > > ->write_end and call balance_dirty_pages_ratelimited() only if it wasn't?
> > 
> > Cough.. the very original version does that exactly, then you raised
> > some concern here:
> > 
> >         https://lkml.org/lkml/2011/4/13/554
> > 
> > The discussion goes on and eventually I get to the current version
> > that looks most acceptable in the three options.
>   Good point. I should have researched web (or my memory) more closely ;)
> Thanks for the pointer - it has reminded me why using PageDirty isn't quite
> perfect.
> 
> > >   For generic_perform_write() it doesn't really matter that much since we
> > > do things page-by-page anyway but other callers could be more efficient...
> > 
> > That's right.
>   You can add:
> Acked-by: Jan Kara <jack@suse.cz>

OK, thanks!

Fengguang

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-12-08  2:54 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-28 13:53 [PATCH 0/7] dirty throttling bits for 3.3 (v2) Wu Fengguang
2011-11-28 13:53 ` [PATCH 1/7] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
2011-12-07 10:21   ` Jan Kara
2011-11-28 13:53 ` [PATCH 2/7] writeback: charge leaked page dirties to active tasks Wu Fengguang
2011-12-07 10:23   ` Jan Kara
2011-11-28 13:53 ` [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
2011-12-07 10:53   ` Jan Kara
2011-12-07 12:08     ` Wu Fengguang
2011-12-07 16:07       ` Jan Kara
2011-12-08  2:44         ` Wu Fengguang
2011-11-28 13:53 ` [PATCH 4/7] writeback: fix dirtied pages accounting on redirty Wu Fengguang
2011-12-07 16:09   ` Jan Kara
2011-11-28 13:53 ` [PATCH 5/7] btrfs: fix dirtied pages accounting on sub-page writes Wu Fengguang
2011-11-28 14:16   ` Wu Fengguang
2011-11-28 13:53 ` [PATCH 6/7] writeback: dirty ratelimit - think time compensation Wu Fengguang
2011-12-07 16:14   ` Jan Kara
2011-11-28 13:53 ` [PATCH 7/7] writeback: comment on the bdi dirty threshold Wu Fengguang
2011-12-07 10:57   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).