linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/17] per device dirty throttling -v7
@ 2007-06-14 21:58 Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 01/17] nfs: remove congestion_end() Peter Zijlstra
                   ` (18 more replies)
  0 siblings, 19 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

Latest version of the per bdi dirty throttling patches.

Most of the changes since last time are little cleanups and more
detail in the split out of the floating proportion into their
own little lib.

Patches are against 2.6.22-rc4-mm2

A rollup of all this against 2.6.21 is available here:
  http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/2.6.21-per_bdi_dirty_pages.patch

This patch-set passes the starve an USB stick test..
-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/17] nfs: remove congestion_end()
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 02/17] lib: percpu_counter variable batch Peter Zijlstra
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: nfs_congestion_fixup.patch --]
[-- Type: text/plain, Size: 1978 bytes --]

Its redundant, clear_bdi_congested() already wakes the waiters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/write.c              |    5 ++---
 include/linux/backing-dev.h |    1 -
 mm/backing-dev.c            |   13 -------------
 3 files changed, 2 insertions(+), 17 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -235,10 +235,9 @@ static void nfs_end_page_writeback(struc
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+	if (atomic_long_dec_return(&nfss->writeback) <
+			NFS_CONGESTION_OFF_THRESH)
 		clear_bdi_congested(&nfss->backing_dev_info, WRITE);
-		congestion_end(WRITE);
-	}
 }
 
 /*
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -94,7 +94,6 @@ void clear_bdi_congested(struct backing_
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
 long congestion_wait_interruptible(int rw, long timeout);
-void congestion_end(int rw);
 
 #define bdi_cap_writeback_dirty(bdi) \
 	(!((bdi)->capabilities & BDI_CAP_NO_WRITEBACK))
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -70,16 +70,3 @@ long congestion_wait_interruptible(int r
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
-
-/**
- * congestion_end - wake up sleepers on a congested backing_dev_info
- * @rw: READ or WRITE
- */
-void congestion_end(int rw)
-{
-	wait_queue_head_t *wqh = &congestion_wqh[rw];
-
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
-}
-EXPORT_SYMBOL(congestion_end);

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 02/17] lib: percpu_counter variable batch
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 01/17] nfs: remove congestion_end() Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 03/17] lib: percpu_counter_mod64 Peter Zijlstra
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: percpu_counter_batch.patch --]
[-- Type: text/plain, Size: 2504 bytes --]

Because the current batch setup has an quadric error bound on the counter,
allow for an alternative setup.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   10 +++++++++-
 lib/percpu_counter.c           |    6 +++---
 2 files changed, 12 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-23 20:34:12.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-05-23 20:36:06.000000000 +0200
@@ -32,9 +32,14 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
+void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
+static inline void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+{
+	__percpu_counter_mod(fbc, amount, FBC_BATCH);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
@@ -70,6 +75,9 @@ static inline void percpu_counter_destro
 {
 }
 
+#define __percpu_counter_mod(fbc, amount, batch) \
+	percpu_counter_mod(fbc, amount)
+
 static inline void
 percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
 {
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-05-23 20:34:12.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-05-23 20:36:21.000000000 +0200
@@ -14,7 +14,7 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch)
 {
 	long count;
 	s32 *pcount;
@@ -22,7 +22,7 @@ void percpu_counter_mod(struct percpu_co
 
 	pcount = per_cpu_ptr(fbc->counters, cpu);
 	count = *pcount + amount;
-	if (count >= FBC_BATCH || count <= -FBC_BATCH) {
+	if (count >= batch || count <= -batch) {
 		spin_lock(&fbc->lock);
 		fbc->count += count;
 		*pcount = 0;
@@ -32,7 +32,7 @@ void percpu_counter_mod(struct percpu_co
 	}
 	put_cpu();
 }
-EXPORT_SYMBOL(percpu_counter_mod);
+EXPORT_SYMBOL(__percpu_counter_mod);
 
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 03/17] lib: percpu_counter_mod64
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 01/17] nfs: remove congestion_end() Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 02/17] lib: percpu_counter variable batch Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 04/17] lib: percpu_counter_set Peter Zijlstra
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: percpu_counter_mod.patch --]
[-- Type: text/plain, Size: 2725 bytes --]

Add percpu_counter_mod64() to allow large modifications.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   17 +++++++++++++++++
 lib/percpu_counter.c           |   20 ++++++++++++++++++++
 2 files changed, 37 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-23 20:36:06.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-05-23 20:37:41.000000000 +0200
@@ -33,6 +33,7 @@ struct percpu_counter {
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
+void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
 static inline void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
@@ -40,6 +41,11 @@ static inline void percpu_counter_mod(st
 	__percpu_counter_mod(fbc, amount, FBC_BATCH);
 }
 
+static inline void percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
+{
+	__percpu_counter_mod64(fbc, amount, FBC_BATCH);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
@@ -86,6 +92,17 @@ percpu_counter_mod(struct percpu_counter
 	preempt_enable();
 }
 
+#define __percpu_counter_mod64(fbc, amount, batch) \
+	percpu_counter_mod64(fbc, amount)
+
+static inline void
+percpu_counter_mod64(struct percpu_counter *fbc, s64 amount)
+{
+	preempt_disable();
+	fbc->count += amount;
+	preempt_enable();
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-05-23 20:36:21.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-05-23 20:37:34.000000000 +0200
@@ -34,6 +34,26 @@ void __percpu_counter_mod(struct percpu_
 }
 EXPORT_SYMBOL(__percpu_counter_mod);
 
+void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch)
+{
+	s64 count;
+	s32 *pcount;
+	int cpu = get_cpu();
+
+	pcount = per_cpu_ptr(fbc->counters, cpu);
+	count = *pcount + amount;
+	if (count >= batch || count <= -batch) {
+		spin_lock(&fbc->lock);
+		fbc->count += count;
+		*pcount = 0;
+		spin_unlock(&fbc->lock);
+	} else {
+		*pcount = count;
+	}
+	put_cpu();
+}
+EXPORT_SYMBOL(__percpu_counter_mod64);
+
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 04/17] lib: percpu_counter_set
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 03/17] lib: percpu_counter_mod64 Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 05/17] lib: percpu_count_sum_signed() Peter Zijlstra
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: percpu_counter_set.patch --]
[-- Type: text/plain, Size: 1979 bytes --]

Provide a method to set a percpu counter to a specified value.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |    6 ++++++
 lib/percpu_counter.c           |   13 +++++++++++++
 2 files changed, 19 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-23 20:37:41.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-05-23 20:37:54.000000000 +0200
@@ -32,6 +32,7 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
 void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
@@ -81,6 +82,11 @@ static inline void percpu_counter_destro
 {
 }
 
+static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
+{
+	fbc->count = amount;
+}
+
 #define __percpu_counter_mod(fbc, amount, batch) \
 	percpu_counter_mod(fbc, amount)
 
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-05-23 20:37:34.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-05-23 20:38:03.000000000 +0200
@@ -14,6 +14,19 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
+{
+	int cpu;
+
+	spin_lock(&fbc->lock);
+	for_each_possible_cpu(cpu) {
+		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		*pcount = 0;
+	}
+	fbc->count = amount;
+	spin_unlock(&fbc->lock);
+}
+
 void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch)
 {
 	long count;

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 05/17] lib: percpu_count_sum_signed()
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 04/17] lib: percpu_counter_set Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-07-17 16:32   ` Josef Sipek
  2007-06-14 21:58 ` [PATCH 06/17] lib: percpu_counter_init_irq Peter Zijlstra
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: percpu_counter_sum.patch --]
[-- Type: text/plain, Size: 2646 bytes --]

Provide an accurate version of percpu_counter_read.

Should we go and replace the current use of percpu_counter_sum()
with percpu_counter_sum_positive(), and call this new primitive
percpu_counter_sum() instead?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   18 +++++++++++++++++-
 lib/percpu_counter.c           |    6 +++---
 2 files changed, 20 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-23 20:37:54.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-05-23 20:38:09.000000000 +0200
@@ -35,7 +35,18 @@ void percpu_counter_destroy(struct percp
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
 void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch);
-s64 percpu_counter_sum(struct percpu_counter *fbc);
+s64 __percpu_counter_sum(struct percpu_counter *fbc);
+
+static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+{
+	s64 ret = __percpu_counter_sum(fbc);
+	return ret < 0 ? 0 : ret;
+}
+
+static inline s64 percpu_counter_sum_signed(struct percpu_counter *fbc)
+{
+	return __percpu_counter_sum(fbc);
+}
 
 static inline void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
 {
@@ -124,6 +135,11 @@ static inline s64 percpu_counter_sum(str
 	return percpu_counter_read_positive(fbc);
 }
 
+static inline s64 percpu_counter_sum_signed(struct percpu_counter *fbc)
+{
+	return fbc->count;
+}
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-05-23 20:38:03.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-05-23 20:38:18.000000000 +0200
@@ -71,7 +71,7 @@ EXPORT_SYMBOL(__percpu_counter_mod64);
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
  */
-s64 percpu_counter_sum(struct percpu_counter *fbc)
+s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
@@ -83,9 +83,9 @@ s64 percpu_counter_sum(struct percpu_cou
 		ret += *pcount;
 	}
 	spin_unlock(&fbc->lock);
-	return ret < 0 ? 0 : ret;
+	return ret;
 }
-EXPORT_SYMBOL(percpu_counter_sum);
+EXPORT_SYMBOL(__percpu_counter_sum);
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 06/17] lib: percpu_counter_init_irq
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 05/17] lib: percpu_count_sum_signed() Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-07-17 16:35   ` Josef Sipek
  2007-06-14 21:58 ` [PATCH 07/17] mm: bdi init hooks Peter Zijlstra
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: percpu_counter_init_irq.patch --]
[-- Type: text/plain, Size: 1914 bytes --]

provide a way to init percpu_counters that are supposed to be used from irq
safe contexts.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |    4 ++++
 lib/percpu_counter.c           |    8 ++++++++
 2 files changed, 12 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -31,6 +31,8 @@ struct percpu_counter {
 #endif
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
+void percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount);
+
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
@@ -89,6 +91,8 @@ static inline void percpu_counter_init(s
 	fbc->count = amount;
 }
 
+#define percpu_counter_init_irq percpu_counter_init
+
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 {
 }
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -87,6 +87,8 @@ s64 __percpu_counter_sum(struct percpu_c
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
+static struct lock_class_key percpu_counter_irqsafe;
+
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	spin_lock_init(&fbc->lock);
@@ -100,6 +102,12 @@ void percpu_counter_init(struct percpu_c
 }
 EXPORT_SYMBOL(percpu_counter_init);
 
+void percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount)
+{
+	percpu_counter_init(fbc, amount);
+	lockdep_set_class(&fbc->lock, &percpu_counter_irqsafe);
+}
+
 void percpu_counter_destroy(struct percpu_counter *fbc)
 {
 	free_percpu(fbc->counters);

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 07/17] mm: bdi init hooks
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 06/17] lib: percpu_counter_init_irq Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 08/17] containers: " Peter Zijlstra
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: bdi_init.patch --]
[-- Type: text/plain, Size: 13028 bytes --]

provide BDI constructor/destructor hooks

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c               |    2 ++
 drivers/block/rd.c              |    6 ++++++
 drivers/char/mem.c              |    2 ++
 drivers/mtd/mtdcore.c           |    5 +++++
 fs/char_dev.c                   |    1 +
 fs/configfs/configfs_internal.h |    2 ++
 fs/configfs/inode.c             |    8 ++++++++
 fs/configfs/mount.c             |    2 ++
 fs/fuse/inode.c                 |    2 ++
 fs/hugetlbfs/inode.c            |    3 +++
 fs/nfs/client.c                 |    3 +++
 fs/ocfs2/dlm/dlmfs.c            |    6 +++++-
 fs/ramfs/inode.c                |    1 +
 fs/sysfs/inode.c                |    5 +++++
 fs/sysfs/mount.c                |    2 ++
 fs/sysfs/sysfs.h                |    1 +
 include/linux/backing-dev.h     |    7 +++++++
 mm/readahead.c                  |    7 +++++++
 mm/shmem.c                      |    1 +
 mm/swap.c                       |    4 ++++
 20 files changed, 69 insertions(+), 1 deletion(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c	2007-06-07 08:57:49.000000000 +0200
+++ linux-2.6/block/ll_rw_blk.c	2007-06-07 16:11:16.000000000 +0200
@@ -1774,6 +1774,7 @@ static void blk_release_queue(struct kob
 
 	blk_trace_shutdown(q);
 
+	bdi_destroy(&q->backing_dev_info);
 	kmem_cache_free(requestq_cachep, q);
 }
 
@@ -1841,6 +1842,7 @@ request_queue_t *blk_alloc_queue_node(gf
 
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
+	bdi_init(&q->backing_dev_info);
 
 	mutex_init(&q->sysfs_lock);
 
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c	2007-06-07 15:38:55.000000000 +0200
+++ linux-2.6/drivers/block/rd.c	2007-06-07 15:39:34.000000000 +0200
@@ -411,6 +411,9 @@ static void __exit rd_cleanup(void)
 		blk_cleanup_queue(rd_queue[i]);
 	}
 	unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
+
+	bdi_destroy(&rd_file_backing_dev_info);
+	bdi_destroy(&rd_backing_dev_info);
 }
 
 /*
@@ -421,6 +424,9 @@ static int __init rd_init(void)
 	int i;
 	int err = -ENOMEM;
 
+	bdi_init(&rd_backing_dev_info);
+	bdi_init(&rd_file_backing_dev_info);
+
 	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
 			(rd_blocksize & (rd_blocksize-1))) {
 		printk("RAMDISK: wrong blocksize %d, reverting to defaults\n",
Index: linux-2.6/drivers/char/mem.c
===================================================================
--- linux-2.6.orig/drivers/char/mem.c	2007-06-06 15:16:25.000000000 +0200
+++ linux-2.6/drivers/char/mem.c	2007-06-07 15:39:34.000000000 +0200
@@ -987,6 +987,8 @@ static int __init chr_dev_init(void)
 			      MKDEV(MEM_MAJOR, devlist[i].minor),
 			      devlist[i].name);
 
+	bdi_init(&zero_bdi);
+
 	return 0;
 }
 
Index: linux-2.6/fs/char_dev.c
===================================================================
--- linux-2.6.orig/fs/char_dev.c	2007-06-06 15:16:25.000000000 +0200
+++ linux-2.6/fs/char_dev.c	2007-06-07 15:39:34.000000000 +0200
@@ -546,6 +546,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
 	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
+	bdi_init(&directly_mappable_cdev_bdi);
 }
 
 
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c	2007-06-07 08:57:55.000000000 +0200
+++ linux-2.6/fs/fuse/inode.c	2007-06-07 15:39:34.000000000 +0200
@@ -433,6 +433,7 @@ static struct fuse_conn *new_conn(void)
 		atomic_set(&fc->num_waiting, 0);
 		fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
+		bdi_init(&fc->bdi);
 		fc->reqctr = 0;
 		fc->blocked = 1;
 		get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
@@ -446,6 +447,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		mutex_destroy(&fc->inst_mutex);
+		bdi_destroy(&fc->bdi);
 		kfree(fc);
 	}
 }
Index: linux-2.6/fs/nfs/client.c
===================================================================
--- linux-2.6.orig/fs/nfs/client.c	2007-06-07 08:57:55.000000000 +0200
+++ linux-2.6/fs/nfs/client.c	2007-06-07 15:39:34.000000000 +0200
@@ -658,6 +658,8 @@ static void nfs_server_set_fsinfo(struct
 	if (server->rsize > NFS_MAX_FILE_IO_SIZE)
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
+	bdi_init(&server->backing_dev_info);
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
 
 	if (server->wsize > max_rpc_payload)
@@ -787,6 +789,7 @@ void nfs_free_server(struct nfs_server *
 	nfs_put_client(server->nfs_client);
 
 	nfs_free_iostats(server->io_stats);
+	bdi_destroy(&server->backing_dev_info);
 	kfree(server);
 	nfs_release_automount_timer();
 	dprintk("<-- nfs_free_server()\n");
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2007-06-07 15:39:25.000000000 +0200
+++ linux-2.6/include/linux/backing-dev.h	2007-06-07 16:11:19.000000000 +0200
@@ -34,6 +34,13 @@ struct backing_dev_info {
 	void *unplug_io_data;
 };
 
+static inline void bdi_init(struct backing_dev_info *bdi)
+{
+}
+
+static inline void bdi_destroy(struct backing_dev_info *bdi)
+{
+}
 
 /*
  * Flags in backing_dev_info::capability
Index: linux-2.6/drivers/mtd/mtdcore.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdcore.c	2007-06-06 15:16:25.000000000 +0200
+++ linux-2.6/drivers/mtd/mtdcore.c	2007-06-07 15:39:34.000000000 +0200
@@ -60,6 +60,7 @@ int add_mtd_device(struct mtd_info *mtd)
 			break;
 		}
 	}
+	bdi_init(mtd->backing_dev_info);
 
 	BUG_ON(mtd->writesize == 0);
 	mutex_lock(&mtd_table_mutex);
@@ -142,6 +143,10 @@ int del_mtd_device (struct mtd_info *mtd
 	}
 
 	mutex_unlock(&mtd_table_mutex);
+
+	if (mtd->backing_dev_info)
+		bdi_destroy(mtd->backing_dev_info);
+
 	return ret;
 }
 
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2007-06-07 15:38:56.000000000 +0200
+++ linux-2.6/fs/hugetlbfs/inode.c	2007-06-07 15:39:34.000000000 +0200
@@ -831,6 +831,8 @@ static int __init init_hugetlbfs_fs(void
  out:
 	if (error)
 		kmem_cache_destroy(hugetlbfs_inode_cachep);
+	else
+		bdi_init(&hugetlbfs_backing_dev_info);
 	return error;
 }
 
@@ -838,6 +840,7 @@ static void __exit exit_hugetlbfs_fs(voi
 {
 	kmem_cache_destroy(hugetlbfs_inode_cachep);
 	unregister_filesystem(&hugetlbfs_fs_type);
+	bdi_destroy(&hugetlbfs_backing_dev_info);
 }
 
 module_init(init_hugetlbfs_fs)
Index: linux-2.6/fs/ocfs2/dlm/dlmfs.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlm/dlmfs.c	2007-06-06 15:16:25.000000000 +0200
+++ linux-2.6/fs/ocfs2/dlm/dlmfs.c	2007-06-07 15:39:34.000000000 +0200
@@ -611,8 +611,10 @@ bail:
 			kmem_cache_destroy(dlmfs_inode_cache);
 		if (cleanup_worker)
 			destroy_workqueue(user_dlm_worker);
-	} else
+	} else {
+		bdi_init(&dlmfs_backing_dev_info);
 		printk("OCFS2 User DLM kernel interface loaded\n");
+	}
 	return status;
 }
 
@@ -624,6 +626,8 @@ static void __exit exit_dlmfs_fs(void)
 	destroy_workqueue(user_dlm_worker);
 
 	kmem_cache_destroy(dlmfs_inode_cache);
+
+	bdi_destroy(&dlmfs_backing_dev_info);
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/configfs/configfs_internal.h
===================================================================
--- linux-2.6.orig/fs/configfs/configfs_internal.h	2007-06-06 15:16:25.000000000 +0200
+++ linux-2.6/fs/configfs/configfs_internal.h	2007-06-07 15:39:34.000000000 +0200
@@ -55,6 +55,8 @@ extern int configfs_is_root(struct confi
 
 extern struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent *);
 extern int configfs_create(struct dentry *, int mode, int (*init)(struct inode *));
+extern void configfs_inode_init(void);
+extern void configfs_inode_exit(void);
 
 extern int configfs_create_file(struct config_item *, const struct configfs_attribute *);
 extern int configfs_make_dirent(struct configfs_dirent *,
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c	2007-06-07 15:38:56.000000000 +0200
+++ linux-2.6/fs/configfs/inode.c	2007-06-07 15:39:34.000000000 +0200
@@ -256,4 +256,12 @@ void configfs_hash_and_remove(struct den
 	mutex_unlock(&dir->d_inode->i_mutex);
 }
 
+void __init configfs_inode_init(void)
+{
+	bdi_init(&configfs_backing_dev_info);
+}
 
+void __exit configfs_inode_exit(void)
+{
+	bdi_destroy(&configfs_backing_dev_info);
+}
Index: linux-2.6/fs/configfs/mount.c
===================================================================
--- linux-2.6.orig/fs/configfs/mount.c	2007-06-06 15:16:25.000000000 +0200
+++ linux-2.6/fs/configfs/mount.c	2007-06-07 15:39:34.000000000 +0200
@@ -156,6 +156,7 @@ static int __init configfs_init(void)
 		configfs_dir_cachep = NULL;
 	}
 
+	configfs_inode_init();
 out:
 	return err;
 }
@@ -166,6 +167,7 @@ static void __exit configfs_exit(void)
 	subsystem_unregister(&config_subsys);
 	kmem_cache_destroy(configfs_dir_cachep);
 	configfs_dir_cachep = NULL;
+	configfs_inode_exit();
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/ramfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ramfs/inode.c	2007-06-06 15:16:25.000000000 +0200
+++ linux-2.6/fs/ramfs/inode.c	2007-06-07 15:39:34.000000000 +0200
@@ -223,6 +223,7 @@ module_exit(exit_ramfs_fs)
 
 int __init init_rootfs(void)
 {
+	bdi_init(&ramfs_backing_dev_info);
 	return register_filesystem(&rootfs_fs_type);
 }
 
Index: linux-2.6/fs/sysfs/inode.c
===================================================================
--- linux-2.6.orig/fs/sysfs/inode.c	2007-06-07 15:38:56.000000000 +0200
+++ linux-2.6/fs/sysfs/inode.c	2007-06-07 15:39:34.000000000 +0200
@@ -34,6 +34,11 @@ static const struct inode_operations sys
 	.setattr	= sysfs_setattr,
 };
 
+void __init sysfs_inode_init(void)
+{
+	bdi_init(&sysfs_backing_dev_info);
+}
+
 void sysfs_delete_inode(struct inode *inode)
 {
 	/* Free the shadowed directory inode operations */
Index: linux-2.6/fs/sysfs/mount.c
===================================================================
--- linux-2.6.orig/fs/sysfs/mount.c	2007-06-06 15:22:15.000000000 +0200
+++ linux-2.6/fs/sysfs/mount.c	2007-06-07 15:39:34.000000000 +0200
@@ -103,6 +103,8 @@ int __init sysfs_init(void)
 	} else
 		goto out_err;
 out:
+	if (!err)
+		sysfs_inode_init();
 	return err;
 out_err:
 	kmem_cache_destroy(sysfs_dir_cachep);
Index: linux-2.6/fs/sysfs/sysfs.h
===================================================================
--- linux-2.6.orig/fs/sysfs/sysfs.h	2007-06-06 15:22:15.000000000 +0200
+++ linux-2.6/fs/sysfs/sysfs.h	2007-06-07 15:39:43.000000000 +0200
@@ -57,6 +57,7 @@ extern void sysfs_delete_inode(struct in
 extern void sysfs_init_inode(struct sysfs_dirent *sd, struct inode *inode);
 extern struct inode * sysfs_get_inode(struct sysfs_dirent *sd);
 extern void sysfs_instantiate(struct dentry *dentry, struct inode *inode);
+extern void sysfs_inode_init(void);
 
 extern void release_sysfs_dirent(struct sysfs_dirent * sd);
 extern int sysfs_dirent_exist(struct sysfs_dirent *, const unsigned char *);
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2007-06-07 15:38:59.000000000 +0200
+++ linux-2.6/mm/shmem.c	2007-06-07 15:39:34.000000000 +0200
@@ -2444,6 +2444,7 @@ static int __init init_tmpfs(void)
 		printk(KERN_ERR "Could not kern_mount tmpfs\n");
 		goto out1;
 	}
+	bdi_init(&shmem_backing_dev_info);
 	return 0;
 
 out1:
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c	2007-06-07 08:57:57.000000000 +0200
+++ linux-2.6/mm/swap.c	2007-06-07 15:39:34.000000000 +0200
@@ -550,6 +550,10 @@ void __init swap_setup(void)
 {
 	unsigned long megs = num_physpages >> (20 - PAGE_SHIFT);
 
+#ifdef CONFIG_SWAP
+	bdi_init(swapper_space.backing_dev_info);
+#endif
+
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
 		page_cluster = 2;
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c	2007-06-07 15:38:59.000000000 +0200
+++ linux-2.6/mm/readahead.c	2007-06-07 15:39:34.000000000 +0200
@@ -242,6 +242,13 @@ unsigned long max_sane_readahead(unsigne
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
+static int __init readahead_init(void)
+{
+	bdi_init(&default_backing_dev_info);
+	return 0;
+}
+subsys_initcall(readahead_init);
+
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 08/17] containers: bdi init hooks
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 07/17] mm: bdi init hooks Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 09/17] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: bdi_init_container.patch --]
[-- Type: text/plain, Size: 1299 bytes --]

split off from the large bdi_init patch because containers are not slated
for mainline any time soon.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/container.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/container.c
===================================================================
--- linux-2.6.orig/kernel/container.c
+++ linux-2.6/kernel/container.c
@@ -554,12 +554,13 @@ static int container_populate_dir(struct
 static struct inode_operations container_dir_inode_operations;
 static struct file_operations proc_containerstats_operations;
 
+static struct backing_dev_info container_backing_dev_info = {
+	.capabilities	= BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
+};
+
 static struct inode *container_new_inode(mode_t mode, struct super_block *sb)
 {
 	struct inode *inode = new_inode(sb);
-	static struct backing_dev_info container_backing_dev_info = {
-		.capabilities	= BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
-	};
 
 	if (inode) {
 		inode->i_mode = mode;
@@ -2058,6 +2059,8 @@ int __init container_init(void)
 	if (err < 0)
 		goto out;
 
+	bdi_init(&container_backing_dev_info);
+
 	entry = create_proc_entry("containers", 0, NULL);
 	if (entry)
 		entry->proc_fops = &proc_containerstats_operations;

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 09/17] mtd: give mtdconcat devices their own backing_dev_info
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 08/17] containers: " Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 10/17] mm: scalable bdi statistics counters Peter Zijlstra
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea, Robert Kaiser

[-- Attachment #1: bdi_mtdconcat.patch --]
[-- Type: text/plain, Size: 3283 bytes --]

These are actual devices, give them their own BDI.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Kaiser <rkaiser@sysgo.de>
---
 drivers/mtd/mtdconcat.c |   28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

Index: linux-2.6/drivers/mtd/mtdconcat.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdconcat.c	2007-04-22 18:55:17.000000000 +0200
+++ linux-2.6/drivers/mtd/mtdconcat.c	2007-04-22 19:01:42.000000000 +0200
@@ -32,6 +32,7 @@ struct mtd_concat {
 	struct mtd_info mtd;
 	int num_subdev;
 	struct mtd_info **subdev;
+	struct backing_dev_info backing_dev_info;
 };
 
 /*
@@ -782,10 +783,9 @@ struct mtd_info *mtd_concat_create(struc
 
 	for (i = 1; i < num_devs; i++) {
 		if (concat->mtd.type != subdev[i]->type) {
-			kfree(concat);
 			printk("Incompatible device type on \"%s\"\n",
 			       subdev[i]->name);
-			return NULL;
+			goto error;
 		}
 		if (concat->mtd.flags != subdev[i]->flags) {
 			/*
@@ -794,10 +794,9 @@ struct mtd_info *mtd_concat_create(struc
 			 */
 			if ((concat->mtd.flags ^ subdev[i]->
 			     flags) & ~MTD_WRITEABLE) {
-				kfree(concat);
 				printk("Incompatible device flags on \"%s\"\n",
 				       subdev[i]->name);
-				return NULL;
+				goto error;
 			} else
 				/* if writeable attribute differs,
 				   make super device writeable */
@@ -809,9 +808,12 @@ struct mtd_info *mtd_concat_create(struc
 		 * - copy-mapping is still permitted
 		 */
 		if (concat->mtd.backing_dev_info !=
-		    subdev[i]->backing_dev_info)
+		    subdev[i]->backing_dev_info) {
+			concat->backing_dev_info = default_backing_dev_info;
+			bdi_init(&concat->backing_dev_info);
 			concat->mtd.backing_dev_info =
-				&default_backing_dev_info;
+				&concat->backing_dev_info;
+		}
 
 		concat->mtd.size += subdev[i]->size;
 		concat->mtd.ecc_stats.badblocks +=
@@ -821,10 +823,9 @@ struct mtd_info *mtd_concat_create(struc
 		    concat->mtd.oobsize    !=  subdev[i]->oobsize ||
 		    !concat->mtd.read_oob  != !subdev[i]->read_oob ||
 		    !concat->mtd.write_oob != !subdev[i]->write_oob) {
-			kfree(concat);
 			printk("Incompatible OOB or ECC data on \"%s\"\n",
 			       subdev[i]->name);
-			return NULL;
+			goto error;
 		}
 		concat->subdev[i] = subdev[i];
 
@@ -903,11 +904,10 @@ struct mtd_info *mtd_concat_create(struc
 		    kmalloc(num_erase_region *
 			    sizeof (struct mtd_erase_region_info), GFP_KERNEL);
 		if (!erase_region_p) {
-			kfree(concat);
 			printk
 			    ("memory allocation error while creating erase region list"
 			     " for device \"%s\"\n", name);
-			return NULL;
+			goto error;
 		}
 
 		/*
@@ -968,6 +968,12 @@ struct mtd_info *mtd_concat_create(struc
 	}
 
 	return &concat->mtd;
+
+error:
+	if (concat->mtd.backing_dev_info == &concat->backing_dev_info)
+		bdi_destroy(&concat->backing_dev_info);
+	kfree(concat);
+	return NULL;
 }
 
 /*
@@ -977,6 +983,8 @@ struct mtd_info *mtd_concat_create(struc
 void mtd_concat_destroy(struct mtd_info *mtd)
 {
 	struct mtd_concat *concat = CONCAT(mtd);
+	if (concat->mtd.backing_dev_info == &concat->backing_dev_info)
+		bdi_destroy(&concat->backing_dev_info);
 	if (concat->mtd.numeraseregions)
 		kfree(concat->mtd.eraseregions);
 	kfree(concat);

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 10/17] mm: scalable bdi statistics counters.
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 09/17] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 11/17] mm: count reclaimable pages per BDI Peter Zijlstra
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: bdi_stat.patch --]
[-- Type: text/plain, Size: 4038 bytes --]

Provide scalable per backing_dev_info statistics counters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |   96 +++++++++++++++++++++++++++++++++++++++++++-
 mm/backing-dev.c            |   21 +++++++++
 2 files changed, 115 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h	2007-05-10 10:21:53.000000000 +0200
+++ linux-2.6/include/linux/backing-dev.h	2007-05-10 10:23:26.000000000 +0200
@@ -8,6 +8,8 @@
 #ifndef _LINUX_BACKING_DEV_H
 #define _LINUX_BACKING_DEV_H
 
+#include <linux/percpu_counter.h>
+#include <linux/log2.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -24,6 +26,12 @@ enum bdi_state {
 
 typedef int (congested_fn)(void *, int);
 
+enum bdi_stat_item {
+	NR_BDI_STAT_ITEMS
+};
+
+#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
+
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
@@ -32,14 +40,86 @@ struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+
+	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 };
 
-static inline void bdi_init(struct backing_dev_info *bdi)
+void bdi_init(struct backing_dev_info *bdi);
+void bdi_destroy(struct backing_dev_info *bdi);
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s32 amount)
+{
+	__percpu_counter_mod(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
+}
+
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	__mod_bdi_stat(bdi, item, 1);
+}
+
+static inline void inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
 {
+	__mod_bdi_stat(bdi, item, -1);
 }
 
-static inline void bdi_destroy(struct backing_dev_info *bdi)
+static inline void dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
 {
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline s64 bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_read_positive(&bdi->bdi_stat[item]);
+}
+
+static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_sum(&bdi->bdi_stat[item]);
+}
+
+static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	s64 sum;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sum = __bdi_stat_sum(bdi, item);
+	local_irq_restore(flags);
+
+	return sum;
+}
+
+/*
+ * maximal error of a stat counter.
+ */
+static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
+{
+#ifdef CONFIG_SMP
+	return nr_cpu_ids * BDI_STAT_BATCH;
+#else
+	return 1;
+#endif
 }
 
 /*
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c	2007-05-10 10:21:46.000000000 +0200
+++ linux-2.6/mm/backing-dev.c	2007-05-10 10:23:08.000000000 +0200
@@ -5,6 +5,24 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 
+void bdi_init(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		percpu_counter_init_irq(&bdi->bdi_stat[i], 0);
+}
+EXPORT_SYMBOL(bdi_init);
+
+void bdi_destroy(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		percpu_counter_destroy(&bdi->bdi_stat[i]);
+}
+EXPORT_SYMBOL(bdi_destroy);
+
 static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 11/17] mm: count reclaimable pages per BDI
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 10/17] mm: scalable bdi statistics counters Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 12/17] mm: count writeback " Peter Zijlstra
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: bdi_stat_reclaimable.patch --]
[-- Type: text/plain, Size: 4077 bytes --]

Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                 |    2 ++
 fs/nfs/write.c              |    7 +++++++
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    4 ++++
 mm/truncate.c               |    2 ++
 5 files changed, 16 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -705,6 +705,8 @@ static int __set_page_dirty(struct page 
 
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -829,6 +829,8 @@ int __set_page_dirty_nobuffers(struct pa
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
+				__inc_bdi_stat(mapping->backing_dev_info,
+						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
@@ -963,6 +965,8 @@ int clear_page_dirty_for_io(struct page 
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			return 1;
 		}
 		return 0;
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -72,6 +72,8 @@ void cancel_dirty_page(struct page *page
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -463,6 +463,7 @@ nfs_mark_request_commit(struct nfs_page 
 	set_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 	spin_unlock(&nfsi->req_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -549,6 +550,8 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		nfs_inode_remove_request(req);
@@ -1211,6 +1214,8 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
@@ -1236,6 +1241,8 @@ static void nfs_commit_done(struct rpc_t
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
 			req->wb_context->dentry->d_inode->i_sb->s_id,
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -27,6 +27,7 @@ enum bdi_state {
 typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
+	BDI_RECLAIMABLE,
 	NR_BDI_STAT_ITEMS
 };
 

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 12/17] mm: count writeback pages per BDI
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 11/17] mm: count reclaimable pages per BDI Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 13/17] mm: expose BDI statistics in sysfs Peter Zijlstra
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: bdi_stat_writeback.patch --]
[-- Type: text/plain, Size: 1931 bytes --]

Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |   12 ++++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -981,14 +981,18 @@ int test_clear_page_writeback(struct pag
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi))
+				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -1004,14 +1008,18 @@ int test_set_page_writeback(struct page 
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi))
+				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -28,6 +28,7 @@ typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
+	BDI_WRITEBACK,
 	NR_BDI_STAT_ITEMS
 };
 

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 13/17] mm: expose BDI statistics in sysfs.
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 12/17] mm: count writeback " Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 14/17] lib: floating proportions Peter Zijlstra
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: bdi_stat_sysfs.patch --]
[-- Type: text/plain, Size: 1964 bytes --]

Expose the per BDI stats in /sys/block/<dev>/queue/*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -3977,6 +3977,23 @@ static ssize_t queue_max_hw_sectors_show
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_reclaimable_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_reclaimable =
+		bdi_stat(&q->backing_dev_info, BDI_RECLAIMABLE);
+
+	return sprintf(page, "%llu\n",
+			nr_reclaimable >> (PAGE_CACHE_SHIFT - 10));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_writeback =
+		bdi_stat(&q->backing_dev_info, BDI_WRITEBACK);
+
+	return sprintf(page, "%llu\n",
+			nr_writeback >> (PAGE_CACHE_SHIFT - 10));
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -4001,6 +4018,16 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_reclaimable_entry = {
+	.attr = {.name = "reclaimable_kb", .mode = S_IRUGO },
+	.show = queue_nr_reclaimable_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+	.attr = {.name = "writeback_kb", .mode = S_IRUGO },
+	.show = queue_nr_writeback_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -4012,6 +4039,8 @@ static struct attribute *default_attrs[]
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
+	&queue_reclaimable_entry.attr,
+	&queue_writeback_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 14/17] lib: floating proportions
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 13/17] mm: expose BDI statistics in sysfs Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 15/17] lib: floating proportions _single Peter Zijlstra
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: proportions.patch --]
[-- Type: text/plain, Size: 10074 bytes --]

Given a set of objects, floating proportions aims to efficiently give the
proportional 'activity' of a single item as compared to the whole set. Where
'activity' is a measure of a temporal property of the items.

It is efficient in that it need not inspect any other items of the set
in order to provide the answer. It is not even needed to know how many
other items there are.

It has one parameter, and that is the period of 'time' over which the 
'activity' is measured.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/proportions.h |   81 +++++++++++++
 lib/Makefile                |    3 
 lib/proportions.c           |  258 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 341 insertions(+), 1 deletion(-)

Index: linux-2.6/lib/proportions.c
===================================================================
--- /dev/null
+++ linux-2.6/lib/proportions.c
@@ -0,0 +1,258 @@
+/*
+ * FLoating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * Description:
+ *
+ * The floating proportion is a time derivative with an exponentially decaying
+ * history:
+ *
+ *   p_{j} = \Sum_{i=0} (dx_{j}/dt_{-i}) / 2^(1+i)
+ *
+ * Where j is an element from {prop_local}, x_{j} is j's number of events,
+ * and i the time period over which the differential is taken. So d/dt_{-i} is
+ * the differential over the i-th last period.
+ *
+ * The decaying history gives smooth transitions. The time differential carries
+ * the notion of speed.
+ *
+ * The denominator is 2^(1+i) because we want the series to be normalised, ie.
+ *
+ *   \Sum_{i=0} 1/2^(1+i) = 1
+ *
+ * Further more, if we measure time (t) in the same events as x; so that:
+ *
+ *   t = \Sum_{j} x_{j}
+ *
+ * we get that:
+ *
+ *   \Sum_{j} p_{j} = 1
+ *
+ * Writing this in an iterative fashion we get (dropping the 'd's):
+ *
+ *   if (++x_{j}, ++t > period)
+ *     t /= 2;
+ *     for_each (j)
+ *       x_{j} /= 2;
+ *
+ * so that:
+ *
+ *   p_{j} = x_{j} / t;
+ *
+ * We optimize away the '/= 2' for the global time delta by noting that:
+ *
+ *   if (++t > period) t /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   period/2 + (++t % period/2)
+ *
+ * [ Furthermore, when we choose period to be 2^n it can be written in terms of
+ *   binary operations and wraparound artefacts disappear. ]
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   c = t / (period/2)
+ *
+ * [ Its monotonic increasing property can be applied to mitigate the wrap-
+ *   around issue. ]
+ *
+ * This allows us to do away with the loop over all prop_locals on each period
+ * expiration. By remembering the period count under which it was last accessed
+ * as c_{j}, we can obtain the number of 'missed' cycles from:
+ *
+ *   c - c_{j}
+ *
+ * We can then lazily catch up to the global period count every time we are
+ * going to use x_{j}, by doing:
+ *
+ *   x_{j} /= 2^(c - c_{j}), c_{j} = c
+ */
+
+#include <linux/proportions.h>
+#include <linux/rcupdate.h>
+
+void prop_descriptor_init(struct prop_descriptor *pd, int shift)
+{
+	pd->index = 0;
+	pd->pg[0].shift = shift;
+	percpu_counter_init_irq(&pd->pg[0].events, 0);
+	percpu_counter_init_irq(&pd->pg[1].events, 0);
+	mutex_init(&pd->mutex);
+}
+
+/*
+ * We have two copies, and flip between them to make it seem like an atomic
+ * update. The update is not really atomic wrt the events counter, but
+ * it is internally consistent with the bit layout depending on shift.
+ *
+ * We copy the events count, move the bits around and flip the index.
+ */
+void prop_change_shift(struct prop_descriptor *pd, int shift)
+{
+	int index;
+	int offset;
+	u64 events;
+	unsigned long flags;
+
+	mutex_lock(&pd->mutex);
+
+	index = pd->index ^ 1;
+	offset = pd->pg[pd->index].shift - shift;
+	if (!offset)
+		goto out;
+
+	pd->pg[index].shift = shift;
+
+	local_irq_save(flags);
+	events = percpu_counter_sum_signed(
+			&pd->pg[pd->index].events);
+	if (offset < 0)
+		events <<= -offset;
+	else
+		events >>= offset;
+	percpu_counter_set(&pd->pg[index].events, events);
+
+	/*
+	 * ensure the new pg is fully written before the switch
+	 */
+	smp_wmb();
+	pd->index = index;
+	local_irq_restore(flags);
+
+	synchronize_rcu();
+
+out:
+	mutex_unlock(&pd->mutex);
+}
+
+/*
+ * wrap the access to the data in an rcu_read_lock() section;
+ * this is used to track the active references.
+ */
+struct prop_global *prop_get_global(struct prop_descriptor *pd)
+{
+	int index;
+
+	rcu_read_lock();
+	index = pd->index;
+	/*
+	 * match the wmb from vcd_flip()
+	 */
+	smp_rmb();
+	return &pd->pg[index];
+}
+
+void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg)
+{
+	rcu_read_unlock();
+}
+
+static void prop_adjust_shift(struct prop_local *pl, int new_shift)
+{
+	int offset = pl->shift - new_shift;
+
+	if (!offset)
+		return;
+
+	if (offset < 0)
+		pl->period <<= -offset;
+	else
+		pl->period >>= offset;
+
+	pl->shift = new_shift;
+}
+
+void prop_local_init(struct prop_local *pl)
+{
+	spin_lock_init(&pl->lock);
+	pl->shift = 0;
+	pl->period = 0;
+	percpu_counter_init_irq(&pl->events, 0);
+}
+
+void prop_local_destroy(struct prop_local *pl)
+{
+	percpu_counter_destroy(&pl->events);
+}
+
+/*
+ * Catch up with missed period expirations.
+ *
+ *   until (c_{j} == c)
+ *     x_{j} -= x_{j}/2;
+ *     c_{j}++;
+ */
+void prop_norm(struct prop_global *pg,
+		struct prop_local *pl)
+{
+	unsigned long period = 1UL << (pg->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&pg->events);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (pl->period == global_period)
+		return;
+
+	spin_lock_irqsave(&pl->lock, flags);
+	prop_adjust_shift(pl, pg->shift);
+	/*
+	 * For each missed period, we half the local counter.
+	 * basically:
+	 *   pl->events >> (global_period - pl->period);
+	 *
+	 * but since the distributed nature of percpu counters make division
+	 * rather hard, use a regular subtraction loop. This is safe, because
+	 * the events will only every be incremented, hence the subtraction
+	 * can never result in a negative number.
+	 */
+	while (pl->period != global_period) {
+		unsigned long val = percpu_counter_read(&pl->events);
+		unsigned long half = (val + 1) >> 1;
+
+		/*
+		 * Half of zero won't be much less, break out.
+		 * This limits the loop to shift iterations, even
+		 * if we missed a million.
+		 */
+		if (!val)
+			break;
+
+		/*
+		 * Iff shift >32 half might exceed the limits of
+		 * the regular percpu_counter_mod.
+		 */
+		percpu_counter_mod64(&pl->events, -half);
+		pl->period += period;
+	}
+	pl->period = global_period;
+	spin_unlock_irqrestore(&pl->lock, flags);
+}
+
+/*
+ * Obtain an fraction of this proportion
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+		long *numerator, long *denominator)
+{
+	unsigned long period_2 = 1UL << (pg->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	prop_norm(pg, pl);
+	*numerator = percpu_counter_read_positive(&pl->events);
+
+	global_count = percpu_counter_read(&pg->events);
+	*denominator = period_2 + (global_count & counter_mask);
+}
+
Index: linux-2.6/include/linux/proportions.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/proportions.h
@@ -0,0 +1,81 @@
+/*
+ * FLoating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_PROPORTIONS_H
+#define _LINUX_PROPORTIONS_H
+
+#include <linux/percpu_counter.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+
+struct prop_global {
+	/*
+	 * The period over which we differentiate
+	 *
+	 *   period = 2^shift
+	 */
+	int shift;
+	/*
+	 * The total event counter aka 'time'.
+	 *
+	 * Treated as an unsigned long; the lower 'shift - 1' bits are the
+	 * counter bits, the remaining upper bits the period counter.
+	 */
+	struct percpu_counter events;
+};
+
+/*
+ * global proportion descriptor
+ *
+ * this is needed to consitently flip prop_global structures.
+ */
+struct prop_descriptor {
+	int index;
+	struct prop_global pg[2];
+	struct mutex mutex;		/* serialize the prop_global switch */
+};
+
+void prop_descriptor_init(struct prop_descriptor *pd, int shift);
+void prop_change_shift(struct prop_descriptor *pd, int new_shift);
+struct prop_global *prop_get_global(struct prop_descriptor *pd);
+void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg);
+
+struct prop_local {
+	/*
+	 * the local events counter
+	 */
+	struct percpu_counter events;
+
+	/*
+	 * snapshot of the last seen global state
+	 */
+	int shift;
+	unsigned long period;
+	spinlock_t lock;		/* protect the snapshot state */
+};
+
+void prop_local_init(struct prop_local *pl);
+void prop_local_destroy(struct prop_local *pl);
+
+void prop_norm(struct prop_global *pg, struct prop_local *pl);
+
+/*
+ *   ++x_{j}, ++t
+ */
+static inline
+void __prop_inc(struct prop_global *pg, struct prop_local *pl)
+{
+	prop_norm(pg, pl);
+	percpu_counter_mod(&pl->events, 1);
+	percpu_counter_mod(&pg->events, 1);
+}
+
+void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+		long *numerator, long *denominator);
+
+#endif /* _LINUX_PROPORTIONS_H */
Index: linux-2.6/lib/Makefile
===================================================================
--- linux-2.6.orig/lib/Makefile
+++ linux-2.6/lib/Makefile
@@ -5,7 +5,8 @@
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o \
 	 idr.o int_sqrt.o bitmap.o extable.o prio_tree.o \
-	 sha1.o irq_regs.o reciprocal_div.o argv_split.o
+	 sha1.o irq_regs.o reciprocal_div.o argv_split.o \
+	 proportions.o
 
 lib-$(CONFIG_MMU) += ioremap.o pagewalk.o
 lib-$(CONFIG_SMP) += cpumask.o

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 15/17] lib: floating proportions _single
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 14/17] lib: floating proportions Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 16/17] mm: per device dirty threshold Peter Zijlstra
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: proportions_single.patch --]
[-- Type: text/plain, Size: 9457 bytes --]

Provide a prop_local that does not use a percpu variable for its counter.
This is useful for items that are not (or infrequently) accessed from
multiple context and/or are plenty enought that the percpu_counter overhead
will hurt (tasks).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/proportions.h |  112 +++++++++++++++++++++++++++++++++++++--
 lib/proportions.c           |  124 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 218 insertions(+), 18 deletions(-)

Index: linux-2.6/include/linux/proportions.h
===================================================================
--- linux-2.6.orig/include/linux/proportions.h
+++ linux-2.6/include/linux/proportions.h
@@ -45,7 +45,11 @@ void prop_change_shift(struct prop_descr
 struct prop_global *prop_get_global(struct prop_descriptor *pd);
 void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg);
 
-struct prop_local {
+/*
+ * ----- PERCPU ------
+ */
+
+struct prop_local_percpu {
 	/*
 	 * the local events counter
 	 */
@@ -59,23 +63,117 @@ struct prop_local {
 	spinlock_t lock;		/* protect the snapshot state */
 };
 
-void prop_local_init(struct prop_local *pl);
-void prop_local_destroy(struct prop_local *pl);
+void prop_local_init_percpu(struct prop_local_percpu *pl);
+void prop_local_destroy_percpu(struct prop_local_percpu *pl);
 
-void prop_norm(struct prop_global *pg, struct prop_local *pl);
+void prop_norm_percpu(struct prop_global *pg, struct prop_local_percpu *pl);
 
 /*
  *   ++x_{j}, ++t
  */
 static inline
-void __prop_inc(struct prop_global *pg, struct prop_local *pl)
+void __prop_inc_percpu(struct prop_global *pg, struct prop_local_percpu *pl)
 {
-	prop_norm(pg, pl);
+	prop_norm_percpu(pg, pl);
 	percpu_counter_mod(&pl->events, 1);
 	percpu_counter_mod(&pg->events, 1);
 }
 
-void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+void prop_fraction_percpu(struct prop_global *pg, struct prop_local_percpu *pl,
+		long *numerator, long *denominator);
+
+/*
+ * ----- SINGLE ------
+ */
+
+struct prop_local_single {
+	/*
+	 * the local events counter
+	 */
+	unsigned long events;
+
+	/*
+	 * snapshot of the last seen global state
+	 * and a lock protecting this state
+	 */
+	int shift;
+	unsigned long period;
+	spinlock_t lock;		/* protect the snapshot state */
+};
+
+void prop_local_init_single(struct prop_local_single *pl);
+void prop_local_destroy_single(struct prop_local_single *pl);
+
+void prop_norm_single(struct prop_global *pg, struct prop_local_single *pl);
+
+/*
+ *   ++x_{j}, ++t
+ */
+static inline
+void __prop_inc_single(struct prop_global *pg, struct prop_local_single *pl)
+{
+	prop_norm_single(pg, pl);
+	pl->events++;
+	percpu_counter_mod(&pg->events, 1);
+}
+
+void prop_fraction_single(struct prop_global *pg, struct prop_local_single *pl,
 		long *numerator, long *denominator);
 
+/*
+ * ----- GLUE ------
+ */
+
+#undef TYPE_EQUAL
+#define TYPE_EQUAL(expr, type) \
+	__builtin_types_compatible_p(typeof(expr), type)
+
+extern int __bad_prop_local(void);
+
+#define prop_local_init(prop_local)					\
+do {									\
+	if (TYPE_EQUAL(*(prop_local), struct prop_local_percpu))	\
+		prop_local_init_percpu(					\
+			(struct prop_local_percpu *)(prop_local));	\
+	else if (TYPE_EQUAL(*(prop_local), struct prop_local_single))	\
+		prop_local_init_single(					\
+			(struct prop_local_single *)(prop_local));	\
+	else __bad_prop_local();					\
+} while (0)
+
+#define prop_local_destroy(prop_local)					\
+do {									\
+	if (TYPE_EQUAL(*(prop_local), struct prop_local_percpu))	\
+		prop_local_destroy_percpu(				\
+			(struct prop_local_percpu *)(prop_local));	\
+	else if (TYPE_EQUAL(*(prop_local), struct prop_local_single))	\
+		prop_local_destroy_single(				\
+			(struct prop_local_single *)(prop_local));	\
+	else __bad_prop_local();					\
+} while (0)
+
+#define __prop_inc(prop_global, prop_local)				\
+do {									\
+	if (TYPE_EQUAL(*(prop_local), struct prop_local_percpu))	\
+		__prop_inc_percpu(prop_global,				\
+			(struct prop_local_percpu *)(prop_local)); 	\
+	else if (TYPE_EQUAL(*(prop_local), struct prop_local_single))	\
+		__prop_inc_single(prop_global,				\
+			(struct prop_local_single *)(prop_local)); 	\
+	else __bad_prop_local();					\
+} while (0)
+
+#define prop_fraction(prop_global, prop_local, num, denom)		\
+do {									\
+	if (TYPE_EQUAL(*(prop_local), struct prop_local_percpu))	\
+		prop_fraction_percpu(prop_global,			\
+			(struct prop_local_percpu *)(prop_local),	\
+			num, denom);					\
+	else if (TYPE_EQUAL(*(prop_local), struct prop_local_single))	\
+		prop_fraction_single(prop_global,			\
+			(struct prop_local_single *)(prop_local),	\
+			num, denom);					\
+	else __bad_prop_local();					\
+} while (0)
+
 #endif /* _LINUX_PROPORTIONS_H */
Index: linux-2.6/lib/proportions.c
===================================================================
--- linux-2.6.orig/lib/proportions.c
+++ linux-2.6/lib/proportions.c
@@ -149,22 +149,31 @@ void prop_put_global(struct prop_descrip
 	rcu_read_unlock();
 }
 
-static void prop_adjust_shift(struct prop_local *pl, int new_shift)
+static void
+__prop_adjust_shift(int *pl_shift, unsigned long *pl_period, int new_shift)
 {
-	int offset = pl->shift - new_shift;
+	int offset = *pl_shift - new_shift;
 
 	if (!offset)
 		return;
 
 	if (offset < 0)
-		pl->period <<= -offset;
+		*pl_period <<= -offset;
 	else
-		pl->period >>= offset;
+		*pl_period >>= offset;
 
-	pl->shift = new_shift;
+	*pl_shift = new_shift;
 }
 
-void prop_local_init(struct prop_local *pl)
+#define prop_adjust_shift(prop_local, pg_shift)			\
+	__prop_adjust_shift(&(prop_local)->shift,		\
+			    &(prop_local)->period, pg_shift)
+
+/*
+ * PERCPU
+ */
+
+void prop_local_init_percpu(struct prop_local_percpu *pl)
 {
 	spin_lock_init(&pl->lock);
 	pl->shift = 0;
@@ -172,7 +181,7 @@ void prop_local_init(struct prop_local *
 	percpu_counter_init_irq(&pl->events, 0);
 }
 
-void prop_local_destroy(struct prop_local *pl)
+void prop_local_destroy_percpu(struct prop_local_percpu *pl)
 {
 	percpu_counter_destroy(&pl->events);
 }
@@ -184,8 +193,7 @@ void prop_local_destroy(struct prop_loca
  *     x_{j} -= x_{j}/2;
  *     c_{j}++;
  */
-void prop_norm(struct prop_global *pg,
-		struct prop_local *pl)
+void prop_norm_percpu(struct prop_global *pg, struct prop_local_percpu *pl)
 {
 	unsigned long period = 1UL << (pg->shift - 1);
 	unsigned long period_mask = ~(period - 1);
@@ -242,17 +250,111 @@ void prop_norm(struct prop_global *pg,
  *
  *   p_{j} = x_{j} / (period/2 + t % period/2)
  */
-void prop_fraction(struct prop_global *pg, struct prop_local *pl,
+void prop_fraction_percpu(struct prop_global *pg, struct prop_local_percpu *pl,
 		long *numerator, long *denominator)
 {
 	unsigned long period_2 = 1UL << (pg->shift - 1);
 	unsigned long counter_mask = period_2 - 1;
 	unsigned long global_count;
 
-	prop_norm(pg, pl);
+	prop_norm_percpu(pg, pl);
 	*numerator = percpu_counter_read_positive(&pl->events);
 
 	global_count = percpu_counter_read(&pg->events);
 	*denominator = period_2 + (global_count & counter_mask);
 }
 
+/*
+ * SINGLE
+ */
+
+void prop_local_init_single(struct prop_local_single *pl)
+{
+	spin_lock_init(&pl->lock);
+	pl->shift = 0;
+	pl->period = 0;
+	pl->events = 0;
+}
+
+void prop_local_destroy_single(struct prop_local_single *pl)
+{
+}
+
+/*
+ * Catch up with missed period expirations.
+ *
+ *   until (c_{j} == c)
+ *     x_{j} -= x_{j}/2;
+ *     c_{j}++;
+ */
+void prop_norm_single(struct prop_global *pg, struct prop_local_single *pl)
+{
+	unsigned long period = 1UL << (pg->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&pg->events);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (pl->period == global_period)
+		return;
+
+	spin_lock_irqsave(&pl->lock, flags);
+	prop_adjust_shift(pl, pg->shift);
+	/*
+	 * For each missed period, we half the local counter.
+	 * basically:
+	 *   pl->events >> (global_period - pl->period);
+	 *
+	 * but since the distributed nature of single counters make division
+	 * rather hard, use a regular subtraction loop. This is safe, because
+	 * the events will only every be incremented, hence the subtraction
+	 * can never result in a negative number.
+	 */
+	while (pl->period != global_period) {
+		unsigned long val = pl->events;
+		unsigned long half = (val + 1) >> 1;
+
+		/*
+		 * Half of zero won't be much less, break out.
+		 * This limits the loop to shift iterations, even
+		 * if we missed a million.
+		 */
+		if (!val)
+			break;
+
+		/*
+		 * Iff shift >32 half might exceed the limits of
+		 * the regular single_counter_mod.
+		 */
+		pl->events -= half;
+		pl->period += period;
+	}
+	pl->period = global_period;
+	spin_unlock_irqrestore(&pl->lock, flags);
+}
+
+/*
+ * Obtain an fraction of this proportion
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+void prop_fraction_single(struct prop_global *pg, struct prop_local_single *pl,
+		long *numerator, long *denominator)
+{
+	unsigned long period_2 = 1UL << (pg->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	prop_norm_single(pg, pl);
+	*numerator = pl->events;
+
+	global_count = percpu_counter_read(&pg->events);
+	*denominator = period_2 + (global_count & counter_mask);
+}
+

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 16/17] mm: per device dirty threshold
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 15/17] lib: floating proportions _single Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 21:58 ` [PATCH 17/17] mm: dirty balancing for tasks Peter Zijlstra
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: writeback-balance-per-backing_dev.patch --]
[-- Type: text/plain, Size: 13883 bytes --]

Scale writeback cache per backing device, proportional to its writeout speed.

By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:

 - mutual interference starvation (for any number of BDIs);
 - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.

A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.

So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.

What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    4 
 kernel/sysctl.c             |    5 -
 mm/backing-dev.c            |    5 +
 mm/page-writeback.c         |  215 +++++++++++++++++++++++++++++++++++++-------
 4 files changed, 195 insertions(+), 34 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -10,6 +10,7 @@
 
 #include <linux/percpu_counter.h>
 #include <linux/log2.h>
+#include <linux/proportions.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -44,6 +45,9 @@ struct backing_dev_info {
 	void *unplug_io_data;
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
+
+	struct prop_local_percpu completions;
+	int dirty_exceeded;
 };
 
 void bdi_init(struct backing_dev_info *bdi);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -2,6 +2,7 @@
  * mm/page-writeback.c
  *
  * Copyright (C) 2002, Linus Torvalds.
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
  *
  * Contains functions related to writing back dirty pages at the
  * address_space level.
@@ -49,8 +50,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;	/* Dirty mem may be over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +102,106 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by keeping a floating proportion between BDIs, based on page
+ * writeback completions [end_page_writeback()]. Those devices that write out
+ * pages fastest will get the larger share, while the slower will get a smaller
+ * share.
+ *
+ * We use page writeout completions because we are interested in getting rid of
+ * dirty pages. Having them written out is the primary goal.
+ *
+ * We introduce a concept of time, a period over which we measure these events,
+ * because demand can/will vary over time. The length of this period itself is
+ * measured in page writeback completions.
+ *
+ */
+static struct prop_descriptor vm_completions;
+
+static unsigned long determine_dirtyable_memory(void);
+
+/*
+ * couple the period to the dirty_ratio:
+ *
+ *   period/2 ~ roundup_pow_of_two(dirty limit)
+ */
+static int calc_period_shift(void)
+{
+	unsigned long dirty_total;
+
+	dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
+	return 2 + ilog2(dirty_total - 1);
+}
+
+/*
+ * update the period when the dirty ratio changes.
+ */
+int dirty_ratio_handler(ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int old_ratio = vm_dirty_ratio;
+	int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+	if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
+		int shift = calc_period_shift();
+		prop_change_shift(&vm_completions, shift);
+	}
+	return ret;
+}
+
+/*
+ * Increment the BDI's writeout completion count and the global writeout
+ * completion count. Called from test_clear_page_writeback().
+ */
+static void __bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+	struct prop_global *pg = prop_get_global(&vm_completions);
+	__prop_inc(pg, &bdi->completions);
+	prop_put_global(&vm_completions, pg);
+}
+
+/*
+ * Obtain an accurate fraction of the BDI's portion.
+ */
+static void bdi_writeout_fraction(struct backing_dev_info *bdi,
+		long *numerator, long *denominator)
+{
+	if (bdi_cap_writeback_dirty(bdi)) {
+		struct prop_global *pg = prop_get_global(&vm_completions);
+		prop_fraction(pg, &bdi->completions, numerator, denominator);
+		prop_put_global(&vm_completions, pg);
+	} else {
+		*numerator = 0;
+		*denominator = 1;
+	}
+}
+
+/*
+ * Clip the earned share of dirty pages to that which is actually available.
+ * This avoids exceeding the total dirty_limit when the floating averages
+ * fluctuate too quickly.
+ */
+static void
+clip_bdi_dirty_limit(struct backing_dev_info *bdi, long dirty, long *pbdi_dirty)
+{
+	long avail_dirty;
+
+	avail_dirty = dirty -
+		(global_page_state(NR_FILE_DIRTY) +
+		 global_page_state(NR_WRITEBACK) +
+		 global_page_state(NR_UNSTABLE_NFS));
+
+	if (avail_dirty < 0)
+		avail_dirty = 0;
+
+	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
+		bdi_stat(bdi, BDI_WRITEBACK);
+
+	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
+}
+
+/*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
  *
@@ -158,8 +257,8 @@ static unsigned long determine_dirtyable
 }
 
 static void
-get_dirty_limits(long *pbackground, long *pdirty,
-					struct address_space *mapping)
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		 struct backing_dev_info *bdi)
 {
 	int background_ratio;		/* Percentages */
 	int dirty_ratio;
@@ -193,6 +292,22 @@ get_dirty_limits(long *pbackground, long
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+
+	if (bdi) {
+		long long bdi_dirty = dirty;
+		long numerator, denominator;
+
+		/*
+		 * Calculate this BDI's share of the dirty ratio.
+		 */
+		bdi_writeout_fraction(bdi, &numerator, &denominator);
+
+		bdi_dirty *= numerator;
+		do_div(bdi_dirty, denominator);
+
+		*pbdi_dirty = bdi_dirty;
+		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
+	}
 }
 
 /*
@@ -204,9 +319,11 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long nr_reclaimable;
+	long bdi_nr_reclaimable;
+	long bdi_nr_writeback;
 	long background_thresh;
 	long dirty_thresh;
+	long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -221,15 +338,15 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		get_dirty_limits(&background_thresh, &dirty_thresh,
+				&bdi_thresh, bdi);
+		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
 				break;
 
-		if (!dirty_exceeded)
-			dirty_exceeded = 1;
+		if (!bdi->dirty_exceeded)
+			bdi->dirty_exceeded = 1;
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
@@ -237,16 +354,37 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
-			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
-						break;
+
+			get_dirty_limits(&background_thresh, &dirty_thresh,
+				       &bdi_thresh, bdi);
+
+			/*
+			 * In order to avoid the stacked BDI deadlock we need
+			 * to ensure we accurately count the 'dirty' pages when
+			 * the threshold is low.
+			 *
+			 * Otherwise it would be possible to get thresh+n pages
+			 * reported dirty, even though there are thresh-m pages
+			 * actually dirty; with m+n sitting in the percpu
+			 * deltas.
+			 */
+			if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+				bdi_nr_reclaimable =
+					bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat_sum(bdi, BDI_WRITEBACK);
+			} else {
+				bdi_nr_reclaimable =
+					bdi_stat(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat(bdi, BDI_WRITEBACK);
+			}
+
+			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+				break;
+
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
@@ -254,9 +392,9 @@ static void balance_dirty_pages(struct a
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (nr_reclaimable + global_page_state(NR_WRITEBACK)
-		<= dirty_thresh && dirty_exceeded)
-			dirty_exceeded = 0;
+	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
+			bdi->dirty_exceeded)
+		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
 		return;		/* pdflush is already working this queue */
@@ -270,7 +408,9 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
+			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
+					  + global_page_state(NR_UNSTABLE_NFS)
+					  > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 
@@ -306,7 +446,7 @@ void balance_dirty_pages_ratelimited_nr(
 	unsigned long *p;
 
 	ratelimit = ratelimit_pages;
-	if (dirty_exceeded)
+	if (mapping->backing_dev_info->dirty_exceeded)
 		ratelimit = 8;
 
 	/*
@@ -342,7 +482,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -377,7 +517,7 @@ static void background_writeout(unsigned
 		long background_thresh;
 		long dirty_thresh;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
@@ -479,11 +619,13 @@ int dirty_writeback_centisecs_handler(ct
 		struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
 {
 	proc_dointvec_userhz_jiffies(table, write, file, buffer, length, ppos);
-	if (dirty_writeback_interval) {
-		mod_timer(&wb_timer,
-			jiffies + dirty_writeback_interval);
+	if (write) {
+		if (dirty_writeback_interval) {
+			mod_timer(&wb_timer,
+					jiffies + dirty_writeback_interval);
 		} else {
-		del_timer(&wb_timer);
+			del_timer(&wb_timer);
+		}
 	}
 	return 0;
 }
@@ -582,9 +724,14 @@ static struct notifier_block __cpuinitda
  */
 void __init page_writeback_init(void)
 {
+	int shift;
+
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
+
+	shift = calc_period_shift();
+	prop_descriptor_init(&vm_completions, shift);
 }
 
 /**
@@ -990,8 +1137,10 @@ int test_clear_page_writeback(struct pag
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_writeback_dirty(bdi))
+			if (bdi_cap_writeback_dirty(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+				__bdi_writeout_inc(bdi);
+			}
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -11,6 +11,9 @@ void bdi_init(struct backing_dev_info *b
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_init_irq(&bdi->bdi_stat[i], 0);
+
+	bdi->dirty_exceeded = 0;
+	prop_local_init(&bdi->completions);
 }
 EXPORT_SYMBOL(bdi_init);
 
@@ -20,6 +23,8 @@ void bdi_destroy(struct backing_dev_info
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_destroy(&bdi->bdi_stat[i]);
+
+	prop_local_destroy(&bdi->completions);
 }
 EXPORT_SYMBOL(bdi_destroy);
 
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -163,6 +163,9 @@ extern ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+extern int dirty_ratio_handler(ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos);
 
 #ifdef CONFIG_PROVE_LOCKING
 extern int prove_locking;
@@ -772,7 +775,7 @@ static ctl_table vm_table[] = {
 		.data		= &vm_dirty_ratio,
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
+		.proc_handler	= &dirty_ratio_handler,
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 		.extra2		= &one_hundred,

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 17/17] mm: dirty balancing for tasks
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 16/17] mm: per device dirty threshold Peter Zijlstra
@ 2007-06-14 21:58 ` Peter Zijlstra
  2007-06-14 23:14 ` [PATCH 00/17] per device dirty throttling -v7 Andrew Morton
  2007-07-17 10:10 ` Miklos Szeredi
  18 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2007-06-14 21:58 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, andrea

[-- Attachment #1: dirty_pages2.patch --]
[-- Type: text/plain, Size: 5109 bytes --]

Based on ideas of Andrew:
  http://marc.info/?l=linux-kernel&m=102912915020543&w=2

Scale the bdi dirty limit inversly with the tasks dirty rate.
This makes heavy writers have a lower dirty limit than the occasional writer. 

Andrea proposed something similar:
  http://lwn.net/Articles/152277/

The main disadvantage to his patch is that he uses an unrelated quantity to
measure time, which leaves him with a workload dependant tunable. Other than
that the two approached appear quite similar.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |    2 +
 kernel/exit.c         |    1 
 kernel/fork.c         |    1 
 mm/page-writeback.c   |   56 +++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 59 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -84,6 +84,7 @@ struct sched_param {
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
 #include <linux/task_io_accounting.h>
+#include <linux/proportions.h>
 
 #include <asm/processor.h>
 
@@ -1151,6 +1152,7 @@ struct task_struct {
 #ifdef CONFIG_FAULT_INJECTION
 	int make_it_fail;
 #endif
+	struct prop_local_single dirties;
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6/kernel/exit.c
===================================================================
--- linux-2.6.orig/kernel/exit.c
+++ linux-2.6/kernel/exit.c
@@ -162,6 +162,7 @@ repeat:
 	ptrace_unlink(p);
 	BUG_ON(!list_empty(&p->ptrace_list) || !list_empty(&p->ptrace_children));
 	__exit_signal(p);
+	prop_local_destroy(&p->dirties);
 
 	/*
 	 * If we are the last non-leader member of the thread
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -192,6 +192,7 @@ static struct task_struct *dup_task_stru
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
+	prop_local_init(&tsk->dirties);
 	return tsk;
 }
 
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -118,6 +118,7 @@ static void background_writeout(unsigned
  *
  */
 static struct prop_descriptor vm_completions;
+static struct prop_descriptor vm_dirties;
 
 static unsigned long determine_dirtyable_memory(void);
 
@@ -146,6 +147,7 @@ int dirty_ratio_handler(ctl_table *table
 	if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
 		int shift = calc_period_shift();
 		prop_change_shift(&vm_completions, shift);
+		prop_change_shift(&vm_dirties, shift);
 	}
 	return ret;
 }
@@ -161,6 +163,16 @@ static void __bdi_writeout_inc(struct ba
 	prop_put_global(&vm_completions, pg);
 }
 
+static void task_dirty_inc(struct task_struct *tsk)
+{
+	unsigned long flags;
+	struct prop_global *pg = prop_get_global(&vm_dirties);
+	local_irq_save(flags);
+	__prop_inc(pg, &tsk->dirties);
+	local_irq_restore(flags);
+	prop_put_global(&vm_dirties, pg);
+}
+
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */
@@ -201,6 +213,38 @@ clip_bdi_dirty_limit(struct backing_dev_
 	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
 }
 
+void task_dirties_fraction(struct task_struct *tsk,
+		long *numerator, long *denominator)
+{
+	struct prop_global *pg = prop_get_global(&vm_dirties);
+	prop_fraction(pg, &tsk->dirties, numerator, denominator);
+	prop_put_global(&vm_dirties, pg);
+}
+
+/*
+ * scale the dirty limit
+ *
+ * task specific dirty limit:
+ *
+ *   dirty -= (dirty/2) * p_{t}
+ */
+void task_dirty_limit(struct task_struct *tsk, long *pdirty)
+{
+	long numerator, denominator;
+	long dirty = *pdirty;
+	long long inv = dirty >> 1;
+
+	task_dirties_fraction(tsk, &numerator, &denominator);
+	inv *= numerator;
+	do_div(inv, denominator);
+
+	dirty -= inv;
+	if (dirty < *pdirty/2)
+		dirty = *pdirty/2;
+
+	*pdirty = dirty;
+}
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -307,6 +351,7 @@ get_dirty_limits(long *pbackground, long
 
 		*pbdi_dirty = bdi_dirty;
 		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
+		task_dirty_limit(current, pbdi_dirty);
 	}
 }
 
@@ -732,6 +777,7 @@ void __init page_writeback_init(void)
 
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
+	prop_descriptor_init(&vm_dirties, shift);
 }
 
 /**
@@ -1010,7 +1056,7 @@ EXPORT_SYMBOL(redirty_page_for_writepage
  * If the mapping doesn't provide a set_page_dirty a_op, then
  * just fall through and assume that it wants buffer_heads.
  */
-int fastcall set_page_dirty(struct page *page)
+static int __set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
@@ -1028,6 +1074,14 @@ int fastcall set_page_dirty(struct page 
 	}
 	return 0;
 }
+
+int fastcall set_page_dirty(struct page *page)
+{
+	int ret = __set_page_dirty(page);
+	if (ret)
+		task_dirty_inc(current);
+	return ret;
+}
 EXPORT_SYMBOL(set_page_dirty);
 
 /*

-- 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00/17] per device dirty throttling -v7
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2007-06-14 21:58 ` [PATCH 17/17] mm: dirty balancing for tasks Peter Zijlstra
@ 2007-06-14 23:14 ` Andrew Morton
  2007-07-17 10:10 ` Miklos Szeredi
  18 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2007-06-14 23:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, neilb, dgc, tomoki.sekiyama.qu,
	a.p.zijlstra, nikita, trond.myklebust, yingchao.zhou, andrea

> On Thu, 14 Jun 2007 23:58:17 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Latest version of the per bdi dirty throttling patches.

Thanks.  I've got some travel coming up and will be rather intermittent and
laggy for a week or two.  I'll save this patchset for when the in-flight
movies get dull ;)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 00/17] per device dirty throttling -v7
  2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2007-06-14 23:14 ` [PATCH 00/17] per device dirty throttling -v7 Andrew Morton
@ 2007-07-17 10:10 ` Miklos Szeredi
  18 siblings, 0 replies; 23+ messages in thread
From: Miklos Szeredi @ 2007-07-17 10:10 UTC (permalink / raw)
  To: a.p.zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, a.p.zijlstra, nikita, trond.myklebust,
	yingchao.zhou, andrea

> Latest version of the per bdi dirty throttling patches.
> 
> Most of the changes since last time are little cleanups and more
> detail in the split out of the floating proportion into their
> own little lib.
> 
> Patches are against 2.6.22-rc4-mm2
> 
> A rollup of all this against 2.6.21 is available here:
>   http://programming.kicks-ass.net/kernel-patches/balance_dirty_pages/2.6.21-per_bdi_dirty_pages.patch
> 
> This patch-set passes the starve an USB stick test..

I've done some testing of several problem cases.

1) fuse writable mmap patches + bash_shared_mapping
2) writes in a setup involving a loop dev
  a) ext3 over loop over ext3
  b) ext3 over loop over fuse-passthrough over ext3
  c) ext3 over loop over ntfs-3g

Without the patch, in all the cases I've seen deadlocks or long
stalls.  With the patch, I could not reproduce this in any of the
cases.  As predicted, the patch is performing well in this respect :)

2a is the simplest to reproduce (2.6.22, dual core, 1GB ram)

  dd if=/dev/zero of=/tmp/p5 bs=1M seek=4999 count=1
  mkfs.ext3 -F /tmp/p5
  mkdir /tmp/m5
  mount -oloop /tmp/p5 /tmp/m5
  dd if=/dev/zero of=/tmp/m5/foo bs=1M count=4000

The second dd can stall for indefinite amounts of time.  Kicking it
with sync can get it moving, but it relapses after some time.

Even with the per-device-throttling patch, case 2 shows an nr_dirty
elevated far above the 10% limit, reaching 40% or higher.  I believe,
this is due to a missing balance_dirty_pages() call in the loop
device.  And indeed the anomaly can be solved by adding this patch:

  http://lkml.org/lkml/2007/3/24/101

Miklos

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/17] lib: percpu_count_sum_signed()
  2007-06-14 21:58 ` [PATCH 05/17] lib: percpu_count_sum_signed() Peter Zijlstra
@ 2007-07-17 16:32   ` Josef Sipek
  2007-07-17 16:35     ` Josef Sipek
  0 siblings, 1 reply; 23+ messages in thread
From: Josef Sipek @ 2007-07-17 16:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	andrea

On Thu, Jun 14, 2007 at 11:58:22PM +0200, Peter Zijlstra wrote:
> Provide an accurate version of percpu_counter_read.
> 
> Should we go and replace the current use of percpu_counter_sum()
> with percpu_counter_sum_positive(), and call this new primitive
> percpu_counter_sum() instead?
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/percpu_counter.h |   18 +++++++++++++++++-
>  lib/percpu_counter.c           |    6 +++---
>  2 files changed, 20 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/include/linux/percpu_counter.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-23 20:37:54.000000000 +0200
> +++ linux-2.6/include/linux/percpu_counter.h	2007-05-23 20:38:09.000000000 +0200
> @@ -35,7 +35,18 @@ void percpu_counter_destroy(struct percp
>  void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
>  void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
>  void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch);
> -s64 percpu_counter_sum(struct percpu_counter *fbc);
> +s64 __percpu_counter_sum(struct percpu_counter *fbc);
> +
> +static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
> +{
> +	s64 ret = __percpu_counter_sum(fbc);
> +	return ret < 0 ? 0 : ret;

max(0, ret) maybe?

Josef 'Jeff' Sipek.

-- 
Real Programmers consider "what you see is what you get" to be just as bad a
concept in Text Editors as it is in women. No, the Real Programmer wants a
"you asked for it, you got it" text editor -- complicated, cryptic,
powerful, unforgiving, dangerous.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 06/17] lib: percpu_counter_init_irq
  2007-06-14 21:58 ` [PATCH 06/17] lib: percpu_counter_init_irq Peter Zijlstra
@ 2007-07-17 16:35   ` Josef Sipek
  0 siblings, 0 replies; 23+ messages in thread
From: Josef Sipek @ 2007-07-17 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	andrea

On Thu, Jun 14, 2007 at 11:58:23PM +0200, Peter Zijlstra wrote:
> provide a way to init percpu_counters that are supposed to be used from irq
> safe contexts.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/percpu_counter.h |    4 ++++
>  lib/percpu_counter.c           |    8 ++++++++
>  2 files changed, 12 insertions(+)
> 
> Index: linux-2.6/include/linux/percpu_counter.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu_counter.h
> +++ linux-2.6/include/linux/percpu_counter.h
> @@ -31,6 +31,8 @@ struct percpu_counter {
>  #endif
>  
>  void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
> +void percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount);
> +
>  void percpu_counter_destroy(struct percpu_counter *fbc);
>  void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
>  void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
> @@ -89,6 +91,8 @@ static inline void percpu_counter_init(s
>  	fbc->count = amount;
>  }
>  
> +#define percpu_counter_init_irq percpu_counter_init

Huh? I'm confused. You have prototypes for both, and now a #define?

Josef 'Jeff' Sipek.

-- 
Hegh QaQ law'
quvHa'ghach QaQ puS

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/17] lib: percpu_count_sum_signed()
  2007-07-17 16:32   ` Josef Sipek
@ 2007-07-17 16:35     ` Josef Sipek
  0 siblings, 0 replies; 23+ messages in thread
From: Josef Sipek @ 2007-07-17 16:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	andrea

On Tue, Jul 17, 2007 at 12:32:43PM -0400, Josef Sipek wrote:
> On Thu, Jun 14, 2007 at 11:58:22PM +0200, Peter Zijlstra wrote:
> > Provide an accurate version of percpu_counter_read.
> > 
> > Should we go and replace the current use of percpu_counter_sum()
> > with percpu_counter_sum_positive(), and call this new primitive
> > percpu_counter_sum() instead?
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  include/linux/percpu_counter.h |   18 +++++++++++++++++-
> >  lib/percpu_counter.c           |    6 +++---
> >  2 files changed, 20 insertions(+), 4 deletions(-)
> > 
> > Index: linux-2.6/include/linux/percpu_counter.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-23 20:37:54.000000000 +0200
> > +++ linux-2.6/include/linux/percpu_counter.h	2007-05-23 20:38:09.000000000 +0200
> > @@ -35,7 +35,18 @@ void percpu_counter_destroy(struct percp
> >  void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
> >  void __percpu_counter_mod(struct percpu_counter *fbc, s32 amount, s32 batch);
> >  void __percpu_counter_mod64(struct percpu_counter *fbc, s64 amount, s32 batch);
> > -s64 percpu_counter_sum(struct percpu_counter *fbc);
> > +s64 __percpu_counter_sum(struct percpu_counter *fbc);
> > +
> > +static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
> > +{
> > +	s64 ret = __percpu_counter_sum(fbc);
> > +	return ret < 0 ? 0 : ret;
> 
> max(0, ret) maybe?
> 
> Josef 'Jeff' Sipek.

Ok, replying to email that's more than a month old may not be the best idea
:)

Josef 'Jeff' Sipek.

-- 
We have joy, we have fun, we have Linux on a Sun...

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2007-07-17 16:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-06-14 21:58 [PATCH 00/17] per device dirty throttling -v7 Peter Zijlstra
2007-06-14 21:58 ` [PATCH 01/17] nfs: remove congestion_end() Peter Zijlstra
2007-06-14 21:58 ` [PATCH 02/17] lib: percpu_counter variable batch Peter Zijlstra
2007-06-14 21:58 ` [PATCH 03/17] lib: percpu_counter_mod64 Peter Zijlstra
2007-06-14 21:58 ` [PATCH 04/17] lib: percpu_counter_set Peter Zijlstra
2007-06-14 21:58 ` [PATCH 05/17] lib: percpu_count_sum_signed() Peter Zijlstra
2007-07-17 16:32   ` Josef Sipek
2007-07-17 16:35     ` Josef Sipek
2007-06-14 21:58 ` [PATCH 06/17] lib: percpu_counter_init_irq Peter Zijlstra
2007-07-17 16:35   ` Josef Sipek
2007-06-14 21:58 ` [PATCH 07/17] mm: bdi init hooks Peter Zijlstra
2007-06-14 21:58 ` [PATCH 08/17] containers: " Peter Zijlstra
2007-06-14 21:58 ` [PATCH 09/17] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
2007-06-14 21:58 ` [PATCH 10/17] mm: scalable bdi statistics counters Peter Zijlstra
2007-06-14 21:58 ` [PATCH 11/17] mm: count reclaimable pages per BDI Peter Zijlstra
2007-06-14 21:58 ` [PATCH 12/17] mm: count writeback " Peter Zijlstra
2007-06-14 21:58 ` [PATCH 13/17] mm: expose BDI statistics in sysfs Peter Zijlstra
2007-06-14 21:58 ` [PATCH 14/17] lib: floating proportions Peter Zijlstra
2007-06-14 21:58 ` [PATCH 15/17] lib: floating proportions _single Peter Zijlstra
2007-06-14 21:58 ` [PATCH 16/17] mm: per device dirty threshold Peter Zijlstra
2007-06-14 21:58 ` [PATCH 17/17] mm: dirty balancing for tasks Peter Zijlstra
2007-06-14 23:14 ` [PATCH 00/17] per device dirty throttling -v7 Andrew Morton
2007-07-17 10:10 ` Miklos Szeredi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).