linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/23] per device dirty throttling -v9
@ 2007-08-16  7:45 Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 01/23] nfs: remove congestion_end() Peter Zijlstra
                   ` (23 more replies)
  0 siblings, 24 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

Per device dirty throttling patches

These patches aim to improve balance_dirty_pages() and directly address three
issues:
  1) inter device starvation
  2) stacked device deadlocks
  3) inter process starvation

1 and 2 are a direct result from removing the global dirty limit and using
per device dirty limits. By giving each device its own dirty limit is will
no longer starve another device, and the cyclic dependancy on the dirty limit
is broken.

In order to efficiently distribute the dirty limit across the independant
devices a floating proportion is used, this will allocate a share of the total
limit proportional to the device's recent activity.

3 is done by also scaling the dirty limit proportional to the current task's
recent dirty rate.

Changes since -v8:
 - cleanup of the proportion code
 - fix percpu_counter_add(&counter, -(unsigned long))
 - fix per task dirty rate code
 - fwd port to .23-rc2-mm2

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 01/23] nfs: remove congestion_end()
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 02/23] lib: percpu_counter_add Peter Zijlstra
                   ` (22 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: nfs_congestion_fixup.patch --]
[-- Type: text/plain, Size: 1965 bytes --]

Its redundant, clear_bdi_congested() already wakes the waiters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/nfs/write.c              |    5 ++---
 include/linux/backing-dev.h |    1 -
 mm/backing-dev.c            |   13 -------------
 3 files changed, 2 insertions(+), 17 deletions(-)

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -235,10 +235,8 @@ static void nfs_end_page_writeback(struc
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
 		clear_bdi_congested(&nfss->backing_dev_info, WRITE);
-		congestion_end(WRITE);
-	}
 }
 
 /*
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -93,7 +93,6 @@ static inline int bdi_rw_congested(struc
 void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
-void congestion_end(int rw);
 
 #define bdi_cap_writeback_dirty(bdi) \
 	(!((bdi)->capabilities & BDI_CAP_NO_WRITEBACK))
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -54,16 +54,3 @@ long congestion_wait(int rw, long timeou
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-
-/**
- * congestion_end - wake up sleepers on a congested backing_dev_info
- * @rw: READ or WRITE
- */
-void congestion_end(int rw)
-{
-	wait_queue_head_t *wqh = &congestion_wqh[rw];
-
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
-}
-EXPORT_SYMBOL(congestion_end);

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 02/23] lib: percpu_counter_add
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 01/23] nfs: remove congestion_end() Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-17 15:48   ` Josef Sipek
  2007-08-16  7:45 ` [PATCH 03/23] lib: percpu_counter_sub Peter Zijlstra
                   ` (21 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_add.patch --]
[-- Type: text/plain, Size: 6927 bytes --]

 s/percpu_counter_mod/percpu_counter_add/

Because its a better name, _mod implies modulo.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/ext2/balloc.c               |    6 +++---
 fs/ext2/ialloc.c               |    2 +-
 fs/ext3/balloc.c               |    4 ++--
 fs/ext3/resize.c               |    4 ++--
 fs/ext4/balloc.c               |    4 ++--
 fs/ext4/resize.c               |    4 ++--
 include/linux/percpu_counter.h |    8 ++++----
 lib/percpu_counter.c           |    4 ++--
 8 files changed, 18 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/ext2/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext2/balloc.c
+++ linux-2.6/fs/ext2/balloc.c
@@ -163,7 +163,7 @@ static int reserve_blocks(struct super_b
 			return 0;
 	}
 
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -count);
+	percpu_counter_add(&sbi->s_freeblocks_counter, -count);
 	sb->s_dirt = 1;
 	return count;
 }
@@ -173,7 +173,7 @@ static void release_blocks(struct super_
 	if (count) {
 		struct ext2_sb_info *sbi = EXT2_SB(sb);
 
-		percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+		percpu_counter_add(&sbi->s_freeblocks_counter, count);
 		sb->s_dirt = 1;
 	}
 }
@@ -1402,7 +1402,7 @@ allocated:
 	}
 
 	group_adjust_blocks(sb, group_no, gdp, gdp_bh, -num);
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_add(&sbi->s_freeblocks_counter, -num);
 
 	mark_buffer_dirty(bitmap_bh);
 	if (sb->s_flags & MS_SYNCHRONOUS)
Index: linux-2.6/fs/ext2/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext2/ialloc.c
+++ linux-2.6/fs/ext2/ialloc.c
@@ -534,7 +534,7 @@ got:
 		goto fail;
 	}
 
-	percpu_counter_mod(&sbi->s_freeinodes_counter, -1);
+	percpu_counter_add(&sbi->s_freeinodes_counter, -1);
 	if (S_ISDIR(mode))
 		percpu_counter_inc(&sbi->s_dirs_counter);
 
Index: linux-2.6/fs/ext3/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext3/balloc.c
+++ linux-2.6/fs/ext3/balloc.c
@@ -609,7 +609,7 @@ do_more:
 		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
 			group_freed);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+	percpu_counter_add(&sbi->s_freeblocks_counter, count);
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1672,7 +1672,7 @@ allocated:
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_add(&sbi->s_freeblocks_counter, -num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext3_journal_dirty_metadata(handle, gdp_bh);
Index: linux-2.6/fs/ext3/resize.c
===================================================================
--- linux-2.6.orig/fs/ext3/resize.c
+++ linux-2.6/fs/ext3/resize.c
@@ -884,9 +884,9 @@ int ext3_group_add(struct super_block *s
 		input->reserved_blocks);
 
 	/* Update the free space counts */
-	percpu_counter_mod(&sbi->s_freeblocks_counter,
+	percpu_counter_add(&sbi->s_freeblocks_counter,
 			   input->free_blocks_count);
-	percpu_counter_mod(&sbi->s_freeinodes_counter,
+	percpu_counter_add(&sbi->s_freeinodes_counter,
 			   EXT3_INODES_PER_GROUP(sb));
 
 	ext3_journal_dirty_metadata(handle, sbi->s_sbh);
Index: linux-2.6/fs/ext4/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext4/balloc.c
+++ linux-2.6/fs/ext4/balloc.c
@@ -628,7 +628,7 @@ do_more:
 		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
 			group_freed);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+	percpu_counter_add(&sbi->s_freeblocks_counter, count);
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1697,7 +1697,7 @@ allocated:
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_add(&sbi->s_freeblocks_counter, -num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext4_journal_dirty_metadata(handle, gdp_bh);
Index: linux-2.6/fs/ext4/resize.c
===================================================================
--- linux-2.6.orig/fs/ext4/resize.c
+++ linux-2.6/fs/ext4/resize.c
@@ -893,9 +893,9 @@ int ext4_group_add(struct super_block *s
 		input->reserved_blocks);
 
 	/* Update the free space counts */
-	percpu_counter_mod(&sbi->s_freeblocks_counter,
+	percpu_counter_add(&sbi->s_freeblocks_counter,
 			   input->free_blocks_count);
-	percpu_counter_mod(&sbi->s_freeinodes_counter,
+	percpu_counter_add(&sbi->s_freeinodes_counter,
 			   EXT4_INODES_PER_GROUP(sb));
 
 	ext4_journal_dirty_metadata(handle, sbi->s_sbh);
Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -32,7 +32,7 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
+void percpu_counter_add(struct percpu_counter *fbc, s32 amount);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
@@ -71,7 +71,7 @@ static inline void percpu_counter_destro
 }
 
 static inline void
-percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+percpu_counter_add(struct percpu_counter *fbc, s32 amount)
 {
 	preempt_disable();
 	fbc->count += amount;
@@ -97,12 +97,12 @@ static inline s64 percpu_counter_sum(str
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
 {
-	percpu_counter_mod(fbc, 1);
+	percpu_counter_add(fbc, 1);
 }
 
 static inline void percpu_counter_dec(struct percpu_counter *fbc)
 {
-	percpu_counter_mod(fbc, -1);
+	percpu_counter_add(fbc, -1);
 }
 
 #endif /* _LINUX_PERCPU_COUNTER_H */
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -14,7 +14,7 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+void percpu_counter_add(struct percpu_counter *fbc, s32 amount)
 {
 	long count;
 	s32 *pcount;
@@ -32,7 +32,7 @@ void percpu_counter_mod(struct percpu_co
 	}
 	put_cpu();
 }
-EXPORT_SYMBOL(percpu_counter_mod);
+EXPORT_SYMBOL(percpu_counter_add);
 
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 03/23] lib: percpu_counter_sub
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 01/23] nfs: remove congestion_end() Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 02/23] lib: percpu_counter_add Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 04/23] lib: percpu_counter variable batch Peter Zijlstra
                   ` (20 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds,
	Hugh Dickins

[-- Attachment #1: percpu_counter_sub.patch --]
[-- Type: text/plain, Size: 3075 bytes --]

Hugh spotted that some code does:
  percpu_counter_add(&counter, -unsignedlong)

which, when the amount argument is of type s32, sort-of works thanks to
two's-complement. However when we'd change the type to s64 this breaks on 32bit
machines, because the promotion rules zero extend the unsigned number.

Provide percpu_counter_sub() to hide the s64 cast. That is:
  percpu_counter_sub(&counter, foo)
is equal to:
  percpu_counter_add(&counter, -(s64)foo);

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hugh@veritas.com>
---
 fs/ext2/balloc.c               |    4 ++--
 fs/ext3/balloc.c               |    2 +-
 fs/ext4/balloc.c               |    2 +-
 include/linux/percpu_counter.h |    5 +++++
 4 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6/fs/ext2/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext2/balloc.c
+++ linux-2.6/fs/ext2/balloc.c
@@ -163,7 +163,7 @@ static int reserve_blocks(struct super_b
 			return 0;
 	}
 
-	percpu_counter_add(&sbi->s_freeblocks_counter, -count);
+	percpu_counter_sub(&sbi->s_freeblocks_counter, count);
 	sb->s_dirt = 1;
 	return count;
 }
@@ -1402,7 +1402,7 @@ allocated:
 	}
 
 	group_adjust_blocks(sb, group_no, gdp, gdp_bh, -num);
-	percpu_counter_add(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_sub(&sbi->s_freeblocks_counter, num);
 
 	mark_buffer_dirty(bitmap_bh);
 	if (sb->s_flags & MS_SYNCHRONOUS)
Index: linux-2.6/fs/ext3/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext3/balloc.c
+++ linux-2.6/fs/ext3/balloc.c
@@ -1672,7 +1672,7 @@ allocated:
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_add(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_sub(&sbi->s_freeblocks_counter, num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext3_journal_dirty_metadata(handle, gdp_bh);
Index: linux-2.6/fs/ext4/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext4/balloc.c
+++ linux-2.6/fs/ext4/balloc.c
@@ -1697,7 +1697,7 @@ allocated:
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_add(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_sub(&sbi->s_freeblocks_counter, num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext4_journal_dirty_metadata(handle, gdp_bh);
Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -105,4 +105,9 @@ static inline void percpu_counter_dec(st
 	percpu_counter_add(fbc, -1);
 }
 
+static inline void percpu_counter_sub(struct percpu_counter *fbc, s64 amount)
+{
+	percpu_counter_add(fbc, -amount);
+}
+
 #endif /* _LINUX_PERCPU_COUNTER_H */

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 04/23] lib: percpu_counter variable batch
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 03/23] lib: percpu_counter_sub Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 05/23] lib: make percpu_counter_add take s64 Peter Zijlstra
                   ` (19 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_batch.patch --]
[-- Type: text/plain, Size: 2503 bytes --]

Because the current batch setup has an quadric error bound on the counter,
allow for an alternative setup.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   10 +++++++++-
 lib/percpu_counter.c           |    6 +++---
 2 files changed, 12 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h	2007-05-23 20:34:12.000000000 +0200
+++ linux-2.6/include/linux/percpu_counter.h	2007-05-23 20:36:06.000000000 +0200
@@ -32,9 +32,14 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
-void percpu_counter_add(struct percpu_counter *fbc, s32 amount);
+void __percpu_counter_add(struct percpu_counter *fbc, s32 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
+static inline void percpu_counter_add(struct percpu_counter *fbc, s32 amount)
+{
+	__percpu_counter_add(fbc, amount, FBC_BATCH);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
@@ -70,6 +75,9 @@ static inline void percpu_counter_destro
 {
 }
 
+#define __percpu_counter_add(fbc, amount, batch) \
+	percpu_counter_add(fbc, amount)
+
 static inline void
 percpu_counter_add(struct percpu_counter *fbc, s32 amount)
 {
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c	2007-05-23 20:34:12.000000000 +0200
+++ linux-2.6/lib/percpu_counter.c	2007-05-23 20:36:21.000000000 +0200
@@ -14,7 +14,7 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
-void percpu_counter_add(struct percpu_counter *fbc, s32 amount)
+void __percpu_counter_add(struct percpu_counter *fbc, s32 amount, s32 batch)
 {
 	long count;
 	s32 *pcount;
@@ -22,7 +22,7 @@ void percpu_counter_add(struct percpu_co
 
 	pcount = per_cpu_ptr(fbc->counters, cpu);
 	count = *pcount + amount;
-	if (count >= FBC_BATCH || count <= -FBC_BATCH) {
+	if (count >= batch || count <= -batch) {
 		spin_lock(&fbc->lock);
 		fbc->count += count;
 		*pcount = 0;
@@ -32,7 +32,7 @@ void percpu_counter_add(struct percpu_co
 	}
 	put_cpu();
 }
-EXPORT_SYMBOL(percpu_counter_add);
+EXPORT_SYMBOL(__percpu_counter_add);
 
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 05/23] lib: make percpu_counter_add take s64
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 04/23] lib: percpu_counter variable batch Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 06/23] lib: percpu_counter_set Peter Zijlstra
                   ` (18 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_add64.patch --]
[-- Type: text/plain, Size: 1864 bytes --]

percpu_counter is a s64 counter, make _add consitent.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |    6 +++---
 lib/percpu_counter.c           |    4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -32,10 +32,10 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
-void __percpu_counter_add(struct percpu_counter *fbc, s32 amount, s32 batch);
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
-static inline void percpu_counter_add(struct percpu_counter *fbc, s32 amount)
+static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
 	__percpu_counter_add(fbc, amount, FBC_BATCH);
 }
@@ -79,7 +79,7 @@ static inline void percpu_counter_destro
 	percpu_counter_add(fbc, amount)
 
 static inline void
-percpu_counter_add(struct percpu_counter *fbc, s32 amount)
+percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
 	preempt_disable();
 	fbc->count += amount;
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -14,9 +14,9 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
-void __percpu_counter_add(struct percpu_counter *fbc, s32 amount, s32 batch)
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
-	long count;
+	s64 count;
 	s32 *pcount;
 	int cpu = get_cpu();
 

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 06/23] lib: percpu_counter_set
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 05/23] lib: make percpu_counter_add take s64 Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 07/23] lib: percpu_counter_sum_positive Peter Zijlstra
                   ` (17 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_set.patch --]
[-- Type: text/plain, Size: 1791 bytes --]

Provide a method to set a percpu counter to a specified value.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |    6 ++++++
 lib/percpu_counter.c           |   14 ++++++++++++++
 2 files changed, 20 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -32,6 +32,7 @@ struct percpu_counter {
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
 s64 percpu_counter_sum(struct percpu_counter *fbc);
 
@@ -75,6 +76,11 @@ static inline void percpu_counter_destro
 {
 }
 
+static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
+{
+	fbc->count = amount;
+}
+
 #define __percpu_counter_add(fbc, amount, batch) \
 	percpu_counter_add(fbc, amount)
 
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -14,6 +14,20 @@ static LIST_HEAD(percpu_counters);
 static DEFINE_MUTEX(percpu_counters_lock);
 #endif
 
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
+{
+	int cpu;
+
+	spin_lock(&fbc->lock);
+	for_each_possible_cpu(cpu) {
+		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		*pcount = 0;
+	}
+	fbc->count = amount;
+	spin_unlock(&fbc->lock);
+}
+EXPORT_SYMBOL(percpu_counter_set);
+
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 07/23] lib: percpu_counter_sum_positive
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (5 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 06/23] lib: percpu_counter_set Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 08/23] lib: percpu_count_sum() Peter Zijlstra
                   ` (16 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_sum_positive.patch --]
[-- Type: text/plain, Size: 4665 bytes --]

 s/percpu_counter_sum/&_positive/

Because its consitent with percpu_counter_read*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/ext3/super.c                |    4 ++--
 fs/ext4/super.c                |    4 ++--
 fs/file_table.c                |    2 +-
 include/linux/percpu_counter.h |    4 ++--
 lib/percpu_counter.c           |    4 ++--
 5 files changed, 9 insertions(+), 9 deletions(-)

Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c
+++ linux-2.6/fs/ext3/super.c
@@ -2472,13 +2472,13 @@ static int ext3_statfs (struct dentry * 
 	buf->f_type = EXT3_SUPER_MAGIC;
 	buf->f_bsize = sb->s_blocksize;
 	buf->f_blocks = le32_to_cpu(es->s_blocks_count) - sbi->s_overhead_last;
-	buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
+	buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
 	es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
 	buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
 	if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
 		buf->f_bavail = 0;
 	buf->f_files = le32_to_cpu(es->s_inodes_count);
-	buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
+	buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
 	es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
 	buf->f_namelen = EXT3_NAME_LEN;
 	fsid = le64_to_cpup((void *)es->s_uuid) ^
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c
+++ linux-2.6/fs/ext4/super.c
@@ -2592,13 +2592,13 @@ static int ext4_statfs (struct dentry * 
 	buf->f_type = EXT4_SUPER_MAGIC;
 	buf->f_bsize = sb->s_blocksize;
 	buf->f_blocks = ext4_blocks_count(es) - sbi->s_overhead_last;
-	buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
+	buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
 	es->s_free_blocks_count = cpu_to_le32(buf->f_bfree);
 	buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
 	if (buf->f_bfree < ext4_r_blocks_count(es))
 		buf->f_bavail = 0;
 	buf->f_files = le32_to_cpu(es->s_inodes_count);
-	buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
+	buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
 	es->s_free_inodes_count = cpu_to_le32(buf->f_ffree);
 	buf->f_namelen = EXT4_NAME_LEN;
 	fsid = le64_to_cpup((void *)es->s_uuid) ^
Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -98,7 +98,7 @@ struct file *get_empty_filp(void)
 		 * percpu_counters are inaccurate.  Do an expensive check before
 		 * we go and fail.
 		 */
-		if (percpu_counter_sum(&nr_files) >= files_stat.max_files)
+		if (percpu_counter_sum_positive(&nr_files) >= files_stat.max_files)
 			goto over;
 	}
 
Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -34,7 +34,7 @@ void percpu_counter_init(struct percpu_c
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
-s64 percpu_counter_sum(struct percpu_counter *fbc);
+s64 percpu_counter_sum_positive(struct percpu_counter *fbc);
 
 static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
@@ -102,7 +102,7 @@ static inline s64 percpu_counter_read_po
 	return fbc->count;
 }
 
-static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
 {
 	return percpu_counter_read_positive(fbc);
 }
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -52,7 +52,7 @@ EXPORT_SYMBOL(__percpu_counter_add);
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
  */
-s64 percpu_counter_sum(struct percpu_counter *fbc)
+s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
@@ -66,7 +66,7 @@ s64 percpu_counter_sum(struct percpu_cou
 	spin_unlock(&fbc->lock);
 	return ret < 0 ? 0 : ret;
 }
-EXPORT_SYMBOL(percpu_counter_sum);
+EXPORT_SYMBOL(percpu_counter_sum_positive);
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 08/23] lib: percpu_count_sum()
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (6 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 07/23] lib: percpu_counter_sum_positive Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 09/23] lib: percpu_counter_init error handling Peter Zijlstra
                   ` (15 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_sum.patch --]
[-- Type: text/plain, Size: 2497 bytes --]

Provide an accurate version of percpu_counter_read.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |   18 +++++++++++++++++-
 lib/percpu_counter.c           |    6 +++---
 2 files changed, 20 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -34,13 +34,24 @@ void percpu_counter_init(struct percpu_c
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
-s64 percpu_counter_sum_positive(struct percpu_counter *fbc);
+s64 __percpu_counter_sum(struct percpu_counter *fbc);
 
 static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
 	__percpu_counter_add(fbc, amount, FBC_BATCH);
 }
 
+static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
+{
+	s64 ret = __percpu_counter_sum(fbc);
+	return ret < 0 ? 0 : ret;
+}
+
+static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+{
+	return __percpu_counter_sum(fbc);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;
@@ -107,6 +118,11 @@ static inline s64 percpu_counter_sum_pos
 	return percpu_counter_read_positive(fbc);
 }
 
+static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+{
+	return percpu_counter_read(fbc);
+}
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -52,7 +52,7 @@ EXPORT_SYMBOL(__percpu_counter_add);
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
  */
-s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
+s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
@@ -64,9 +64,9 @@ s64 percpu_counter_sum_positive(struct p
 		ret += *pcount;
 	}
 	spin_unlock(&fbc->lock);
-	return ret < 0 ? 0 : ret;
+	return ret;
 }
-EXPORT_SYMBOL(percpu_counter_sum_positive);
+EXPORT_SYMBOL(__percpu_counter_sum);
 
 void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 09/23] lib: percpu_counter_init error handling
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (7 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 08/23] lib: percpu_count_sum() Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-17 15:56   ` Josef Sipek
  2007-08-16  7:45 ` [PATCH 10/23] lib: percpu_counter_init_irq Peter Zijlstra
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_init.patch --]
[-- Type: text/plain, Size: 5592 bytes --]

alloc_percpu can fail, propagate that error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/ext2/super.c                |   11 ++++++++---
 fs/ext3/super.c                |   11 ++++++++---
 fs/ext4/super.c                |   11 ++++++++---
 include/linux/percpu_counter.h |    5 +++--
 lib/percpu_counter.c           |    8 +++++++-
 5 files changed, 34 insertions(+), 12 deletions(-)

Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c
+++ linux-2.6/fs/ext2/super.c
@@ -725,6 +725,7 @@ static int ext2_fill_super(struct super_
 	int db_count;
 	int i, j;
 	__le32 features;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -996,12 +997,16 @@ static int ext2_fill_super(struct super_
 	sbi->s_rsv_window_head.rsv_goal_size = 0;
 	ext2_rsv_window_add(sb, &sbi->s_rsv_window_head);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 				ext2_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
 				ext2_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	err |= percpu_counter_init(&sbi->s_dirs_counter,
 				ext2_count_dirs(sb));
+	if (err) {
+		printk(KERN_ERR "EXT2-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 	/*
 	 * set up enough so that it can read an inode
 	 */
Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c
+++ linux-2.6/fs/ext3/super.c
@@ -1485,6 +1485,7 @@ static int ext3_fill_super (struct super
 	int i;
 	int needs_recovery;
 	__le32 features;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -1745,12 +1746,16 @@ static int ext3_fill_super (struct super
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 		ext3_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
 		ext3_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	err |= percpu_counter_init(&sbi->s_dirs_counter,
 		ext3_count_dirs(sb));
+	if (err) {
+		printk(KERN_ERR "EXT3-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 
 	/* per fileystem reservation list head & lock */
 	spin_lock_init(&sbi->s_rsv_window_lock);
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c
+++ linux-2.6/fs/ext4/super.c
@@ -1576,6 +1576,7 @@ static int ext4_fill_super (struct super
 	int needs_recovery;
 	__le32 features;
 	__u64 blocks_count;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -1857,12 +1858,16 @@ static int ext4_fill_super (struct super
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 		ext4_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
 		ext4_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	err |= percpu_counter_init(&sbi->s_dirs_counter,
 		ext4_count_dirs(sb));
+	if (err) {
+		printk(KERN_ERR "EXT4-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 
 	/* per fileystem reservation list head & lock */
 	spin_lock_init(&sbi->s_rsv_window_lock);
Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -30,7 +30,7 @@ struct percpu_counter {
 #define FBC_BATCH	(NR_CPUS*4)
 #endif
 
-void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
@@ -78,9 +78,10 @@ struct percpu_counter {
 	s64 count;
 };
 
-static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+static inline int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	fbc->count = amount;
+	return 0;
 }
 
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -68,21 +68,27 @@ s64 __percpu_counter_sum(struct percpu_c
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
-void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	spin_lock_init(&fbc->lock);
 	fbc->count = amount;
 	fbc->counters = alloc_percpu(s32);
+	if (!fbc->counters)
+		return -ENOMEM;
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);
 	list_add(&fbc->list, &percpu_counters);
 	mutex_unlock(&percpu_counters_lock);
 #endif
+	return 0;
 }
 EXPORT_SYMBOL(percpu_counter_init);
 
 void percpu_counter_destroy(struct percpu_counter *fbc)
 {
+	if (!fbc->counters)
+		return;
+
 	free_percpu(fbc->counters);
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 10/23] lib: percpu_counter_init_irq
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (8 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 09/23] lib: percpu_counter_init error handling Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 11/23] mm: bdi init hooks Peter Zijlstra
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: percpu_counter_init_irq.patch --]
[-- Type: text/plain, Size: 1949 bytes --]

provide a way to tell lockdep about percpu_counters that are supposed to be
used from irq safe contexts.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/percpu_counter.h |    3 +++
 lib/percpu_counter.c           |   12 ++++++++++++
 2 files changed, 15 insertions(+)

Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -31,6 +31,7 @@ struct percpu_counter {
 #endif
 
 int percpu_counter_init(struct percpu_counter *fbc, s64 amount);
+int percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
@@ -84,6 +85,8 @@ static inline int percpu_counter_init(st
 	return 0;
 }
 
+#define percpu_counter_init_irq percpu_counter_init
+
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 {
 }
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -68,6 +68,8 @@ s64 __percpu_counter_sum(struct percpu_c
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
+static struct lock_class_key percpu_counter_irqsafe;
+
 int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	spin_lock_init(&fbc->lock);
@@ -84,6 +86,16 @@ int percpu_counter_init(struct percpu_co
 }
 EXPORT_SYMBOL(percpu_counter_init);
 
+int percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount)
+{
+	int err;
+
+	err = percpu_counter_init(fbc, amount);
+	if (!err)
+		lockdep_set_class(&fbc->lock, &percpu_counter_irqsafe);
+	return err;
+}
+
 void percpu_counter_destroy(struct percpu_counter *fbc)
 {
 	if (!fbc->counters)

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 11/23] mm: bdi init hooks
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (9 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 10/23] lib: percpu_counter_init_irq Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-17 16:10   ` Josef Sipek
  2007-08-16  7:45 ` [PATCH 12/23] containers: " Peter Zijlstra
                   ` (12 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_init.patch --]
[-- Type: text/plain, Size: 14260 bytes --]

provide BDI constructor/destructor hooks

[akpm@linux-foundation.org: compile fix]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c               |   13 ++++++++++---
 drivers/block/rd.c              |   20 +++++++++++++++++++-
 drivers/char/mem.c              |    5 +++++
 fs/char_dev.c                   |    1 +
 fs/configfs/configfs_internal.h |    2 ++
 fs/configfs/inode.c             |    8 ++++++++
 fs/configfs/mount.c             |    9 +++++++++
 fs/fuse/inode.c                 |    9 +++++++++
 fs/hugetlbfs/inode.c            |    9 ++++++++-
 fs/nfs/client.c                 |    6 ++++++
 fs/ocfs2/dlm/dlmfs.c            |    9 ++++++++-
 fs/ramfs/inode.c                |   12 +++++++++++-
 fs/sysfs/inode.c                |    5 +++++
 fs/sysfs/mount.c                |    4 ++++
 fs/sysfs/sysfs.h                |    1 +
 include/linux/backing-dev.h     |    8 ++++++++
 mm/readahead.c                  |    6 ++++++
 mm/shmem.c                      |    6 ++++++
 mm/swap.c                       |    4 ++++
 19 files changed, 130 insertions(+), 7 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -1780,6 +1780,7 @@ static void blk_release_queue(struct kob
 
 	blk_trace_shutdown(q);
 
+	bdi_destroy(&q->backing_dev_info);
 	kmem_cache_free(requestq_cachep, q);
 }
 
@@ -1833,21 +1834,27 @@ static struct kobj_type queue_ktype;
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 {
 	struct request_queue *q;
+	int err;
 
 	q = kmem_cache_alloc_node(requestq_cachep,
 				gfp_mask | __GFP_ZERO, node_id);
 	if (!q)
 		return NULL;
 
+	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
+	q->backing_dev_info.unplug_io_data = q;
+	err = bdi_init(&q->backing_dev_info);
+	if (err) {
+		kmem_cache_free(requestq_cachep, q);
+		return NULL;
+	}
+
 	init_timer(&q->unplug_timer);
 
 	snprintf(q->kobj.name, KOBJ_NAME_LEN, "%s", "queue");
 	q->kobj.ktype = &queue_ktype;
 	kobject_init(&q->kobj);
 
-	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
-	q->backing_dev_info.unplug_io_data = q;
-
 	mutex_init(&q->sysfs_lock);
 
 	return q;
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -411,6 +411,9 @@ static void __exit rd_cleanup(void)
 		blk_cleanup_queue(rd_queue[i]);
 	}
 	unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
+
+	bdi_destroy(&rd_file_backing_dev_info);
+	bdi_destroy(&rd_backing_dev_info);
 }
 
 /*
@@ -419,7 +422,19 @@ static void __exit rd_cleanup(void)
 static int __init rd_init(void)
 {
 	int i;
-	int err = -ENOMEM;
+	int err;
+
+	err = bdi_init(&rd_backing_dev_info);
+	if (err)
+		goto out2;
+
+	err = bdi_init(&rd_file_backing_dev_info);
+	if (err) {
+		bdi_destroy(&rd_backing_dev_info);
+		goto out2;
+	}
+
+	err = -ENOMEM;
 
 	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
 			(rd_blocksize & (rd_blocksize-1))) {
@@ -473,6 +488,9 @@ out:
 		put_disk(rd_disks[i]);
 		blk_cleanup_queue(rd_queue[i]);
 	}
+	bdi_destroy(&rd_backing_dev_info);
+	bdi_destroy(&rd_file_backing_dev_info);
+out2:
 	return err;
 }
 
Index: linux-2.6/drivers/char/mem.c
===================================================================
--- linux-2.6.orig/drivers/char/mem.c
+++ linux-2.6/drivers/char/mem.c
@@ -984,6 +984,11 @@ static struct class *mem_class;
 static int __init chr_dev_init(void)
 {
 	int i;
+	int err;
+
+	err = bdi_init(&zero_bdi);
+	if (err)
+		return err;
 
 	if (register_chrdev(MEM_MAJOR,"mem",&memory_fops))
 		printk("unable to get major %d for memory devs\n", MEM_MAJOR);
Index: linux-2.6/fs/char_dev.c
===================================================================
--- linux-2.6.orig/fs/char_dev.c
+++ linux-2.6/fs/char_dev.c
@@ -545,6 +545,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
 	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
+	bdi_init(&directly_mappable_cdev_bdi);
 }
 
 
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c
+++ linux-2.6/fs/fuse/inode.c
@@ -418,6 +418,7 @@ static int fuse_show_options(struct seq_
 static struct fuse_conn *new_conn(void)
 {
 	struct fuse_conn *fc;
+	int err;
 
 	fc = kzalloc(sizeof(*fc), GFP_KERNEL);
 	if (fc) {
@@ -433,10 +434,17 @@ static struct fuse_conn *new_conn(void)
 		atomic_set(&fc->num_waiting, 0);
 		fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
+		err = bdi_init(&fc->bdi);
+		if (err) {
+			kfree(fc);
+			fc = NULL;
+			goto out;
+		}
 		fc->reqctr = 0;
 		fc->blocked = 1;
 		get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	}
+out:
 	return fc;
 }
 
@@ -446,6 +454,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		mutex_destroy(&fc->inst_mutex);
+		bdi_destroy(&fc->bdi);
 		kfree(fc);
 	}
 }
Index: linux-2.6/fs/nfs/client.c
===================================================================
--- linux-2.6.orig/fs/nfs/client.c
+++ linux-2.6/fs/nfs/client.c
@@ -632,6 +632,7 @@ static void nfs_server_set_fsinfo(struct
 	if (server->rsize > NFS_MAX_FILE_IO_SIZE)
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
 
 	if (server->wsize > max_rpc_payload)
@@ -682,6 +683,10 @@ static int nfs_probe_fsinfo(struct nfs_s
 		goto out_error;
 
 	nfs_server_set_fsinfo(server, &fsinfo);
+	error = bdi_init(&server->backing_dev_info);
+	if (error)
+		goto out_error;
+
 
 	/* Get some general file system info */
 	if (server->namelen == 0) {
@@ -761,6 +766,7 @@ void nfs_free_server(struct nfs_server *
 	nfs_put_client(server->nfs_client);
 
 	nfs_free_iostats(server->io_stats);
+	bdi_destroy(&server->backing_dev_info);
 	kfree(server);
 	nfs_release_automount_timer();
 	dprintk("<-- nfs_free_server()\n");
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -34,6 +34,14 @@ struct backing_dev_info {
 	void *unplug_io_data;
 };
 
+static inline int bdi_init(struct backing_dev_info *bdi)
+{
+	return 0;
+}
+
+static inline void bdi_destroy(struct backing_dev_info *bdi)
+{
+}
 
 /*
  * Flags in backing_dev_info::capability
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -965,11 +965,15 @@ static int __init init_hugetlbfs_fs(void
 	int error;
 	struct vfsmount *vfsmount;
 
+	error = bdi_init(&hugetlbfs_backing_dev_info);
+	if (error)
+		return error;
+
 	hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
 					sizeof(struct hugetlbfs_inode_info),
 					0, 0, init_once);
 	if (hugetlbfs_inode_cachep == NULL)
-		return -ENOMEM;
+		goto out2;
 
 	error = register_filesystem(&hugetlbfs_fs_type);
 	if (error)
@@ -987,6 +991,8 @@ static int __init init_hugetlbfs_fs(void
  out:
 	if (error)
 		kmem_cache_destroy(hugetlbfs_inode_cachep);
+ out2:
+	bdi_destroy(&hugetlbfs_backing_dev_info);
 	return error;
 }
 
@@ -994,6 +1000,7 @@ static void __exit exit_hugetlbfs_fs(voi
 {
 	kmem_cache_destroy(hugetlbfs_inode_cachep);
 	unregister_filesystem(&hugetlbfs_fs_type);
+	bdi_destroy(&hugetlbfs_backing_dev_info);
 }
 
 module_init(init_hugetlbfs_fs)
Index: linux-2.6/fs/ocfs2/dlm/dlmfs.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlm/dlmfs.c
+++ linux-2.6/fs/ocfs2/dlm/dlmfs.c
@@ -588,13 +588,17 @@ static int __init init_dlmfs_fs(void)
 
 	dlmfs_print_version();
 
+	status = bdi_init(&dlmfs_backing_dev_info);
+	if (status)
+		return status;
+
 	dlmfs_inode_cache = kmem_cache_create("dlmfs_inode_cache",
 				sizeof(struct dlmfs_inode_private),
 				0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
 					SLAB_MEM_SPREAD),
 				dlmfs_init_once);
 	if (!dlmfs_inode_cache)
-		return -ENOMEM;
+		goto bail;
 	cleanup_inode = 1;
 
 	user_dlm_worker = create_singlethread_workqueue("user_dlm");
@@ -611,6 +615,7 @@ bail:
 			kmem_cache_destroy(dlmfs_inode_cache);
 		if (cleanup_worker)
 			destroy_workqueue(user_dlm_worker);
+		bdi_destroy(&dlmfs_backing_dev_info);
 	} else
 		printk("OCFS2 User DLM kernel interface loaded\n");
 	return status;
@@ -624,6 +629,8 @@ static void __exit exit_dlmfs_fs(void)
 	destroy_workqueue(user_dlm_worker);
 
 	kmem_cache_destroy(dlmfs_inode_cache);
+
+	bdi_destroy(&dlmfs_backing_dev_info);
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/configfs/configfs_internal.h
===================================================================
--- linux-2.6.orig/fs/configfs/configfs_internal.h
+++ linux-2.6/fs/configfs/configfs_internal.h
@@ -56,6 +56,8 @@ extern int configfs_is_root(struct confi
 
 extern struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent *);
 extern int configfs_create(struct dentry *, int mode, int (*init)(struct inode *));
+extern int configfs_inode_init(void);
+extern void configfs_inode_exit(void);
 
 extern int configfs_create_file(struct config_item *, const struct configfs_attribute *);
 extern int configfs_make_dirent(struct configfs_dirent *,
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -256,4 +256,12 @@ void configfs_hash_and_remove(struct den
 	mutex_unlock(&dir->d_inode->i_mutex);
 }
 
+int __init configfs_inode_init(void)
+{
+	return bdi_init(&configfs_backing_dev_info);
+}
 
+void __exit configfs_inode_exit(void)
+{
+	bdi_destroy(&configfs_backing_dev_info);
+}
Index: linux-2.6/fs/configfs/mount.c
===================================================================
--- linux-2.6.orig/fs/configfs/mount.c
+++ linux-2.6/fs/configfs/mount.c
@@ -154,8 +154,16 @@ static int __init configfs_init(void)
 		subsystem_unregister(&config_subsys);
 		kmem_cache_destroy(configfs_dir_cachep);
 		configfs_dir_cachep = NULL;
+		goto out;
 	}
 
+	err = configfs_inode_init();
+	if (err) {
+		unregister_filesystem(&configfs_fs_type);
+		subsystem_unregister(&config_subsys);
+		kmem_cache_destroy(configfs_dir_cachep);
+		configfs_dir_cachep = NULL;
+	}
 out:
 	return err;
 }
@@ -166,6 +174,7 @@ static void __exit configfs_exit(void)
 	subsystem_unregister(&config_subsys);
 	kmem_cache_destroy(configfs_dir_cachep);
 	configfs_dir_cachep = NULL;
+	configfs_inode_exit();
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/ramfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ramfs/inode.c
+++ linux-2.6/fs/ramfs/inode.c
@@ -223,7 +223,17 @@ module_exit(exit_ramfs_fs)
 
 int __init init_rootfs(void)
 {
-	return register_filesystem(&rootfs_fs_type);
+	int err;
+
+	err = bdi_init(&ramfs_backing_dev_info);
+	if (err)
+		return err;
+
+	err = register_filesystem(&rootfs_fs_type);
+	if (err)
+		bdi_destroy(&ramfs_backing_dev_info);
+
+	return err;
 }
 
 MODULE_LICENSE("GPL");
Index: linux-2.6/fs/sysfs/inode.c
===================================================================
--- linux-2.6.orig/fs/sysfs/inode.c
+++ linux-2.6/fs/sysfs/inode.c
@@ -34,6 +34,11 @@ static const struct inode_operations sys
 	.setattr	= sysfs_setattr,
 };
 
+int __init sysfs_inode_init(void)
+{
+	return bdi_init(&sysfs_backing_dev_info);
+}
+
 int sysfs_setattr(struct dentry * dentry, struct iattr * iattr)
 {
 	struct inode * inode = dentry->d_inode;
Index: linux-2.6/fs/sysfs/mount.c
===================================================================
--- linux-2.6.orig/fs/sysfs/mount.c
+++ linux-2.6/fs/sysfs/mount.c
@@ -90,6 +90,10 @@ int __init sysfs_init(void)
 	if (!sysfs_dir_cachep)
 		goto out;
 
+	err = sysfs_inode_init();
+	if (err)
+		goto out_err;
+
 	err = register_filesystem(&sysfs_fs_type);
 	if (!err) {
 		sysfs_mount = kern_mount(&sysfs_fs_type);
Index: linux-2.6/fs/sysfs/sysfs.h
===================================================================
--- linux-2.6.orig/fs/sysfs/sysfs.h
+++ linux-2.6/fs/sysfs/sysfs.h
@@ -78,6 +78,7 @@ extern int sysfs_addrm_finish(struct sys
 
 extern struct inode * sysfs_get_inode(struct sysfs_dirent *sd);
 extern void sysfs_instantiate(struct dentry *dentry, struct inode *inode);
+extern int sysfs_inode_init(void);
 
 extern void release_sysfs_dirent(struct sysfs_dirent * sd);
 extern struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd,
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -2460,6 +2460,10 @@ static int __init init_tmpfs(void)
 {
 	int error;
 
+	error = bdi_init(&shmem_backing_dev_info);
+	if (error)
+		goto out4;
+
 	error = init_inodecache();
 	if (error)
 		goto out3;
@@ -2484,6 +2488,8 @@ out1:
 out2:
 	destroy_inodecache();
 out3:
+	bdi_destroy(&shmem_backing_dev_info);
+out4:
 	shm_mnt = ERR_PTR(error);
 	return error;
 }
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -548,6 +548,10 @@ void __init swap_setup(void)
 {
 	unsigned long megs = num_physpages >> (20 - PAGE_SHIFT);
 
+#ifdef CONFIG_SWAP
+	bdi_init(swapper_space.backing_dev_info);
+#endif
+
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
 		page_cluster = 2;
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -234,6 +234,12 @@ unsigned long max_sane_readahead(unsigne
 		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
 }
 
+static int __init readahead_init(void)
+{
+	return bdi_init(&default_backing_dev_info);
+}
+subsys_initcall(readahead_init);
+
 /*
  * Submit IO for the read-ahead request in file_ra_state.
  */

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 12/23] containers: bdi init hooks
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (10 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 11/23] mm: bdi init hooks Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 13/23] mtd: " Peter Zijlstra
                   ` (11 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_init_container.patch --]
[-- Type: text/plain, Size: 1541 bytes --]

split off from the large bdi_init patch because containers are not slated
for mainline any time soon.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/container.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/container.c
===================================================================
--- linux-2.6.orig/kernel/container.c
+++ linux-2.6/kernel/container.c
@@ -567,12 +567,13 @@ static int container_populate_dir(struct
 static struct inode_operations container_dir_inode_operations;
 static struct file_operations proc_containerstats_operations;
 
+static struct backing_dev_info container_backing_dev_info = {
+	.capabilities	= BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
+};
+
 static struct inode *container_new_inode(mode_t mode, struct super_block *sb)
 {
 	struct inode *inode = new_inode(sb);
-	static struct backing_dev_info container_backing_dev_info = {
-		.capabilities	= BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
-	};
 
 	if (inode) {
 		inode->i_mode = mode;
@@ -2261,6 +2262,10 @@ int __init container_init(void)
 	int i;
 	struct proc_dir_entry *entry;
 
+	err = bdi_init(&container_backing_dev_info);
+	if (err)
+		return err;
+
 	for (i = 0; i < CONTAINER_SUBSYS_COUNT; i++) {
 		struct container_subsys *ss = subsys[i];
 		if (!ss->early_init)
@@ -2276,6 +2281,9 @@ int __init container_init(void)
 		entry->proc_fops = &proc_containerstats_operations;
 
 out:
+	if (err)
+		bdi_destroy(&container_backing_dev_info);
+
 	return err;
 }
 

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 13/23] mtd: bdi init hooks
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (11 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 12/23] containers: " Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 14/23] mtd: clean up the backing_dev_info usage Peter Zijlstra
                   ` (10 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds,
	David Woodhouse

[-- Attachment #1: bdi_init_mtd.patch --]
[-- Type: text/plain, Size: 1164 bytes --]

split off because the relevant mtd changes seem particular to -mm

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Woodhouse <dwmw2@infradead.org>
---
 drivers/mtd/mtdcore.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux-2.6/drivers/mtd/mtdcore.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdcore.c
+++ linux-2.6/drivers/mtd/mtdcore.c
@@ -48,6 +48,7 @@ static LIST_HEAD(mtd_notifiers);
 int add_mtd_device(struct mtd_info *mtd)
 {
 	int i;
+	int err;
 
 	if (!mtd->backing_dev_info) {
 		switch (mtd->type) {
@@ -62,6 +63,9 @@ int add_mtd_device(struct mtd_info *mtd)
 			break;
 		}
 	}
+	err = bdi_init(mtd->backing_dev_info);
+	if (err)
+		return 1;
 
 	BUG_ON(mtd->writesize == 0);
 	mutex_lock(&mtd_table_mutex);
@@ -102,6 +106,7 @@ int add_mtd_device(struct mtd_info *mtd)
 		}
 
 	mutex_unlock(&mtd_table_mutex);
+	bdi_destroy(mtd->backing_dev_info);
 	return 1;
 }
 
@@ -144,6 +149,10 @@ int del_mtd_device (struct mtd_info *mtd
 	}
 
 	mutex_unlock(&mtd_table_mutex);
+
+	if (mtd->backing_dev_info)
+		bdi_destroy(mtd->backing_dev_info);
+
 	return ret;
 }
 

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 14/23] mtd: clean up the backing_dev_info usage
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (12 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 13/23] mtd: " Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 15/23] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
                   ` (9 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds,
	David Woodhouse

[-- Attachment #1: mtd-bdi-fixups.patch --]
[-- Type: text/plain, Size: 1932 bytes --]

Give each mtd device its own backing_dev_info instance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Woodhouse <dwmw2@infradead.org>
---
 drivers/mtd/mtdcore.c   |    8 +++++---
 include/linux/mtd/mtd.h |    2 ++
 2 files changed, 7 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/mtd/mtdcore.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdcore.c
+++ linux-2.6/drivers/mtd/mtdcore.c
@@ -19,6 +19,7 @@
 #include <linux/init.h>
 #include <linux/mtd/compatmac.h>
 #include <linux/proc_fs.h>
+#include <linux/backing-dev.h>
 
 #include <linux/mtd/mtd.h>
 #include "internal.h"
@@ -53,15 +54,16 @@ int add_mtd_device(struct mtd_info *mtd)
 	if (!mtd->backing_dev_info) {
 		switch (mtd->type) {
 		case MTD_RAM:
-			mtd->backing_dev_info = &mtd_bdi_rw_mappable;
+			mtd->mtd_backing_dev_info = mtd_bdi_rw_mappable;
 			break;
 		case MTD_ROM:
-			mtd->backing_dev_info = &mtd_bdi_ro_mappable;
+			mtd->mtd_backing_dev_info = mtd_bdi_ro_mappable;
 			break;
 		default:
-			mtd->backing_dev_info = &mtd_bdi_unmappable;
+			mtd->mtd_backing_dev_info = mtd_bdi_unmappable;
 			break;
 		}
+		mtd->backing_dev_info = &mtd->mtd_backing_dev_info;
 	}
 	err = bdi_init(mtd->backing_dev_info);
 	if (err)
Index: linux-2.6/include/linux/mtd/mtd.h
===================================================================
--- linux-2.6.orig/include/linux/mtd/mtd.h
+++ linux-2.6/include/linux/mtd/mtd.h
@@ -13,6 +13,7 @@
 #include <linux/module.h>
 #include <linux/uio.h>
 #include <linux/notifier.h>
+#include <linux/backing-dev.h>
 
 #include <linux/mtd/compatmac.h>
 #include <mtd/mtd-abi.h>
@@ -154,6 +155,7 @@ struct mtd_info {
 	 * - provides mmap capabilities
 	 */
 	struct backing_dev_info *backing_dev_info;
+	struct backing_dev_info mtd_backing_dev_info;
 
 
 	int (*read) (struct mtd_info *mtd, loff_t from, size_t len, size_t *retlen, u_char *buf);

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 15/23] mtd: give mtdconcat devices their own backing_dev_info
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (13 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 14/23] mtd: clean up the backing_dev_info usage Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 16/23] mm: scalable bdi statistics counters Peter Zijlstra
                   ` (8 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds,
	David Woodhouse, Robert Kaiser

[-- Attachment #1: bdi_mtdconcat.patch --]
[-- Type: text/plain, Size: 3324 bytes --]

These are actual devices, give them their own BDI.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Robert Kaiser <rkaiser@sysgo.de>
---
 drivers/mtd/mtdconcat.c |   28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

Index: linux-2.6/drivers/mtd/mtdconcat.c
===================================================================
--- linux-2.6.orig/drivers/mtd/mtdconcat.c	2007-04-22 18:55:17.000000000 +0200
+++ linux-2.6/drivers/mtd/mtdconcat.c	2007-04-22 19:01:42.000000000 +0200
@@ -32,6 +32,7 @@ struct mtd_concat {
 	struct mtd_info mtd;
 	int num_subdev;
 	struct mtd_info **subdev;
+	struct backing_dev_info backing_dev_info;
 };
 
 /*
@@ -782,10 +783,9 @@ struct mtd_info *mtd_concat_create(struc
 
 	for (i = 1; i < num_devs; i++) {
 		if (concat->mtd.type != subdev[i]->type) {
-			kfree(concat);
 			printk("Incompatible device type on \"%s\"\n",
 			       subdev[i]->name);
-			return NULL;
+			goto error;
 		}
 		if (concat->mtd.flags != subdev[i]->flags) {
 			/*
@@ -794,10 +794,9 @@ struct mtd_info *mtd_concat_create(struc
 			 */
 			if ((concat->mtd.flags ^ subdev[i]->
 			     flags) & ~MTD_WRITEABLE) {
-				kfree(concat);
 				printk("Incompatible device flags on \"%s\"\n",
 				       subdev[i]->name);
-				return NULL;
+				goto error;
 			} else
 				/* if writeable attribute differs,
 				   make super device writeable */
@@ -809,9 +808,12 @@ struct mtd_info *mtd_concat_create(struc
 		 * - copy-mapping is still permitted
 		 */
 		if (concat->mtd.backing_dev_info !=
-		    subdev[i]->backing_dev_info)
+		    subdev[i]->backing_dev_info) {
+			concat->backing_dev_info = default_backing_dev_info;
+			bdi_init(&concat->backing_dev_info);
 			concat->mtd.backing_dev_info =
-				&default_backing_dev_info;
+				&concat->backing_dev_info;
+		}
 
 		concat->mtd.size += subdev[i]->size;
 		concat->mtd.ecc_stats.badblocks +=
@@ -821,10 +823,9 @@ struct mtd_info *mtd_concat_create(struc
 		    concat->mtd.oobsize    !=  subdev[i]->oobsize ||
 		    !concat->mtd.read_oob  != !subdev[i]->read_oob ||
 		    !concat->mtd.write_oob != !subdev[i]->write_oob) {
-			kfree(concat);
 			printk("Incompatible OOB or ECC data on \"%s\"\n",
 			       subdev[i]->name);
-			return NULL;
+			goto error;
 		}
 		concat->subdev[i] = subdev[i];
 
@@ -903,11 +904,10 @@ struct mtd_info *mtd_concat_create(struc
 		    kmalloc(num_erase_region *
 			    sizeof (struct mtd_erase_region_info), GFP_KERNEL);
 		if (!erase_region_p) {
-			kfree(concat);
 			printk
 			    ("memory allocation error while creating erase region list"
 			     " for device \"%s\"\n", name);
-			return NULL;
+			goto error;
 		}
 
 		/*
@@ -968,6 +968,12 @@ struct mtd_info *mtd_concat_create(struc
 	}
 
 	return &concat->mtd;
+
+error:
+	if (concat->mtd.backing_dev_info == &concat->backing_dev_info)
+		bdi_destroy(&concat->backing_dev_info);
+	kfree(concat);
+	return NULL;
 }
 
 /*
@@ -977,6 +983,8 @@ struct mtd_info *mtd_concat_create(struc
 void mtd_concat_destroy(struct mtd_info *mtd)
 {
 	struct mtd_concat *concat = CONCAT(mtd);
+	if (concat->mtd.backing_dev_info == &concat->backing_dev_info)
+		bdi_destroy(&concat->backing_dev_info);
 	if (concat->mtd.numeraseregions)
 		kfree(concat->mtd.eraseregions);
 	kfree(concat);

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 16/23] mm: scalable bdi statistics counters.
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (14 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 15/23] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-17 16:20   ` Josef Sipek
  2007-08-16  7:45 ` [PATCH 17/23] mm: count reclaimable pages per BDI Peter Zijlstra
                   ` (7 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat.patch --]
[-- Type: text/plain, Size: 4063 bytes --]

Provide scalable per backing_dev_info statistics counters.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |   85 ++++++++++++++++++++++++++++++++++++++++++--
 mm/backing-dev.c            |   27 +++++++++++++
 2 files changed, 109 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -8,6 +8,8 @@
 #ifndef _LINUX_BACKING_DEV_H
 #define _LINUX_BACKING_DEV_H
 
+#include <linux/percpu_counter.h>
+#include <linux/log2.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -24,6 +26,12 @@ enum bdi_state {
 
 typedef int (congested_fn)(void *, int);
 
+enum bdi_stat_item {
+	NR_BDI_STAT_ITEMS
+};
+
+#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
+
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
@@ -32,15 +40,86 @@ struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+
+	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 };
 
-static inline int bdi_init(struct backing_dev_info *bdi)
+int bdi_init(struct backing_dev_info *bdi);
+void bdi_destroy(struct backing_dev_info *bdi);
+
+static inline void __add_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
 {
-	return 0;
+	__percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
 }
 
-static inline void bdi_destroy(struct backing_dev_info *bdi)
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
 {
+	__add_bdi_stat(bdi, item, 1);
+}
+
+static inline void inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	__add_bdi_stat(bdi, item, -1);
+}
+
+static inline void dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline s64 bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_read_positive(&bdi->bdi_stat[item]);
+}
+
+static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_sum_positive(&bdi->bdi_stat[item]);
+}
+
+static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	s64 sum;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sum = __bdi_stat_sum(bdi, item);
+	local_irq_restore(flags);
+
+	return sum;
+}
+
+/*
+ * maximal error of a stat counter.
+ */
+static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
+{
+#ifdef CONFIG_SMP
+	return nr_cpu_ids * BDI_STAT_BATCH;
+#else
+	return 1;
+#endif
 }
 
 /*
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -5,6 +5,33 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 
+int bdi_init(struct backing_dev_info *bdi)
+{
+	int i, j;
+	int err;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
+		err = percpu_counter_init_irq(&bdi->bdi_stat[i], 0);
+		if (err) {
+			for (j = 0; j < i; j++)
+				percpu_counter_destroy(&bdi->bdi_stat[i]);
+			break;
+		}
+	}
+
+	return err;
+}
+EXPORT_SYMBOL(bdi_init);
+
+void bdi_destroy(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		percpu_counter_destroy(&bdi->bdi_stat[i]);
+}
+EXPORT_SYMBOL(bdi_destroy);
+
 static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 17/23] mm: count reclaimable pages per BDI
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (15 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 16/23] mm: scalable bdi statistics counters Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-17 16:23   ` Josef Sipek
  2007-08-16  7:45 ` [PATCH 18/23] mm: count writeback " Peter Zijlstra
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat_reclaimable.patch --]
[-- Type: text/plain, Size: 4062 bytes --]

Count per BDI reclaimable pages; nr_reclaimable = nr_dirty + nr_unstable.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                 |    2 ++
 fs/nfs/write.c              |    7 +++++++
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |    4 ++++
 mm/truncate.c               |    2 ++
 5 files changed, 16 insertions(+)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -697,6 +697,8 @@ static int __set_page_dirty(struct page 
 
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -827,6 +827,8 @@ int __set_page_dirty_nobuffers(struct pa
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
+				__inc_bdi_stat(mapping->backing_dev_info,
+						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
@@ -961,6 +963,8 @@ int clear_page_dirty_for_io(struct page 
 		 */
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			return 1;
 		}
 		return 0;
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -72,6 +72,8 @@ void cancel_dirty_page(struct page *page
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -464,6 +464,7 @@ nfs_mark_request_commit(struct nfs_page 
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -550,6 +551,8 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		nfs_inode_remove_request(req);
@@ -1210,6 +1213,8 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_clear_page_tag_locked(req);
 	}
 	return -ENOMEM;
@@ -1235,6 +1240,8 @@ static void nfs_commit_done(struct rpc_t
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
 			req->wb_context->path.dentry->d_inode->i_sb->s_id,
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -27,6 +27,7 @@ enum bdi_state {
 typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
+	BDI_RECLAIMABLE,
 	NR_BDI_STAT_ITEMS
 };
 

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 18/23] mm: count writeback pages per BDI
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (16 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 17/23] mm: count reclaimable pages per BDI Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 19/23] mm: expose BDI statistics in sysfs Peter Zijlstra
                   ` (5 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat_writeback.patch --]
[-- Type: text/plain, Size: 1930 bytes --]

Count per BDI writeback pages.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |   12 ++++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -981,14 +981,18 @@ int test_clear_page_writeback(struct pag
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi))
+				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -1004,14 +1008,18 @@ int test_set_page_writeback(struct page 
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi))
+				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -28,6 +28,7 @@ typedef int (congested_fn)(void *, int);
 
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
+	BDI_WRITEBACK,
 	NR_BDI_STAT_ITEMS
 };
 

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 19/23] mm: expose BDI statistics in sysfs.
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (17 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 18/23] mm: count writeback " Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 20/23] lib: floating proportions Peter Zijlstra
                   ` (4 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat_sysfs.patch --]
[-- Type: text/plain, Size: 1963 bytes --]

Expose the per BDI stats in /sys/block/<dev>/queue/*

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -3982,6 +3982,23 @@ static ssize_t queue_max_hw_sectors_show
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_reclaimable_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_reclaimable =
+		bdi_stat(&q->backing_dev_info, BDI_RECLAIMABLE);
+
+	return sprintf(page, "%llu\n",
+			nr_reclaimable >> (PAGE_CACHE_SHIFT - 10));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_writeback =
+		bdi_stat(&q->backing_dev_info, BDI_WRITEBACK);
+
+	return sprintf(page, "%llu\n",
+			nr_writeback >> (PAGE_CACHE_SHIFT - 10));
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -4006,6 +4023,16 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_reclaimable_entry = {
+	.attr = {.name = "reclaimable_kb", .mode = S_IRUGO },
+	.show = queue_nr_reclaimable_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+	.attr = {.name = "writeback_kb", .mode = S_IRUGO },
+	.show = queue_nr_writeback_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -4017,6 +4044,8 @@ static struct attribute *default_attrs[]
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
+	&queue_reclaimable_entry.attr,
+	&queue_writeback_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 20/23] lib: floating proportions
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (18 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 19/23] mm: expose BDI statistics in sysfs Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 21/23] mm: per device dirty threshold Peter Zijlstra
                   ` (3 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: proportions.patch --]
[-- Type: text/plain, Size: 14197 bytes --]

Given a set of objects, floating proportions aims to efficiently give the
proportional 'activity' of a single item as compared to the whole set. Where
'activity' is a measure of a temporal property of the items.

It is efficient in that it need not inspect any other items of the set
in order to provide the answer. It is not even needed to know how many
other items there are.

It has one parameter, and that is the period of 'time' over which the 
'activity' is measured.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

Changes since -v8:

 - merged _single into this patch
 - major cleanup
  - removed the overloading of the prop_local methods
  - removed the prop_adjust_shift macro [akpm]
  - simplified the _single code
 - limited the shift argument
 - provided prop_inc_{percpu,single}
 - static initializer for prop_local_single

 include/linux/proportions.h |  119 +++++++++++++
 lib/Makefile                |    3 
 lib/proportions.c           |  384 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 505 insertions(+), 1 deletion(-)

Index: linux-2.6/lib/proportions.c
===================================================================
--- /dev/null
+++ linux-2.6/lib/proportions.c
@@ -0,0 +1,384 @@
+/*
+ * Floating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * Description:
+ *
+ * The floating proportion is a time derivative with an exponentially decaying
+ * history:
+ *
+ *   p_{j} = \Sum_{i=0} (dx_{j}/dt_{-i}) / 2^(1+i)
+ *
+ * Where j is an element from {prop_local}, x_{j} is j's number of events,
+ * and i the time period over which the differential is taken. So d/dt_{-i} is
+ * the differential over the i-th last period.
+ *
+ * The decaying history gives smooth transitions. The time differential carries
+ * the notion of speed.
+ *
+ * The denominator is 2^(1+i) because we want the series to be normalised, ie.
+ *
+ *   \Sum_{i=0} 1/2^(1+i) = 1
+ *
+ * Further more, if we measure time (t) in the same events as x; so that:
+ *
+ *   t = \Sum_{j} x_{j}
+ *
+ * we get that:
+ *
+ *   \Sum_{j} p_{j} = 1
+ *
+ * Writing this in an iterative fashion we get (dropping the 'd's):
+ *
+ *   if (++x_{j}, ++t > period)
+ *     t /= 2;
+ *     for_each (j)
+ *       x_{j} /= 2;
+ *
+ * so that:
+ *
+ *   p_{j} = x_{j} / t;
+ *
+ * We optimize away the '/= 2' for the global time delta by noting that:
+ *
+ *   if (++t > period) t /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   period/2 + (++t % period/2)
+ *
+ * [ Furthermore, when we choose period to be 2^n it can be written in terms of
+ *   binary operations and wraparound artefacts disappear. ]
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   c = t / (period/2)
+ *
+ * [ Its monotonic increasing property can be applied to mitigate the wrap-
+ *   around issue. ]
+ *
+ * This allows us to do away with the loop over all prop_locals on each period
+ * expiration. By remembering the period count under which it was last accessed
+ * as c_{j}, we can obtain the number of 'missed' cycles from:
+ *
+ *   c - c_{j}
+ *
+ * We can then lazily catch up to the global period count every time we are
+ * going to use x_{j}, by doing:
+ *
+ *   x_{j} /= 2^(c - c_{j}), c_{j} = c
+ */
+
+#include <linux/proportions.h>
+#include <linux/rcupdate.h>
+
+/*
+ * Limit the time part in order to ensure there are some bits left for the
+ * cycle counter.
+ */
+#define PROP_MAX_SHIFT (3*BITS_PER_LONG/4)
+
+int prop_descriptor_init(struct prop_descriptor *pd, int shift)
+{
+	int err;
+
+	if (shift > PROP_MAX_SHIFT)
+		shift = PROP_MAX_SHIFT;
+
+	pd->index = 0;
+	pd->pg[0].shift = shift;
+	mutex_init(&pd->mutex);
+	err = percpu_counter_init_irq(&pd->pg[0].events, 0);
+	if (err)
+		goto out;
+
+	err = percpu_counter_init_irq(&pd->pg[1].events, 0);
+	if (err)
+		percpu_counter_destroy(&pd->pg[0].events);
+
+out:
+	return err;
+}
+
+/*
+ * We have two copies, and flip between them to make it seem like an atomic
+ * update. The update is not really atomic wrt the events counter, but
+ * it is internally consistent with the bit layout depending on shift.
+ *
+ * We copy the events count, move the bits around and flip the index.
+ */
+void prop_change_shift(struct prop_descriptor *pd, int shift)
+{
+	int index;
+	int offset;
+	u64 events;
+	unsigned long flags;
+
+	if (shift > PROP_MAX_SHIFT)
+		shift = PROP_MAX_SHIFT;
+
+	mutex_lock(&pd->mutex);
+
+	index = pd->index ^ 1;
+	offset = pd->pg[pd->index].shift - shift;
+	if (!offset)
+		goto out;
+
+	pd->pg[index].shift = shift;
+
+	local_irq_save(flags);
+	events = percpu_counter_sum(&pd->pg[pd->index].events);
+	if (offset < 0)
+		events <<= -offset;
+	else
+		events >>= offset;
+	percpu_counter_set(&pd->pg[index].events, events);
+
+	/*
+	 * ensure the new pg is fully written before the switch
+	 */
+	smp_wmb();
+	pd->index = index;
+	local_irq_restore(flags);
+
+	synchronize_rcu();
+
+out:
+	mutex_unlock(&pd->mutex);
+}
+
+/*
+ * wrap the access to the data in an rcu_read_lock() section;
+ * this is used to track the active references.
+ */
+static struct prop_global *prop_get_global(struct prop_descriptor *pd)
+{
+	int index;
+
+	rcu_read_lock();
+	index = pd->index;
+	/*
+	 * match the wmb from vcd_flip()
+	 */
+	smp_rmb();
+	return &pd->pg[index];
+}
+
+static void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg)
+{
+	rcu_read_unlock();
+}
+
+static void
+prop_adjust_shift(int *pl_shift, unsigned long *pl_period, int new_shift)
+{
+	int offset = *pl_shift - new_shift;
+
+	if (!offset)
+		return;
+
+	if (offset < 0)
+		*pl_period <<= -offset;
+	else
+		*pl_period >>= offset;
+
+	*pl_shift = new_shift;
+}
+
+/*
+ * PERCPU
+ */
+
+int prop_local_init_percpu(struct prop_local_percpu *pl)
+{
+	spin_lock_init(&pl->lock);
+	pl->shift = 0;
+	pl->period = 0;
+	return percpu_counter_init_irq(&pl->events, 0);
+}
+
+void prop_local_destroy_percpu(struct prop_local_percpu *pl)
+{
+	percpu_counter_destroy(&pl->events);
+}
+
+/*
+ * Catch up with missed period expirations.
+ *
+ *   until (c_{j} == c)
+ *     x_{j} -= x_{j}/2;
+ *     c_{j}++;
+ */
+static
+void prop_norm_percpu(struct prop_global *pg, struct prop_local_percpu *pl)
+{
+	unsigned long period = 1UL << (pg->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&pg->events);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (pl->period == global_period)
+		return;
+
+	spin_lock_irqsave(&pl->lock, flags);
+	prop_adjust_shift(&pl->shift, &pl->period, pg->shift);
+	/*
+	 * For each missed period, we half the local counter.
+	 * basically:
+	 *   pl->events >> (global_period - pl->period);
+	 *
+	 * but since the distributed nature of percpu counters make division
+	 * rather hard, use a regular subtraction loop. This is safe, because
+	 * the events will only every be incremented, hence the subtraction
+	 * can never result in a negative number.
+	 */
+	while (pl->period != global_period) {
+		unsigned long val = percpu_counter_read(&pl->events);
+		unsigned long half = (val + 1) >> 1;
+
+		/*
+		 * Half of zero won't be much less, break out.
+		 * This limits the loop to shift iterations, even
+		 * if we missed a million.
+		 */
+		if (!val)
+			break;
+
+		percpu_counter_add(&pl->events, -half);
+		pl->period += period;
+	}
+	pl->period = global_period;
+	spin_unlock_irqrestore(&pl->lock, flags);
+}
+
+/*
+ *   ++x_{j}, ++t
+ */
+void __prop_inc_percpu(struct prop_descriptor *pd, struct prop_local_percpu *pl)
+{
+	struct prop_global *pg = prop_get_global(pd);
+
+	prop_norm_percpu(pg, pl);
+	percpu_counter_add(&pl->events, 1);
+	percpu_counter_add(&pg->events, 1);
+	prop_put_global(pd, pg);
+}
+
+/*
+ * Obtain a fraction of this proportion
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+void prop_fraction_percpu(struct prop_descriptor *pd,
+		struct prop_local_percpu *pl,
+		long *numerator, long *denominator)
+{
+	struct prop_global *pg = prop_get_global(pd);
+	unsigned long period_2 = 1UL << (pg->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	prop_norm_percpu(pg, pl);
+	*numerator = percpu_counter_read_positive(&pl->events);
+
+	global_count = percpu_counter_read(&pg->events);
+	*denominator = period_2 + (global_count & counter_mask);
+
+	prop_put_global(pd, pg);
+}
+
+/*
+ * SINGLE
+ */
+
+int prop_local_init_single(struct prop_local_single *pl)
+{
+	spin_lock_init(&pl->lock);
+	pl->shift = 0;
+	pl->period = 0;
+	pl->events = 0;
+	return 0;
+}
+
+void prop_local_destroy_single(struct prop_local_single *pl)
+{
+}
+
+/*
+ * Catch up with missed period expirations.
+ */
+static
+void prop_norm_single(struct prop_global *pg, struct prop_local_single *pl)
+{
+	unsigned long period = 1UL << (pg->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&pg->events);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (pl->period == global_period)
+		return;
+
+	spin_lock_irqsave(&pl->lock, flags);
+	prop_adjust_shift(&pl->shift, &pl->period, pg->shift);
+	/*
+	 * For each missed period, we half the local counter.
+	 */
+	period = (global_period - pl->period) >> (pg->shift - 1);
+	if (likely(period < BITS_PER_LONG))
+		pl->events >>= period;
+	else
+		pl->events = 0;
+	pl->period = global_period;
+	spin_unlock_irqrestore(&pl->lock, flags);
+}
+
+/*
+ *   ++x_{j}, ++t
+ */
+void __prop_inc_single(struct prop_descriptor *pd, struct prop_local_single *pl)
+{
+	struct prop_global *pg = prop_get_global(pd);
+
+	prop_norm_single(pg, pl);
+	pl->events++;
+	percpu_counter_add(&pg->events, 1);
+	prop_put_global(pd, pg);
+}
+
+/*
+ * Obtain a fraction of this proportion
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+void prop_fraction_single(struct prop_descriptor *pd,
+	       	struct prop_local_single *pl,
+		long *numerator, long *denominator)
+{
+	struct prop_global *pg = prop_get_global(pd);
+	unsigned long period_2 = 1UL << (pg->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	prop_norm_single(pg, pl);
+	*numerator = pl->events;
+
+	global_count = percpu_counter_read(&pg->events);
+	*denominator = period_2 + (global_count & counter_mask);
+
+	prop_put_global(pd, pg);
+}
Index: linux-2.6/include/linux/proportions.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/proportions.h
@@ -0,0 +1,119 @@
+/*
+ * FLoating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_PROPORTIONS_H
+#define _LINUX_PROPORTIONS_H
+
+#include <linux/percpu_counter.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+
+struct prop_global {
+	/*
+	 * The period over which we differentiate
+	 *
+	 *   period = 2^shift
+	 */
+	int shift;
+	/*
+	 * The total event counter aka 'time'.
+	 *
+	 * Treated as an unsigned long; the lower 'shift - 1' bits are the
+	 * counter bits, the remaining upper bits the period counter.
+	 */
+	struct percpu_counter events;
+};
+
+/*
+ * global proportion descriptor
+ *
+ * this is needed to consitently flip prop_global structures.
+ */
+struct prop_descriptor {
+	int index;
+	struct prop_global pg[2];
+	struct mutex mutex;		/* serialize the prop_global switch */
+};
+
+int prop_descriptor_init(struct prop_descriptor *pd, int shift);
+void prop_change_shift(struct prop_descriptor *pd, int new_shift);
+
+/*
+ * ----- PERCPU ------
+ */
+
+struct prop_local_percpu {
+	/*
+	 * the local events counter
+	 */
+	struct percpu_counter events;
+
+	/*
+	 * snapshot of the last seen global state
+	 */
+	int shift;
+	unsigned long period;
+	spinlock_t lock;		/* protect the snapshot state */
+};
+
+int prop_local_init_percpu(struct prop_local_percpu *pl);
+void prop_local_destroy_percpu(struct prop_local_percpu *pl);
+void __prop_inc_percpu(struct prop_descriptor *pd, struct prop_local_percpu *pl);
+void prop_fraction_percpu(struct prop_descriptor *pd, struct prop_local_percpu *pl,
+		long *numerator, long *denominator);
+
+static inline
+void prop_inc_percpu(struct prop_descriptor *pd, struct prop_local_percpu *pl)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__prop_inc_percpu(pd, pl);
+	local_irq_restore(flags);
+}
+
+/*
+ * ----- SINGLE ------
+ */
+
+struct prop_local_single {
+	/*
+	 * the local events counter
+	 */
+	unsigned long events;
+
+	/*
+	 * snapshot of the last seen global state
+	 * and a lock protecting this state
+	 */
+	int shift;
+	unsigned long period;
+	spinlock_t lock;		/* protect the snapshot state */
+};
+
+#define INIT_PROP_LOCAL_SINGLE(name)			\
+{	.lock = __SPIN_LOCK_UNLOCKED(name.lock),	\
+}
+
+int prop_local_init_single(struct prop_local_single *pl);
+void prop_local_destroy_single(struct prop_local_single *pl);
+void __prop_inc_single(struct prop_descriptor *pd, struct prop_local_single *pl);
+void prop_fraction_single(struct prop_descriptor *pd, struct prop_local_single *pl,
+		long *numerator, long *denominator);
+
+static inline
+void prop_inc_single(struct prop_descriptor *pd, struct prop_local_single *pl)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__prop_inc_single(pd, pl);
+	local_irq_restore(flags);
+}
+
+#endif /* _LINUX_PROPORTIONS_H */
Index: linux-2.6/lib/Makefile
===================================================================
--- linux-2.6.orig/lib/Makefile
+++ linux-2.6/lib/Makefile
@@ -5,7 +5,8 @@
 lib-y := ctype.o string.o vsprintf.o kasprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o \
 	 idr.o int_sqrt.o bitmap.o extable.o prio_tree.o \
-	 sha1.o irq_regs.o reciprocal_div.o argv_split.o
+	 sha1.o irq_regs.o reciprocal_div.o argv_split.o \
+	 proportions.o
 
 lib-$(CONFIG_MMU) += ioremap.o pagewalk.o
 lib-$(CONFIG_SMP) += cpumask.o

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 21/23] mm: per device dirty threshold
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (19 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 20/23] lib: floating proportions Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 22/23] mm: dirty balancing for tasks Peter Zijlstra
                   ` (2 subsequent siblings)
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: writeback-balance-per-backing_dev.patch --]
[-- Type: text/plain, Size: 14027 bytes --]

Scale writeback cache per backing device, proportional to its writeout speed.

By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:

 - mutual interference starvation (for any number of BDIs);
 - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.

A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.

So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.

What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

Changes since -v8:

 - explicit usage of prop_local_percpu
 - moved dirty_ratio_handler declaration into a suitable header file

 include/linux/backing-dev.h |    4 
 include/linux/writeback.h   |    4 
 kernel/sysctl.c             |    2 
 mm/backing-dev.c            |   19 +++-
 mm/page-writeback.c         |  202 +++++++++++++++++++++++++++++++++++++-------
 5 files changed, 196 insertions(+), 35 deletions(-)

Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -10,6 +10,7 @@
 
 #include <linux/percpu_counter.h>
 #include <linux/log2.h>
+#include <linux/proportions.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -44,6 +45,9 @@ struct backing_dev_info {
 	void *unplug_io_data;
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
+
+	struct prop_local_percpu completions;
+	int dirty_exceeded;
 };
 
 int bdi_init(struct backing_dev_info *bdi);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -2,6 +2,7 @@
  * mm/page-writeback.c
  *
  * Copyright (C) 2002, Linus Torvalds.
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
  *
  * Contains functions related to writing back dirty pages at the
  * address_space level.
@@ -49,8 +50,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;	/* Dirty mem may be over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +102,103 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by keeping a floating proportion between BDIs, based on page
+ * writeback completions [end_page_writeback()]. Those devices that write out
+ * pages fastest will get the larger share, while the slower will get a smaller
+ * share.
+ *
+ * We use page writeout completions because we are interested in getting rid of
+ * dirty pages. Having them written out is the primary goal.
+ *
+ * We introduce a concept of time, a period over which we measure these events,
+ * because demand can/will vary over time. The length of this period itself is
+ * measured in page writeback completions.
+ *
+ */
+static struct prop_descriptor vm_completions;
+
+static unsigned long determine_dirtyable_memory(void);
+
+/*
+ * couple the period to the dirty_ratio:
+ *
+ *   period/2 ~ roundup_pow_of_two(dirty limit)
+ */
+static int calc_period_shift(void)
+{
+	unsigned long dirty_total;
+
+	dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
+	return 2 + ilog2(dirty_total - 1);
+}
+
+/*
+ * update the period when the dirty ratio changes.
+ */
+int dirty_ratio_handler(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int old_ratio = vm_dirty_ratio;
+	int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+	if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
+		int shift = calc_period_shift();
+		prop_change_shift(&vm_completions, shift);
+	}
+	return ret;
+}
+
+/*
+ * Increment the BDI's writeout completion count and the global writeout
+ * completion count. Called from test_clear_page_writeback().
+ */
+static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+	__prop_inc_percpu(&vm_completions, &bdi->completions);
+}
+
+/*
+ * Obtain an accurate fraction of the BDI's portion.
+ */
+static void bdi_writeout_fraction(struct backing_dev_info *bdi,
+		long *numerator, long *denominator)
+{
+	if (bdi_cap_writeback_dirty(bdi)) {
+		prop_fraction_percpu(&vm_completions, &bdi->completions,
+				numerator, denominator);
+	} else {
+		*numerator = 0;
+		*denominator = 1;
+	}
+}
+
+/*
+ * Clip the earned share of dirty pages to that which is actually available.
+ * This avoids exceeding the total dirty_limit when the floating averages
+ * fluctuate too quickly.
+ */
+static void
+clip_bdi_dirty_limit(struct backing_dev_info *bdi, long dirty, long *pbdi_dirty)
+{
+	long avail_dirty;
+
+	avail_dirty = dirty -
+		(global_page_state(NR_FILE_DIRTY) +
+		 global_page_state(NR_WRITEBACK) +
+		 global_page_state(NR_UNSTABLE_NFS));
+
+	if (avail_dirty < 0)
+		avail_dirty = 0;
+
+	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
+		bdi_stat(bdi, BDI_WRITEBACK);
+
+	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
+}
+
+/*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
  *
@@ -158,8 +254,8 @@ static unsigned long determine_dirtyable
 }
 
 static void
-get_dirty_limits(long *pbackground, long *pdirty,
-					struct address_space *mapping)
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		 struct backing_dev_info *bdi)
 {
 	int background_ratio;		/* Percentages */
 	int dirty_ratio;
@@ -193,6 +289,22 @@ get_dirty_limits(long *pbackground, long
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+
+	if (bdi) {
+		long long bdi_dirty = dirty;
+		long numerator, denominator;
+
+		/*
+		 * Calculate this BDI's share of the dirty ratio.
+		 */
+		bdi_writeout_fraction(bdi, &numerator, &denominator);
+
+		bdi_dirty *= numerator;
+		do_div(bdi_dirty, denominator);
+
+		*pbdi_dirty = bdi_dirty;
+		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
+	}
 }
 
 /*
@@ -204,9 +316,11 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long nr_reclaimable;
+	long bdi_nr_reclaimable;
+	long bdi_nr_writeback;
 	long background_thresh;
 	long dirty_thresh;
+	long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -221,15 +335,15 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		get_dirty_limits(&background_thresh, &dirty_thresh,
+				&bdi_thresh, bdi);
+		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
 				break;
 
-		if (!dirty_exceeded)
-			dirty_exceeded = 1;
+		if (!bdi->dirty_exceeded)
+			bdi->dirty_exceeded = 1;
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
@@ -237,16 +351,37 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
-			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
-						break;
+
+			get_dirty_limits(&background_thresh, &dirty_thresh,
+				       &bdi_thresh, bdi);
+
+			/*
+			 * In order to avoid the stacked BDI deadlock we need
+			 * to ensure we accurately count the 'dirty' pages when
+			 * the threshold is low.
+			 *
+			 * Otherwise it would be possible to get thresh+n pages
+			 * reported dirty, even though there are thresh-m pages
+			 * actually dirty; with m+n sitting in the percpu
+			 * deltas.
+			 */
+			if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+				bdi_nr_reclaimable =
+					bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat_sum(bdi, BDI_WRITEBACK);
+			} else {
+				bdi_nr_reclaimable =
+					bdi_stat(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat(bdi, BDI_WRITEBACK);
+			}
+
+			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+				break;
+
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
@@ -254,9 +389,9 @@ static void balance_dirty_pages(struct a
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (nr_reclaimable + global_page_state(NR_WRITEBACK)
-		<= dirty_thresh && dirty_exceeded)
-			dirty_exceeded = 0;
+	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
+			bdi->dirty_exceeded)
+		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
 		return;		/* pdflush is already working this queue */
@@ -270,7 +405,9 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
+			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
+					  + global_page_state(NR_UNSTABLE_NFS)
+					  > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 
@@ -306,7 +443,7 @@ void balance_dirty_pages_ratelimited_nr(
 	unsigned long *p;
 
 	ratelimit = ratelimit_pages;
-	if (dirty_exceeded)
+	if (mapping->backing_dev_info->dirty_exceeded)
 		ratelimit = 8;
 
 	/*
@@ -342,7 +479,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -377,7 +514,7 @@ static void background_writeout(unsigned
 		long background_thresh;
 		long dirty_thresh;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
@@ -580,9 +717,14 @@ static struct notifier_block __cpuinitda
  */
 void __init page_writeback_init(void)
 {
+	int shift;
+
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
+
+	shift = calc_period_shift();
+	prop_descriptor_init(&vm_completions, shift);
 }
 
 /**
@@ -988,8 +1130,10 @@ int test_clear_page_writeback(struct pag
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-			if (bdi_cap_writeback_dirty(bdi))
+			if (bdi_cap_writeback_dirty(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+				__bdi_writeout_inc(bdi);
+			}
 		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -12,11 +12,17 @@ int bdi_init(struct backing_dev_info *bd
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init_irq(&bdi->bdi_stat[i], 0);
-		if (err) {
-			for (j = 0; j < i; j++)
-				percpu_counter_destroy(&bdi->bdi_stat[i]);
-			break;
-		}
+		if (err)
+			goto err;
+	}
+
+	bdi->dirty_exceeded = 0;
+	err = prop_local_init_percpu(&bdi->completions);
+
+	if (err) {
+err:
+		for (j = 0; j < i; j++)
+			percpu_counter_destroy(&bdi->bdi_stat[i]);
 	}
 
 	return err;
@@ -29,6 +35,8 @@ void bdi_destroy(struct backing_dev_info
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_destroy(&bdi->bdi_stat[i]);
+
+	prop_local_destroy_percpu(&bdi->completions);
 }
 EXPORT_SYMBOL(bdi_destroy);
 
@@ -81,3 +89,4 @@ long congestion_wait(int rw, long timeou
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
+
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -827,7 +827,7 @@ static struct ctl_table vm_table[] = {
 		.data		= &vm_dirty_ratio,
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
+		.proc_handler	= &dirty_ratio_handler,
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h
+++ linux-2.6/include/linux/writeback.h
@@ -104,6 +104,10 @@ extern int dirty_expire_interval;
 extern int block_dump;
 extern int laptop_mode;
 
+extern int dirty_ratio_handler(struct ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 struct ctl_table;
 struct file;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int, struct file *,

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 22/23] mm: dirty balancing for tasks
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (20 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 21/23] mm: per device dirty threshold Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16  7:45 ` [PATCH 23/23] debug: sysfs files for the current ratio/size/total Peter Zijlstra
  2007-08-16 21:29 ` [PATCH 00/23] per device dirty throttling -v9 Christoph Lameter
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: dirty_pages2.patch --]
[-- Type: text/plain, Size: 5673 bytes --]

Based on ideas of Andrew:
  http://marc.info/?l=linux-kernel&m=102912915020543&w=2

Scale the bdi dirty limit inversly with the tasks dirty rate.
This makes heavy writers have a lower dirty limit than the occasional writer. 

Andrea proposed something similar:
  http://lwn.net/Articles/152277/

The main disadvantage to his patch is that he uses an unrelated quantity to
measure time, which leaves him with a workload dependant tunable. Other than
that the two approaches appear quite similar.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---

Changes since -v8:

 - initialized init_task
 - moved the prop_local init after the task_struct copy
 - changed the per task ratio to 1/8 (from 1/2).
 - explicit usage of prop_local_single

 include/linux/init_task.h |    1 
 include/linux/sched.h     |    2 +
 kernel/fork.c             |   10 +++++++++
 mm/page-writeback.c       |   50 +++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 62 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -84,6 +84,7 @@ struct sched_param {
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
 #include <linux/task_io_accounting.h>
+#include <linux/proportions.h>
 
 #include <asm/processor.h>
 
@@ -1125,6 +1126,7 @@ struct task_struct {
 #ifdef CONFIG_FAULT_INJECTION
 	int make_it_fail;
 #endif
+	struct prop_local_single dirties;
 };
 
 /*
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -107,6 +107,7 @@ static struct kmem_cache *mm_cachep;
 
 void free_task(struct task_struct *tsk)
 {
+	prop_local_destroy_single(&tsk->dirties);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	free_task_struct(tsk);
@@ -163,6 +164,7 @@ static struct task_struct *dup_task_stru
 {
 	struct task_struct *tsk;
 	struct thread_info *ti;
+	int err;
 
 	prepare_to_copy(orig);
 
@@ -178,6 +180,14 @@ static struct task_struct *dup_task_stru
 
 	*tsk = *orig;
 	tsk->stack = ti;
+
+	err = prop_local_init_single(&tsk->dirties);
+	if (err) {
+		free_thread_info(ti);
+		free_task_struct(tsk);
+		return NULL;
+	}
+
 	setup_thread_stack(tsk, orig);
 
 #ifdef CONFIG_CC_STACKPROTECTOR
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -118,6 +118,7 @@ static void background_writeout(unsigned
  *
  */
 static struct prop_descriptor vm_completions;
+static struct prop_descriptor vm_dirties;
 
 static unsigned long determine_dirtyable_memory(void);
 
@@ -146,6 +147,7 @@ int dirty_ratio_handler(ctl_table *table
 	if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
 		int shift = calc_period_shift();
 		prop_change_shift(&vm_completions, shift);
+		prop_change_shift(&vm_dirties, shift);
 	}
 	return ret;
 }
@@ -159,6 +161,11 @@ static inline void __bdi_writeout_inc(st
 	__prop_inc_percpu(&vm_completions, &bdi->completions);
 }
 
+static inline void task_dirty_inc(struct task_struct *tsk)
+{
+	prop_inc_single(&vm_dirties, &tsk->dirties);
+}
+
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */
@@ -198,6 +205,37 @@ clip_bdi_dirty_limit(struct backing_dev_
 	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
 }
 
+static inline void task_dirties_fraction(struct task_struct *tsk,
+		long *numerator, long *denominator)
+{
+	prop_fraction_single(&vm_dirties, &tsk->dirties,
+				numerator, denominator);
+}
+
+/*
+ * scale the dirty limit
+ *
+ * task specific dirty limit:
+ *
+ *   dirty -= (dirty/8) * p_{t}
+ */
+void task_dirty_limit(struct task_struct *tsk, long *pdirty)
+{
+	long numerator, denominator;
+	long dirty = *pdirty;
+	long long inv = dirty >> 3;
+
+	task_dirties_fraction(tsk, &numerator, &denominator);
+	inv *= numerator;
+	do_div(inv, denominator);
+
+	dirty -= inv;
+	if (dirty < *pdirty/2)
+		dirty = *pdirty/2;
+
+	*pdirty = dirty;
+}
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -304,6 +342,7 @@ get_dirty_limits(long *pbackground, long
 
 		*pbdi_dirty = bdi_dirty;
 		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
+		task_dirty_limit(current, pbdi_dirty);
 	}
 }
 
@@ -725,6 +764,7 @@ void __init page_writeback_init(void)
 
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
+	prop_descriptor_init(&vm_dirties, shift);
 }
 
 /**
@@ -1003,7 +1043,7 @@ EXPORT_SYMBOL(redirty_page_for_writepage
  * If the mapping doesn't provide a set_page_dirty a_op, then
  * just fall through and assume that it wants buffer_heads.
  */
-int fastcall set_page_dirty(struct page *page)
+static int __set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
@@ -1021,6 +1061,14 @@ int fastcall set_page_dirty(struct page 
 	}
 	return 0;
 }
+
+int fastcall set_page_dirty(struct page *page)
+{
+	int ret = __set_page_dirty(page);
+	if (ret)
+		task_dirty_inc(current);
+	return ret;
+}
 EXPORT_SYMBOL(set_page_dirty);
 
 /*
Index: linux-2.6/include/linux/init_task.h
===================================================================
--- linux-2.6.orig/include/linux/init_task.h
+++ linux-2.6/include/linux/init_task.h
@@ -169,6 +169,7 @@ extern struct group_info init_groups;
 		[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),		\
 		[PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),		\
 	},								\
+	.dirties = INIT_PROP_LOCAL_SINGLE(dirties),			\
 	INIT_TRACE_IRQFLAGS						\
 	INIT_LOCKDEP							\
 }

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 23/23] debug: sysfs files for the current ratio/size/total
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (21 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 22/23] mm: dirty balancing for tasks Peter Zijlstra
@ 2007-08-16  7:45 ` Peter Zijlstra
  2007-08-16 21:29 ` [PATCH 00/23] per device dirty throttling -v9 Christoph Lameter
  23 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16  7:45 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: miklos, akpm, neilb, dgc, tomoki.sekiyama.qu, a.p.zijlstra,
	nikita, trond.myklebust, yingchao.zhou, richard, torvalds

[-- Attachment #1: bdi_stat_debug.patch --]
[-- Type: text/plain, Size: 4216 bytes --]

Expose the per bdi dirty limits in sysfs

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c   |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/page-writeback.c |    4 +-
 2 files changed, 82 insertions(+), 2 deletions(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -4000,6 +4000,56 @@ static ssize_t queue_nr_writeback_show(s
 			nr_writeback >> (PAGE_CACHE_SHIFT - 10));
 }
 
+extern void bdi_writeout_fraction(struct backing_dev_info *bdi,
+		long *numerator, long *denominator);
+
+static ssize_t queue_nr_cache_ratio_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+	scale *= 1024;
+	scale /= div;
+
+	return sprintf(page, "%ld\n", scale);
+}
+
+static ssize_t queue_nr_cache_num_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+
+	return sprintf(page, "%ld\n", scale);
+}
+
+static ssize_t queue_nr_cache_denom_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+
+	return sprintf(page, "%ld\n", div);
+}
+
+extern void
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		struct backing_dev_info *bdi);
+
+static ssize_t queue_nr_cache_size_show(struct request_queue *q, char *page)
+{
+	long background, dirty, bdi_dirty;
+	get_dirty_limits(&background, &dirty, &bdi_dirty, &q->backing_dev_info);
+	return sprintf(page, "%ld\n", bdi_dirty);
+}
+
+static ssize_t queue_nr_cache_total_show(struct request_queue *q, char *page)
+{
+	long background, dirty, bdi_dirty;
+	get_dirty_limits(&background, &dirty, &bdi_dirty, &q->backing_dev_info);
+	return sprintf(page, "%ld\n", dirty);
+}
+
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
 	.show = queue_requests_show,
@@ -4033,6 +4083,31 @@ static struct queue_sysfs_entry queue_wr
 	.show = queue_nr_writeback_show,
 };
 
+static struct queue_sysfs_entry queue_cache_ratio_entry = {
+	.attr = {.name = "cache_ratio", .mode = S_IRUGO },
+	.show = queue_nr_cache_ratio_show,
+};
+
+static struct queue_sysfs_entry queue_cache_num_entry = {
+	.attr = {.name = "cache_num", .mode = S_IRUGO },
+	.show = queue_nr_cache_num_show,
+};
+
+static struct queue_sysfs_entry queue_cache_denom_entry = {
+	.attr = {.name = "cache_denom", .mode = S_IRUGO },
+	.show = queue_nr_cache_denom_show,
+};
+
+static struct queue_sysfs_entry queue_cache_size_entry = {
+	.attr = {.name = "cache_size", .mode = S_IRUGO },
+	.show = queue_nr_cache_size_show,
+};
+
+static struct queue_sysfs_entry queue_cache_total_entry = {
+	.attr = {.name = "cache_total", .mode = S_IRUGO },
+	.show = queue_nr_cache_total_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -4046,6 +4121,11 @@ static struct attribute *default_attrs[]
 	&queue_max_sectors_entry.attr,
 	&queue_reclaimable_entry.attr,
 	&queue_writeback_entry.attr,
+	&queue_cache_ratio_entry.attr,
+	&queue_cache_num_entry.attr,
+	&queue_cache_denom_entry.attr,
+	&queue_cache_size_entry.attr,
+	&queue_cache_total_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -169,7 +169,7 @@ static inline void task_dirty_inc(struct
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */
-static void bdi_writeout_fraction(struct backing_dev_info *bdi,
+void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
 	if (bdi_cap_writeback_dirty(bdi)) {
@@ -291,7 +291,7 @@ static unsigned long determine_dirtyable
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
-static void
+void
 get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
 		 struct backing_dev_info *bdi)
 {

--


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v9
  2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
                   ` (22 preceding siblings ...)
  2007-08-16  7:45 ` [PATCH 23/23] debug: sysfs files for the current ratio/size/total Peter Zijlstra
@ 2007-08-16 21:29 ` Christoph Lameter
  2007-08-17  7:19   ` Peter Zijlstra
  23 siblings, 1 reply; 43+ messages in thread
From: Christoph Lameter @ 2007-08-16 21:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds, pj

Is there any way to make the global limits on which the dirty rate 
calculations are based cpuset specific?

A process is part of a cpuset and that cpuset has only a fraction of 
memory of the whole system. 

And only a fraction of that fraction can be dirtied. We do not currently 
enforce such limits which can cause the amount of dirty pages in 
cpusets to become excessively high. I have posted several patchsets that 
deal with that issue. See http://lkml.org/lkml/2007/1/16/5

It seems that limiting dirty pages in cpusets may be much easier to 
realize in the context of this patchset. The tracking of the dirty pages 
per node is not necessary if one would calculate the maximum amount of 
dirtyable pages in a cpuset and use that as a base, right?





^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v9
  2007-08-16 21:29 ` [PATCH 00/23] per device dirty throttling -v9 Christoph Lameter
@ 2007-08-17  7:19   ` Peter Zijlstra
  2007-08-17 20:37     ` Christoph Lameter
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-17  7:19 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds, pj

[-- Attachment #1: Type: text/plain, Size: 1694 bytes --]

On Thu, 2007-08-16 at 14:29 -0700, Christoph Lameter wrote:
> Is there any way to make the global limits on which the dirty rate 
> calculations are based cpuset specific?
> 
> A process is part of a cpuset and that cpuset has only a fraction of 
> memory of the whole system. 
> 
> And only a fraction of that fraction can be dirtied. We do not currently 
> enforce such limits which can cause the amount of dirty pages in 
> cpusets to become excessively high. I have posted several patchsets that 
> deal with that issue. See http://lkml.org/lkml/2007/1/16/5
> 
> It seems that limiting dirty pages in cpusets may be much easier to 
> realize in the context of this patchset. The tracking of the dirty pages 
> per node is not necessary if one would calculate the maximum amount of 
> dirtyable pages in a cpuset and use that as a base, right?


Currently we do: 
  dirty = total_dirty * bdi_completions_p * task_dirty_p

As dgc pointed out before, there is the issue of bdi/task correlation,
that is, we do not track task dirty rates per bdi, so now a task that
heavily dirties on one bdi will also get penalised on the others (and
similar issues).

If we were to change it so:
  dirty = cpuset_dirty * bdi_completions_p * task_dirty_p

We get additional correlation issues: cpuset/bdi, cpuset/task.
Which could yield surprising results if some bdis are strictly per
cpuset.

The cpuset/task correlation has a strict mapping and could be solved by
keeping the vm_dirties counter per cpuset. However, this would seriously
complicate the code and I'm not sure if it would gain us much.

Anyway, things to ponder. But overall it should be quite doable.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 02/23] lib: percpu_counter_add
  2007-08-16  7:45 ` [PATCH 02/23] lib: percpu_counter_add Peter Zijlstra
@ 2007-08-17 15:48   ` Josef Sipek
  0 siblings, 0 replies; 43+ messages in thread
From: Josef Sipek @ 2007-08-17 15:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

On Thu, Aug 16, 2007 at 09:45:27AM +0200, Peter Zijlstra wrote:
...
> Index: linux-2.6/fs/ext2/balloc.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/balloc.c
> +++ linux-2.6/fs/ext2/balloc.c
> @@ -163,7 +163,7 @@ static int reserve_blocks(struct super_b
>  			return 0;
>  	}
>  
> -	percpu_counter_mod(&sbi->s_freeblocks_counter, -count);
> +	percpu_counter_add(&sbi->s_freeblocks_counter, -count);

Out of curiosity, I noticed similar thing being done in the vm code, what is
preferred:

	foobar_add(&counter, -num);

or

	foobar_sub(&counter, num);

?

Josef 'Jeff' Sipek.

-- 
Research, n.:
  Consider Columbus:
    He didn't know where he was going.
    When he got there he didn't know where he was.
    When he got back he didn't know where he had been.
    And he did it all on someone else's money.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 09/23] lib: percpu_counter_init error handling
  2007-08-16  7:45 ` [PATCH 09/23] lib: percpu_counter_init error handling Peter Zijlstra
@ 2007-08-17 15:56   ` Josef Sipek
  2007-08-17 16:03     ` Peter Zijlstra
  2007-08-18  8:09     ` Peter Zijlstra
  0 siblings, 2 replies; 43+ messages in thread
From: Josef Sipek @ 2007-08-17 15:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

On Thu, Aug 16, 2007 at 09:45:34AM +0200, Peter Zijlstra wrote:
> alloc_percpu can fail, propagate that error.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  fs/ext2/super.c                |   11 ++++++++---
>  fs/ext3/super.c                |   11 ++++++++---
>  fs/ext4/super.c                |   11 ++++++++---
>  include/linux/percpu_counter.h |    5 +++--
>  lib/percpu_counter.c           |    8 +++++++-
>  5 files changed, 34 insertions(+), 12 deletions(-)
> 
> Index: linux-2.6/fs/ext2/super.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/super.c
> +++ linux-2.6/fs/ext2/super.c
> @@ -725,6 +725,7 @@ static int ext2_fill_super(struct super_
>  	int db_count;
>  	int i, j;
>  	__le32 features;
> +	int err;
>  
>  	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
>  	if (!sbi)
> @@ -996,12 +997,16 @@ static int ext2_fill_super(struct super_
>  	sbi->s_rsv_window_head.rsv_goal_size = 0;
>  	ext2_rsv_window_add(sb, &sbi->s_rsv_window_head);
>  
> -	percpu_counter_init(&sbi->s_freeblocks_counter,
> +	err = percpu_counter_init(&sbi->s_freeblocks_counter,
>  				ext2_count_free_blocks(sb));
> -	percpu_counter_init(&sbi->s_freeinodes_counter,
> +	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
>  				ext2_count_free_inodes(sb));
> -	percpu_counter_init(&sbi->s_dirs_counter,
> +	err |= percpu_counter_init(&sbi->s_dirs_counter,
>  				ext2_count_dirs(sb));
> +	if (err) {
> +		printk(KERN_ERR "EXT2-fs: insufficient memory\n");
> +		goto failed_mount3;
> +	}

Can percpu_counter_init fail with only one error code? If not, the error
code potentially used in future at failed_mount3 could be nonsensical
because of the bitwise or-ing.

> Index: linux-2.6/lib/percpu_counter.c
> ===================================================================
> --- linux-2.6.orig/lib/percpu_counter.c
> +++ linux-2.6/lib/percpu_counter.c
> @@ -68,21 +68,27 @@ s64 __percpu_counter_sum(struct percpu_c
>  }
>  EXPORT_SYMBOL(__percpu_counter_sum);
>  
> -void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
> +int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
>  {
>  	spin_lock_init(&fbc->lock);
>  	fbc->count = amount;
>  	fbc->counters = alloc_percpu(s32);
> +	if (!fbc->counters)
> +		return -ENOMEM;
>  #ifdef CONFIG_HOTPLUG_CPU
>  	mutex_lock(&percpu_counters_lock);
>  	list_add(&fbc->list, &percpu_counters);
>  	mutex_unlock(&percpu_counters_lock);
>  #endif
> +	return 0;
>  }

I guess this answers my question. But I'd still be weary because a trivial
change here could produce very strange error codes in ext2/3/4.

Josef 'Jeff' Sipek.

-- 
Once you have their hardware. Never give it back.
(The First Rule of Hardware Acquisition)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 09/23] lib: percpu_counter_init error handling
  2007-08-17 15:56   ` Josef Sipek
@ 2007-08-17 16:03     ` Peter Zijlstra
  2007-08-18  8:09     ` Peter Zijlstra
  1 sibling, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-17 16:03 UTC (permalink / raw)
  To: Josef Sipek
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

[-- Attachment #1: Type: text/plain, Size: 1170 bytes --]

On Fri, 2007-08-17 at 11:56 -0400, Josef Sipek wrote:
> On Thu, Aug 16, 2007 at 09:45:34AM +0200, Peter Zijlstra wrote:
> )
> > @@ -996,12 +997,16 @@ static int ext2_fill_super(struct super_
> >  	sbi->s_rsv_window_head.rsv_goal_size = 0;
> >  	ext2_rsv_window_add(sb, &sbi->s_rsv_window_head);
> >  
> > -	percpu_counter_init(&sbi->s_freeblocks_counter,
> > +	err = percpu_counter_init(&sbi->s_freeblocks_counter,
> >  				ext2_count_free_blocks(sb));
> > -	percpu_counter_init(&sbi->s_freeinodes_counter,
> > +	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
> >  				ext2_count_free_inodes(sb));
> > -	percpu_counter_init(&sbi->s_dirs_counter,
> > +	err |= percpu_counter_init(&sbi->s_dirs_counter,
> >  				ext2_count_dirs(sb));
> > +	if (err) {
> > +		printk(KERN_ERR "EXT2-fs: insufficient memory\n");
> > +		goto failed_mount3;
> > +	}
> 
> Can percpu_counter_init fail with only one error code? If not, the error
> code potentially used in future at failed_mount3 could be nonsensical
> because of the bitwise or-ing.

I guess I could have written saner code :-/ will try to come up with
something that is both clear and short.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 11/23] mm: bdi init hooks
  2007-08-16  7:45 ` [PATCH 11/23] mm: bdi init hooks Peter Zijlstra
@ 2007-08-17 16:10   ` Josef Sipek
  2007-08-17 16:15     ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Josef Sipek @ 2007-08-17 16:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

On Thu, Aug 16, 2007 at 09:45:36AM +0200, Peter Zijlstra wrote:
> provide BDI constructor/destructor hooks
...
> Index: linux-2.6/drivers/block/rd.c
> ===================================================================
> --- linux-2.6.orig/drivers/block/rd.c
> +++ linux-2.6/drivers/block/rd.c
...
> @@ -419,7 +422,19 @@ static void __exit rd_cleanup(void)
>  static int __init rd_init(void)
>  {
>  	int i;
> -	int err = -ENOMEM;
> +	int err;
> +
> +	err = bdi_init(&rd_backing_dev_info);
> +	if (err)
> +		goto out2;
> +
> +	err = bdi_init(&rd_file_backing_dev_info);
> +	if (err) {
> +		bdi_destroy(&rd_backing_dev_info);
> +		goto out2;

How about this...

if (err)
	goto out3;

> +	}
> +
> +	err = -ENOMEM;
>  
>  	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
>  			(rd_blocksize & (rd_blocksize-1))) {
> @@ -473,6 +488,9 @@ out:
>  		put_disk(rd_disks[i]);
>  		blk_cleanup_queue(rd_queue[i]);
>  	}
> +	bdi_destroy(&rd_backing_dev_info);
> +	bdi_destroy(&rd_file_backing_dev_info);

	bdi_destroy(&rd_file_backing_dev_info);
out3:
	bdi_destroy(&rd_backing_dev_info);

Sure you might want to switch from numbered labels to something a bit more
descriptive.

> +out2:
>  	return err;
>  }
>  

Josef 'Jeff' Sipek.

-- 
The box said "Windows XP or better required". So I installed Linux.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 11/23] mm: bdi init hooks
  2007-08-17 16:10   ` Josef Sipek
@ 2007-08-17 16:15     ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-17 16:15 UTC (permalink / raw)
  To: Josef Sipek
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

[-- Attachment #1: Type: text/plain, Size: 1522 bytes --]

On Fri, 2007-08-17 at 12:10 -0400, Josef Sipek wrote:
> On Thu, Aug 16, 2007 at 09:45:36AM +0200, Peter Zijlstra wrote:
> > provide BDI constructor/destructor hooks
> ....
> > Index: linux-2.6/drivers/block/rd.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/block/rd.c
> > +++ linux-2.6/drivers/block/rd.c
> ....
> > @@ -419,7 +422,19 @@ static void __exit rd_cleanup(void)
> >  static int __init rd_init(void)
> >  {
> >  	int i;
> > -	int err = -ENOMEM;
> > +	int err;
> > +
> > +	err = bdi_init(&rd_backing_dev_info);
> > +	if (err)
> > +		goto out2;
> > +
> > +	err = bdi_init(&rd_file_backing_dev_info);
> > +	if (err) {
> > +		bdi_destroy(&rd_backing_dev_info);
> > +		goto out2;
> 
> How about this...

seems like a sane idea.

> if (err)
> 	goto out3;
> 
> > +	}
> > +
> > +	err = -ENOMEM;
> >  
> >  	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
> >  			(rd_blocksize & (rd_blocksize-1))) {
> > @@ -473,6 +488,9 @@ out:
> >  		put_disk(rd_disks[i]);
> >  		blk_cleanup_queue(rd_queue[i]);
> >  	}
> > +	bdi_destroy(&rd_backing_dev_info);
> > +	bdi_destroy(&rd_file_backing_dev_info);
> 
> 	bdi_destroy(&rd_file_backing_dev_info);
> out3:
> 	bdi_destroy(&rd_backing_dev_info);
> 
> Sure you might want to switch from numbered labels to something a bit more
> descriptive.

I was just keeping in style here.

Thanks for looking this over, all these error paths did make my head
spin a little.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 16/23] mm: scalable bdi statistics counters.
  2007-08-16  7:45 ` [PATCH 16/23] mm: scalable bdi statistics counters Peter Zijlstra
@ 2007-08-17 16:20   ` Josef Sipek
  2007-08-17 16:23     ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Josef Sipek @ 2007-08-17 16:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

On Thu, Aug 16, 2007 at 09:45:41AM +0200, Peter Zijlstra wrote:
...
> Index: linux-2.6/include/linux/backing-dev.h
> ===================================================================
> --- linux-2.6.orig/include/linux/backing-dev.h
> +++ linux-2.6/include/linux/backing-dev.h
...
> @@ -24,6 +26,12 @@ enum bdi_state {
>  
>  typedef int (congested_fn)(void *, int);
>  
> +enum bdi_stat_item {
> +	NR_BDI_STAT_ITEMS
> +};

enum numbering starts at 0, so NR_BDI_STAT_ITEMS == 0

> +
> +#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
> +
>  struct backing_dev_info {
>  	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
>  	unsigned long state;	/* Always use atomic bitops on this */
> @@ -32,15 +40,86 @@ struct backing_dev_info {
>  	void *congested_data;	/* Pointer to aux data for congested func */
>  	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
>  	void *unplug_io_data;
> +
> +	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];

So, this is a 0-element array.

>  };
>  
> -static inline int bdi_init(struct backing_dev_info *bdi)
> +int bdi_init(struct backing_dev_info *bdi);
> +void bdi_destroy(struct backing_dev_info *bdi);
> +
> +static inline void __add_bdi_stat(struct backing_dev_info *bdi,
> +		enum bdi_stat_item item, s64 amount)
>  {
> -	return 0;
> +	__percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);

Boom!

>  }

Josef 'Jeff' Sipek.

-- 
You measure democracy by the freedom it gives its dissidents, not the
freedom it gives its assimilated conformists.
		- Abbie Hoffman

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 16/23] mm: scalable bdi statistics counters.
  2007-08-17 16:20   ` Josef Sipek
@ 2007-08-17 16:23     ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-17 16:23 UTC (permalink / raw)
  To: Josef Sipek
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

[-- Attachment #1: Type: text/plain, Size: 1809 bytes --]

On Fri, 2007-08-17 at 12:20 -0400, Josef Sipek wrote:
> On Thu, Aug 16, 2007 at 09:45:41AM +0200, Peter Zijlstra wrote:
> ....
> > Index: linux-2.6/include/linux/backing-dev.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/backing-dev.h
> > +++ linux-2.6/include/linux/backing-dev.h
> ....
> > @@ -24,6 +26,12 @@ enum bdi_state {
> >  
> >  typedef int (congested_fn)(void *, int);
> >  
> > +enum bdi_stat_item {
> > +	NR_BDI_STAT_ITEMS
> > +};
> 
> enum numbering starts at 0, so NR_BDI_STAT_ITEMS == 0
> 
> > +
> > +#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
> > +
> >  struct backing_dev_info {
> >  	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
> >  	unsigned long state;	/* Always use atomic bitops on this */
> > @@ -32,15 +40,86 @@ struct backing_dev_info {
> >  	void *congested_data;	/* Pointer to aux data for congested func */
> >  	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
> >  	void *unplug_io_data;
> > +
> > +	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
> 
> So, this is a 0-element array.
> 
> >  };
> >  
> > -static inline int bdi_init(struct backing_dev_info *bdi)
> > +int bdi_init(struct backing_dev_info *bdi);
> > +void bdi_destroy(struct backing_dev_info *bdi);
> > +
> > +static inline void __add_bdi_stat(struct backing_dev_info *bdi,
> > +		enum bdi_stat_item item, s64 amount)
> >  {
> > -	return 0;
> > +	__percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
> 
> Boom!
> 
> >  }

Quite so, but since there are no callers _yet_ it will not go boom :-)

This patch introduces the framework, patch 17 and 18 will introduce both
stat items and callers.

So it should all work out just fine.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 17/23] mm: count reclaimable pages per BDI
  2007-08-16  7:45 ` [PATCH 17/23] mm: count reclaimable pages per BDI Peter Zijlstra
@ 2007-08-17 16:23   ` Josef Sipek
  0 siblings, 0 replies; 43+ messages in thread
From: Josef Sipek @ 2007-08-17 16:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

On Thu, Aug 16, 2007 at 09:45:42AM +0200, Peter Zijlstra wrote:
...
> Index: linux-2.6/include/linux/backing-dev.h
> ===================================================================
> --- linux-2.6.orig/include/linux/backing-dev.h
> +++ linux-2.6/include/linux/backing-dev.h
> @@ -27,6 +27,7 @@ enum bdi_state {
>  typedef int (congested_fn)(void *, int);
>  
>  enum bdi_stat_item {
> +	BDI_RECLAIMABLE,
>  	NR_BDI_STAT_ITEMS
>  };

Ok, I see. Ignore my comment on 16/xx :)

Jeff.

-- 
Keyboard not found!
Press F1 to enter Setup

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 00/23] per device dirty throttling -v9
  2007-08-17  7:19   ` Peter Zijlstra
@ 2007-08-17 20:37     ` Christoph Lameter
  0 siblings, 0 replies; 43+ messages in thread
From: Christoph Lameter @ 2007-08-17 20:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds, pj

On Fri, 17 Aug 2007, Peter Zijlstra wrote:

> Currently we do: 
>   dirty = total_dirty * bdi_completions_p * task_dirty_p
> 
> As dgc pointed out before, there is the issue of bdi/task correlation,
> that is, we do not track task dirty rates per bdi, so now a task that
> heavily dirties on one bdi will also get penalised on the others (and
> similar issues).

I think that is tolerable.
> 
> If we were to change it so:
>   dirty = cpuset_dirty * bdi_completions_p * task_dirty_p
> 
> We get additional correlation issues: cpuset/bdi, cpuset/task.
> Which could yield surprising results if some bdis are strictly per
> cpuset.

If we do not do the above then the dirty page calculation for a small 
cpuset (F.e. 1 node of a 128 node system) could allow an amount of dirty
pages that will fill up all the node.

> The cpuset/task correlation has a strict mapping and could be solved by
> keeping the vm_dirties counter per cpuset. However, this would seriously
> complicate the code and I'm not sure if it would gain us much.

The patchset that I referred to has code to calculate the dirty count and 
ratio per cpuset by looping over the nodes. Currently we are having 
trouble with small cpusets not performing writeout correctly. This 
sometimes may result in OOM conditions because the whole node is full of 
dirty pages. If the cpu boundaries are enforced in a strict way then the 
application may fail with an OOM.

We can compensate by recalculating the dirty_ratio based on the smallest 
cpuset but then larger cpusets are penalized. Also one cannot set the 
dirty_ratio below a certain mininum.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 09/23] lib: percpu_counter_init error handling
  2007-08-17 15:56   ` Josef Sipek
  2007-08-17 16:03     ` Peter Zijlstra
@ 2007-08-18  8:09     ` Peter Zijlstra
  2007-08-23 18:24       ` Josef Sipek
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-18  8:09 UTC (permalink / raw)
  To: Josef Sipek
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

[-- Attachment #1: Type: text/plain, Size: 7713 bytes --]

On Fri, 2007-08-17 at 11:56 -0400, Josef Sipek wrote:
> On Thu, Aug 16, 2007 at 09:45:34AM +0200, Peter Zijlstra wrote:

> > Index: linux-2.6/fs/ext2/super.c
> > ===================================================================
> > --- linux-2.6.orig/fs/ext2/super.c
> > +++ linux-2.6/fs/ext2/super.c
> > @@ -725,6 +725,7 @@ static int ext2_fill_super(struct super_
> >  	int db_count;
> >  	int i, j;
> >  	__le32 features;
> > +	int err;
> >  
> >  	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
> >  	if (!sbi)
> > @@ -996,12 +997,16 @@ static int ext2_fill_super(struct super_
> >  	sbi->s_rsv_window_head.rsv_goal_size = 0;
> >  	ext2_rsv_window_add(sb, &sbi->s_rsv_window_head);
> >  
> > -	percpu_counter_init(&sbi->s_freeblocks_counter,
> > +	err = percpu_counter_init(&sbi->s_freeblocks_counter,
> >  				ext2_count_free_blocks(sb));
> > -	percpu_counter_init(&sbi->s_freeinodes_counter,
> > +	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
> >  				ext2_count_free_inodes(sb));
> > -	percpu_counter_init(&sbi->s_dirs_counter,
> > +	err |= percpu_counter_init(&sbi->s_dirs_counter,
> >  				ext2_count_dirs(sb));
> > +	if (err) {
> > +		printk(KERN_ERR "EXT2-fs: insufficient memory\n");
> > +		goto failed_mount3;
> > +	}
> 
> Can percpu_counter_init fail with only one error code? If not, the error
> code potentially used in future at failed_mount3 could be nonsensical
> because of the bitwise or-ing.

The actual value of err is irrelevant, it is not used after this not
zero check.

But how about this:
---
Subject: lib: percpu_counter_init error handling

alloc_percpu can fail, propagate that error.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/ext2/super.c                |   15 ++++++++++++---
 fs/ext3/super.c                |   21 +++++++++++++++------
 fs/ext4/super.c                |   21 +++++++++++++++------
 include/linux/percpu_counter.h |    5 +++--
 lib/percpu_counter.c           |    8 +++++++-
 5 files changed, 52 insertions(+), 18 deletions(-)

Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c
+++ linux-2.6/fs/ext2/super.c
@@ -725,6 +725,7 @@ static int ext2_fill_super(struct super_
 	int db_count;
 	int i, j;
 	__le32 features;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -996,12 +997,20 @@ static int ext2_fill_super(struct super_
 	sbi->s_rsv_window_head.rsv_goal_size = 0;
 	ext2_rsv_window_add(sb, &sbi->s_rsv_window_head);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 				ext2_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	if (!err) {
+		err = percpu_counter_init(&sbi->s_freeinodes_counter,
 				ext2_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	}
+	if (!err) {
+		err = percpu_counter_init(&sbi->s_dirs_counter,
 				ext2_count_dirs(sb));
+	}
+	if (err) {
+		printk(KERN_ERR "EXT2-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 	/*
 	 * set up enough so that it can read an inode
 	 */
Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c
+++ linux-2.6/fs/ext3/super.c
@@ -1485,6 +1485,7 @@ static int ext3_fill_super (struct super
 	int i;
 	int needs_recovery;
 	__le32 features;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -1745,12 +1746,20 @@ static int ext3_fill_super (struct super
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
-		ext3_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
-		ext3_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
-		ext3_count_dirs(sb));
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
+			ext3_count_free_blocks(sb));
+	if (!err) {
+		err = percpu_counter_init(&sbi->s_freeinodes_counter,
+				ext3_count_free_inodes(sb));
+	}
+	if (!err) {
+		err = percpu_counter_init(&sbi->s_dirs_counter,
+				ext3_count_dirs(sb));
+	}
+	if (err) {
+		printk(KERN_ERR "EXT3-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 
 	/* per fileystem reservation list head & lock */
 	spin_lock_init(&sbi->s_rsv_window_lock);
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c
+++ linux-2.6/fs/ext4/super.c
@@ -1576,6 +1576,7 @@ static int ext4_fill_super (struct super
 	int needs_recovery;
 	__le32 features;
 	__u64 blocks_count;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -1857,12 +1858,20 @@ static int ext4_fill_super (struct super
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
-		ext4_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
-		ext4_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
-		ext4_count_dirs(sb));
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
+			ext4_count_free_blocks(sb));
+	if (!err) {
+		err = percpu_counter_init(&sbi->s_freeinodes_counter,
+				ext4_count_free_inodes(sb));
+	}
+	if (!err) {
+		err = percpu_counter_init(&sbi->s_dirs_counter,
+				ext4_count_dirs(sb));
+	}
+	if (err) {
+		printk(KERN_ERR "EXT4-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 
 	/* per fileystem reservation list head & lock */
 	spin_lock_init(&sbi->s_rsv_window_lock);
Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -30,7 +30,7 @@ struct percpu_counter {
 #define FBC_BATCH	(NR_CPUS*4)
 #endif
 
-void percpu_counter_init(struct percpu_counter *fbc, s64 amount);
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount);
 void percpu_counter_destroy(struct percpu_counter *fbc);
 void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
@@ -78,9 +78,10 @@ struct percpu_counter {
 	s64 count;
 };
 
-static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+static inline int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	fbc->count = amount;
+	return 0;
 }
 
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -68,21 +68,27 @@ s64 __percpu_counter_sum(struct percpu_c
 }
 EXPORT_SYMBOL(__percpu_counter_sum);
 
-void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	spin_lock_init(&fbc->lock);
 	fbc->count = amount;
 	fbc->counters = alloc_percpu(s32);
+	if (!fbc->counters)
+		return -ENOMEM;
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);
 	list_add(&fbc->list, &percpu_counters);
 	mutex_unlock(&percpu_counters_lock);
 #endif
+	return 0;
 }
 EXPORT_SYMBOL(percpu_counter_init);
 
 void percpu_counter_destroy(struct percpu_counter *fbc)
 {
+	if (!fbc->counters)
+		return;
+
 	free_percpu(fbc->counters);
 #ifdef CONFIG_HOTPLUG_CPU
 	mutex_lock(&percpu_counters_lock);


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 09/23] lib: percpu_counter_init error handling
  2007-08-18  8:09     ` Peter Zijlstra
@ 2007-08-23 18:24       ` Josef Sipek
  0 siblings, 0 replies; 43+ messages in thread
From: Josef Sipek @ 2007-08-23 18:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mm, linux-kernel, miklos, akpm, neilb, dgc,
	tomoki.sekiyama.qu, nikita, trond.myklebust, yingchao.zhou,
	richard, torvalds

On Sat, Aug 18, 2007 at 10:09:34AM +0200, Peter Zijlstra wrote:
> On Fri, 2007-08-17 at 11:56 -0400, Josef Sipek wrote:
> > On Thu, Aug 16, 2007 at 09:45:34AM +0200, Peter Zijlstra wrote:
 
Sorry...this mail got lost in the flood of email after a procmail rule
stopped working...

> The actual value of err is irrelevant, it is not used after this not
> zero check.
> 
> But how about this:
> ---
> Subject: lib: percpu_counter_init error handling
> 
> alloc_percpu can fail, propagate that error.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  fs/ext2/super.c                |   15 ++++++++++++---
>  fs/ext3/super.c                |   21 +++++++++++++++------
>  fs/ext4/super.c                |   21 +++++++++++++++------
>  include/linux/percpu_counter.h |    5 +++--
>  lib/percpu_counter.c           |    8 +++++++-
>  5 files changed, 52 insertions(+), 18 deletions(-)
> 
> Index: linux-2.6/fs/ext2/super.c
> ===================================================================
> --- linux-2.6.orig/fs/ext2/super.c
> +++ linux-2.6/fs/ext2/super.c
> @@ -725,6 +725,7 @@ static int ext2_fill_super(struct super_
>  	int db_count;
>  	int i, j;
>  	__le32 features;
> +	int err;
>  
>  	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
>  	if (!sbi)
> @@ -996,12 +997,20 @@ static int ext2_fill_super(struct super_
>  	sbi->s_rsv_window_head.rsv_goal_size = 0;
>  	ext2_rsv_window_add(sb, &sbi->s_rsv_window_head);
>  
> -	percpu_counter_init(&sbi->s_freeblocks_counter,
> +	err = percpu_counter_init(&sbi->s_freeblocks_counter,
>  				ext2_count_free_blocks(sb));
> -	percpu_counter_init(&sbi->s_freeinodes_counter,
> +	if (!err) {
> +		err = percpu_counter_init(&sbi->s_freeinodes_counter,
>  				ext2_count_free_inodes(sb));
> -	percpu_counter_init(&sbi->s_dirs_counter,
> +	}
> +	if (!err) {
> +		err = percpu_counter_init(&sbi->s_dirs_counter,
>  				ext2_count_dirs(sb));
> +	}
> +	if (err) {
> +		printk(KERN_ERR "EXT2-fs: insufficient memory\n");
> +		goto failed_mount3;
> +	}
>  	/*
>  	 * set up enough so that it can read an inode
>  	 */

I find this more readable as I don't have to try to figure out what the
bitops are doing :)

Jeff.

-- 
Mankind invented the atomic bomb, but no mouse would ever construct a
mousetrap.
		- Albert Einstein

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 00/23] per device dirty throttling -v9
  2007-08-23 17:41     ` Peter Zijlstra
@ 2007-08-24 10:47       ` Martin Knoblauch
  0 siblings, 0 replies; 43+ messages in thread
From: Martin Knoblauch @ 2007-08-24 10:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel


--- Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
> > --- Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> > > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> > > 
> > > > Peter,
> > > > 
> > > >  any chance to get a rollup against 2.6.22-stable?
> > > > 
> > > >  The 2.6.23 series may not be usable for me due to the
> > > > nosharedcache changes for NFS (the new default will massively
> > > > disturb the user-space automounter).
> > > 
> > > I'll see what I can do, bit busy with other stuff atm, hopefully
> > > after
> > > the weekend.
> > > 
> > Hi Peter,
> > 
> >  any progress on a version against 2.6.22.5? I have seen the very
> > positive report from Jeffrey W. Baker and would really love to test
> > your patch. But as I said, anything newer than 2.6.22.x might not
> be an
> > option due to the NFS changes.
> 
> mindless port, seems to compile and boot on my test box ymmv.
> 
> I think .5 should not present anything other than trivial rejects if
> anything. But I'm not keeping -stable in my git remotes so I can't
> say
> for sure.

Hi Peter,

 thanks a lot. It applies to 2.6.22.5 almost cleanly, with just one
8-line offset in readahead.c.

 I will report testing-results separately.

Thanks
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 00/23] per device dirty throttling -v9
  2007-08-23 15:59   ` Martin Knoblauch
@ 2007-08-23 17:41     ` Peter Zijlstra
  2007-08-24 10:47       ` Martin Knoblauch
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-23 17:41 UTC (permalink / raw)
  To: spamtrap; +Cc: linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 1048 bytes --]

On Thu, 2007-08-23 at 08:59 -0700, Martin Knoblauch wrote:
> --- Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> > 
> > > Peter,
> > > 
> > >  any chance to get a rollup against 2.6.22-stable?
> > > 
> > >  The 2.6.23 series may not be usable for me due to the
> > > nosharedcache changes for NFS (the new default will massively
> > > disturb the user-space automounter).
> > 
> > I'll see what I can do, bit busy with other stuff atm, hopefully
> > after
> > the weekend.
> > 
> Hi Peter,
> 
>  any progress on a version against 2.6.22.5? I have seen the very
> positive report from Jeffrey W. Baker and would really love to test
> your patch. But as I said, anything newer than 2.6.22.x might not be an
> option due to the NFS changes.

mindless port, seems to compile and boot on my test box ymmv.

I think .5 should not present anything other than trivial rejects if
anything. But I'm not keeping -stable in my git remotes so I can't say
for sure.

[-- Attachment #1.2: bdi-rollup-v9-v2.6.22.patch --]
[-- Type: text/x-patch, Size: 69879 bytes --]

Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c
+++ linux-2.6/fs/nfs/write.c
@@ -237,10 +237,8 @@ static void nfs_end_page_writeback(struc
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH) {
+	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
 		clear_bdi_congested(&nfss->backing_dev_info, WRITE);
-		congestion_end(WRITE);
-	}
 }
 
 /*
@@ -466,6 +464,7 @@ nfs_mark_request_commit(struct nfs_page 
 	set_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 	spin_unlock(&nfsi->req_lock);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 }
 
@@ -552,6 +551,8 @@ static void nfs_cancel_commit_list(struc
 	while(!list_empty(head)) {
 		req = nfs_list_entry(head->next);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		nfs_inode_remove_request(req);
@@ -1207,6 +1208,8 @@ nfs_commit_list(struct inode *inode, str
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 		nfs_clear_page_writeback(req);
 	}
 	return -ENOMEM;
@@ -1232,6 +1235,8 @@ static void nfs_commit_done(struct rpc_t
 		nfs_list_remove_request(req);
 		clear_bit(PG_NEED_COMMIT, &(req)->wb_flags);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
+		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
+				BDI_RECLAIMABLE);
 
 		dprintk("NFS: commit (%s/%Ld %d@%Ld)",
 			req->wb_context->dentry->d_inode->i_sb->s_id,
Index: linux-2.6/include/linux/backing-dev.h
===================================================================
--- linux-2.6.orig/include/linux/backing-dev.h
+++ linux-2.6/include/linux/backing-dev.h
@@ -8,6 +8,9 @@
 #ifndef _LINUX_BACKING_DEV_H
 #define _LINUX_BACKING_DEV_H
 
+#include <linux/percpu_counter.h>
+#include <linux/log2.h>
+#include <linux/proportions.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -24,6 +27,14 @@ enum bdi_state {
 
 typedef int (congested_fn)(void *, int);
 
+enum bdi_stat_item {
+	BDI_RECLAIMABLE,
+	BDI_WRITEBACK,
+	NR_BDI_STAT_ITEMS
+};
+
+#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
+
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
@@ -32,8 +43,90 @@ struct backing_dev_info {
 	void *congested_data;	/* Pointer to aux data for congested func */
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
+
+	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
+
+	struct prop_local_percpu completions;
+	int dirty_exceeded;
 };
 
+int bdi_init(struct backing_dev_info *bdi);
+void bdi_destroy(struct backing_dev_info *bdi);
+
+static inline void __mod_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item, s64 amount)
+{
+	__percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
+}
+
+static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	__mod_bdi_stat(bdi, item, 1);
+}
+
+static inline void inc_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__inc_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	__mod_bdi_stat(bdi, item, -1);
+}
+
+static inline void dec_bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__dec_bdi_stat(bdi, item);
+	local_irq_restore(flags);
+}
+
+static inline s64 bdi_stat(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_read_positive(&bdi->bdi_stat[item]);
+}
+
+static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	return percpu_counter_sum_positive(&bdi->bdi_stat[item]);
+}
+
+static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
+		enum bdi_stat_item item)
+{
+	s64 sum;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	sum = __bdi_stat_sum(bdi, item);
+	local_irq_restore(flags);
+
+	return sum;
+}
+
+/*
+ * maximal error of a stat counter.
+ */
+static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
+{
+#ifdef CONFIG_SMP
+	return nr_cpu_ids * BDI_STAT_BATCH;
+#else
+	return 1;
+#endif
+}
 
 /*
  * Flags in backing_dev_info::capability
@@ -94,7 +187,6 @@ void clear_bdi_congested(struct backing_
 void set_bdi_congested(struct backing_dev_info *bdi, int rw);
 long congestion_wait(int rw, long timeout);
 long congestion_wait_interruptible(int rw, long timeout);
-void congestion_end(int rw);
 
 #define bdi_cap_writeback_dirty(bdi) \
 	(!((bdi)->capabilities & BDI_CAP_NO_WRITEBACK))
Index: linux-2.6/mm/backing-dev.c
===================================================================
--- linux-2.6.orig/mm/backing-dev.c
+++ linux-2.6/mm/backing-dev.c
@@ -5,6 +5,41 @@
 #include <linux/sched.h>
 #include <linux/module.h>
 
+int bdi_init(struct backing_dev_info *bdi)
+{
+	int i, j;
+	int err;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
+		err = percpu_counter_init_irq(&bdi->bdi_stat[i], 0);
+		if (err)
+			goto err;
+	}
+
+	bdi->dirty_exceeded = 0;
+	err = prop_local_init_percpu(&bdi->completions);
+
+	if (err) {
+err:
+		for (j = 0; j < i; j++)
+			percpu_counter_destroy(&bdi->bdi_stat[i]);
+	}
+
+	return err;
+}
+EXPORT_SYMBOL(bdi_init);
+
+void bdi_destroy(struct backing_dev_info *bdi)
+{
+	int i;
+
+	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
+		percpu_counter_destroy(&bdi->bdi_stat[i]);
+
+	prop_local_destroy_percpu(&bdi->completions);
+}
+EXPORT_SYMBOL(bdi_destroy);
+
 static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
@@ -70,16 +105,3 @@ long congestion_wait_interruptible(int r
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait_interruptible);
-
-/**
- * congestion_end - wake up sleepers on a congested backing_dev_info
- * @rw: READ or WRITE
- */
-void congestion_end(int rw)
-{
-	wait_queue_head_t *wqh = &congestion_wqh[rw];
-
-	if (waitqueue_active(wqh))
-		wake_up(wqh);
-}
-EXPORT_SYMBOL(congestion_end);
Index: linux-2.6/fs/ext2/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext2/balloc.c
+++ linux-2.6/fs/ext2/balloc.c
@@ -124,7 +124,7 @@ static int reserve_blocks(struct super_b
 			return 0;
 	}
 
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -count);
+	percpu_counter_sub(&sbi->s_freeblocks_counter, count);
 	sb->s_dirt = 1;
 	return count;
 }
@@ -134,7 +134,7 @@ static void release_blocks(struct super_
 	if (count) {
 		struct ext2_sb_info *sbi = EXT2_SB(sb);
 
-		percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+		percpu_counter_add(&sbi->s_freeblocks_counter, count);
 		sb->s_dirt = 1;
 	}
 }
Index: linux-2.6/fs/ext2/ialloc.c
===================================================================
--- linux-2.6.orig/fs/ext2/ialloc.c
+++ linux-2.6/fs/ext2/ialloc.c
@@ -542,7 +542,7 @@ got:
 		goto fail;
 	}
 
-	percpu_counter_mod(&sbi->s_freeinodes_counter, -1);
+	percpu_counter_add(&sbi->s_freeinodes_counter, -1);
 	if (S_ISDIR(mode))
 		percpu_counter_inc(&sbi->s_dirs_counter);
 
Index: linux-2.6/fs/ext3/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext3/balloc.c
+++ linux-2.6/fs/ext3/balloc.c
@@ -570,7 +570,7 @@ do_more:
 		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
 			group_freed);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+	percpu_counter_add(&sbi->s_freeblocks_counter, count);
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1633,7 +1633,7 @@ allocated:
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_sub(&sbi->s_freeblocks_counter, num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext3_journal_dirty_metadata(handle, gdp_bh);
Index: linux-2.6/fs/ext3/resize.c
===================================================================
--- linux-2.6.orig/fs/ext3/resize.c
+++ linux-2.6/fs/ext3/resize.c
@@ -884,9 +884,9 @@ int ext3_group_add(struct super_block *s
 		input->reserved_blocks);
 
 	/* Update the free space counts */
-	percpu_counter_mod(&sbi->s_freeblocks_counter,
+	percpu_counter_add(&sbi->s_freeblocks_counter,
 			   input->free_blocks_count);
-	percpu_counter_mod(&sbi->s_freeinodes_counter,
+	percpu_counter_add(&sbi->s_freeinodes_counter,
 			   EXT3_INODES_PER_GROUP(sb));
 
 	ext3_journal_dirty_metadata(handle, sbi->s_sbh);
Index: linux-2.6/fs/ext4/balloc.c
===================================================================
--- linux-2.6.orig/fs/ext4/balloc.c
+++ linux-2.6/fs/ext4/balloc.c
@@ -587,7 +587,7 @@ do_more:
 		cpu_to_le16(le16_to_cpu(desc->bg_free_blocks_count) +
 			group_freed);
 	spin_unlock(sb_bgl_lock(sbi, block_group));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, count);
+	percpu_counter_add(&sbi->s_freeblocks_counter, count);
 
 	/* We dirtied the bitmap block */
 	BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -1647,7 +1647,7 @@ allocated:
 	gdp->bg_free_blocks_count =
 			cpu_to_le16(le16_to_cpu(gdp->bg_free_blocks_count)-num);
 	spin_unlock(sb_bgl_lock(sbi, group_no));
-	percpu_counter_mod(&sbi->s_freeblocks_counter, -num);
+	percpu_counter_sub(&sbi->s_freeblocks_counter, num);
 
 	BUFFER_TRACE(gdp_bh, "journal_dirty_metadata for group descriptor");
 	err = ext4_journal_dirty_metadata(handle, gdp_bh);
Index: linux-2.6/fs/ext4/resize.c
===================================================================
--- linux-2.6.orig/fs/ext4/resize.c
+++ linux-2.6/fs/ext4/resize.c
@@ -893,9 +893,9 @@ int ext4_group_add(struct super_block *s
 		input->reserved_blocks);
 
 	/* Update the free space counts */
-	percpu_counter_mod(&sbi->s_freeblocks_counter,
+	percpu_counter_add(&sbi->s_freeblocks_counter,
 			   input->free_blocks_count);
-	percpu_counter_mod(&sbi->s_freeinodes_counter,
+	percpu_counter_add(&sbi->s_freeinodes_counter,
 			   EXT4_INODES_PER_GROUP(sb));
 
 	ext4_journal_dirty_metadata(handle, sbi->s_sbh);
Index: linux-2.6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.orig/include/linux/percpu_counter.h
+++ linux-2.6/include/linux/percpu_counter.h
@@ -26,20 +26,43 @@ struct percpu_counter {
 #define FBC_BATCH	(NR_CPUS*4)
 #endif
 
-static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+static inline
+int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	spin_lock_init(&fbc->lock);
 	fbc->count = amount;
 	fbc->counters = alloc_percpu(s32);
+	if (!fbc->counters)
+		return -ENOMEM;
+	return 0;
 }
 
+int percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount);
+
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 {
 	free_percpu(fbc->counters);
 }
 
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount);
-s64 percpu_counter_sum(struct percpu_counter *fbc);
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch);
+s64 __percpu_counter_sum(struct percpu_counter *fbc);
+
+static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
+{
+	__percpu_counter_add(fbc, amount, FBC_BATCH);
+}
+
+static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
+{
+	s64 ret = __percpu_counter_sum(fbc);
+	return ret < 0 ? 0 : ret;
+}
+
+static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+{
+	return __percpu_counter_sum(fbc);
+}
 
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
@@ -67,17 +90,28 @@ struct percpu_counter {
 	s64 count;
 };
 
-static inline void percpu_counter_init(struct percpu_counter *fbc, s64 amount)
+static inline int percpu_counter_init(struct percpu_counter *fbc, s64 amount)
 {
 	fbc->count = amount;
+	return 0;
 }
 
+#define percpu_counter_init_irq percpu_counter_init
+
 static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 {
 }
 
+static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
+{
+	fbc->count = amount;
+}
+
+#define __percpu_counter_add(fbc, amount, batch) \
+	percpu_counter_add(fbc, amount)
+
 static inline void
-percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
 	preempt_disable();
 	fbc->count += amount;
@@ -94,21 +128,31 @@ static inline s64 percpu_counter_read_po
 	return fbc->count;
 }
 
-static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+static inline s64 percpu_counter_sum_positive(struct percpu_counter *fbc)
 {
 	return percpu_counter_read_positive(fbc);
 }
 
+static inline s64 percpu_counter_sum(struct percpu_counter *fbc)
+{
+	return percpu_counter_read(fbc);
+}
+
 #endif	/* CONFIG_SMP */
 
 static inline void percpu_counter_inc(struct percpu_counter *fbc)
 {
-	percpu_counter_mod(fbc, 1);
+	percpu_counter_add(fbc, 1);
 }
 
 static inline void percpu_counter_dec(struct percpu_counter *fbc)
 {
-	percpu_counter_mod(fbc, -1);
+	percpu_counter_add(fbc, -1);
+}
+
+static inline void percpu_counter_sub(struct percpu_counter *fbc, s64 amount)
+{
+	percpu_counter_add(fbc, -amount);
 }
 
 #endif /* _LINUX_PERCPU_COUNTER_H */
Index: linux-2.6/lib/percpu_counter.c
===================================================================
--- linux-2.6.orig/lib/percpu_counter.c
+++ linux-2.6/lib/percpu_counter.c
@@ -5,15 +5,41 @@
 #include <linux/percpu_counter.h>
 #include <linux/module.h>
 
-void percpu_counter_mod(struct percpu_counter *fbc, s32 amount)
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 {
-	long count;
+	int cpu;
+
+	spin_lock(&fbc->lock);
+	for_each_possible_cpu(cpu) {
+		s32 *pcount = per_cpu_ptr(fbc->counters, cpu);
+		*pcount = 0;
+	}
+	fbc->count = amount;
+	spin_unlock(&fbc->lock);
+}
+EXPORT_SYMBOL(percpu_counter_set);
+
+static struct lock_class_key percpu_counter_irqsafe;
+
+int percpu_counter_init_irq(struct percpu_counter *fbc, s64 amount)
+{
+	int err;
+
+	err = percpu_counter_init(fbc, amount);
+	if (!err)
+		lockdep_set_class(&fbc->lock, &percpu_counter_irqsafe);
+	return err;
+}
+
+void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+{
+	s64 count;
 	s32 *pcount;
 	int cpu = get_cpu();
 
 	pcount = per_cpu_ptr(fbc->counters, cpu);
 	count = *pcount + amount;
-	if (count >= FBC_BATCH || count <= -FBC_BATCH) {
+	if (count >= batch || count <= -batch) {
 		spin_lock(&fbc->lock);
 		fbc->count += count;
 		*pcount = 0;
@@ -23,13 +49,13 @@ void percpu_counter_mod(struct percpu_co
 	}
 	put_cpu();
 }
-EXPORT_SYMBOL(percpu_counter_mod);
+EXPORT_SYMBOL(__percpu_counter_add);
 
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
  */
-s64 percpu_counter_sum(struct percpu_counter *fbc)
+s64 __percpu_counter_sum(struct percpu_counter *fbc)
 {
 	s64 ret;
 	int cpu;
@@ -41,6 +67,6 @@ s64 percpu_counter_sum(struct percpu_cou
 		ret += *pcount;
 	}
 	spin_unlock(&fbc->lock);
-	return ret < 0 ? 0 : ret;
+	return ret;
 }
-EXPORT_SYMBOL(percpu_counter_sum);
+EXPORT_SYMBOL(__percpu_counter_sum);
Index: linux-2.6/fs/ext3/super.c
===================================================================
--- linux-2.6.orig/fs/ext3/super.c
+++ linux-2.6/fs/ext3/super.c
@@ -1406,6 +1406,7 @@ static int ext3_fill_super (struct super
 	int i;
 	int needs_recovery;
 	__le32 features;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -1665,12 +1666,16 @@ static int ext3_fill_super (struct super
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 		ext3_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
 		ext3_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	err |= percpu_counter_init(&sbi->s_dirs_counter,
 		ext3_count_dirs(sb));
+	if (err) {
+		printk(KERN_ERR "EXT3-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 
 	/* per fileystem reservation list head & lock */
 	spin_lock_init(&sbi->s_rsv_window_lock);
@@ -2448,12 +2453,12 @@ static int ext3_statfs (struct dentry * 
 	buf->f_type = EXT3_SUPER_MAGIC;
 	buf->f_bsize = sb->s_blocksize;
 	buf->f_blocks = le32_to_cpu(es->s_blocks_count) - overhead;
-	buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
+	buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
 	buf->f_bavail = buf->f_bfree - le32_to_cpu(es->s_r_blocks_count);
 	if (buf->f_bfree < le32_to_cpu(es->s_r_blocks_count))
 		buf->f_bavail = 0;
 	buf->f_files = le32_to_cpu(es->s_inodes_count);
-	buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
+	buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
 	buf->f_namelen = EXT3_NAME_LEN;
 	fsid = le64_to_cpup((void *)es->s_uuid) ^
 	       le64_to_cpup((void *)es->s_uuid + sizeof(u64));
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c
+++ linux-2.6/fs/ext4/super.c
@@ -1465,6 +1465,7 @@ static int ext4_fill_super (struct super
 	int needs_recovery;
 	__le32 features;
 	__u64 blocks_count;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -1737,12 +1738,16 @@ static int ext4_fill_super (struct super
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 		ext4_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
 		ext4_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	err |= percpu_counter_init(&sbi->s_dirs_counter,
 		ext4_count_dirs(sb));
+	if (err) {
+		printk(KERN_ERR "EXT4-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 
 	/* per fileystem reservation list head & lock */
 	spin_lock_init(&sbi->s_rsv_window_lock);
@@ -2523,12 +2528,12 @@ static int ext4_statfs (struct dentry * 
 	buf->f_type = EXT4_SUPER_MAGIC;
 	buf->f_bsize = sb->s_blocksize;
 	buf->f_blocks = ext4_blocks_count(es) - overhead;
-	buf->f_bfree = percpu_counter_sum(&sbi->s_freeblocks_counter);
+	buf->f_bfree = percpu_counter_sum_positive(&sbi->s_freeblocks_counter);
 	buf->f_bavail = buf->f_bfree - ext4_r_blocks_count(es);
 	if (buf->f_bfree < ext4_r_blocks_count(es))
 		buf->f_bavail = 0;
 	buf->f_files = le32_to_cpu(es->s_inodes_count);
-	buf->f_ffree = percpu_counter_sum(&sbi->s_freeinodes_counter);
+	buf->f_ffree = percpu_counter_sum_positive(&sbi->s_freeinodes_counter);
 	buf->f_namelen = EXT4_NAME_LEN;
 	fsid = le64_to_cpup((void *)es->s_uuid) ^
 	       le64_to_cpup((void *)es->s_uuid + sizeof(u64));
Index: linux-2.6/fs/file_table.c
===================================================================
--- linux-2.6.orig/fs/file_table.c
+++ linux-2.6/fs/file_table.c
@@ -98,7 +98,7 @@ struct file *get_empty_filp(void)
 		 * percpu_counters are inaccurate.  Do an expensive check before
 		 * we go and fail.
 		 */
-		if (percpu_counter_sum(&nr_files) >= files_stat.max_files)
+		if (percpu_counter_sum_positive(&nr_files) >= files_stat.max_files)
 			goto over;
 	}
 
Index: linux-2.6/fs/ext2/super.c
===================================================================
--- linux-2.6.orig/fs/ext2/super.c
+++ linux-2.6/fs/ext2/super.c
@@ -652,6 +652,7 @@ static int ext2_fill_super(struct super_
 	int db_count;
 	int i, j;
 	__le32 features;
+	int err;
 
 	sbi = kzalloc(sizeof(*sbi), GFP_KERNEL);
 	if (!sbi)
@@ -907,12 +908,16 @@ static int ext2_fill_super(struct super_
 	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
 	spin_lock_init(&sbi->s_next_gen_lock);
 
-	percpu_counter_init(&sbi->s_freeblocks_counter,
+	err = percpu_counter_init(&sbi->s_freeblocks_counter,
 				ext2_count_free_blocks(sb));
-	percpu_counter_init(&sbi->s_freeinodes_counter,
+	err |= percpu_counter_init(&sbi->s_freeinodes_counter,
 				ext2_count_free_inodes(sb));
-	percpu_counter_init(&sbi->s_dirs_counter,
+	err |= percpu_counter_init(&sbi->s_dirs_counter,
 				ext2_count_dirs(sb));
+	if (err) {
+		printk(KERN_ERR "EXT2-fs: insufficient memory\n");
+		goto failed_mount3;
+	}
 	/*
 	 * set up enough so that it can read an inode
 	 */
Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -1783,6 +1783,7 @@ static void blk_release_queue(struct kob
 
 	blk_trace_shutdown(q);
 
+	bdi_destroy(&q->backing_dev_info);
 	kmem_cache_free(requestq_cachep, q);
 }
 
@@ -1835,6 +1836,7 @@ static struct kobj_type queue_ktype;
 
 request_queue_t *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 {
+	int err;
 	request_queue_t *q;
 
 	q = kmem_cache_alloc_node(requestq_cachep, gfp_mask, node_id);
@@ -1842,15 +1844,20 @@ request_queue_t *blk_alloc_queue_node(gf
 		return NULL;
 
 	memset(q, 0, sizeof(*q));
+	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
+	q->backing_dev_info.unplug_io_data = q;
+	err = bdi_init(&q->backing_dev_info);
+	if (err) {
+		kmem_cache_free(requestq_cachep, q);
+		return NULL;
+	}
+
 	init_timer(&q->unplug_timer);
 
 	snprintf(q->kobj.name, KOBJ_NAME_LEN, "%s", "queue");
 	q->kobj.ktype = &queue_ktype;
 	kobject_init(&q->kobj);
 
-	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
-	q->backing_dev_info.unplug_io_data = q;
-
 	mutex_init(&q->sysfs_lock);
 
 	return q;
@@ -3984,6 +3991,73 @@ static ssize_t queue_max_hw_sectors_show
 	return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_nr_reclaimable_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_reclaimable =
+		bdi_stat(&q->backing_dev_info, BDI_RECLAIMABLE);
+
+	return sprintf(page, "%llu\n",
+			nr_reclaimable >> (PAGE_CACHE_SHIFT - 10));
+}
+
+static ssize_t queue_nr_writeback_show(struct request_queue *q, char *page)
+{
+	unsigned long long nr_writeback =
+		bdi_stat(&q->backing_dev_info, BDI_WRITEBACK);
+
+	return sprintf(page, "%llu\n",
+			nr_writeback >> (PAGE_CACHE_SHIFT - 10));
+}
+
+extern void bdi_writeout_fraction(struct backing_dev_info *bdi,
+		long *numerator, long *denominator);
+
+static ssize_t queue_nr_cache_ratio_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+	scale *= 1024;
+	scale /= div;
+
+	return sprintf(page, "%ld\n", scale);
+}
+
+static ssize_t queue_nr_cache_num_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+
+	return sprintf(page, "%ld\n", scale);
+}
+
+static ssize_t queue_nr_cache_denom_show(struct request_queue *q, char *page)
+{
+	long scale, div;
+
+	bdi_writeout_fraction(&q->backing_dev_info, &scale, &div);
+
+	return sprintf(page, "%ld\n", div);
+}
+
+extern void
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		struct backing_dev_info *bdi);
+
+static ssize_t queue_nr_cache_size_show(struct request_queue *q, char *page)
+{
+	long background, dirty, bdi_dirty;
+	get_dirty_limits(&background, &dirty, &bdi_dirty, &q->backing_dev_info);
+	return sprintf(page, "%ld\n", bdi_dirty);
+}
+
+static ssize_t queue_nr_cache_total_show(struct request_queue *q, char *page)
+{
+	long background, dirty, bdi_dirty;
+	get_dirty_limits(&background, &dirty, &bdi_dirty, &q->backing_dev_info);
+	return sprintf(page, "%ld\n", dirty);
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
 	.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
@@ -4008,6 +4082,41 @@ static struct queue_sysfs_entry queue_ma
 	.show = queue_max_hw_sectors_show,
 };
 
+static struct queue_sysfs_entry queue_reclaimable_entry = {
+	.attr = {.name = "reclaimable_kb", .mode = S_IRUGO },
+	.show = queue_nr_reclaimable_show,
+};
+
+static struct queue_sysfs_entry queue_writeback_entry = {
+	.attr = {.name = "writeback_kb", .mode = S_IRUGO },
+	.show = queue_nr_writeback_show,
+};
+
+static struct queue_sysfs_entry queue_cache_ratio_entry = {
+	.attr = {.name = "cache_ratio", .mode = S_IRUGO },
+	.show = queue_nr_cache_ratio_show,
+};
+
+static struct queue_sysfs_entry queue_cache_num_entry = {
+	.attr = {.name = "cache_num", .mode = S_IRUGO },
+	.show = queue_nr_cache_num_show,
+};
+
+static struct queue_sysfs_entry queue_cache_denom_entry = {
+	.attr = {.name = "cache_denom", .mode = S_IRUGO },
+	.show = queue_nr_cache_denom_show,
+};
+
+static struct queue_sysfs_entry queue_cache_size_entry = {
+	.attr = {.name = "cache_size", .mode = S_IRUGO },
+	.show = queue_nr_cache_size_show,
+};
+
+static struct queue_sysfs_entry queue_cache_total_entry = {
+	.attr = {.name = "cache_total", .mode = S_IRUGO },
+	.show = queue_nr_cache_total_show,
+};
+
 static struct queue_sysfs_entry queue_iosched_entry = {
 	.attr = {.name = "scheduler", .mode = S_IRUGO | S_IWUSR },
 	.show = elv_iosched_show,
@@ -4019,6 +4128,13 @@ static struct attribute *default_attrs[]
 	&queue_ra_entry.attr,
 	&queue_max_hw_sectors_entry.attr,
 	&queue_max_sectors_entry.attr,
+	&queue_reclaimable_entry.attr,
+	&queue_writeback_entry.attr,
+	&queue_cache_ratio_entry.attr,
+	&queue_cache_num_entry.attr,
+	&queue_cache_denom_entry.attr,
+	&queue_cache_size_entry.attr,
+	&queue_cache_total_entry.attr,
 	&queue_iosched_entry.attr,
 	NULL,
 };
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -411,6 +411,9 @@ static void __exit rd_cleanup(void)
 		blk_cleanup_queue(rd_queue[i]);
 	}
 	unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
+
+	bdi_destroy(&rd_file_backing_dev_info);
+	bdi_destroy(&rd_backing_dev_info);
 }
 
 /*
@@ -419,7 +422,19 @@ static void __exit rd_cleanup(void)
 static int __init rd_init(void)
 {
 	int i;
-	int err = -ENOMEM;
+	int err;
+
+	err = bdi_init(&rd_backing_dev_info);
+	if (err)
+		goto out2;
+
+	err = bdi_init(&rd_file_backing_dev_info);
+	if (err) {
+		bdi_destroy(&rd_backing_dev_info);
+		goto out2;
+	}
+
+	err = -ENOMEM;
 
 	if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
 			(rd_blocksize & (rd_blocksize-1))) {
@@ -473,6 +488,9 @@ out:
 		put_disk(rd_disks[i]);
 		blk_cleanup_queue(rd_queue[i]);
 	}
+	bdi_destroy(&rd_backing_dev_info);
+	bdi_destroy(&rd_file_backing_dev_info);
+out2:
 	return err;
 }
 
Index: linux-2.6/drivers/char/mem.c
===================================================================
--- linux-2.6.orig/drivers/char/mem.c
+++ linux-2.6/drivers/char/mem.c
@@ -977,6 +977,11 @@ static struct class *mem_class;
 static int __init chr_dev_init(void)
 {
 	int i;
+	int err;
+
+	err = bdi_init(&zero_bdi);
+	if (err)
+		return err;
 
 	if (register_chrdev(MEM_MAJOR,"mem",&memory_fops))
 		printk("unable to get major %d for memory devs\n", MEM_MAJOR);
Index: linux-2.6/fs/char_dev.c
===================================================================
--- linux-2.6.orig/fs/char_dev.c
+++ linux-2.6/fs/char_dev.c
@@ -546,6 +546,7 @@ static struct kobject *base_probe(dev_t 
 void __init chrdev_init(void)
 {
 	cdev_map = kobj_map_init(base_probe, &chrdevs_lock);
+	bdi_init(&directly_mappable_cdev_bdi);
 }
 
 
Index: linux-2.6/fs/fuse/inode.c
===================================================================
--- linux-2.6.orig/fs/fuse/inode.c
+++ linux-2.6/fs/fuse/inode.c
@@ -401,6 +401,7 @@ static int fuse_show_options(struct seq_
 static struct fuse_conn *new_conn(void)
 {
 	struct fuse_conn *fc;
+	int err;
 
 	fc = kzalloc(sizeof(*fc), GFP_KERNEL);
 	if (fc) {
@@ -416,10 +417,17 @@ static struct fuse_conn *new_conn(void)
 		atomic_set(&fc->num_waiting, 0);
 		fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 		fc->bdi.unplug_io_fn = default_unplug_io_fn;
+		err = bdi_init(&fc->bdi);
+		if (err) {
+			kfree(fc);
+			fc = NULL;
+			goto out;
+		}
 		fc->reqctr = 0;
 		fc->blocked = 1;
 		get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key));
 	}
+out:
 	return fc;
 }
 
@@ -429,6 +437,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		if (fc->destroy_req)
 			fuse_request_free(fc->destroy_req);
 		mutex_destroy(&fc->inst_mutex);
+		bdi_destroy(&fc->bdi);
 		kfree(fc);
 	}
 }
Index: linux-2.6/fs/nfs/client.c
===================================================================
--- linux-2.6.orig/fs/nfs/client.c
+++ linux-2.6/fs/nfs/client.c
@@ -658,6 +658,7 @@ static void nfs_server_set_fsinfo(struct
 	if (server->rsize > NFS_MAX_FILE_IO_SIZE)
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
 
 	if (server->wsize > max_rpc_payload)
@@ -708,6 +709,10 @@ static int nfs_probe_fsinfo(struct nfs_s
 		goto out_error;
 
 	nfs_server_set_fsinfo(server, &fsinfo);
+	error = bdi_init(&server->backing_dev_info);
+	if (error)
+		goto out_error;
+
 
 	/* Get some general file system info */
 	if (server->namelen == 0) {
@@ -787,6 +792,7 @@ void nfs_free_server(struct nfs_server *
 	nfs_put_client(server->nfs_client);
 
 	nfs_free_iostats(server->io_stats);
+	bdi_destroy(&server->backing_dev_info);
 	kfree(server);
 	nfs_release_automount_timer();
 	dprintk("<-- nfs_free_server()\n");
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c
+++ linux-2.6/fs/hugetlbfs/inode.c
@@ -802,11 +802,15 @@ static int __init init_hugetlbfs_fs(void
 	int error;
 	struct vfsmount *vfsmount;
 
+	error = bdi_init(&hugetlbfs_backing_dev_info);
+	if (error)
+		return error;
+
 	hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
 					sizeof(struct hugetlbfs_inode_info),
 					0, 0, init_once, NULL);
 	if (hugetlbfs_inode_cachep == NULL)
-		return -ENOMEM;
+		goto out2;
 
 	error = register_filesystem(&hugetlbfs_fs_type);
 	if (error)
@@ -824,6 +828,8 @@ static int __init init_hugetlbfs_fs(void
  out:
 	if (error)
 		kmem_cache_destroy(hugetlbfs_inode_cachep);
+ out2:
+	bdi_destroy(&hugetlbfs_backing_dev_info);
 	return error;
 }
 
@@ -831,6 +837,7 @@ static void __exit exit_hugetlbfs_fs(voi
 {
 	kmem_cache_destroy(hugetlbfs_inode_cachep);
 	unregister_filesystem(&hugetlbfs_fs_type);
+	bdi_destroy(&hugetlbfs_backing_dev_info);
 }
 
 module_init(init_hugetlbfs_fs)
Index: linux-2.6/fs/ocfs2/dlm/dlmfs.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/dlm/dlmfs.c
+++ linux-2.6/fs/ocfs2/dlm/dlmfs.c
@@ -588,13 +588,17 @@ static int __init init_dlmfs_fs(void)
 
 	dlmfs_print_version();
 
+	status = bdi_init(&dlmfs_backing_dev_info);
+	if (status)
+		return status;
+
 	dlmfs_inode_cache = kmem_cache_create("dlmfs_inode_cache",
 				sizeof(struct dlmfs_inode_private),
 				0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
 					SLAB_MEM_SPREAD),
 				dlmfs_init_once, NULL);
 	if (!dlmfs_inode_cache)
-		return -ENOMEM;
+		goto bail;
 	cleanup_inode = 1;
 
 	user_dlm_worker = create_singlethread_workqueue("user_dlm");
@@ -611,6 +615,7 @@ bail:
 			kmem_cache_destroy(dlmfs_inode_cache);
 		if (cleanup_worker)
 			destroy_workqueue(user_dlm_worker);
+		bdi_destroy(&dlmfs_backing_dev_info);
 	} else
 		printk("OCFS2 User DLM kernel interface loaded\n");
 	return status;
@@ -624,6 +629,8 @@ static void __exit exit_dlmfs_fs(void)
 	destroy_workqueue(user_dlm_worker);
 
 	kmem_cache_destroy(dlmfs_inode_cache);
+
+	bdi_destroy(&dlmfs_backing_dev_info);
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/configfs/configfs_internal.h
===================================================================
--- linux-2.6.orig/fs/configfs/configfs_internal.h
+++ linux-2.6/fs/configfs/configfs_internal.h
@@ -55,6 +55,8 @@ extern int configfs_is_root(struct confi
 
 extern struct inode * configfs_new_inode(mode_t mode, struct configfs_dirent *);
 extern int configfs_create(struct dentry *, int mode, int (*init)(struct inode *));
+extern int configfs_inode_init(void);
+extern void configfs_inode_exit(void);
 
 extern int configfs_create_file(struct config_item *, const struct configfs_attribute *);
 extern int configfs_make_dirent(struct configfs_dirent *,
Index: linux-2.6/fs/configfs/inode.c
===================================================================
--- linux-2.6.orig/fs/configfs/inode.c
+++ linux-2.6/fs/configfs/inode.c
@@ -256,4 +256,12 @@ void configfs_hash_and_remove(struct den
 	mutex_unlock(&dir->d_inode->i_mutex);
 }
 
+int __init configfs_inode_init(void)
+{
+	return bdi_init(&configfs_backing_dev_info);
+}
 
+void __exit configfs_inode_exit(void)
+{
+	bdi_destroy(&configfs_backing_dev_info);
+}
Index: linux-2.6/fs/configfs/mount.c
===================================================================
--- linux-2.6.orig/fs/configfs/mount.c
+++ linux-2.6/fs/configfs/mount.c
@@ -154,8 +154,16 @@ static int __init configfs_init(void)
 		subsystem_unregister(&config_subsys);
 		kmem_cache_destroy(configfs_dir_cachep);
 		configfs_dir_cachep = NULL;
+		goto out;
 	}
 
+	err = configfs_inode_init();
+	if (err) {
+		unregister_filesystem(&configfs_fs_type);
+		subsystem_unregister(&config_subsys);
+		kmem_cache_destroy(configfs_dir_cachep);
+		configfs_dir_cachep = NULL;
+	}
 out:
 	return err;
 }
@@ -166,6 +174,7 @@ static void __exit configfs_exit(void)
 	subsystem_unregister(&config_subsys);
 	kmem_cache_destroy(configfs_dir_cachep);
 	configfs_dir_cachep = NULL;
+	configfs_inode_exit();
 }
 
 MODULE_AUTHOR("Oracle");
Index: linux-2.6/fs/ramfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ramfs/inode.c
+++ linux-2.6/fs/ramfs/inode.c
@@ -222,7 +222,17 @@ module_exit(exit_ramfs_fs)
 
 int __init init_rootfs(void)
 {
-	return register_filesystem(&rootfs_fs_type);
+	int err;
+
+	err = bdi_init(&ramfs_backing_dev_info);
+	if (err)
+		return err;
+
+	err = register_filesystem(&rootfs_fs_type);
+	if (err)
+		bdi_destroy(&ramfs_backing_dev_info);
+
+	return err;
 }
 
 MODULE_LICENSE("GPL");
Index: linux-2.6/fs/sysfs/inode.c
===================================================================
--- linux-2.6.orig/fs/sysfs/inode.c
+++ linux-2.6/fs/sysfs/inode.c
@@ -44,6 +44,11 @@ void sysfs_delete_inode(struct inode *in
 	return generic_delete_inode(inode);
 }
 
+int __init sysfs_inode_init(void)
+{
+	return bdi_init(&sysfs_backing_dev_info);
+}
+
 int sysfs_setattr(struct dentry * dentry, struct iattr * iattr)
 {
 	struct inode * inode = dentry->d_inode;
Index: linux-2.6/fs/sysfs/mount.c
===================================================================
--- linux-2.6.orig/fs/sysfs/mount.c
+++ linux-2.6/fs/sysfs/mount.c
@@ -98,6 +98,10 @@ int __init sysfs_init(void)
 	if (!sysfs_dir_cachep)
 		goto out;
 
+	err = sysfs_inode_init();
+	if (err)
+		goto out_err;
+
 	err = register_filesystem(&sysfs_fs_type);
 	if (!err) {
 		sysfs_mount = kern_mount(&sysfs_fs_type);
Index: linux-2.6/fs/sysfs/sysfs.h
===================================================================
--- linux-2.6.orig/fs/sysfs/sysfs.h
+++ linux-2.6/fs/sysfs/sysfs.h
@@ -17,6 +17,7 @@ extern struct kmem_cache *sysfs_dir_cach
 extern void sysfs_delete_inode(struct inode *inode);
 extern struct inode * sysfs_new_inode(mode_t mode, struct sysfs_dirent *);
 extern int sysfs_create(struct dentry *, int mode, int (*init)(struct inode *));
+extern int sysfs_inode_init(void);
 
 extern int sysfs_dirent_exist(struct sysfs_dirent *, const unsigned char *);
 extern int sysfs_make_dirent(struct sysfs_dirent *, struct dentry *, void *,
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c
+++ linux-2.6/mm/shmem.c
@@ -2490,6 +2490,10 @@ static int __init init_tmpfs(void)
 {
 	int error;
 
+	error = bdi_init(&shmem_backing_dev_info);
+	if (error)
+		goto out4;
+
 	error = init_inodecache();
 	if (error)
 		goto out3;
@@ -2514,6 +2518,8 @@ out1:
 out2:
 	destroy_inodecache();
 out3:
+	bdi_destroy(&shmem_backing_dev_info);
+out4:
 	shm_mnt = ERR_PTR(error);
 	return error;
 }
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -505,6 +505,10 @@ void __init swap_setup(void)
 {
 	unsigned long megs = num_physpages >> (20 - PAGE_SHIFT);
 
+#ifdef CONFIG_SWAP
+	bdi_init(swapper_space.backing_dev_info);
+#endif
+
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
 		page_cluster = 2;
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -75,6 +75,12 @@ static inline void ra_off(struct file_ra
 	return;
 }
 
+static int __init readahead_init(void)
+{
+	return bdi_init(&default_backing_dev_info);
+}
+subsys_initcall(readahead_init);
+
 /*
  * Set the initial window size, round to next power of 2 and square
  * for small size, x 4 for medium, and x 2 for large
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -726,6 +726,8 @@ int __set_page_dirty_buffers(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
+			__inc_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
 		radix_tree_tag_set(&mapping->page_tree,
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -2,6 +2,7 @@
  * mm/page-writeback.c
  *
  * Copyright (C) 2002, Linus Torvalds.
+ * Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
  *
  * Contains functions related to writing back dirty pages at the
  * address_space level.
@@ -49,8 +50,6 @@
  */
 static long ratelimit_pages = 32;
 
-static int dirty_exceeded __cacheline_aligned_in_smp;	/* Dirty mem may be over limit */
-
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
@@ -103,6 +102,141 @@ EXPORT_SYMBOL(laptop_mode);
 static void background_writeout(unsigned long _min_pages);
 
 /*
+ * Scale the writeback cache size proportional to the relative writeout speeds.
+ *
+ * We do this by keeping a floating proportion between BDIs, based on page
+ * writeback completions [end_page_writeback()]. Those devices that write out
+ * pages fastest will get the larger share, while the slower will get a smaller
+ * share.
+ *
+ * We use page writeout completions because we are interested in getting rid of
+ * dirty pages. Having them written out is the primary goal.
+ *
+ * We introduce a concept of time, a period over which we measure these events,
+ * because demand can/will vary over time. The length of this period itself is
+ * measured in page writeback completions.
+ *
+ */
+static struct prop_descriptor vm_completions;
+static struct prop_descriptor vm_dirties;
+
+static unsigned long determine_dirtyable_memory(void);
+
+/*
+ * couple the period to the dirty_ratio:
+ *
+ *   period/2 ~ roundup_pow_of_two(dirty limit)
+ */
+static int calc_period_shift(void)
+{
+	unsigned long dirty_total;
+
+	dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100;
+	return 2 + ilog2(dirty_total - 1);
+}
+
+/*
+ * update the period when the dirty ratio changes.
+ */
+int dirty_ratio_handler(ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos)
+{
+	int old_ratio = vm_dirty_ratio;
+	int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos);
+	if (ret == 0 && write && vm_dirty_ratio != old_ratio) {
+		int shift = calc_period_shift();
+		prop_change_shift(&vm_completions, shift);
+		prop_change_shift(&vm_dirties, shift);
+	}
+	return ret;
+}
+
+/*
+ * Increment the BDI's writeout completion count and the global writeout
+ * completion count. Called from test_clear_page_writeback().
+ */
+static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
+{
+	__prop_inc_percpu(&vm_completions, &bdi->completions);
+}
+
+static inline void task_dirty_inc(struct task_struct *tsk)
+{
+	prop_inc_single(&vm_dirties, &tsk->dirties);
+}
+
+/*
+ * Obtain an accurate fraction of the BDI's portion.
+ */
+void bdi_writeout_fraction(struct backing_dev_info *bdi,
+		long *numerator, long *denominator)
+{
+	if (bdi_cap_writeback_dirty(bdi)) {
+		prop_fraction_percpu(&vm_completions, &bdi->completions,
+				numerator, denominator);
+	} else {
+		*numerator = 0;
+		*denominator = 1;
+	}
+}
+
+/*
+ * Clip the earned share of dirty pages to that which is actually available.
+ * This avoids exceeding the total dirty_limit when the floating averages
+ * fluctuate too quickly.
+ */
+static void
+clip_bdi_dirty_limit(struct backing_dev_info *bdi, long dirty, long *pbdi_dirty)
+{
+	long avail_dirty;
+
+	avail_dirty = dirty -
+		(global_page_state(NR_FILE_DIRTY) +
+		 global_page_state(NR_WRITEBACK) +
+		 global_page_state(NR_UNSTABLE_NFS));
+
+	if (avail_dirty < 0)
+		avail_dirty = 0;
+
+	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
+		bdi_stat(bdi, BDI_WRITEBACK);
+
+	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
+}
+
+static inline void task_dirties_fraction(struct task_struct *tsk,
+		long *numerator, long *denominator)
+{
+	prop_fraction_single(&vm_dirties, &tsk->dirties,
+				numerator, denominator);
+}
+
+/*
+ * scale the dirty limit
+ *
+ * task specific dirty limit:
+ *
+ *   dirty -= (dirty/8) * p_{t}
+ */
+void task_dirty_limit(struct task_struct *tsk, long *pdirty)
+{
+	long numerator, denominator;
+	long dirty = *pdirty;
+	long long inv = dirty >> 3;
+
+	task_dirties_fraction(tsk, &numerator, &denominator);
+	inv *= numerator;
+	do_div(inv, denominator);
+
+	dirty -= inv;
+	if (dirty < *pdirty/2)
+		dirty = *pdirty/2;
+
+	*pdirty = dirty;
+}
+
+/*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
  *
@@ -157,9 +291,9 @@ static unsigned long determine_dirtyable
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
-static void
-get_dirty_limits(long *pbackground, long *pdirty,
-					struct address_space *mapping)
+void
+get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty,
+		 struct backing_dev_info *bdi)
 {
 	int background_ratio;		/* Percentages */
 	int dirty_ratio;
@@ -193,6 +327,23 @@ get_dirty_limits(long *pbackground, long
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+
+	if (bdi) {
+		long long bdi_dirty = dirty;
+		long numerator, denominator;
+
+		/*
+		 * Calculate this BDI's share of the dirty ratio.
+		 */
+		bdi_writeout_fraction(bdi, &numerator, &denominator);
+
+		bdi_dirty *= numerator;
+		do_div(bdi_dirty, denominator);
+
+		*pbdi_dirty = bdi_dirty;
+		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
+		task_dirty_limit(current, pbdi_dirty);
+	}
 }
 
 /*
@@ -204,9 +355,11 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long nr_reclaimable;
+	long bdi_nr_reclaimable;
+	long bdi_nr_writeback;
 	long background_thresh;
 	long dirty_thresh;
+	long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
 
@@ -221,15 +374,15 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, mapping);
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
-			dirty_thresh)
+		get_dirty_limits(&background_thresh, &dirty_thresh,
+				&bdi_thresh, bdi);
+		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
 				break;
 
-		if (!dirty_exceeded)
-			dirty_exceeded = 1;
+		if (!bdi->dirty_exceeded)
+			bdi->dirty_exceeded = 1;
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
@@ -237,16 +390,37 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (bdi_nr_reclaimable) {
 			writeback_inodes(&wbc);
-			get_dirty_limits(&background_thresh,
-					 	&dirty_thresh, mapping);
-			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-			if (nr_reclaimable +
-				global_page_state(NR_WRITEBACK)
-					<= dirty_thresh)
-						break;
+
+			get_dirty_limits(&background_thresh, &dirty_thresh,
+				       &bdi_thresh, bdi);
+
+			/*
+			 * In order to avoid the stacked BDI deadlock we need
+			 * to ensure we accurately count the 'dirty' pages when
+			 * the threshold is low.
+			 *
+			 * Otherwise it would be possible to get thresh+n pages
+			 * reported dirty, even though there are thresh-m pages
+			 * actually dirty; with m+n sitting in the percpu
+			 * deltas.
+			 */
+			if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+				bdi_nr_reclaimable =
+					bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat_sum(bdi, BDI_WRITEBACK);
+			} else {
+				bdi_nr_reclaimable =
+					bdi_stat(bdi, BDI_RECLAIMABLE);
+				bdi_nr_writeback =
+					bdi_stat(bdi, BDI_WRITEBACK);
+			}
+
+			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+				break;
+
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
@@ -254,9 +428,9 @@ static void balance_dirty_pages(struct a
 		congestion_wait(WRITE, HZ/10);
 	}
 
-	if (nr_reclaimable + global_page_state(NR_WRITEBACK)
-		<= dirty_thresh && dirty_exceeded)
-			dirty_exceeded = 0;
+	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
+			bdi->dirty_exceeded)
+		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
 		return;		/* pdflush is already working this queue */
@@ -270,7 +444,9 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	     (!laptop_mode && (nr_reclaimable > background_thresh)))
+			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
+					  + global_page_state(NR_UNSTABLE_NFS)
+					  > background_thresh)))
 		pdflush_operation(background_writeout, 0);
 }
 
@@ -306,7 +482,7 @@ void balance_dirty_pages_ratelimited_nr(
 	unsigned long *p;
 
 	ratelimit = ratelimit_pages;
-	if (dirty_exceeded)
+	if (mapping->backing_dev_info->dirty_exceeded)
 		ratelimit = 8;
 
 	/*
@@ -342,7 +518,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	}
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
@@ -377,7 +553,7 @@ static void background_writeout(unsigned
 		long background_thresh;
 		long dirty_thresh;
 
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL);
+		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 		if (global_page_state(NR_FILE_DIRTY) +
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
@@ -582,9 +758,15 @@ static struct notifier_block __cpuinitda
  */
 void __init page_writeback_init(void)
 {
+	int shift;
+
 	mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
+
+	shift = calc_period_shift();
+	prop_descriptor_init(&vm_completions, shift);
+	prop_descriptor_init(&vm_dirties, shift);
 }
 
 /**
@@ -828,6 +1010,8 @@ int __set_page_dirty_nobuffers(struct pa
 			BUG_ON(mapping2 != mapping);
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
+				__inc_bdi_stat(mapping->backing_dev_info,
+						BDI_RECLAIMABLE);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
 			radix_tree_tag_set(&mapping->page_tree,
@@ -860,7 +1044,7 @@ EXPORT_SYMBOL(redirty_page_for_writepage
  * If the mapping doesn't provide a set_page_dirty a_op, then
  * just fall through and assume that it wants buffer_heads.
  */
-int fastcall set_page_dirty(struct page *page)
+static int __set_page_dirty(struct page *page)
 {
 	struct address_space *mapping = page_mapping(page);
 
@@ -878,6 +1062,14 @@ int fastcall set_page_dirty(struct page 
 	}
 	return 0;
 }
+
+int fastcall set_page_dirty(struct page *page)
+{
+	int ret = __set_page_dirty(page);
+	if (ret)
+		task_dirty_inc(current);
+	return ret;
+}
 EXPORT_SYMBOL(set_page_dirty);
 
 /*
@@ -954,6 +1146,8 @@ int clear_page_dirty_for_io(struct page 
 			set_page_dirty(page);
 		if (TestClearPageDirty(page)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			return 1;
 		}
 		return 0;
@@ -968,14 +1162,20 @@ int test_clear_page_writeback(struct pag
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi)) {
+				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+				__bdi_writeout_inc(bdi);
+			}
+		}
 		write_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -989,14 +1189,18 @@ int test_set_page_writeback(struct page 
 	int ret;
 
 	if (mapping) {
+		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		unsigned long flags;
 
 		write_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
+			if (bdi_cap_writeback_dirty(bdi))
+				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -72,6 +72,8 @@ void cancel_dirty_page(struct page *page
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
 			dec_zone_page_state(page, NR_FILE_DIRTY);
+			dec_bdi_stat(mapping->backing_dev_info,
+					BDI_RECLAIMABLE);
 			if (account_size)
 				task_io_account_cancelled_write(account_size);
 		}
Index: linux-2.6/lib/proportions.c
===================================================================
--- /dev/null
+++ linux-2.6/lib/proportions.c
@@ -0,0 +1,385 @@
+/*
+ * FLoating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * Description:
+ *
+ * The floating proportion is a time derivative with an exponentially decaying
+ * history:
+ *
+ *   p_{j} = \Sum_{i=0} (dx_{j}/dt_{-i}) / 2^(1+i)
+ *
+ * Where j is an element from {prop_local}, x_{j} is j's number of events,
+ * and i the time period over which the differential is taken. So d/dt_{-i} is
+ * the differential over the i-th last period.
+ *
+ * The decaying history gives smooth transitions. The time differential carries
+ * the notion of speed.
+ *
+ * The denominator is 2^(1+i) because we want the series to be normalised, ie.
+ *
+ *   \Sum_{i=0} 1/2^(1+i) = 1
+ *
+ * Further more, if we measure time (t) in the same events as x; so that:
+ *
+ *   t = \Sum_{j} x_{j}
+ *
+ * we get that:
+ *
+ *   \Sum_{j} p_{j} = 1
+ *
+ * Writing this in an iterative fashion we get (dropping the 'd's):
+ *
+ *   if (++x_{j}, ++t > period)
+ *     t /= 2;
+ *     for_each (j)
+ *       x_{j} /= 2;
+ *
+ * so that:
+ *
+ *   p_{j} = x_{j} / t;
+ *
+ * We optimize away the '/= 2' for the global time delta by noting that:
+ *
+ *   if (++t > period) t /= 2:
+ *
+ * Can be approximated by:
+ *
+ *   period/2 + (++t % period/2)
+ *
+ * [ Furthermore, when we choose period to be 2^n it can be written in terms of
+ *   binary operations and wraparound artefacts disappear. ]
+ *
+ * Also note that this yields a natural counter of the elapsed periods:
+ *
+ *   c = t / (period/2)
+ *
+ * [ Its monotonic increasing property can be applied to mitigate the wrap-
+ *   around issue. ]
+ *
+ * This allows us to do away with the loop over all prop_locals on each period
+ * expiration. By remembering the period count under which it was last accessed
+ * as c_{j}, we can obtain the number of 'missed' cycles from:
+ *
+ *   c - c_{j}
+ *
+ * We can then lazily catch up to the global period count every time we are
+ * going to use x_{j}, by doing:
+ *
+ *   x_{j} /= 2^(c - c_{j}), c_{j} = c
+ */
+
+#include <linux/proportions.h>
+#include <linux/rcupdate.h>
+
+/*
+ * Limit the time part in order to ensure there are some bits left for the
+ * cycle counter.
+ */
+#define PROP_MAX_SHIFT (3*BITS_PER_LONG/4)
+
+int prop_descriptor_init(struct prop_descriptor *pd, int shift)
+{
+	int err;
+
+	if (shift > PROP_MAX_SHIFT)
+		shift = PROP_MAX_SHIFT;
+
+	pd->index = 0;
+	pd->pg[0].shift = shift;
+	mutex_init(&pd->mutex);
+	err = percpu_counter_init_irq(&pd->pg[0].events, 0);
+	if (err)
+		goto out;
+
+	err = percpu_counter_init_irq(&pd->pg[1].events, 0);
+	if (err)
+		percpu_counter_destroy(&pd->pg[0].events);
+
+out:
+	return err;
+}
+
+/*
+ * We have two copies, and flip between them to make it seem like an atomic
+ * update. The update is not really atomic wrt the events counter, but
+ * it is internally consistent with the bit layout depending on shift.
+ *
+ * We copy the events count, move the bits around and flip the index.
+ */
+void prop_change_shift(struct prop_descriptor *pd, int shift)
+{
+	int index;
+	int offset;
+	u64 events;
+	unsigned long flags;
+
+	if (shift > PROP_MAX_SHIFT)
+		shift = PROP_MAX_SHIFT;
+
+	mutex_lock(&pd->mutex);
+
+	index = pd->index ^ 1;
+	offset = pd->pg[pd->index].shift - shift;
+	if (!offset)
+		goto out;
+
+	pd->pg[index].shift = shift;
+
+	local_irq_save(flags);
+	events = percpu_counter_sum(&pd->pg[pd->index].events);
+	if (offset < 0)
+		events <<= -offset;
+	else
+		events >>= offset;
+	percpu_counter_set(&pd->pg[index].events, events);
+
+	/*
+	 * ensure the new pg is fully written before the switch
+	 */
+	smp_wmb();
+	pd->index = index;
+	local_irq_restore(flags);
+
+	synchronize_rcu();
+
+out:
+	mutex_unlock(&pd->mutex);
+}
+
+/*
+ * wrap the access to the data in an rcu_read_lock() section;
+ * this is used to track the active references.
+ */
+static struct prop_global *prop_get_global(struct prop_descriptor *pd)
+{
+	int index;
+
+	rcu_read_lock();
+	index = pd->index;
+	/*
+	 * match the wmb from vcd_flip()
+	 */
+	smp_rmb();
+	return &pd->pg[index];
+}
+
+static void prop_put_global(struct prop_descriptor *pd, struct prop_global *pg)
+{
+	rcu_read_unlock();
+}
+
+static void
+prop_adjust_shift(int *pl_shift, unsigned long *pl_period, int new_shift)
+{
+	int offset = *pl_shift - new_shift;
+
+	if (!offset)
+		return;
+
+	if (offset < 0)
+		*pl_period <<= -offset;
+	else
+		*pl_period >>= offset;
+
+	*pl_shift = new_shift;
+}
+
+/*
+ * PERCPU
+ */
+
+int prop_local_init_percpu(struct prop_local_percpu *pl)
+{
+	spin_lock_init(&pl->lock);
+	pl->shift = 0;
+	pl->period = 0;
+	return percpu_counter_init_irq(&pl->events, 0);
+}
+
+void prop_local_destroy_percpu(struct prop_local_percpu *pl)
+{
+	percpu_counter_destroy(&pl->events);
+}
+
+/*
+ * Catch up with missed period expirations.
+ *
+ *   until (c_{j} == c)
+ *     x_{j} -= x_{j}/2;
+ *     c_{j}++;
+ */
+static
+void prop_norm_percpu(struct prop_global *pg, struct prop_local_percpu *pl)
+{
+	unsigned long period = 1UL << (pg->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&pg->events);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (pl->period == global_period)
+		return;
+
+	spin_lock_irqsave(&pl->lock, flags);
+	prop_adjust_shift(&pl->shift, &pl->period, pg->shift);
+	period = 1UL << (pg->shift - 1);
+	/*
+	 * For each missed period, we half the local counter.
+	 * basically:
+	 *   pl->events >> (global_period - pl->period);
+	 *
+	 * but since the distributed nature of percpu counters make division
+	 * rather hard, use a regular subtraction loop. This is safe, because
+	 * the events will only every be incremented, hence the subtraction
+	 * can never result in a negative number.
+	 */
+	while (pl->period != global_period) {
+		unsigned long val = percpu_counter_read(&pl->events);
+		unsigned long half = (val + 1) >> 1;
+
+		/*
+		 * Half of zero won't be much less, break out.
+		 * This limits the loop to shift iterations, even
+		 * if we missed a million.
+		 */
+		if (!val)
+			break;
+
+		percpu_counter_add(&pl->events, -half);
+		pl->period += period;
+	}
+	pl->period = global_period;
+	spin_unlock_irqrestore(&pl->lock, flags);
+}
+
+/*
+ *   ++x_{j}, ++t
+ */
+void __prop_inc_percpu(struct prop_descriptor *pd, struct prop_local_percpu *pl)
+{
+	struct prop_global *pg = prop_get_global(pd);
+
+	prop_norm_percpu(pg, pl);
+	percpu_counter_add(&pl->events, 1);
+	percpu_counter_add(&pg->events, 1);
+	prop_put_global(pd, pg);
+}
+
+/*
+ * Obtain an fraction of this proportion
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+void prop_fraction_percpu(struct prop_descriptor *pd,
+		struct prop_local_percpu *pl,
+		long *numerator, long *denominator)
+{
+	struct prop_global *pg = prop_get_global(pd);
+	unsigned long period_2 = 1UL << (pg->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	prop_norm_percpu(pg, pl);
+	*numerator = percpu_counter_read_positive(&pl->events);
+
+	global_count = percpu_counter_read(&pg->events);
+	*denominator = period_2 + (global_count & counter_mask);
+
+	prop_put_global(pd, pg);
+}
+
+/*
+ * SINGLE
+ */
+
+int prop_local_init_single(struct prop_local_single *pl)
+{
+	spin_lock_init(&pl->lock);
+	pl->shift = 0;
+	pl->period = 0;
+	pl->events = 0;
+	return 0;
+}
+
+void prop_local_destroy_single(struct prop_local_single *pl)
+{
+}
+
+/*
+ * Catch up with missed period expirations.
+ */
+static
+void prop_norm_single(struct prop_global *pg, struct prop_local_single *pl)
+{
+	unsigned long period = 1UL << (pg->shift - 1);
+	unsigned long period_mask = ~(period - 1);
+	unsigned long global_period;
+	unsigned long flags;
+
+	global_period = percpu_counter_read(&pg->events);
+	global_period &= period_mask;
+
+	/*
+	 * Fast path - check if the local and global period count still match
+	 * outside of the lock.
+	 */
+	if (pl->period == global_period)
+		return;
+
+	spin_lock_irqsave(&pl->lock, flags);
+	prop_adjust_shift(&pl->shift, &pl->period, pg->shift);
+	/*
+	 * For each missed period, we half the local counter.
+	 */
+	period = (global_period - pl->period) >> (pg->shift - 1);
+	if (likely(period < BITS_PER_LONG))
+		pl->events >>= period;
+	else
+		pl->events = 0;
+	pl->period = global_period;
+	spin_unlock_irqrestore(&pl->lock, flags);
+}
+
+/*
+ *   ++x_{j}, ++t
+ */
+void __prop_inc_single(struct prop_descriptor *pd, struct prop_local_single *pl)
+{
+	struct prop_global *pg = prop_get_global(pd);
+
+	prop_norm_single(pg, pl);
+	pl->events++;
+	percpu_counter_add(&pg->events, 1);
+	prop_put_global(pd, pg);
+}
+
+/*
+ * Obtain an fraction of this proportion
+ *
+ *   p_{j} = x_{j} / (period/2 + t % period/2)
+ */
+void prop_fraction_single(struct prop_descriptor *pd,
+	       	struct prop_local_single *pl,
+		long *numerator, long *denominator)
+{
+	struct prop_global *pg = prop_get_global(pd);
+	unsigned long period_2 = 1UL << (pg->shift - 1);
+	unsigned long counter_mask = period_2 - 1;
+	unsigned long global_count;
+
+	prop_norm_single(pg, pl);
+	*numerator = pl->events;
+
+	global_count = percpu_counter_read(&pg->events);
+	*denominator = period_2 + (global_count & counter_mask);
+
+	prop_put_global(pd, pg);
+}
Index: linux-2.6/include/linux/proportions.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/proportions.h
@@ -0,0 +1,119 @@
+/*
+ * FLoating proportions
+ *
+ *  Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <pzijlstr@redhat.com>
+ *
+ * This file contains the public data structure and API definitions.
+ */
+
+#ifndef _LINUX_PROPORTIONS_H
+#define _LINUX_PROPORTIONS_H
+
+#include <linux/percpu_counter.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+
+struct prop_global {
+	/*
+	 * The period over which we differentiate
+	 *
+	 *   period = 2^shift
+	 */
+	int shift;
+	/*
+	 * The total event counter aka 'time'.
+	 *
+	 * Treated as an unsigned long; the lower 'shift - 1' bits are the
+	 * counter bits, the remaining upper bits the period counter.
+	 */
+	struct percpu_counter events;
+};
+
+/*
+ * global proportion descriptor
+ *
+ * this is needed to consitently flip prop_global structures.
+ */
+struct prop_descriptor {
+	int index;
+	struct prop_global pg[2];
+	struct mutex mutex;		/* serialize the prop_global switch */
+};
+
+int prop_descriptor_init(struct prop_descriptor *pd, int shift);
+void prop_change_shift(struct prop_descriptor *pd, int new_shift);
+
+/*
+ * ----- PERCPU ------
+ */
+
+struct prop_local_percpu {
+	/*
+	 * the local events counter
+	 */
+	struct percpu_counter events;
+
+	/*
+	 * snapshot of the last seen global state
+	 */
+	int shift;
+	unsigned long period;
+	spinlock_t lock;		/* protect the snapshot state */
+};
+
+int prop_local_init_percpu(struct prop_local_percpu *pl);
+void prop_local_destroy_percpu(struct prop_local_percpu *pl);
+void __prop_inc_percpu(struct prop_descriptor *pd, struct prop_local_percpu *pl);
+void prop_fraction_percpu(struct prop_descriptor *pd, struct prop_local_percpu *pl,
+		long *numerator, long *denominator);
+
+static inline
+void prop_inc_percpu(struct prop_descriptor *pd, struct prop_local_percpu *pl)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__prop_inc_percpu(pd, pl);
+	local_irq_restore(flags);
+}
+
+/*
+ * ----- SINGLE ------
+ */
+
+struct prop_local_single {
+	/*
+	 * the local events counter
+	 */
+	unsigned long events;
+
+	/*
+	 * snapshot of the last seen global state
+	 * and a lock protecting this state
+	 */
+	int shift;
+	unsigned long period;
+	spinlock_t lock;		/* protect the snapshot state */
+};
+
+#define INIT_PROP_LOCAL_SINGLE(name)			\
+{	.lock = __SPIN_LOCK_UNLOCKED(name.lock),	\
+}
+
+int prop_local_init_single(struct prop_local_single *pl);
+void prop_local_destroy_single(struct prop_local_single *pl);
+void __prop_inc_single(struct prop_descriptor *pd, struct prop_local_single *pl);
+void prop_fraction_single(struct prop_descriptor *pd, struct prop_local_single *pl,
+		long *numerator, long *denominator);
+
+static inline
+void prop_inc_single(struct prop_descriptor *pd, struct prop_local_single *pl)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__prop_inc_single(pd, pl);
+	local_irq_restore(flags);
+}
+
+#endif /* _LINUX_PROPORTIONS_H */
Index: linux-2.6/lib/Makefile
===================================================================
--- linux-2.6.orig/lib/Makefile
+++ linux-2.6/lib/Makefile
@@ -5,7 +5,8 @@
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 rbtree.o radix-tree.o dump_stack.o \
 	 idr.o int_sqrt.o bitmap.o extable.o prio_tree.o \
-	 sha1.o irq_regs.o reciprocal_div.o
+	 sha1.o irq_regs.o reciprocal_div.o \
+	 proportions.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -160,6 +160,10 @@ extern ctl_table inotify_table[];
 int sysctl_legacy_va_layout;
 #endif
 
+extern int dirty_ratio_handler(ctl_table *table, int write,
+		struct file *filp, void __user *buffer, size_t *lenp,
+		loff_t *ppos);
+
 
 /* The default sysctl tables: */
 
@@ -675,7 +679,7 @@ static ctl_table vm_table[] = {
 		.data		= &vm_dirty_ratio,
 		.maxlen		= sizeof(vm_dirty_ratio),
 		.mode		= 0644,
-		.proc_handler	= &proc_dointvec_minmax,
+		.proc_handler	= &dirty_ratio_handler,
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -83,6 +83,7 @@ struct sched_param {
 #include <linux/timer.h>
 #include <linux/hrtimer.h>
 #include <linux/task_io_accounting.h>
+#include <linux/proportions.h>
 
 #include <asm/processor.h>
 
@@ -1076,6 +1077,7 @@ struct task_struct {
 #ifdef CONFIG_FAULT_INJECTION
 	int make_it_fail;
 #endif
+	struct prop_local_single dirties;
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c
+++ linux-2.6/kernel/fork.c
@@ -106,6 +106,7 @@ static struct kmem_cache *mm_cachep;
 
 void free_task(struct task_struct *tsk)
 {
+	prop_local_destroy_single(&tsk->dirties);
 	free_thread_info(tsk->stack);
 	rt_mutex_debug_task_free(tsk);
 	free_task_struct(tsk);
@@ -162,6 +163,7 @@ static struct task_struct *dup_task_stru
 {
 	struct task_struct *tsk;
 	struct thread_info *ti;
+	int err;
 
 	prepare_to_copy(orig);
 
@@ -177,6 +179,14 @@ static struct task_struct *dup_task_stru
 
 	*tsk = *orig;
 	tsk->stack = ti;
+
+	err = prop_local_init_single(&tsk->dirties);
+	if (err) {
+		free_thread_info(ti);
+		free_task_struct(tsk);
+		return NULL;
+	}
+
 	setup_thread_stack(tsk, orig);
 
 #ifdef CONFIG_CC_STACKPROTECTOR
Index: linux-2.6/include/linux/init_task.h
===================================================================
--- linux-2.6.orig/include/linux/init_task.h
+++ linux-2.6/include/linux/init_task.h
@@ -167,6 +167,7 @@ extern struct group_info init_groups;
 		[PIDTYPE_PGID] = INIT_PID_LINK(PIDTYPE_PGID),		\
 		[PIDTYPE_SID]  = INIT_PID_LINK(PIDTYPE_SID),		\
 	},								\
+	.dirties = INIT_PROP_LOCAL_SINGLE(dirties),			\
 	INIT_TRACE_IRQFLAGS						\
 	INIT_LOCKDEP							\
 }

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 00/23] per device dirty throttling -v9
  2007-08-16 12:55 ` Peter Zijlstra
  2007-08-16 13:21   ` Martin Knoblauch
@ 2007-08-23 15:59   ` Martin Knoblauch
  2007-08-23 17:41     ` Peter Zijlstra
  1 sibling, 1 reply; 43+ messages in thread
From: Martin Knoblauch @ 2007-08-23 15:59 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel


--- Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> 
> > Peter,
> > 
> >  any chance to get a rollup against 2.6.22-stable?
> > 
> >  The 2.6.23 series may not be usable for me due to the
> > nosharedcache changes for NFS (the new default will massively
> > disturb the user-space automounter).
> 
> I'll see what I can do, bit busy with other stuff atm, hopefully
> after
> the weekend.
> 
Hi Peter,

 any progress on a version against 2.6.22.5? I have seen the very
positive report from Jeffrey W. Baker and would really love to test
your patch. But as I said, anything newer than 2.6.22.x might not be an
option due to the NFS changes.

Kind regards
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 00/23] per device dirty throttling -v9
  2007-08-16 12:55 ` Peter Zijlstra
@ 2007-08-16 13:21   ` Martin Knoblauch
  2007-08-23 15:59   ` Martin Knoblauch
  1 sibling, 0 replies; 43+ messages in thread
From: Martin Knoblauch @ 2007-08-16 13:21 UTC (permalink / raw)
  To: Peter Zijlstra, spamtrap; +Cc: linux-kernel


--- Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:
> 
> > Peter,
> > 
> >  any chance to get a rollup against 2.6.22-stable?
> > 
> >  The 2.6.23 series may not be usable for me due to the
> > nosharedcache changes for NFS (the new default will massively
> > disturb the user-space automounter).
> 
> I'll see what I can do, bit busy with other stuff atm, hopefully
> after the weekend.
> 
Hi Peter,

 that would be highly appreciated. Thanks a lot in advance.

Martin


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 00/23] per device dirty throttling -v9
  2007-08-16 12:49 Martin Knoblauch
@ 2007-08-16 12:55 ` Peter Zijlstra
  2007-08-16 13:21   ` Martin Knoblauch
  2007-08-23 15:59   ` Martin Knoblauch
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2007-08-16 12:55 UTC (permalink / raw)
  To: spamtrap; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 385 bytes --]

On Thu, 2007-08-16 at 05:49 -0700, Martin Knoblauch wrote:

> Peter,
> 
>  any chance to get a rollup against 2.6.22-stable?
> 
>  The 2.6.23 series may not be usable for me due to the
> nosharedcache changes for NFS (the new default will massively
> disturb the user-space automounter).

I'll see what I can do, bit busy with other stuff atm, hopefully after
the weekend.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH 00/23] per device dirty throttling -v9
@ 2007-08-16 12:49 Martin Knoblauch
  2007-08-16 12:55 ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Martin Knoblauch @ 2007-08-16 12:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter zijlstra

>Per device dirty throttling patches
>
>These patches aim to improve balance_dirty_pages() and directly
>address three issues:
>1) inter device starvation
>2) stacked device deadlocks
>3) inter process starvation
>
>1 and 2 are a direct result from removing the global dirty
>limit and using per device dirty limits. By giving each device
>its own dirty limit is will no longer starve another device,
>and the cyclic dependancy on the dirty limit is broken.
>
>In order to efficiently distribute the dirty limit across
>the independant devices a floating proportion is used, this
>will allocate a share of the total limit proportional to the
>device's recent activity.
>
>3 is done by also scaling the dirty limit proportional to the
>current task's recent dirty rate.
>
>Changes since -v8:
>- cleanup of the proportion code
>- fix percpu_counter_add(&counter, -(unsigned long))
>- fix per task dirty rate code
>- fwd port to .23-rc2-mm2

Peter,

 any chance to get a rollup against 2.6.22-stable?

 The 2.6.23 series may not be usable for me due to the
nosharedcache changes for NFS (the new default will massively
disturb the user-space automounter).

Cheers
Martin 


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2007-08-24 10:47 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-16  7:45 [PATCH 00/23] per device dirty throttling -v9 Peter Zijlstra
2007-08-16  7:45 ` [PATCH 01/23] nfs: remove congestion_end() Peter Zijlstra
2007-08-16  7:45 ` [PATCH 02/23] lib: percpu_counter_add Peter Zijlstra
2007-08-17 15:48   ` Josef Sipek
2007-08-16  7:45 ` [PATCH 03/23] lib: percpu_counter_sub Peter Zijlstra
2007-08-16  7:45 ` [PATCH 04/23] lib: percpu_counter variable batch Peter Zijlstra
2007-08-16  7:45 ` [PATCH 05/23] lib: make percpu_counter_add take s64 Peter Zijlstra
2007-08-16  7:45 ` [PATCH 06/23] lib: percpu_counter_set Peter Zijlstra
2007-08-16  7:45 ` [PATCH 07/23] lib: percpu_counter_sum_positive Peter Zijlstra
2007-08-16  7:45 ` [PATCH 08/23] lib: percpu_count_sum() Peter Zijlstra
2007-08-16  7:45 ` [PATCH 09/23] lib: percpu_counter_init error handling Peter Zijlstra
2007-08-17 15:56   ` Josef Sipek
2007-08-17 16:03     ` Peter Zijlstra
2007-08-18  8:09     ` Peter Zijlstra
2007-08-23 18:24       ` Josef Sipek
2007-08-16  7:45 ` [PATCH 10/23] lib: percpu_counter_init_irq Peter Zijlstra
2007-08-16  7:45 ` [PATCH 11/23] mm: bdi init hooks Peter Zijlstra
2007-08-17 16:10   ` Josef Sipek
2007-08-17 16:15     ` Peter Zijlstra
2007-08-16  7:45 ` [PATCH 12/23] containers: " Peter Zijlstra
2007-08-16  7:45 ` [PATCH 13/23] mtd: " Peter Zijlstra
2007-08-16  7:45 ` [PATCH 14/23] mtd: clean up the backing_dev_info usage Peter Zijlstra
2007-08-16  7:45 ` [PATCH 15/23] mtd: give mtdconcat devices their own backing_dev_info Peter Zijlstra
2007-08-16  7:45 ` [PATCH 16/23] mm: scalable bdi statistics counters Peter Zijlstra
2007-08-17 16:20   ` Josef Sipek
2007-08-17 16:23     ` Peter Zijlstra
2007-08-16  7:45 ` [PATCH 17/23] mm: count reclaimable pages per BDI Peter Zijlstra
2007-08-17 16:23   ` Josef Sipek
2007-08-16  7:45 ` [PATCH 18/23] mm: count writeback " Peter Zijlstra
2007-08-16  7:45 ` [PATCH 19/23] mm: expose BDI statistics in sysfs Peter Zijlstra
2007-08-16  7:45 ` [PATCH 20/23] lib: floating proportions Peter Zijlstra
2007-08-16  7:45 ` [PATCH 21/23] mm: per device dirty threshold Peter Zijlstra
2007-08-16  7:45 ` [PATCH 22/23] mm: dirty balancing for tasks Peter Zijlstra
2007-08-16  7:45 ` [PATCH 23/23] debug: sysfs files for the current ratio/size/total Peter Zijlstra
2007-08-16 21:29 ` [PATCH 00/23] per device dirty throttling -v9 Christoph Lameter
2007-08-17  7:19   ` Peter Zijlstra
2007-08-17 20:37     ` Christoph Lameter
2007-08-16 12:49 Martin Knoblauch
2007-08-16 12:55 ` Peter Zijlstra
2007-08-16 13:21   ` Martin Knoblauch
2007-08-23 15:59   ` Martin Knoblauch
2007-08-23 17:41     ` Peter Zijlstra
2007-08-24 10:47       ` Martin Knoblauch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).