linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Per superblock shrinkers V2
@ 2010-05-25  8:53 Dave Chinner
  2010-05-25  8:53 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: Dave Chinner @ 2010-05-25  8:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, linux-mm, xfs


This series reworks the filesystem shrinkers. We currently have a
set of issues with the current filesystem shrinkers:

        1. There is an dependency between dentry and inode cache
           shrinking that is only implicitly defined by the order of
           shrinker registration.
        2. The shrinkers need to walk the superblock list and pin
           the superblock to avoid unmount races with the sb going
           away.
        3. The dentry cache uses per-superblock LRUs and proportions
           reclaim between all the superblocks which means we are
           doing breadth based reclaim. This means we touch every
           superblock for every shrinker call, and may only reclaim
           a single dentry at a time from a given superblock.
        4. The inode cache has a global LRU, so it has different
           reclaim patterns to the dentry cache, despite the fact
           that the dentry cache is generally the only thing that
           pins inodes in memory.
        5. Filesystems need to register their own shrinkers for
           caches and can't co-ordinate them with the dentry and
           inode cache shrinkers.

The series starts by converting the inode cache to per-superblock
LRUs and changes the shrinker to match the dentry cache (#4).

It then adds a context to the shrinker callouts by passing the
shrinker structure with the callout. With this, a shrinker structure
is added to the superblock structure and a per-superblock shrinker
is registered.  Both the inode and dentry caches are modified to
shrunk via the superblock shrinker, and this directly encodes the
dcache/icache dependency inside the shrinker (#1).

This shrinker structure also avoids the need to pin the superblock
inside the shrinker because the shrinker is unregistered before the
superblock is freed (#2). Further, it pushes the proportioning of
reclaim between superblocks back up into the shrinker and batches
all the reclaim from a superblock into a tight call loop until the
shrink cycle for that superblock is complete. This effectively
converts reclaim to a depth-based reclaim mechanism which has a
smaller CPU cache footprint than the current mechanism (#3).

Then a pair of superblock operations that can be used to implement
filesystem specific cache reclaim is added. This is split into two
operations we don't need to overload the number of objects to scan
to indicate that a count should be returned.

Finally, the XFS inode cache shrinker is converted to use these
superblock operations, removing the need to register a shrinker,
keep a global list of XFS filesystems and locking to access the
per-filesystem caches. This fixes several new lockdep warnings the
XFS shrinker introduces because of the different contexts the
shrinker is called in, and allows for correct proportioning of
reclaim between the dentry, inode and XFS inode caches on the
filesystem to be executed (#5).

Version 2:
- rebase on new superblock iterator code in 2.6.35
- move sb shrinker registration to sget() and unregister to
  deactivate_super() to avoid unregister_shrinker() being called
  inside a spinlock.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
@ 2010-05-25  8:53 ` Dave Chinner
  2010-05-26 16:17   ` Nick Piggin
  2010-05-27 20:32   ` Andrew Morton
  2010-05-25  8:53 ` [PATCH 2/5] mm: add context argument to shrinker callback Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 38+ messages in thread
From: Dave Chinner @ 2010-05-25  8:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimatin schemes.

To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.

The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c         |    2 +-
 fs/inode.c                |   87 +++++++++++++++++++++++++++++++++++++++-----
 fs/super.c                |    1 +
 include/linux/fs.h        |    4 ++
 include/linux/writeback.h |    1 -
 5 files changed, 83 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5c4161f..b1e76ef 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -565,7 +565,7 @@ select_queue:
 			/*
 			 * The inode is clean, unused
 			 */
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_list, &inode->i_sb->s_inode_lru);
 		}
 	}
 	inode_sync_complete(inode);
diff --git a/fs/inode.c b/fs/inode.c
index 2bee20a..3caa758 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -25,6 +25,7 @@
 #include <linux/mount.h>
 #include <linux/async.h>
 #include <linux/posix_acl.h>
+#include "internal.h"
 
 /*
  * This is needed for the following functions:
@@ -74,7 +75,6 @@ static unsigned int i_hash_shift __read_mostly;
  */
 
 LIST_HEAD(inode_in_use);
-LIST_HEAD(inode_unused);
 static struct hlist_head *inode_hashtable __read_mostly;
 
 /*
@@ -292,6 +292,7 @@ void __iget(struct inode *inode)
 	if (!(inode->i_state & (I_DIRTY|I_SYNC)))
 		list_move(&inode->i_list, &inode_in_use);
 	inodes_stat.nr_unused--;
+	inode->i_sb->s_nr_inodes_unused--;
 }
 
 /**
@@ -386,6 +387,7 @@ static int invalidate_list(struct list_head *head, struct list_head *dispose)
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
 			list_move(&inode->i_list, dispose);
+			inode->i_sb->s_nr_inodes_unused--;
 			WARN_ON(inode->i_state & I_NEW);
 			inode->i_state |= I_FREEING;
 			count++;
@@ -444,32 +446,31 @@ static int can_unuse(struct inode *inode)
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
- * the front of the inode_unused list.  So look for it there and if the
+ * the front of the sb->s_inode_lru list.  So look for it there and if the
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
  * If the inode has metadata buffers attached to mapping->private_list then
  * try to remove them.
  */
-static void prune_icache(int nr_to_scan)
+static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
 	int nr_scanned;
 	unsigned long reap = 0;
 
-	down_read(&iprune_sem);
 	spin_lock(&inode_lock);
-	for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
+	for (nr_scanned = *nr_to_scan; nr_scanned >= 0; nr_scanned--) {
 		struct inode *inode;
 
-		if (list_empty(&inode_unused))
+		if (list_empty(&sb->s_inode_lru))
 			break;
 
-		inode = list_entry(inode_unused.prev, struct inode, i_list);
+		inode = list_entry(sb->s_inode_lru.prev, struct inode, i_list);
 
 		if (inode->i_state || atomic_read(&inode->i_count)) {
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_list, &sb->s_inode_lru);
 			continue;
 		}
 		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
@@ -481,7 +482,7 @@ static void prune_icache(int nr_to_scan)
 			iput(inode);
 			spin_lock(&inode_lock);
 
-			if (inode != list_entry(inode_unused.next,
+			if (inode != list_entry(sb->s_inode_lru.next,
 						struct inode, i_list))
 				continue;	/* wrong inode or list_empty */
 			if (!can_unuse(inode))
@@ -493,13 +494,77 @@ static void prune_icache(int nr_to_scan)
 		nr_pruned++;
 	}
 	inodes_stat.nr_unused -= nr_pruned;
+	sb->s_nr_inodes_unused -= nr_pruned;
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_INODESTEAL, reap);
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
+	*nr_to_scan = nr_scanned;
 
 	dispose_list(&freeable);
+}
+
+static void prune_icache(int count)
+{
+	struct super_block *sb, *n;
+	int w_count;
+	int unused = inodes_stat.nr_unused;
+	int prune_ratio;
+	int pruned;
+
+	if (unused == 0 || count == 0)
+		return;
+	down_read(&iprune_sem);
+	if (count >= unused)
+		prune_ratio = 1;
+	else
+		prune_ratio = unused / count;
+	spin_lock(&sb_lock);
+	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
+		if (list_empty(&sb->s_instances))
+			continue;
+		if (sb->s_nr_inodes_unused == 0)
+			continue;
+		sb->s_count++;
+		/* Now, we reclaim unused dentrins with fairness.
+		 * We reclaim them same percentage from each superblock.
+		 * We calculate number of dentries to scan on this sb
+		 * as follows, but the implementation is arranged to avoid
+		 * overflows:
+		 * number of dentries to scan on this sb =
+		 * count * (number of dentries on this sb /
+		 * number of dentries in the machine)
+		 */
+		spin_unlock(&sb_lock);
+		if (prune_ratio != 1)
+			w_count = (sb->s_nr_inodes_unused / prune_ratio) + 1;
+		else
+			w_count = sb->s_nr_inodes_unused;
+		pruned = w_count;
+		/*
+		 * We need to be sure this filesystem isn't being unmounted,
+		 * otherwise we could race with generic_shutdown_super(), and
+		 * end up holding a reference to an inode while the filesystem
+		 * is unmounted.  So we try to get s_umount, and make sure
+		 * s_root isn't NULL.
+		 */
+		if (down_read_trylock(&sb->s_umount)) {
+			if ((sb->s_root != NULL) &&
+			    (!list_empty(&sb->s_inode_lru))) {
+				shrink_icache_sb(sb, &w_count);
+				pruned -= w_count;
+			}
+			up_read(&sb->s_umount);
+		}
+		spin_lock(&sb_lock);
+		count -= pruned;
+		__put_super(sb);
+		/* more work left to do? */
+		if (count <= 0)
+			break;
+	}
+	spin_unlock(&sb_lock);
 	up_read(&iprune_sem);
 }
 
@@ -1238,8 +1303,9 @@ int generic_detach_inode(struct inode *inode)
 
 	if (!hlist_unhashed(&inode->i_hash)) {
 		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
-			list_move(&inode->i_list, &inode_unused);
+			list_move(&inode->i_list, &sb->s_inode_lru);
 		inodes_stat.nr_unused++;
+		sb->s_nr_inodes_unused++;
 		if (sb->s_flags & MS_ACTIVE) {
 			spin_unlock(&inode_lock);
 			return 0;
@@ -1252,6 +1318,7 @@ int generic_detach_inode(struct inode *inode)
 		WARN_ON(inode->i_state & I_NEW);
 		inode->i_state &= ~I_WILL_FREE;
 		inodes_stat.nr_unused--;
+		sb->s_nr_inodes_unused--;
 		hlist_del_init(&inode->i_hash);
 	}
 	list_del_init(&inode->i_list);
diff --git a/fs/super.c b/fs/super.c
index 69688b1..c554c53 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -60,6 +60,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_HLIST_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
 		INIT_LIST_HEAD(&s->s_dentry_lru);
+		INIT_LIST_HEAD(&s->s_inode_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b336cb9..7b90c43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1346,6 +1346,10 @@ struct super_block {
 	struct list_head	s_dentry_lru;	/* unused dentry lru */
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
+	/* s_inode_lru and s_nr_inodes_unused are protected by inode_lock */
+	struct list_head	s_inode_lru;	/* unused inode lru */
+	int			s_nr_inodes_unused;	/* # of inodes on lru */
+
 	struct block_device	*s_bdev;
 	struct backing_dev_info *s_bdi;
 	struct mtd_info		*s_mtd;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index cc97d6c..a74837e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,7 +11,6 @@ struct backing_dev_info;
 
 extern spinlock_t inode_lock;
 extern struct list_head inode_in_use;
-extern struct list_head inode_unused;
 
 /*
  * fs/fs-writeback.c
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/5] mm: add context argument to shrinker callback
  2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
  2010-05-25  8:53 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
@ 2010-05-25  8:53 ` Dave Chinner
  2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2010-05-25  8:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 arch/x86/kvm/mmu.c              |    2 +-
 drivers/gpu/drm/i915/i915_gem.c |    2 +-
 fs/dcache.c                     |    2 +-
 fs/gfs2/glock.c                 |    2 +-
 fs/gfs2/quota.c                 |    2 +-
 fs/gfs2/quota.h                 |    2 +-
 fs/inode.c                      |    2 +-
 fs/mbcache.c                    |    5 +++--
 fs/nfs/dir.c                    |    2 +-
 fs/nfs/internal.h               |    3 ++-
 fs/quota/dquot.c                |    2 +-
 fs/ubifs/shrinker.c             |    2 +-
 fs/ubifs/ubifs.h                |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c      |    5 +++--
 fs/xfs/linux-2.6/xfs_sync.c     |    1 +
 fs/xfs/quota/xfs_qm.c           |    7 +++++--
 include/linux/mm.h              |    2 +-
 mm/vmscan.c                     |    8 +++++---
 18 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 81563e7..ac3d107 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2919,7 +2919,7 @@ static int kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm)
 	return kvm_mmu_zap_page(kvm, page) + 1;
 }
 
-static int mmu_shrink(int nr_to_scan, gfp_t gfp_mask)
+static int mmu_shrink(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	struct kvm *kvm;
 	struct kvm *kvm_freed = NULL;
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 112699f..6cd2e7a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5216,7 +5216,7 @@ i915_gpu_is_active(struct drm_device *dev)
 }
 
 static int
-i915_gem_shrink(int nr_to_scan, gfp_t gfp_mask)
+i915_gem_shrink(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	drm_i915_private_t *dev_priv, *next_dev;
 	struct drm_i915_gem_object *obj_priv, *next_obj;
diff --git a/fs/dcache.c b/fs/dcache.c
index d96047b..dba6b6d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -894,7 +894,7 @@ EXPORT_SYMBOL(shrink_dcache_parent);
  *
  * In this case we return -1 to tell the caller that we baled.
  */
-static int shrink_dcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	if (nr) {
 		if (!(gfp_mask & __GFP_FS))
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index ddcdbf4..04b540c 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1348,7 +1348,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
 }
 
 
-static int gfs2_shrink_glock_memory(int nr, gfp_t gfp_mask)
+static int gfs2_shrink_glock_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	struct gfs2_glock *gl;
 	int may_demote;
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 49667d6..4ea548f 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -77,7 +77,7 @@ static LIST_HEAD(qd_lru_list);
 static atomic_t qd_lru_count = ATOMIC_INIT(0);
 static DEFINE_SPINLOCK(qd_lru_lock);
 
-int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask)
+int gfs2_shrink_qd_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	struct gfs2_quota_data *qd;
 	struct gfs2_sbd *sdp;
diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
index 195f60c..e7d236c 100644
--- a/fs/gfs2/quota.h
+++ b/fs/gfs2/quota.h
@@ -51,7 +51,7 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
 	return ret;
 }
 
-extern int gfs2_shrink_qd_memory(int nr, gfp_t gfp_mask);
+extern int gfs2_shrink_qd_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask);
 extern const struct quotactl_ops gfs2_quotactl_ops;
 
 #endif /* __QUOTA_DOT_H__ */
diff --git a/fs/inode.c b/fs/inode.c
index 3caa758..1e44ec5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -577,7 +577,7 @@ static void prune_icache(int count)
  * This function is passed the number of inodes to scan, and it returns the
  * total number of remaining possibly-reclaimable inodes.
  */
-static int shrink_icache_memory(int nr, gfp_t gfp_mask)
+static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	if (nr) {
 		/*
diff --git a/fs/mbcache.c b/fs/mbcache.c
index ec88ff3..e28f21b 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -115,7 +115,7 @@ mb_cache_indexes(struct mb_cache *cache)
  * What the mbcache registers as to get shrunk dynamically.
  */
 
-static int mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask);
+static int mb_cache_shrink_fn(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask);
 
 static struct shrinker mb_cache_shrinker = {
 	.shrink = mb_cache_shrink_fn,
@@ -191,13 +191,14 @@ forget:
  * This function is called by the kernel memory management when memory
  * gets low.
  *
+ * @shrink: (ignored)
  * @nr_to_scan: Number of objects to scan
  * @gfp_mask: (ignored)
  *
  * Returns the number of objects which are present in the cache.
  */
 static int
-mb_cache_shrink_fn(int nr_to_scan, gfp_t gfp_mask)
+mb_cache_shrink_fn(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	LIST_HEAD(free_list);
 	struct list_head *l, *ltmp;
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ee9a179..3f33bc0 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1708,7 +1708,7 @@ static void nfs_access_free_list(struct list_head *head)
 	}
 }
 
-int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask)
+int nfs_access_cache_shrinker(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	LIST_HEAD(head);
 	struct nfs_inode *nfsi;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index d8bd619..e70f44b 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -205,7 +205,8 @@ extern struct rpc_procinfo nfs4_procedures[];
 void nfs_close_context(struct nfs_open_context *ctx, int is_sync);
 
 /* dir.c */
-extern int nfs_access_cache_shrinker(int nr_to_scan, gfp_t gfp_mask);
+extern int nfs_access_cache_shrinker(struct shrinker *shrink,
+					int nr_to_scan, gfp_t gfp_mask);
 
 /* inode.c */
 extern struct workqueue_struct *nfsiod_workqueue;
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 655a4c5..cfd5437 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -697,7 +697,7 @@ static int dqstats_read(unsigned int type)
  * more memory
  */
 
-static int shrink_dqcache_memory(int nr, gfp_t gfp_mask)
+static int shrink_dqcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	if (nr) {
 		spin_lock(&dq_list_lock);
diff --git a/fs/ubifs/shrinker.c b/fs/ubifs/shrinker.c
index 02feb59..0b20111 100644
--- a/fs/ubifs/shrinker.c
+++ b/fs/ubifs/shrinker.c
@@ -277,7 +277,7 @@ static int kick_a_thread(void)
 	return 0;
 }
 
-int ubifs_shrinker(int nr, gfp_t gfp_mask)
+int ubifs_shrinker(struct shrinker *shrink, int nr, gfp_t gfp_mask)
 {
 	int freed, contention = 0;
 	long clean_zn_cnt = atomic_long_read(&ubifs_clean_zn_cnt);
diff --git a/fs/ubifs/ubifs.h b/fs/ubifs/ubifs.h
index bd2542d..5a92345 100644
--- a/fs/ubifs/ubifs.h
+++ b/fs/ubifs/ubifs.h
@@ -1575,7 +1575,7 @@ int ubifs_tnc_start_commit(struct ubifs_info *c, struct ubifs_zbranch *zroot);
 int ubifs_tnc_end_commit(struct ubifs_info *c);
 
 /* shrinker.c */
-int ubifs_shrinker(int nr_to_scan, gfp_t gfp_mask);
+int ubifs_shrinker(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask);
 
 /* commit.c */
 int ubifs_bg_thread(void *info);
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index f01de3c..fe8bf82 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -44,7 +44,7 @@
 
 static kmem_zone_t *xfs_buf_zone;
 STATIC int xfsbufd(void *);
-STATIC int xfsbufd_wakeup(int, gfp_t);
+STATIC int xfsbufd_wakeup(struct shrinker *, int, gfp_t);
 STATIC void xfs_buf_delwri_queue(xfs_buf_t *, int);
 static struct shrinker xfs_buf_shake = {
 	.shrink = xfsbufd_wakeup,
@@ -339,7 +339,7 @@ _xfs_buf_lookup_pages(
 					__func__, gfp_mask);
 
 			XFS_STATS_INC(xb_page_retries);
-			xfsbufd_wakeup(0, gfp_mask);
+			xfsbufd_wakeup(NULL, 0, gfp_mask);
 			congestion_wait(BLK_RW_ASYNC, HZ/50);
 			goto retry;
 		}
@@ -1753,6 +1753,7 @@ xfs_buf_runall_queues(
 
 STATIC int
 xfsbufd_wakeup(
+	struct shrinker		*shrink,
 	int			priority,
 	gfp_t			mask)
 {
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 3884e20..c881a0c 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -842,6 +842,7 @@ static struct rw_semaphore xfs_mount_list_lock;
 
 static int
 xfs_reclaim_inode_shrink(
+	struct shrinker	*shrink,
 	int		nr_to_scan,
 	gfp_t		gfp_mask)
 {
diff --git a/fs/xfs/quota/xfs_qm.c b/fs/xfs/quota/xfs_qm.c
index 38e7641..b8051aa 100644
--- a/fs/xfs/quota/xfs_qm.c
+++ b/fs/xfs/quota/xfs_qm.c
@@ -69,7 +69,7 @@ STATIC void	xfs_qm_list_destroy(xfs_dqlist_t *);
 
 STATIC int	xfs_qm_init_quotainos(xfs_mount_t *);
 STATIC int	xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int	xfs_qm_shake(int, gfp_t);
+STATIC int	xfs_qm_shake(struct shrinker *, int, gfp_t);
 
 static struct shrinker xfs_qm_shaker = {
 	.shrink = xfs_qm_shake,
@@ -2117,7 +2117,10 @@ xfs_qm_shake_freelist(
  */
 /* ARGSUSED */
 STATIC int
-xfs_qm_shake(int nr_to_scan, gfp_t gfp_mask)
+xfs_qm_shake(
+	struct shrinker	*shrink,
+	int		nr_to_scan,
+	gfp_t		gfp_mask)
 {
 	int	ndqused, nfree, n;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fb19bb9..3d7eedc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -994,7 +994,7 @@ static inline void sync_mm_rss(struct task_struct *task, struct mm_struct *mm)
  * querying the cache size, so a fastpath for that case is appropriate.
  */
 struct shrinker {
-	int (*shrink)(int nr_to_scan, gfp_t gfp_mask);
+	int (*shrink)(struct shrinker *, int nr_to_scan, gfp_t gfp_mask);
 	int seeks;	/* seeks to recreate an obj */
 
 	/* These are for internal use */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..9d56aaf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -215,8 +215,9 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	list_for_each_entry(shrinker, &shrinker_list, list) {
 		unsigned long long delta;
 		unsigned long total_scan;
-		unsigned long max_pass = (*shrinker->shrink)(0, gfp_mask);
+		unsigned long max_pass;
 
+		max_pass = (*shrinker->shrink)(shrinker, 0, gfp_mask);
 		delta = (4 * scanned) / shrinker->seeks;
 		delta *= max_pass;
 		do_div(delta, lru_pages + 1);
@@ -244,8 +245,9 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 			int shrink_ret;
 			int nr_before;
 
-			nr_before = (*shrinker->shrink)(0, gfp_mask);
-			shrink_ret = (*shrinker->shrink)(this_scan, gfp_mask);
+			nr_before = (*shrinker->shrink)(shrinker, 0, gfp_mask);
+			shrink_ret = (*shrinker->shrink)(shrinker, this_scan,
+								gfp_mask);
 			if (shrink_ret == -1)
 				break;
 			if (shrink_ret < nr_before)
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
  2010-05-25  8:53 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
  2010-05-25  8:53 ` [PATCH 2/5] mm: add context argument to shrinker callback Dave Chinner
@ 2010-05-25  8:53 ` Dave Chinner
  2010-05-26 16:41   ` Nick Piggin
                     ` (2 more replies)
  2010-05-25  8:53 ` [PATCH 4/5] superblock: add filesystem shrinker operations Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 38+ messages in thread
From: Dave Chinner @ 2010-05-25  8:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, linux-mm, xfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 15826 bytes --]

From: Dave Chinner <dchinner@redhat.com>

With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well.  The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.

With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/dcache.c        |  133 ++++++++--------------------------------------------
 fs/inode.c         |  109 +++---------------------------------------
 fs/super.c         |   53 +++++++++++++++++++++
 include/linux/fs.h |    7 +++
 4 files changed, 88 insertions(+), 214 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index dba6b6d..d7bd781 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
  * which flags are set. This means we don't need to maintain multiple
  * similar copies of this loop.
  */
-static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
+static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
 {
 	LIST_HEAD(referenced);
 	LIST_HEAD(tmp);
 	struct dentry *dentry;
-	int cnt = 0;
 
 	BUG_ON(!sb);
-	BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
+	BUG_ON((flags & DCACHE_REFERENCED) && count == -1);
 	spin_lock(&dcache_lock);
-	if (count != NULL)
-		/* called from prune_dcache() and shrink_dcache_parent() */
-		cnt = *count;
-restart:
-	if (count == NULL)
+	if (count == -1)
 		list_splice_init(&sb->s_dentry_lru, &tmp);
 	else {
 		while (!list_empty(&sb->s_dentry_lru)) {
@@ -492,13 +487,13 @@ restart:
 			} else {
 				list_move_tail(&dentry->d_lru, &tmp);
 				spin_unlock(&dentry->d_lock);
-				cnt--;
-				if (!cnt)
+				if (--count == 0)
 					break;
 			}
 			cond_resched_lock(&dcache_lock);
 		}
 	}
+prune_more:
 	while (!list_empty(&tmp)) {
 		dentry = list_entry(tmp.prev, struct dentry, d_lru);
 		dentry_lru_del_init(dentry);
@@ -516,88 +511,29 @@ restart:
 		/* dentry->d_lock was dropped in prune_one_dentry() */
 		cond_resched_lock(&dcache_lock);
 	}
-	if (count == NULL && !list_empty(&sb->s_dentry_lru))
-		goto restart;
-	if (count != NULL)
-		*count = cnt;
+	if (count == -1 && !list_empty(&sb->s_dentry_lru)) {
+		list_splice_init(&sb->s_dentry_lru, &tmp);
+		goto prune_more;
+	}
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
 	spin_unlock(&dcache_lock);
 }
 
 /**
- * prune_dcache - shrink the dcache
- * @count: number of entries to try to free
+ * prune_dcache_sb - shrink the dcache
+ * @nr_to_scan: number of entries to try to free
  *
- * Shrink the dcache. This is done when we need more memory, or simply when we
- * need to unmount something (at which point we need to unuse all dentries).
+ * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
+ * done when we need more memory an called from the superblock shrinker
+ * function.
  *
- * This function may fail to free any resources if all the dentries are in use.
+ * This function may fail to free any resources if all the dentries are in
+ * use.
  */
-static void prune_dcache(int count)
+void prune_dcache_sb(struct super_block *sb, int nr_to_scan)
 {
-	struct super_block *sb, *n;
-	int w_count;
-	int unused = dentry_stat.nr_unused;
-	int prune_ratio;
-	int pruned;
-
-	if (unused == 0 || count == 0)
-		return;
-	spin_lock(&dcache_lock);
-	if (count >= unused)
-		prune_ratio = 1;
-	else
-		prune_ratio = unused / count;
-	spin_lock(&sb_lock);
-	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
-		if (list_empty(&sb->s_instances))
-			continue;
-		if (sb->s_nr_dentry_unused == 0)
-			continue;
-		sb->s_count++;
-		/* Now, we reclaim unused dentrins with fairness.
-		 * We reclaim them same percentage from each superblock.
-		 * We calculate number of dentries to scan on this sb
-		 * as follows, but the implementation is arranged to avoid
-		 * overflows:
-		 * number of dentries to scan on this sb =
-		 * count * (number of dentries on this sb /
-		 * number of dentries in the machine)
-		 */
-		spin_unlock(&sb_lock);
-		if (prune_ratio != 1)
-			w_count = (sb->s_nr_dentry_unused / prune_ratio) + 1;
-		else
-			w_count = sb->s_nr_dentry_unused;
-		pruned = w_count;
-		/*
-		 * We need to be sure this filesystem isn't being unmounted,
-		 * otherwise we could race with generic_shutdown_super(), and
-		 * end up holding a reference to an inode while the filesystem
-		 * is unmounted.  So we try to get s_umount, and make sure
-		 * s_root isn't NULL.
-		 */
-		if (down_read_trylock(&sb->s_umount)) {
-			if ((sb->s_root != NULL) &&
-			    (!list_empty(&sb->s_dentry_lru))) {
-				spin_unlock(&dcache_lock);
-				__shrink_dcache_sb(sb, &w_count,
-						DCACHE_REFERENCED);
-				pruned -= w_count;
-				spin_lock(&dcache_lock);
-			}
-			up_read(&sb->s_umount);
-		}
-		spin_lock(&sb_lock);
-		count -= pruned;
-		__put_super(sb);
-		/* more work left to do? */
-		if (count <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-	spin_unlock(&dcache_lock);
+	__shrink_dcache_sb(sb, nr_to_scan, DCACHE_REFERENCED);
 }
 
 /**
@@ -610,7 +546,7 @@ static void prune_dcache(int count)
  */
 void shrink_dcache_sb(struct super_block * sb)
 {
-	__shrink_dcache_sb(sb, NULL, 0);
+	__shrink_dcache_sb(sb, -1, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
@@ -878,37 +814,10 @@ void shrink_dcache_parent(struct dentry * parent)
 	int found;
 
 	while ((found = select_parent(parent)) != 0)
-		__shrink_dcache_sb(sb, &found, 0);
+		__shrink_dcache_sb(sb, found, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
 
-/*
- * Scan `nr' dentries and return the number which remain.
- *
- * We need to avoid reentering the filesystem if the caller is performing a
- * GFP_NOFS allocation attempt.  One example deadlock is:
- *
- * ext2_new_block->getblk->GFP->shrink_dcache_memory->prune_dcache->
- * prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->
- * ext2_discard_prealloc->ext2_free_blocks->lock_super->DEADLOCK.
- *
- * In this case we return -1 to tell the caller that we baled.
- */
-static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
-{
-	if (nr) {
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-		prune_dcache(nr);
-	}
-	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker dcache_shrinker = {
-	.shrink = shrink_dcache_memory,
-	.seeks = DEFAULT_SEEKS,
-};
-
 /**
  * d_alloc	-	allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -2316,8 +2225,6 @@ static void __init dcache_init(void)
 	 */
 	dentry_cache = KMEM_CACHE(dentry,
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
-	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
diff --git a/fs/inode.c b/fs/inode.c
index 1e44ec5..5fb4a39 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -25,7 +25,6 @@
 #include <linux/mount.h>
 #include <linux/async.h>
 #include <linux/posix_acl.h>
-#include "internal.h"
 
 /*
  * This is needed for the following functions:
@@ -441,8 +440,10 @@ static int can_unuse(struct inode *inode)
 }
 
 /*
- * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * Walk the superblock inode LRU for freeable inodes and attempt to free them.
+ * This is called from the superblock shrinker function with a number of inodes
+ * to trim from the LRU. Inodes to be freed are moved to a temporary list and
+ * then are freed outside inode_lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
@@ -450,10 +451,10 @@ static int can_unuse(struct inode *inode)
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
- * try to remove them.
+ * If the inode has metadata buffers attached to mapping->private_list then try
+ * to remove them.
  */
-static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
+void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
@@ -461,7 +462,7 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 	unsigned long reap = 0;
 
 	spin_lock(&inode_lock);
-	for (nr_scanned = *nr_to_scan; nr_scanned >= 0; nr_scanned--) {
+	for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
 		struct inode *inode;
 
 		if (list_empty(&sb->s_inode_lru))
@@ -500,103 +501,10 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
-	*nr_to_scan = nr_scanned;
 
 	dispose_list(&freeable);
 }
 
-static void prune_icache(int count)
-{
-	struct super_block *sb, *n;
-	int w_count;
-	int unused = inodes_stat.nr_unused;
-	int prune_ratio;
-	int pruned;
-
-	if (unused == 0 || count == 0)
-		return;
-	down_read(&iprune_sem);
-	if (count >= unused)
-		prune_ratio = 1;
-	else
-		prune_ratio = unused / count;
-	spin_lock(&sb_lock);
-	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
-		if (list_empty(&sb->s_instances))
-			continue;
-		if (sb->s_nr_inodes_unused == 0)
-			continue;
-		sb->s_count++;
-		/* Now, we reclaim unused dentrins with fairness.
-		 * We reclaim them same percentage from each superblock.
-		 * We calculate number of dentries to scan on this sb
-		 * as follows, but the implementation is arranged to avoid
-		 * overflows:
-		 * number of dentries to scan on this sb =
-		 * count * (number of dentries on this sb /
-		 * number of dentries in the machine)
-		 */
-		spin_unlock(&sb_lock);
-		if (prune_ratio != 1)
-			w_count = (sb->s_nr_inodes_unused / prune_ratio) + 1;
-		else
-			w_count = sb->s_nr_inodes_unused;
-		pruned = w_count;
-		/*
-		 * We need to be sure this filesystem isn't being unmounted,
-		 * otherwise we could race with generic_shutdown_super(), and
-		 * end up holding a reference to an inode while the filesystem
-		 * is unmounted.  So we try to get s_umount, and make sure
-		 * s_root isn't NULL.
-		 */
-		if (down_read_trylock(&sb->s_umount)) {
-			if ((sb->s_root != NULL) &&
-			    (!list_empty(&sb->s_inode_lru))) {
-				shrink_icache_sb(sb, &w_count);
-				pruned -= w_count;
-			}
-			up_read(&sb->s_umount);
-		}
-		spin_lock(&sb_lock);
-		count -= pruned;
-		__put_super(sb);
-		/* more work left to do? */
-		if (count <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-	up_read(&iprune_sem);
-}
-
-/*
- * shrink_icache_memory() will attempt to reclaim some unused inodes.  Here,
- * "unused" means that no dentries are referring to the inodes: the files are
- * not open and the dcache references to those inodes have already been
- * reclaimed.
- *
- * This function is passed the number of inodes to scan, and it returns the
- * total number of remaining possibly-reclaimable inodes.
- */
-static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
-{
-	if (nr) {
-		/*
-		 * Nasty deadlock avoidance.  We may hold various FS locks,
-		 * and we don't want to recurse into the FS that called us
-		 * in clear_inode() and friends..
-		 */
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-		prune_icache(nr);
-	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker icache_shrinker = {
-	.shrink = shrink_icache_memory,
-	.seeks = DEFAULT_SEEKS,
-};
-
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
@@ -1634,7 +1542,6 @@ void __init inode_init(void)
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once);
-	register_shrinker(&icache_shrinker);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
diff --git a/fs/super.c b/fs/super.c
index c554c53..07e22e3 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -37,6 +37,50 @@
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
+{
+	struct super_block *sb;
+	int count;
+
+	sb = container_of(shrink, struct super_block, s_shrink);
+
+	/*
+	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
+	 * to recurse into the FS that called us in clear_inode() and friends..
+	 */
+	if (!(gfp_mask & __GFP_FS))
+		return -1;
+
+	/*
+	 * if we can't get the umount lock, then there's no point having the
+	 * shrinker try again because the sb is being torn down.
+	 */
+	if (!down_read_trylock(&sb->s_umount))
+		return -1;
+
+	if (!sb->s_root) {
+		up_read(&sb->s_umount);
+		return -1;
+	}
+
+	if (nr_to_scan) {
+		/* proportion the scan between the two cacheѕ */
+		int total;
+
+		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
+		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
+
+		/* prune dcache first as icache is pinned by it */
+		prune_dcache_sb(sb, count);
+		prune_icache_sb(sb, nr_to_scan - count);
+	}
+
+	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
+						* sysctl_vfs_cache_pressure;
+	up_read(&sb->s_umount);
+	return count;
+}
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
@@ -99,6 +143,13 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		s->s_qcop = sb_quotactl_ops;
 		s->s_op = &default_op;
 		s->s_time_gran = 1000000000;
+
+		/*
+		 * The shrinker is set up here but not registered until after
+		 * the superblock has been filled out successfully.
+		 */
+		s->s_shrink.shrink = prune_super;
+		s->s_shrink.seeks = DEFAULT_SEEKS;
 	}
 out:
 	return s;
@@ -162,6 +213,7 @@ void deactivate_locked_super(struct super_block *s)
 	struct file_system_type *fs = s->s_type;
 	if (atomic_dec_and_test(&s->s_active)) {
 		vfs_dq_off(s, 0);
+		unregister_shrinker(&s->s_shrink);
 		fs->kill_sb(s);
 		put_filesystem(fs);
 		put_super(s);
@@ -335,6 +387,7 @@ retry:
 	list_add_tail(&s->s_list, &super_blocks);
 	list_add(&s->s_instances, &type->fs_supers);
 	spin_unlock(&sb_lock);
+	register_shrinker(&s->s_shrink);
 	get_filesystem(type);
 	return s;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7b90c43..5bff2dc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -382,6 +382,7 @@ struct inodes_stat_t {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/mm.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -1385,8 +1386,14 @@ struct super_block {
 	 * generic_show_options()
 	 */
 	char *s_options;
+
+	struct shrinker s_shrink;	/* per-sb shrinker handle */
 };
 
+/* superblock cache pruning functions */
+void prune_icache_sb(struct super_block *sb, int nr_to_scan);
+void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/5] superblock: add filesystem shrinker operations
  2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
                   ` (2 preceding siblings ...)
  2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
@ 2010-05-25  8:53 ` Dave Chinner
  2010-05-27 20:32   ` Andrew Morton
  2010-05-25  8:53 ` [PATCH 5/5] xfs: make use of new shrinker callout Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-05-25  8:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, linux-mm, xfs

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 3572 bytes --]

From: Dave Chinner <dchinner@redhat.com>

Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.

Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/super.c         |   43 +++++++++++++++++++++++++++++++------------
 include/linux/fs.h |   11 +++++++++++
 2 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 07e22e3..9410218 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -40,7 +40,8 @@ DEFINE_SPINLOCK(sb_lock);
 static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 {
 	struct super_block *sb;
-	int count;
+	int	fs_objects = 0;
+	int	total_objects;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
@@ -63,22 +64,40 @@ static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
 		return -1;
 	}
 
-	if (nr_to_scan) {
-		/* proportion the scan between the two cacheѕ */
-		int total;
-
-		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
-		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
+	if (sb->s_op && sb->s_op->nr_cached_objects)
+		fs_objects = sb->s_op->nr_cached_objects(sb);
 
-		/* prune dcache first as icache is pinned by it */
-		prune_dcache_sb(sb, count);
-		prune_icache_sb(sb, nr_to_scan - count);
+	total_objects = sb->s_nr_dentry_unused +
+			sb->s_nr_inodes_unused + fs_objects + 1;
+	if (nr_to_scan) {
+		int	dentries;
+		int	inodes;
+
+		/* proportion the scan between the cacheѕ */
+		dentries = (nr_to_scan * sb->s_nr_dentry_unused) /
+							total_objects;
+		inodes = (nr_to_scan * sb->s_nr_inodes_unused) /
+							total_objects;
+		if (fs_objects)
+			fs_objects = (nr_to_scan * fs_objects) /
+							total_objects;
+		/*
+		 * prune the dcache first as the icache is pinned by it, then
+		 * prune the icache, followed by the filesystem specific caches
+		 */
+		prune_dcache_sb(sb, dentries);
+		prune_icache_sb(sb, inodes);
+		if (sb->s_op && sb->s_op->free_cached_objects) {
+			sb->s_op->free_cached_objects(sb, fs_objects);
+			fs_objects = sb->s_op->nr_cached_objects(sb);
+		}
+		total_objects = sb->s_nr_dentry_unused +
+				sb->s_nr_inodes_unused + fs_objects;
 	}
 
-	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
-						* sysctl_vfs_cache_pressure;
+	total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
 	up_read(&sb->s_umount);
-	return count;
+	return total_objects;
 }
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5bff2dc..efcdcc6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1590,6 +1590,17 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+
+	/*
+	 * memory shrinker operations.
+	 * ->nr_cached_objects() should return the number of freeable cached
+	 * objects the filesystem holds.
+	 * ->free_cache_objects() should attempt to free the number of cached
+	 * objects indicated. It should return how many objects it attempted to
+	 * free.
+	 */
+	int (*nr_cached_objects)(struct super_block *);
+	int (*free_cached_objects)(struct super_block *, int);
 };
 
 /*
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/5] xfs: make use of new shrinker callout
  2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
                   ` (3 preceding siblings ...)
  2010-05-25  8:53 ` [PATCH 4/5] superblock: add filesystem shrinker operations Dave Chinner
@ 2010-05-25  8:53 ` Dave Chinner
  2010-05-26 16:44 ` [PATCH 0/5] Per superblock shrinkers V2 Nick Piggin
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2010-05-25  8:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-fsdevel, linux-mm, xfs

From: Dave Chinner <dchinner@redhat.com>

Convert the inode reclaim shrinker to use the new per-sb shrinker
operations.  This fixes a bunch of lockdep warnings about the
xfs_mount_list_lock being taken in different reclaim contexts by
removing it, and allows the reclaim to be proportioned across
filesystems with no extra code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_super.c   |   23 ++++++--
 fs/xfs/linux-2.6/xfs_sync.c    |  124 +++++++++++-----------------------------
 fs/xfs/linux-2.6/xfs_sync.h    |   16 +++--
 fs/xfs/quota/xfs_qm_syscalls.c |    2 +-
 fs/xfs/xfs_mount.h             |    1 -
 5 files changed, 61 insertions(+), 105 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index f24dbe5..b59886a 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1212,7 +1212,6 @@ xfs_fs_put_super(
 
 	xfs_unmountfs(mp);
 	xfs_freesb(mp);
-	xfs_inode_shrinker_unregister(mp);
 	xfs_icsb_destroy_counters(mp);
 	xfs_close_devices(mp);
 	xfs_dmops_put(mp);
@@ -1626,8 +1625,6 @@ xfs_fs_fill_super(
 	if (error)
 		goto fail_vnrele;
 
-	xfs_inode_shrinker_register(mp);
-
 	kfree(mtpt);
 	return 0;
 
@@ -1681,6 +1678,22 @@ xfs_fs_get_sb(
 			   mnt);
 }
 
+static int
+xfs_fs_nr_cached_objects(
+	struct super_block	*sb)
+{
+	return xfs_reclaim_inodes_count(XFS_M(sb));
+}
+
+static int
+xfs_fs_free_cached_objects(
+	struct super_block	*sb,
+	int			nr_to_scan)
+{
+	xfs_reclaim_inodes_nr(XFS_M(sb), 0, nr_to_scan);
+	return 0;
+}
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
@@ -1694,6 +1707,8 @@ static const struct super_operations xfs_super_operations = {
 	.statfs			= xfs_fs_statfs,
 	.remount_fs		= xfs_fs_remount,
 	.show_options		= xfs_fs_show_options,
+	.nr_cached_objects	= xfs_fs_nr_cached_objects,
+	.free_cached_objects	= xfs_fs_free_cached_objects,
 };
 
 static struct file_system_type xfs_fs_type = {
@@ -1873,7 +1888,6 @@ init_xfs_fs(void)
 		goto out_cleanup_procfs;
 
 	vfs_initquota();
-	xfs_inode_shrinker_init();
 
 	error = register_filesystem(&xfs_fs_type);
 	if (error)
@@ -1901,7 +1915,6 @@ exit_xfs_fs(void)
 {
 	vfs_exitquota();
 	unregister_filesystem(&xfs_fs_type);
-	xfs_inode_shrinker_destroy();
 	xfs_sysctl_unregister();
 	xfs_cleanup_procfs();
 	xfs_buf_terminate();
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index c881a0c..6d74a0d 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -137,7 +137,7 @@ restart:
 
 	} while ((*nr_to_scan)--);
 
-	if (skipped) {
+	if (skipped && *nr_to_scan > 0) {
 		delay(1);
 		goto restart;
 	}
@@ -152,14 +152,14 @@ xfs_inode_ag_iterator(
 	int			flags,
 	int			tag,
 	int			exclusive,
-	int			*nr_to_scan)
+	int			nr_to_scan)
 {
 	int			error = 0;
 	int			last_error = 0;
 	xfs_agnumber_t		ag;
-	int			nr;
 
-	nr = nr_to_scan ? *nr_to_scan : INT_MAX;
+	if (nr_to_scan <= 0)
+		nr_to_scan = INT_MAX;
 	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
 		struct xfs_perag	*pag;
 
@@ -169,18 +169,16 @@ xfs_inode_ag_iterator(
 			continue;
 		}
 		error = xfs_inode_ag_walk(mp, pag, execute, flags, tag,
-						exclusive, &nr);
+						exclusive, &nr_to_scan);
 		xfs_perag_put(pag);
 		if (error) {
 			last_error = error;
 			if (error == EFSCORRUPTED)
 				break;
 		}
-		if (nr <= 0)
+		if (nr_to_scan <= 0)
 			break;
 	}
-	if (nr_to_scan)
-		*nr_to_scan = nr;
 	return XFS_ERROR(last_error);
 }
 
@@ -299,7 +297,7 @@ xfs_sync_data(
 	ASSERT((flags & ~(SYNC_TRYLOCK|SYNC_WAIT)) == 0);
 
 	error = xfs_inode_ag_iterator(mp, xfs_sync_inode_data, flags,
-				      XFS_ICI_NO_TAG, 0, NULL);
+				      XFS_ICI_NO_TAG, 0, 0);
 	if (error)
 		return XFS_ERROR(error);
 
@@ -318,7 +316,7 @@ xfs_sync_attr(
 	ASSERT((flags & ~SYNC_WAIT) == 0);
 
 	return xfs_inode_ag_iterator(mp, xfs_sync_inode_attr, flags,
-				     XFS_ICI_NO_TAG, 0, NULL);
+				     XFS_ICI_NO_TAG, 0, 0);
 }
 
 STATIC int
@@ -821,100 +819,44 @@ reclaim:
 
 }
 
+/*
+ * Scan a certain number of inodes for reclaim. nr_to_scan <= 0 means reclaim
+ * every inode that has the reclaim tag set.
+ */
 int
-xfs_reclaim_inodes(
+xfs_reclaim_inodes_nr(
 	xfs_mount_t	*mp,
-	int		mode)
+	int		mode,
+	int		nr_to_scan)
 {
 	return xfs_inode_ag_iterator(mp, xfs_reclaim_inode, mode,
-					XFS_ICI_RECLAIM_TAG, 1, NULL);
+					XFS_ICI_RECLAIM_TAG, 1, nr_to_scan);
 }
 
 /*
- * Shrinker infrastructure.
+ * Return the number of reclaimable inodes in the filesystem for
+ * the shrinker to determine how much to reclaim.
  *
- * This is all far more complex than it needs to be. It adds a global list of
- * mounts because the shrinkers can only call a global context. We need to make
- * the shrinkers pass a context to avoid the need for global state.
+ * Because the inode cache may not have any reclaimable inodes in it, but will
+ * be populated as part of the higher level cleaning, we need to count all
+ * those inodes as reclaimable here as well.
  */
-static LIST_HEAD(xfs_mount_list);
-static struct rw_semaphore xfs_mount_list_lock;
-
-static int
-xfs_reclaim_inode_shrink(
-	struct shrinker	*shrink,
-	int		nr_to_scan,
-	gfp_t		gfp_mask)
+int
+xfs_reclaim_inodes_count(
+	xfs_mount_t	*mp)
 {
-	struct xfs_mount *mp;
-	struct xfs_perag *pag;
-	xfs_agnumber_t	ag;
-	int		reclaimable = 0;
-
-	if (nr_to_scan) {
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-
-		down_read(&xfs_mount_list_lock);
-		list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
-			xfs_inode_ag_iterator(mp, xfs_reclaim_inode, 0,
-					XFS_ICI_RECLAIM_TAG, 1, &nr_to_scan);
-			if (nr_to_scan <= 0)
-				break;
-		}
-		up_read(&xfs_mount_list_lock);
-	}
-
-	down_read(&xfs_mount_list_lock);
-	list_for_each_entry(mp, &xfs_mount_list, m_mplist) {
-		for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
+	xfs_agnumber_t		ag;
+	int			reclaimable = 0;
 
-			pag = xfs_perag_get(mp, ag);
-			if (!pag->pag_ici_init) {
-				xfs_perag_put(pag);
-				continue;
-			}
-			reclaimable += pag->pag_ici_reclaimable;
+	for (ag = 0; ag < mp->m_sb.sb_agcount; ag++) {
+		struct xfs_perag *pag = xfs_perag_get(mp, ag);
+		if (!pag->pag_ici_init) {
 			xfs_perag_put(pag);
+			continue;
 		}
+		reclaimable += pag->pag_ici_reclaimable;
+		xfs_perag_put(pag);
 	}
-	up_read(&xfs_mount_list_lock);
-	return reclaimable;
-}
-
-static struct shrinker xfs_inode_shrinker = {
-	.shrink = xfs_reclaim_inode_shrink,
-	.seeks = DEFAULT_SEEKS,
-};
-
-void __init
-xfs_inode_shrinker_init(void)
-{
-	init_rwsem(&xfs_mount_list_lock);
-	register_shrinker(&xfs_inode_shrinker);
-}
-
-void
-xfs_inode_shrinker_destroy(void)
-{
-	ASSERT(list_empty(&xfs_mount_list));
-	unregister_shrinker(&xfs_inode_shrinker);
-}
-
-void
-xfs_inode_shrinker_register(
-	struct xfs_mount	*mp)
-{
-	down_write(&xfs_mount_list_lock);
-	list_add_tail(&mp->m_mplist, &xfs_mount_list);
-	up_write(&xfs_mount_list_lock);
+	return reclaimable + mp->m_super->s_nr_inodes_unused;
 }
 
-void
-xfs_inode_shrinker_unregister(
-	struct xfs_mount	*mp)
-{
-	down_write(&xfs_mount_list_lock);
-	list_del(&mp->m_mplist);
-	up_write(&xfs_mount_list_lock);
-}
diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
index cdcbaac..c55f645 100644
--- a/fs/xfs/linux-2.6/xfs_sync.h
+++ b/fs/xfs/linux-2.6/xfs_sync.h
@@ -43,7 +43,14 @@ void xfs_quiesce_attr(struct xfs_mount *mp);
 
 void xfs_flush_inodes(struct xfs_inode *ip);
 
-int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+int xfs_reclaim_inodes_count(struct xfs_mount *mp);
+int xfs_reclaim_inodes_nr(struct xfs_mount *mp, int mode, int nr_to_scan);
+
+static inline int
+xfs_reclaim_inodes(struct xfs_mount *mp, int mode)
+{
+	return xfs_reclaim_inodes_nr(mp, mode, 0);
+}
 
 void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
 void __xfs_inode_set_reclaim_tag(struct xfs_perag *pag, struct xfs_inode *ip);
@@ -53,11 +60,6 @@ void __xfs_inode_clear_reclaim_tag(struct xfs_mount *mp, struct xfs_perag *pag,
 int xfs_sync_inode_valid(struct xfs_inode *ip, struct xfs_perag *pag);
 int xfs_inode_ag_iterator(struct xfs_mount *mp,
 	int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag, int flags),
-	int flags, int tag, int write_lock, int *nr_to_scan);
-
-void xfs_inode_shrinker_init(void);
-void xfs_inode_shrinker_destroy(void);
-void xfs_inode_shrinker_register(struct xfs_mount *mp);
-void xfs_inode_shrinker_unregister(struct xfs_mount *mp);
+	int flags, int tag, int write_lock, int nr_to_scan);
 
 #endif
diff --git a/fs/xfs/quota/xfs_qm_syscalls.c b/fs/xfs/quota/xfs_qm_syscalls.c
index 92b002f..f5b0e4e 100644
--- a/fs/xfs/quota/xfs_qm_syscalls.c
+++ b/fs/xfs/quota/xfs_qm_syscalls.c
@@ -894,7 +894,7 @@ xfs_qm_dqrele_all_inodes(
 {
 	ASSERT(mp->m_quotainfo);
 	xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags,
-				XFS_ICI_NO_TAG, 0, NULL);
+				XFS_ICI_NO_TAG, 0, 0);
 }
 
 /*------------------------------------------------------------------------*/
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 9ff48a1..4fa0bc7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -259,7 +259,6 @@ typedef struct xfs_mount {
 	wait_queue_head_t	m_wait_single_sync_task;
 	__int64_t		m_update_flags;	/* sb flags we need to update
 						   on the next remount,rw */
-	struct list_head	m_mplist;	/* inode shrinker mount list */
 } xfs_mount_t;
 
 /*
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-25  8:53 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
@ 2010-05-26 16:17   ` Nick Piggin
  2010-05-26 23:01     ` Dave Chinner
  2010-05-27 20:32   ` Andrew Morton
  1 sibling, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2010-05-26 16:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, May 25, 2010 at 06:53:04PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode unused list is currently a global LRU. This does not match
> the other global filesystem cache - the dentry cache - which uses
> per-superblock LRU lists. Hence we have related filesystem object
> types using different LRU reclaimatin schemes.

Is this an improvement I wonder? The dcache is using per sb lists
because it specifically requires sb traversal.

What allocation/reclaim really wants (for good scalability and NUMA
characteristics) is per-zone lists for these things. It's easy to
convert a single list into per-zone lists.

It is much harder to convert per-sb lists into per-sb x per-zone lists.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
@ 2010-05-26 16:41   ` Nick Piggin
  2010-05-26 23:12     ` Dave Chinner
  2010-05-27  6:35   ` Nick Piggin
  2010-05-27 20:32   ` Andrew Morton
  2 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2010-05-26 16:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
>   * which flags are set. This means we don't need to maintain multiple
>   * similar copies of this loop.
>   */
> -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
>  {
>  	LIST_HEAD(referenced);
>  	LIST_HEAD(tmp);
>  	struct dentry *dentry;
> -	int cnt = 0;
>  
>  	BUG_ON(!sb);
> -	BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
> +	BUG_ON((flags & DCACHE_REFERENCED) && count == -1);
>  	spin_lock(&dcache_lock);
> -	if (count != NULL)
> -		/* called from prune_dcache() and shrink_dcache_parent() */
> -		cnt = *count;
> -restart:
> -	if (count == NULL)
> +	if (count == -1)
>  		list_splice_init(&sb->s_dentry_lru, &tmp);
>  	else {
>  		while (!list_empty(&sb->s_dentry_lru)) {
> @@ -492,13 +487,13 @@ restart:
>  			} else {
>  				list_move_tail(&dentry->d_lru, &tmp);
>  				spin_unlock(&dentry->d_lock);
> -				cnt--;
> -				if (!cnt)
> +				if (--count == 0)
>  					break;
>  			}
>  			cond_resched_lock(&dcache_lock);
>  		}
>  	}
> +prune_more:
>  	while (!list_empty(&tmp)) {
>  		dentry = list_entry(tmp.prev, struct dentry, d_lru);
>  		dentry_lru_del_init(dentry);
> @@ -516,88 +511,29 @@ restart:
>  		/* dentry->d_lock was dropped in prune_one_dentry() */
>  		cond_resched_lock(&dcache_lock);
>  	}
> -	if (count == NULL && !list_empty(&sb->s_dentry_lru))
> -		goto restart;
> -	if (count != NULL)
> -		*count = cnt;
> +	if (count == -1 && !list_empty(&sb->s_dentry_lru)) {
> +		list_splice_init(&sb->s_dentry_lru, &tmp);
> +		goto prune_more;
> +	}

Nitpick but I prefer just the restart label wher it is previously. This
is moving setup for the next iteration into the "error" case.


> +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> +{
> +	struct super_block *sb;
> +	int count;
> +
> +	sb = container_of(shrink, struct super_block, s_shrink);
> +
> +	/*
> +	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
> +	 * to recurse into the FS that called us in clear_inode() and friends..
> +	 */
> +	if (!(gfp_mask & __GFP_FS))
> +		return -1;
> +
> +	/*
> +	 * if we can't get the umount lock, then there's no point having the
> +	 * shrinker try again because the sb is being torn down.
> +	 */
> +	if (!down_read_trylock(&sb->s_umount))
> +		return -1;

Would you just elaborate on the lock order problem somewhere? (the
comment makes it look like we *could* take the mutex if we wanted
to).


> +
> +	if (!sb->s_root) {
> +		up_read(&sb->s_umount);
> +		return -1;
> +	}
> +
> +	if (nr_to_scan) {
> +		/* proportion the scan between the two cacheѕ */
> +		int total;
> +
> +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> +
> +		/* prune dcache first as icache is pinned by it */
> +		prune_dcache_sb(sb, count);
> +		prune_icache_sb(sb, nr_to_scan - count);
> +	}
> +
> +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> +						* sysctl_vfs_cache_pressure;

Do you think truncating in the divisions is at all a problem? It
probably doesn't matter much I suppose.

> @@ -162,6 +213,7 @@ void deactivate_locked_super(struct super_block *s)
>  	struct file_system_type *fs = s->s_type;
>  	if (atomic_dec_and_test(&s->s_active)) {
>  		vfs_dq_off(s, 0);
> +		unregister_shrinker(&s->s_shrink);
>  		fs->kill_sb(s);
>  		put_filesystem(fs);
>  		put_super(s);
> @@ -335,6 +387,7 @@ retry:
>  	list_add_tail(&s->s_list, &super_blocks);
>  	list_add(&s->s_instances, &type->fs_supers);
>  	spin_unlock(&sb_lock);
> +	register_shrinker(&s->s_shrink);
>  	get_filesystem(type);
>  	return s;
>  }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7b90c43..5bff2dc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -382,6 +382,7 @@ struct inodes_stat_t {
>  #include <linux/capability.h>
>  #include <linux/semaphore.h>
>  #include <linux/fiemap.h>
> +#include <linux/mm.h>
>  
>  #include <asm/atomic.h>
>  #include <asm/byteorder.h>
> @@ -1385,8 +1386,14 @@ struct super_block {
>  	 * generic_show_options()
>  	 */
>  	char *s_options;
> +
> +	struct shrinker s_shrink;	/* per-sb shrinker handle */
>  };

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Per superblock shrinkers V2
  2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
                   ` (4 preceding siblings ...)
  2010-05-25  8:53 ` [PATCH 5/5] xfs: make use of new shrinker callout Dave Chinner
@ 2010-05-26 16:44 ` Nick Piggin
  2010-05-27 20:32 ` Andrew Morton
  2010-07-02 12:13 ` Christoph Hellwig
  7 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2010-05-26 16:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, May 25, 2010 at 06:53:03PM +1000, Dave Chinner wrote:
> 
> This series reworks the filesystem shrinkers. We currently have a
> set of issues with the current filesystem shrinkers:
> 
>         1. There is an dependency between dentry and inode cache
>            shrinking that is only implicitly defined by the order of
>            shrinker registration.
>         2. The shrinkers need to walk the superblock list and pin
>            the superblock to avoid unmount races with the sb going
>            away.
>         3. The dentry cache uses per-superblock LRUs and proportions
>            reclaim between all the superblocks which means we are
>            doing breadth based reclaim. This means we touch every
>            superblock for every shrinker call, and may only reclaim
>            a single dentry at a time from a given superblock.
>         4. The inode cache has a global LRU, so it has different
>            reclaim patterns to the dentry cache, despite the fact
>            that the dentry cache is generally the only thing that
>            pins inodes in memory.
>         5. Filesystems need to register their own shrinkers for
>            caches and can't co-ordinate them with the dentry and
>            inode cache shrinkers.

Seems like a fairly good approach overall. Thanks.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-26 16:17   ` Nick Piggin
@ 2010-05-26 23:01     ` Dave Chinner
  2010-05-27  2:04       ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-05-26 23:01 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 02:17:33AM +1000, Nick Piggin wrote:
> On Tue, May 25, 2010 at 06:53:04PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The inode unused list is currently a global LRU. This does not match
> > the other global filesystem cache - the dentry cache - which uses
> > per-superblock LRU lists. Hence we have related filesystem object
> > types using different LRU reclaimatin schemes.
> 
> Is this an improvement I wonder? The dcache is using per sb lists
> because it specifically requires sb traversal.

Right - I originally implemented the per-sb dentry lists for
scalability purposes. i.e. to avoid monopolising the dentry_lock
during unmount looking for dentries on a specific sb and hanging the
system for several minutes.

However, the reason for doing this to the inode cache is not for
scalability, it's because we have a tight relationship between the
dentry and inode cacheѕ. That is, reclaim from the dentry LRU grows
the inode LRU.  Like the registration of the shrinkers, this is kind
of an implicit, undocumented behavour of the current shrinker
implemenation.

What this patch series does is take that implicit relationship and
make it explicit.  It also allows other filesystem caches to tie
into the relationship if they need to (e.g. the XFS inode cache).
What it _doesn't do_ is change the macro level behaviour of the
shrinkers...

> What allocation/reclaim really wants (for good scalability and NUMA
> characteristics) is per-zone lists for these things. It's easy to
> convert a single list into per-zone lists.
>
> It is much harder to convert per-sb lists into per-sb x per-zone lists.

No it's not. Just convert the s_{dentry,inode}_lru lists on each
superblock and call the shrinker with a new zone mask field to pick
the correct LRU. That's no harder than converting a global LRU.
Anyway, you'd still have to do per-sb x per-zone lists for the dentry LRUs,
so changing the inode cache to per-sb makes no difference.

However, this is a moot point because we don't have per-zone shrinker
interfaces. That's an entirely separate discussion because of the
macro-level behavioural changes it implies....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-26 16:41   ` Nick Piggin
@ 2010-05-26 23:12     ` Dave Chinner
  2010-05-27  1:53       ` [PATCH 3/5 v2] " Dave Chinner
  2010-05-27  2:19       ` [PATCH 3/5] " Nick Piggin
  0 siblings, 2 replies; 38+ messages in thread
From: Dave Chinner @ 2010-05-26 23:12 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> > @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
> >   * which flags are set. This means we don't need to maintain multiple
> >   * similar copies of this loop.
> >   */
> > -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> > +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
> >  {
> >  	LIST_HEAD(referenced);
> >  	LIST_HEAD(tmp);
> >  	struct dentry *dentry;
> > -	int cnt = 0;
> >  
> >  	BUG_ON(!sb);
> > -	BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
> > +	BUG_ON((flags & DCACHE_REFERENCED) && count == -1);
> >  	spin_lock(&dcache_lock);
> > -	if (count != NULL)
> > -		/* called from prune_dcache() and shrink_dcache_parent() */
> > -		cnt = *count;
> > -restart:
> > -	if (count == NULL)
> > +	if (count == -1)
> >  		list_splice_init(&sb->s_dentry_lru, &tmp);
> >  	else {
> >  		while (!list_empty(&sb->s_dentry_lru)) {
> > @@ -492,13 +487,13 @@ restart:
> >  			} else {
> >  				list_move_tail(&dentry->d_lru, &tmp);
> >  				spin_unlock(&dentry->d_lock);
> > -				cnt--;
> > -				if (!cnt)
> > +				if (--count == 0)
> >  					break;
> >  			}
> >  			cond_resched_lock(&dcache_lock);
> >  		}
> >  	}
> > +prune_more:
> >  	while (!list_empty(&tmp)) {
> >  		dentry = list_entry(tmp.prev, struct dentry, d_lru);
> >  		dentry_lru_del_init(dentry);
> > @@ -516,88 +511,29 @@ restart:
> >  		/* dentry->d_lock was dropped in prune_one_dentry() */
> >  		cond_resched_lock(&dcache_lock);
> >  	}
> > -	if (count == NULL && !list_empty(&sb->s_dentry_lru))
> > -		goto restart;
> > -	if (count != NULL)
> > -		*count = cnt;
> > +	if (count == -1 && !list_empty(&sb->s_dentry_lru)) {
> > +		list_splice_init(&sb->s_dentry_lru, &tmp);
> > +		goto prune_more;
> > +	}
> 
> Nitpick but I prefer just the restart label wher it is previously. This
> is moving setup for the next iteration into the "error" case.

Ok, will fix.

> > +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> > +{
> > +	struct super_block *sb;
> > +	int count;
> > +
> > +	sb = container_of(shrink, struct super_block, s_shrink);
> > +
> > +	/*
> > +	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
> > +	 * to recurse into the FS that called us in clear_inode() and friends..
> > +	 */
> > +	if (!(gfp_mask & __GFP_FS))
> > +		return -1;
> > +
> > +	/*
> > +	 * if we can't get the umount lock, then there's no point having the
> > +	 * shrinker try again because the sb is being torn down.
> > +	 */
> > +	if (!down_read_trylock(&sb->s_umount))
> > +		return -1;
> 
> Would you just elaborate on the lock order problem somewhere? (the
> comment makes it look like we *could* take the mutex if we wanted
> to).

The shrinker is unregistered in deactivate_locked_super() which is
just before ->kill_sb is called. The sb->s_umount lock is held at
this point. hence is the shrinker is operating, we will deadlock if
we try to lock it like this:

	unmount:			shrinker:
					down_read(&shrinker_lock);
	down_write(&sb->s_umount)
	unregister_shrinker()
	down_write(&shrinker_lock)
					prune_super()
					  down_read(&sb->s_umount);
					  (deadlock)

hence if we can't get the sb->s_umount lock in prune_super(), then
the superblock must be being unmounted and the shrinker should abort
as the ->kill_sb method will clean up everything after the shrinker
is unregistered. Hence the down_read_trylock().


> > +	if (!sb->s_root) {
> > +		up_read(&sb->s_umount);
> > +		return -1;
> > +	}
> > +
> > +	if (nr_to_scan) {
> > +		/* proportion the scan between the two cacheѕ */
> > +		int total;
> > +
> > +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> > +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> > +
> > +		/* prune dcache first as icache is pinned by it */
> > +		prune_dcache_sb(sb, count);
> > +		prune_icache_sb(sb, nr_to_scan - count);
> > +	}
> > +
> > +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> > +						* sysctl_vfs_cache_pressure;
> 
> Do you think truncating in the divisions is at all a problem? It
> probably doesn't matter much I suppose.

Same code as currently exists. IIRC, the reasoning is that if we've
got less that 100 objects to reclaim, then we're unlikely to be able
to free up any memory from the caches, anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/5 v2] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-26 23:12     ` Dave Chinner
@ 2010-05-27  1:53       ` Dave Chinner
  2010-05-27  4:01         ` Al Viro
  2010-05-27  2:19       ` [PATCH 3/5] " Nick Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-05-27  1:53 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
....
> > Nitpick but I prefer just the restart label wher it is previously. This
> > is moving setup for the next iteration into the "error" case.
> 
> Ok, will fix.
....
> > Would you just elaborate on the lock order problem somewhere? (the
> > comment makes it look like we *could* take the mutex if we wanted
> > to).
> 
> The shrinker is unregistered in deactivate_locked_super() which is
> just before ->kill_sb is called. The sb->s_umount lock is held at
> this point. hence is the shrinker is operating, we will deadlock if
> we try to lock it like this:
> 
> 	unmount:			shrinker:
> 					down_read(&shrinker_lock);
> 	down_write(&sb->s_umount)
> 	unregister_shrinker()
> 	down_write(&shrinker_lock)
> 					prune_super()
> 					  down_read(&sb->s_umount);
> 					  (deadlock)
> 
> hence if we can't get the sb->s_umount lock in prune_super(), then
> the superblock must be being unmounted and the shrinker should abort
> as the ->kill_sb method will clean up everything after the shrinker
> is unregistered. Hence the down_read_trylock().

Updated patch below with these issues fixed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

superblock: introduce per-sb cache shrinker infrastructure

From: Dave Chinner <dchinner@redhat.com>

With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well.  The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.

With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
Version 2:
- change loop restart in __shrink_dcache_sb() to match previous
  restart semantics
- add a better comment in prune_super() to explain the deadlock we
  are avoiding by using down_read_trylock(&sb->s_umount) before
  starting any shrinking.

 fs/dcache.c        |  127 +++++++---------------------------------------------
 fs/inode.c         |  109 +++-----------------------------------------
 fs/super.c         |   58 ++++++++++++++++++++++++
 include/linux/fs.h |    7 +++
 4 files changed, 89 insertions(+), 212 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index dba6b6d..a7cd335 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -456,21 +456,17 @@ static void prune_one_dentry(struct dentry * dentry)
  * which flags are set. This means we don't need to maintain multiple
  * similar copies of this loop.
  */
-static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
+static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
 {
 	LIST_HEAD(referenced);
 	LIST_HEAD(tmp);
 	struct dentry *dentry;
-	int cnt = 0;
 
 	BUG_ON(!sb);
-	BUG_ON((flags & DCACHE_REFERENCED) && count == NULL);
+	BUG_ON((flags & DCACHE_REFERENCED) && count == -1);
 	spin_lock(&dcache_lock);
-	if (count != NULL)
-		/* called from prune_dcache() and shrink_dcache_parent() */
-		cnt = *count;
 restart:
-	if (count == NULL)
+	if (count == -1)
 		list_splice_init(&sb->s_dentry_lru, &tmp);
 	else {
 		while (!list_empty(&sb->s_dentry_lru)) {
@@ -492,8 +488,7 @@ restart:
 			} else {
 				list_move_tail(&dentry->d_lru, &tmp);
 				spin_unlock(&dentry->d_lock);
-				cnt--;
-				if (!cnt)
+				if (--count == 0)
 					break;
 			}
 			cond_resched_lock(&dcache_lock);
@@ -516,88 +511,27 @@ restart:
 		/* dentry->d_lock was dropped in prune_one_dentry() */
 		cond_resched_lock(&dcache_lock);
 	}
-	if (count == NULL && !list_empty(&sb->s_dentry_lru))
+	if (count == -1 && !list_empty(&sb->s_dentry_lru))
 		goto restart;
-	if (count != NULL)
-		*count = cnt;
 	if (!list_empty(&referenced))
 		list_splice(&referenced, &sb->s_dentry_lru);
 	spin_unlock(&dcache_lock);
 }
 
 /**
- * prune_dcache - shrink the dcache
- * @count: number of entries to try to free
+ * prune_dcache_sb - shrink the dcache
+ * @nr_to_scan: number of entries to try to free
  *
- * Shrink the dcache. This is done when we need more memory, or simply when we
- * need to unmount something (at which point we need to unuse all dentries).
+ * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
+ * done when we need more memory an called from the superblock shrinker
+ * function.
  *
- * This function may fail to free any resources if all the dentries are in use.
+ * This function may fail to free any resources if all the dentries are in
+ * use.
  */
-static void prune_dcache(int count)
+void prune_dcache_sb(struct super_block *sb, int nr_to_scan)
 {
-	struct super_block *sb, *n;
-	int w_count;
-	int unused = dentry_stat.nr_unused;
-	int prune_ratio;
-	int pruned;
-
-	if (unused == 0 || count == 0)
-		return;
-	spin_lock(&dcache_lock);
-	if (count >= unused)
-		prune_ratio = 1;
-	else
-		prune_ratio = unused / count;
-	spin_lock(&sb_lock);
-	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
-		if (list_empty(&sb->s_instances))
-			continue;
-		if (sb->s_nr_dentry_unused == 0)
-			continue;
-		sb->s_count++;
-		/* Now, we reclaim unused dentrins with fairness.
-		 * We reclaim them same percentage from each superblock.
-		 * We calculate number of dentries to scan on this sb
-		 * as follows, but the implementation is arranged to avoid
-		 * overflows:
-		 * number of dentries to scan on this sb =
-		 * count * (number of dentries on this sb /
-		 * number of dentries in the machine)
-		 */
-		spin_unlock(&sb_lock);
-		if (prune_ratio != 1)
-			w_count = (sb->s_nr_dentry_unused / prune_ratio) + 1;
-		else
-			w_count = sb->s_nr_dentry_unused;
-		pruned = w_count;
-		/*
-		 * We need to be sure this filesystem isn't being unmounted,
-		 * otherwise we could race with generic_shutdown_super(), and
-		 * end up holding a reference to an inode while the filesystem
-		 * is unmounted.  So we try to get s_umount, and make sure
-		 * s_root isn't NULL.
-		 */
-		if (down_read_trylock(&sb->s_umount)) {
-			if ((sb->s_root != NULL) &&
-			    (!list_empty(&sb->s_dentry_lru))) {
-				spin_unlock(&dcache_lock);
-				__shrink_dcache_sb(sb, &w_count,
-						DCACHE_REFERENCED);
-				pruned -= w_count;
-				spin_lock(&dcache_lock);
-			}
-			up_read(&sb->s_umount);
-		}
-		spin_lock(&sb_lock);
-		count -= pruned;
-		__put_super(sb);
-		/* more work left to do? */
-		if (count <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-	spin_unlock(&dcache_lock);
+	__shrink_dcache_sb(sb, nr_to_scan, DCACHE_REFERENCED);
 }
 
 /**
@@ -610,7 +544,7 @@ static void prune_dcache(int count)
  */
 void shrink_dcache_sb(struct super_block * sb)
 {
-	__shrink_dcache_sb(sb, NULL, 0);
+	__shrink_dcache_sb(sb, -1, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_sb);
 
@@ -878,37 +812,10 @@ void shrink_dcache_parent(struct dentry * parent)
 	int found;
 
 	while ((found = select_parent(parent)) != 0)
-		__shrink_dcache_sb(sb, &found, 0);
+		__shrink_dcache_sb(sb, found, 0);
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
 
-/*
- * Scan `nr' dentries and return the number which remain.
- *
- * We need to avoid reentering the filesystem if the caller is performing a
- * GFP_NOFS allocation attempt.  One example deadlock is:
- *
- * ext2_new_block->getblk->GFP->shrink_dcache_memory->prune_dcache->
- * prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->
- * ext2_discard_prealloc->ext2_free_blocks->lock_super->DEADLOCK.
- *
- * In this case we return -1 to tell the caller that we baled.
- */
-static int shrink_dcache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
-{
-	if (nr) {
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-		prune_dcache(nr);
-	}
-	return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker dcache_shrinker = {
-	.shrink = shrink_dcache_memory,
-	.seeks = DEFAULT_SEEKS,
-};
-
 /**
  * d_alloc	-	allocate a dcache entry
  * @parent: parent of entry to allocate
@@ -2316,8 +2223,6 @@ static void __init dcache_init(void)
 	 */
 	dentry_cache = KMEM_CACHE(dentry,
 		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
-	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
diff --git a/fs/inode.c b/fs/inode.c
index 1e44ec5..5fb4a39 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -25,7 +25,6 @@
 #include <linux/mount.h>
 #include <linux/async.h>
 #include <linux/posix_acl.h>
-#include "internal.h"
 
 /*
  * This is needed for the following functions:
@@ -441,8 +440,10 @@ static int can_unuse(struct inode *inode)
 }
 
 /*
- * Scan `goal' inodes on the unused list for freeable ones. They are moved to
- * a temporary list and then are freed outside inode_lock by dispose_list().
+ * Walk the superblock inode LRU for freeable inodes and attempt to free them.
+ * This is called from the superblock shrinker function with a number of inodes
+ * to trim from the LRU. Inodes to be freed are moved to a temporary list and
+ * then are freed outside inode_lock by dispose_list().
  *
  * Any inodes which are pinned purely because of attached pagecache have their
  * pagecache removed.  We expect the final iput() on that inode to add it to
@@ -450,10 +451,10 @@ static int can_unuse(struct inode *inode)
  * inode is still freeable, proceed.  The right inode is found 99.9% of the
  * time in testing on a 4-way.
  *
- * If the inode has metadata buffers attached to mapping->private_list then
- * try to remove them.
+ * If the inode has metadata buffers attached to mapping->private_list then try
+ * to remove them.
  */
-static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
+void prune_icache_sb(struct super_block *sb, int nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	int nr_pruned = 0;
@@ -461,7 +462,7 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 	unsigned long reap = 0;
 
 	spin_lock(&inode_lock);
-	for (nr_scanned = *nr_to_scan; nr_scanned >= 0; nr_scanned--) {
+	for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
 		struct inode *inode;
 
 		if (list_empty(&sb->s_inode_lru))
@@ -500,103 +501,10 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
 	else
 		__count_vm_events(PGINODESTEAL, reap);
 	spin_unlock(&inode_lock);
-	*nr_to_scan = nr_scanned;
 
 	dispose_list(&freeable);
 }
 
-static void prune_icache(int count)
-{
-	struct super_block *sb, *n;
-	int w_count;
-	int unused = inodes_stat.nr_unused;
-	int prune_ratio;
-	int pruned;
-
-	if (unused == 0 || count == 0)
-		return;
-	down_read(&iprune_sem);
-	if (count >= unused)
-		prune_ratio = 1;
-	else
-		prune_ratio = unused / count;
-	spin_lock(&sb_lock);
-	list_for_each_entry_safe(sb, n, &super_blocks, s_list) {
-		if (list_empty(&sb->s_instances))
-			continue;
-		if (sb->s_nr_inodes_unused == 0)
-			continue;
-		sb->s_count++;
-		/* Now, we reclaim unused dentrins with fairness.
-		 * We reclaim them same percentage from each superblock.
-		 * We calculate number of dentries to scan on this sb
-		 * as follows, but the implementation is arranged to avoid
-		 * overflows:
-		 * number of dentries to scan on this sb =
-		 * count * (number of dentries on this sb /
-		 * number of dentries in the machine)
-		 */
-		spin_unlock(&sb_lock);
-		if (prune_ratio != 1)
-			w_count = (sb->s_nr_inodes_unused / prune_ratio) + 1;
-		else
-			w_count = sb->s_nr_inodes_unused;
-		pruned = w_count;
-		/*
-		 * We need to be sure this filesystem isn't being unmounted,
-		 * otherwise we could race with generic_shutdown_super(), and
-		 * end up holding a reference to an inode while the filesystem
-		 * is unmounted.  So we try to get s_umount, and make sure
-		 * s_root isn't NULL.
-		 */
-		if (down_read_trylock(&sb->s_umount)) {
-			if ((sb->s_root != NULL) &&
-			    (!list_empty(&sb->s_inode_lru))) {
-				shrink_icache_sb(sb, &w_count);
-				pruned -= w_count;
-			}
-			up_read(&sb->s_umount);
-		}
-		spin_lock(&sb_lock);
-		count -= pruned;
-		__put_super(sb);
-		/* more work left to do? */
-		if (count <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-	up_read(&iprune_sem);
-}
-
-/*
- * shrink_icache_memory() will attempt to reclaim some unused inodes.  Here,
- * "unused" means that no dentries are referring to the inodes: the files are
- * not open and the dcache references to those inodes have already been
- * reclaimed.
- *
- * This function is passed the number of inodes to scan, and it returns the
- * total number of remaining possibly-reclaimable inodes.
- */
-static int shrink_icache_memory(struct shrinker *shrink, int nr, gfp_t gfp_mask)
-{
-	if (nr) {
-		/*
-		 * Nasty deadlock avoidance.  We may hold various FS locks,
-		 * and we don't want to recurse into the FS that called us
-		 * in clear_inode() and friends..
-		 */
-		if (!(gfp_mask & __GFP_FS))
-			return -1;
-		prune_icache(nr);
-	}
-	return (inodes_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker icache_shrinker = {
-	.shrink = shrink_icache_memory,
-	.seeks = DEFAULT_SEEKS,
-};
-
 static void __wait_on_freeing_inode(struct inode *inode);
 /*
  * Called with the inode lock held.
@@ -1634,7 +1542,6 @@ void __init inode_init(void)
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once);
-	register_shrinker(&icache_shrinker);
 
 	/* Hash may have been set up in inode_init_early */
 	if (!hashdist)
diff --git a/fs/super.c b/fs/super.c
index c554c53..613339b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -37,6 +37,55 @@
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
+{
+	struct super_block *sb;
+	int count;
+
+	sb = container_of(shrink, struct super_block, s_shrink);
+
+	/*
+	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
+	 * to recurse into the FS that called us in clear_inode() and friends..
+	 */
+	if (!(gfp_mask & __GFP_FS))
+		return -1;
+
+	/*
+	 * If we can't get the umount lock, then it's because the sb is being
+	 * unmounted. If we get here, then the unmount is likely stuck trying
+	 * to unregister the shrinker, so we must not block trying to get the
+	 * sb->s_umount otherwise we deadlock. Hence if we fail to get the
+	 * sb_umount lock, abort shrinking the sb by telling the shrinker not
+	 * to call us again and the unmount process will clean up the cache for
+	 * us after it has unregistered the shrinker.
+	 */
+	if (!down_read_trylock(&sb->s_umount))
+		return -1;
+
+	if (!sb->s_root) {
+		up_read(&sb->s_umount);
+		return -1;
+	}
+
+	if (nr_to_scan) {
+		/* proportion the scan between the two cacheѕ */
+		int total;
+
+		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
+		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
+
+		/* prune dcache first as icache is pinned by it */
+		prune_dcache_sb(sb, count);
+		prune_icache_sb(sb, nr_to_scan - count);
+	}
+
+	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
+						* sysctl_vfs_cache_pressure;
+	up_read(&sb->s_umount);
+	return count;
+}
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
@@ -99,6 +148,13 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		s->s_qcop = sb_quotactl_ops;
 		s->s_op = &default_op;
 		s->s_time_gran = 1000000000;
+
+		/*
+		 * The shrinker is set up here but not registered until after
+		 * the superblock has been filled out successfully.
+		 */
+		s->s_shrink.shrink = prune_super;
+		s->s_shrink.seeks = DEFAULT_SEEKS;
 	}
 out:
 	return s;
@@ -162,6 +218,7 @@ void deactivate_locked_super(struct super_block *s)
 	struct file_system_type *fs = s->s_type;
 	if (atomic_dec_and_test(&s->s_active)) {
 		vfs_dq_off(s, 0);
+		unregister_shrinker(&s->s_shrink);
 		fs->kill_sb(s);
 		put_filesystem(fs);
 		put_super(s);
@@ -335,6 +392,7 @@ retry:
 	list_add_tail(&s->s_list, &super_blocks);
 	list_add(&s->s_instances, &type->fs_supers);
 	spin_unlock(&sb_lock);
+	register_shrinker(&s->s_shrink);
 	get_filesystem(type);
 	return s;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7b90c43..5bff2dc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -382,6 +382,7 @@ struct inodes_stat_t {
 #include <linux/capability.h>
 #include <linux/semaphore.h>
 #include <linux/fiemap.h>
+#include <linux/mm.h>
 
 #include <asm/atomic.h>
 #include <asm/byteorder.h>
@@ -1385,8 +1386,14 @@ struct super_block {
 	 * generic_show_options()
 	 */
 	char *s_options;
+
+	struct shrinker s_shrink;	/* per-sb shrinker handle */
 };
 
+/* superblock cache pruning functions */
+void prune_icache_sb(struct super_block *sb, int nr_to_scan);
+void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
+
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-26 23:01     ` Dave Chinner
@ 2010-05-27  2:04       ` Nick Piggin
  2010-05-27  4:02         ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2010-05-27  2:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 09:01:29AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 02:17:33AM +1000, Nick Piggin wrote:
> > On Tue, May 25, 2010 at 06:53:04PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The inode unused list is currently a global LRU. This does not match
> > > the other global filesystem cache - the dentry cache - which uses
> > > per-superblock LRU lists. Hence we have related filesystem object
> > > types using different LRU reclaimatin schemes.
> > 
> > Is this an improvement I wonder? The dcache is using per sb lists
> > because it specifically requires sb traversal.
> 
> Right - I originally implemented the per-sb dentry lists for
> scalability purposes. i.e. to avoid monopolising the dentry_lock
> during unmount looking for dentries on a specific sb and hanging the
> system for several minutes.
> 
> However, the reason for doing this to the inode cache is not for
> scalability, it's because we have a tight relationship between the
> dentry and inode cacheѕ. That is, reclaim from the dentry LRU grows
> the inode LRU.  Like the registration of the shrinkers, this is kind
> of an implicit, undocumented behavour of the current shrinker
> implemenation.

Right, that's why I wonder whether it is an improvement. It would
be interesting to see some tests (showing at least parity).

 
> What this patch series does is take that implicit relationship and
> make it explicit.  It also allows other filesystem caches to tie
> into the relationship if they need to (e.g. the XFS inode cache).
> What it _doesn't do_ is change the macro level behaviour of the
> shrinkers...
> 
> > What allocation/reclaim really wants (for good scalability and NUMA
> > characteristics) is per-zone lists for these things. It's easy to
> > convert a single list into per-zone lists.
> >
> > It is much harder to convert per-sb lists into per-sb x per-zone lists.
> 
> No it's not. Just convert the s_{dentry,inode}_lru lists on each
> superblock and call the shrinker with a new zone mask field to pick
> the correct LRU. That's no harder than converting a global LRU.
> Anyway, you'd still have to do per-sb x per-zone lists for the dentry LRUs,
> so changing the inode cache to per-sb makes no difference.

Right, it just makes it harder to do. By much harder, I did mostly mean
the extra memory overhead. If there is *no* benefit from doing per-sb
icache then I would question whether we should.

 
> However, this is a moot point because we don't have per-zone shrinker
> interfaces. That's an entirely separate discussion because of the
> macro-level behavioural changes it implies....

Yep. I have some patches for it, but they're currently behind the other
fine grained locking stuff. But it's something that really needs to be
implemented, IMO.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-26 23:12     ` Dave Chinner
  2010-05-27  1:53       ` [PATCH 3/5 v2] " Dave Chinner
@ 2010-05-27  2:19       ` Nick Piggin
  2010-05-27  4:07         ` Dave Chinner
  1 sibling, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2010-05-27  2:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> > On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> > > @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
> > > +	/*
> > > +	 * if we can't get the umount lock, then there's no point having the
> > > +	 * shrinker try again because the sb is being torn down.
> > > +	 */
> > > +	if (!down_read_trylock(&sb->s_umount))
> > > +		return -1;
> > 
> > Would you just elaborate on the lock order problem somewhere? (the
> > comment makes it look like we *could* take the mutex if we wanted
> > to).
> 
> The shrinker is unregistered in deactivate_locked_super() which is
> just before ->kill_sb is called. The sb->s_umount lock is held at
> this point. hence is the shrinker is operating, we will deadlock if
> we try to lock it like this:
> 
> 	unmount:			shrinker:
> 					down_read(&shrinker_lock);
> 	down_write(&sb->s_umount)
> 	unregister_shrinker()
> 	down_write(&shrinker_lock)
> 					prune_super()
> 					  down_read(&sb->s_umount);
> 					  (deadlock)
> 
> hence if we can't get the sb->s_umount lock in prune_super(), then
> the superblock must be being unmounted and the shrinker should abort
> as the ->kill_sb method will clean up everything after the shrinker
> is unregistered. Hence the down_read_trylock().

You added it to the comment in your updated patch, that was the main
thing I wanted. Thanks.


> > > +	if (!sb->s_root) {
> > > +		up_read(&sb->s_umount);
> > > +		return -1;
> > > +	}
> > > +
> > > +	if (nr_to_scan) {
> > > +		/* proportion the scan between the two cacheѕ */
> > > +		int total;
> > > +
> > > +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> > > +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> > > +
> > > +		/* prune dcache first as icache is pinned by it */
> > > +		prune_dcache_sb(sb, count);
> > > +		prune_icache_sb(sb, nr_to_scan - count);
> > > +	}
> > > +
> > > +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> > > +						* sysctl_vfs_cache_pressure;
> > 
> > Do you think truncating in the divisions is at all a problem? It
> > probably doesn't matter much I suppose.
> 
> Same code as currently exists. IIRC, the reasoning is that if we've
> got less that 100 objects to reclaim, then we're unlikely to be able
> to free up any memory from the caches, anyway.

Yeah, which is why I stop short of saying you should change it in
this patch.

But I think we should ensure things can get reclaimed eventually.
100 objects could be 100 slabs, which could be anything from
half a meg to half a dozen. Multiplied by each of the caches.
Could be significant in small systems.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5 v2] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27  1:53       ` [PATCH 3/5 v2] " Dave Chinner
@ 2010-05-27  4:01         ` Al Viro
  2010-05-27  6:17           ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Al Viro @ 2010-05-27  4:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Nick Piggin, linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 11:53:35AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> ....
> > > Nitpick but I prefer just the restart label wher it is previously. This
> > > is moving setup for the next iteration into the "error" case.
> > 
> > Ok, will fix.
> ....
> > > Would you just elaborate on the lock order problem somewhere? (the
> > > comment makes it look like we *could* take the mutex if we wanted
> > > to).
> > 
> > The shrinker is unregistered in deactivate_locked_super() which is
> > just before ->kill_sb is called. The sb->s_umount lock is held at
> > this point. hence is the shrinker is operating, we will deadlock if
> > we try to lock it like this:
> > 
> > 	unmount:			shrinker:
> > 					down_read(&shrinker_lock);
> > 	down_write(&sb->s_umount)
> > 	unregister_shrinker()
> > 	down_write(&shrinker_lock)
> > 					prune_super()
> > 					  down_read(&sb->s_umount);
> > 					  (deadlock)
> > 
> > hence if we can't get the sb->s_umount lock in prune_super(), then
> > the superblock must be being unmounted and the shrinker should abort
> > as the ->kill_sb method will clean up everything after the shrinker
> > is unregistered. Hence the down_read_trylock().

Um...  Maybe I'm dumb, but what's wrong with doing unregistration from
deactivate_locked_super(), right after the call of ->kill_sb()?  At that
point ->s_umount is already dropped, so we won't deadlock at all.
Shrinker rwsem will make sure that all shrinkers-in-progress will run
to completion, so we won't get a superblock freed under prune_super().
I don't particulary mind down_try_read() in prune_super(), but why not
make life obviously safer?

Am I missing something here?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-27  2:04       ` Nick Piggin
@ 2010-05-27  4:02         ` Dave Chinner
  2010-05-27  4:23           ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-05-27  4:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 12:04:45PM +1000, Nick Piggin wrote:
> On Thu, May 27, 2010 at 09:01:29AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2010 at 02:17:33AM +1000, Nick Piggin wrote:
> > > On Tue, May 25, 2010 at 06:53:04PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > The inode unused list is currently a global LRU. This does not match
> > > > the other global filesystem cache - the dentry cache - which uses
> > > > per-superblock LRU lists. Hence we have related filesystem object
> > > > types using different LRU reclaimatin schemes.
> > > 
> > > Is this an improvement I wonder? The dcache is using per sb lists
> > > because it specifically requires sb traversal.
> > 
> > Right - I originally implemented the per-sb dentry lists for
> > scalability purposes. i.e. to avoid monopolising the dentry_lock
> > during unmount looking for dentries on a specific sb and hanging the
> > system for several minutes.
> > 
> > However, the reason for doing this to the inode cache is not for
> > scalability, it's because we have a tight relationship between the
> > dentry and inode cacheѕ. That is, reclaim from the dentry LRU grows
> > the inode LRU.  Like the registration of the shrinkers, this is kind
> > of an implicit, undocumented behavour of the current shrinker
> > implemenation.
> 
> Right, that's why I wonder whether it is an improvement. It would
> be interesting to see some tests (showing at least parity).

I've done some testing showing parity. They've been along the lines
of:
	- populate cache with 1m dentries + inodes
	- run 'time echo 2 > /proc/sys/vm/drop_caches'

I've used different methods of populating the caches to have them
non-sequential in the LRU (i.e. trigger fragmentation), have dirty
backing inodes (e.g. the VFS inode clean, the xfs inode dirty
because transactions haven't completed), etc.

The variation on the test is around +-10%, with the per-sb shrinkers
averaging about 5% lower time to reclaim. This is within the error
margin of the test, so it's not really a conclusive win, but it is
certainly shows that it does not slow anything down. If you've got a
better way to test it, then I'm all ears....

> > What this patch series does is take that implicit relationship and
> > make it explicit.  It also allows other filesystem caches to tie
> > into the relationship if they need to (e.g. the XFS inode cache).
> > What it _doesn't do_ is change the macro level behaviour of the
> > shrinkers...
> > 
> > > What allocation/reclaim really wants (for good scalability and NUMA
> > > characteristics) is per-zone lists for these things. It's easy to
> > > convert a single list into per-zone lists.
> > >
> > > It is much harder to convert per-sb lists into per-sb x per-zone lists.
> > 
> > No it's not. Just convert the s_{dentry,inode}_lru lists on each
> > superblock and call the shrinker with a new zone mask field to pick
> > the correct LRU. That's no harder than converting a global LRU.
> > Anyway, you'd still have to do per-sb x per-zone lists for the dentry LRUs,
> > so changing the inode cache to per-sb makes no difference.
> 
> Right, it just makes it harder to do. By much harder, I did mostly mean
> the extra memory overhead.

You've still got to allocate that extra memory on the per-sb dentry
LRUs so it's not really a valid argument. IOWs, if it's too much
memory for per-sb inode LRUs, then it's too much memory for the
per-sb dentry LRUs as well...

> If there is *no* benefit from doing per-sb
> icache then I would question whether we should.

The same vague questions wondering about the benefit of per-sb
dentry LRUs were raised when I first proposed them years ago, and
look where we are now.  Besides, focussing on whether this one patch
is a benefit or not is really missing the point because it's the
benefits of this patchset as a whole that need to be considered....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27  2:19       ` [PATCH 3/5] " Nick Piggin
@ 2010-05-27  4:07         ` Dave Chinner
  2010-05-27  4:24           ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-05-27  4:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 12:19:05PM +1000, Nick Piggin wrote:
> On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> > > > +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> > > > +						* sysctl_vfs_cache_pressure;
> > > 
> > > Do you think truncating in the divisions is at all a problem? It
> > > probably doesn't matter much I suppose.
> > 
> > Same code as currently exists. IIRC, the reasoning is that if we've
> > got less that 100 objects to reclaim, then we're unlikely to be able
> > to free up any memory from the caches, anyway.
> 
> Yeah, which is why I stop short of saying you should change it in
> this patch.
> 
> But I think we should ensure things can get reclaimed eventually.
> 100 objects could be 100 slabs, which could be anything from
> half a meg to half a dozen. Multiplied by each of the caches.
> Could be significant in small systems.

True, but usually there are busy objects in the dentry and inode
slabs, so it shouldn't be a significant issue. We can probably
address such problems if they can be demonstrated to be an issue in
a separate patch set....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-27  4:02         ` Dave Chinner
@ 2010-05-27  4:23           ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2010-05-27  4:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 02:02:10PM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 12:04:45PM +1000, Nick Piggin wrote:
> > On Thu, May 27, 2010 at 09:01:29AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2010 at 02:17:33AM +1000, Nick Piggin wrote:
> > > > On Tue, May 25, 2010 at 06:53:04PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > The inode unused list is currently a global LRU. This does not match
> > > > > the other global filesystem cache - the dentry cache - which uses
> > > > > per-superblock LRU lists. Hence we have related filesystem object
> > > > > types using different LRU reclaimatin schemes.
> > > > 
> > > > Is this an improvement I wonder? The dcache is using per sb lists
> > > > because it specifically requires sb traversal.
> > > 
> > > Right - I originally implemented the per-sb dentry lists for
> > > scalability purposes. i.e. to avoid monopolising the dentry_lock
> > > during unmount looking for dentries on a specific sb and hanging the
> > > system for several minutes.
> > > 
> > > However, the reason for doing this to the inode cache is not for
> > > scalability, it's because we have a tight relationship between the
> > > dentry and inode cacheѕ. That is, reclaim from the dentry LRU grows
> > > the inode LRU.  Like the registration of the shrinkers, this is kind
> > > of an implicit, undocumented behavour of the current shrinker
> > > implemenation.
> > 
> > Right, that's why I wonder whether it is an improvement. It would
> > be interesting to see some tests (showing at least parity).
> 
> I've done some testing showing parity. They've been along the lines
> of:
> 	- populate cache with 1m dentries + inodes
> 	- run 'time echo 2 > /proc/sys/vm/drop_caches'
> 
> I've used different methods of populating the caches to have them
> non-sequential in the LRU (i.e. trigger fragmentation), have dirty
> backing inodes (e.g. the VFS inode clean, the xfs inode dirty
> because transactions haven't completed), etc.
> 
> The variation on the test is around +-10%, with the per-sb shrinkers
> averaging about 5% lower time to reclaim. This is within the error
> margin of the test, so it's not really a conclusive win, but it is
> certainly shows that it does not slow anything down. If you've got a
> better way to test it, then I'm all ears....

I guess the problem is that inode LRU cache isn't very useful as
long as there are dentries in the way (which is most of the time,
isn't it?). I think nfsd will exercise them better? Dont know of
any other cases.


> > Right, it just makes it harder to do. By much harder, I did mostly mean
> > the extra memory overhead.
> 
> You've still got to allocate that extra memory on the per-sb dentry
> LRUs so it's not really a valid argument.

Well it would be per-zone, per-sb list, but I don't think that
makes it an ivalid point.


> IOWs, if it's too much
> memory for per-sb inode LRUs, then it's too much memory for the
> per-sb dentry LRUs as well...

Not about how much is too much, it's about more cost or memory
usage for what benefit? I guess it isn't a lot more memory though.

 
> > If there is *no* benefit from doing per-sb
> > icache then I would question whether we should.
> 
> The same vague questions wondering about the benefit of per-sb
> dentry LRUs were raised when I first proposed them years ago, and
> look where we are now.

To be fair that is because there were specific needs to do per-sb
pruning. This isn't the case with icache.


>  Besides, focussing on whether this one patch
> is a benefit or not is really missing the point because it's the
> benefits of this patchset as a whole that need to be considered....

I would indeed like to focus on the benefits of the patchset as a
whole. Leaving aside the xfs changes, it would be interesting to
have at least a few numbers for dcache/icache heavy workloads.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27  4:07         ` Dave Chinner
@ 2010-05-27  4:24           ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2010-05-27  4:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 02:07:04PM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 12:19:05PM +1000, Nick Piggin wrote:
> > On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> > > > > +	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> > > > > +						* sysctl_vfs_cache_pressure;
> > > > 
> > > > Do you think truncating in the divisions is at all a problem? It
> > > > probably doesn't matter much I suppose.
> > > 
> > > Same code as currently exists. IIRC, the reasoning is that if we've
> > > got less that 100 objects to reclaim, then we're unlikely to be able
> > > to free up any memory from the caches, anyway.
> > 
> > Yeah, which is why I stop short of saying you should change it in
> > this patch.
> > 
> > But I think we should ensure things can get reclaimed eventually.
> > 100 objects could be 100 slabs, which could be anything from
> > half a meg to half a dozen. Multiplied by each of the caches.
> > Could be significant in small systems.
> 
> True, but usually there are busy objects in the dentry and inode
> slabs, so it shouldn't be a significant issue. We can probably
> address such problems if they can be demonstrated to be an issue in
> a separate patch set....

I didn't want to say it is a problem with your patchset, I just
thought of it when reviewing.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5 v2] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27  4:01         ` Al Viro
@ 2010-05-27  6:17           ` Dave Chinner
  2010-05-27  6:46             ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-05-27  6:17 UTC (permalink / raw)
  To: Al Viro; +Cc: Nick Piggin, linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 05:01:20AM +0100, Al Viro wrote:
> On Thu, May 27, 2010 at 11:53:35AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2010 at 09:12:14AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2010 at 02:41:16AM +1000, Nick Piggin wrote:
> > ....
> > > > Nitpick but I prefer just the restart label wher it is previously. This
> > > > is moving setup for the next iteration into the "error" case.
> > > 
> > > Ok, will fix.
> > ....
> > > > Would you just elaborate on the lock order problem somewhere? (the
> > > > comment makes it look like we *could* take the mutex if we wanted
> > > > to).
> > > 
> > > The shrinker is unregistered in deactivate_locked_super() which is
> > > just before ->kill_sb is called. The sb->s_umount lock is held at
> > > this point. hence is the shrinker is operating, we will deadlock if
> > > we try to lock it like this:
> > > 
> > > 	unmount:			shrinker:
> > > 					down_read(&shrinker_lock);
> > > 	down_write(&sb->s_umount)
> > > 	unregister_shrinker()
> > > 	down_write(&shrinker_lock)
> > > 					prune_super()
> > > 					  down_read(&sb->s_umount);
> > > 					  (deadlock)
> > > 
> > > hence if we can't get the sb->s_umount lock in prune_super(), then
> > > the superblock must be being unmounted and the shrinker should abort
> > > as the ->kill_sb method will clean up everything after the shrinker
> > > is unregistered. Hence the down_read_trylock().
> 
> Um...  Maybe I'm dumb, but what's wrong with doing unregistration from
> deactivate_locked_super(), right after the call of ->kill_sb()?  At that
> point ->s_umount is already dropped, so we won't deadlock at all.
> Shrinker rwsem will make sure that all shrinkers-in-progress will run
> to completion, so we won't get a superblock freed under prune_super().
> I don't particulary mind down_try_read() in prune_super(), but why not
> make life obviously safer?
> 
> Am I missing something here?

I was worried about memory allocation in the ->kill_sb path
deadlocking on the s_umount lock if it enters reclaim. e.g.  XFS
inodes can still be dirty even after the VFS has disposed of them,
and writing them back can require page cache allocation for the
backing buffers. If allocation recurses back into the shrinker, we
can deadlock on the s_umount lock.  This doesn't seem like an XFS
specific problem, so I used a trylock to avoid that whole class of
problems (same way the current shrinkers do).

>From there, we can unregister the shrinker before calling ->kill_sb
as per above. That, in turn, means that the unmount
invalidate_inodes() vs shrinker race goes away and the iprune_sem is
not needed in the new prune_icache_sb() function.  I'm pretty sure
that I can now remove the iprune_sem, but I haven't written the
patch to do that yet.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
  2010-05-26 16:41   ` Nick Piggin
@ 2010-05-27  6:35   ` Nick Piggin
  2010-05-27 22:40     ` Dave Chinner
  2010-05-27 20:32   ` Andrew Morton
  2 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2010-05-27  6:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -37,6 +37,50 @@
>  LIST_HEAD(super_blocks);
>  DEFINE_SPINLOCK(sb_lock);
>  
> +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> +{
> +	struct super_block *sb;
> +	int count;
> +
> +	sb = container_of(shrink, struct super_block, s_shrink);
> +
> +	/*
> +	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
> +	 * to recurse into the FS that called us in clear_inode() and friends..
> +	 */
> +	if (!(gfp_mask & __GFP_FS))
> +		return -1;
> +
> +	/*
> +	 * if we can't get the umount lock, then there's no point having the
> +	 * shrinker try again because the sb is being torn down.
> +	 */
> +	if (!down_read_trylock(&sb->s_umount))
> +		return -1;
> +
> +	if (!sb->s_root) {
> +		up_read(&sb->s_umount);
> +		return -1;
> +	}
> +
> +	if (nr_to_scan) {
> +		/* proportion the scan between the two cacheѕ */
> +		int total;
> +
> +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> +
> +		/* prune dcache first as icache is pinned by it */
> +		prune_dcache_sb(sb, count);
> +		prune_icache_sb(sb, nr_to_scan - count);

Hmm, an interesting dynamic that you've changed is that previously
we'd scan dcache LRU proportionately to pagecache, and then scan
inode LRU in proportion to the current number of unused inodes.

But we can think of inodes that are only in use by unused (and aged)
dentries as effectively unused themselves. So this sequence under
estimates how many inodes to scan. This could bias pressure against
dcache I'd think, especially considering inodes are far larger than
dentries. Maybe require 2 passes to get the inodes unused inthe
first pass.

Part of the problem is the funny shrinker API.

The right way to do it is to change the shrinker API so that it passes
down the lru_pages and scanned into the callback. From there, the
shrinkers can calculate the appropriate ratio of objects to scan.
No need for 2-call scheme, no need for shrinker->seeks, and the
ability to calculate an appropriate ratio first for dcache, and *then*
for icache.

A helper of course can do the calculation (considering that every
driver and their dog will do the wrong thing if we let them :)).

unsigned long shrinker_scan(unsigned long lru_pages,
			unsigned long lru_scanned,
			unsigned long nr_objects,
			unsigned long scan_ratio)
{
	unsigned long long tmp = nr_objects;

	tmp *= lru_scanned * 100;
	do_div(tmp, (lru_pages * scan_ratio) + 1);

	return (unsigned long)tmp;
}

Then the shrinker callback will go:
	sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
				sb->s_nr_dentry_unused,
				vfs_cache_pressure * SEEKS_PER_DENTRY);
	if (sb->s_nr_dentry_scan > SHRINK_BATCH)
		prune_dcache()

	sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
				sb->s_nr_inodes_unused,
				vfs_cache_pressure * SEEKS_PER_INODE);
	...

What do you think of that? Seeing as we're changing the shrinker API
anyway, I'd think it is high time to do somthing like this.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5 v2] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27  6:17           ` Dave Chinner
@ 2010-05-27  6:46             ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2010-05-27  6:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 04:17:51PM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 05:01:20AM +0100, Al Viro wrote:
> > Um...  Maybe I'm dumb, but what's wrong with doing unregistration from
> > deactivate_locked_super(), right after the call of ->kill_sb()?  At that
> > point ->s_umount is already dropped, so we won't deadlock at all.
> > Shrinker rwsem will make sure that all shrinkers-in-progress will run
> > to completion, so we won't get a superblock freed under prune_super().
> > I don't particulary mind down_try_read() in prune_super(), but why not
> > make life obviously safer?
> > 
> > Am I missing something here?
> 
> I was worried about memory allocation in the ->kill_sb path
> deadlocking on the s_umount lock if it enters reclaim. e.g.  XFS
> inodes can still be dirty even after the VFS has disposed of them,
> and writing them back can require page cache allocation for the
> backing buffers. If allocation recurses back into the shrinker, we
> can deadlock on the s_umount lock.  This doesn't seem like an XFS
> specific problem, so I used a trylock to avoid that whole class of
> problems (same way the current shrinkers do).

If GFP_FS is set, we wouldn't touch the locks. It is a concern
though, if __GFP_FS allocations were previously permitted under
the exclusive lock.

 
> >From there, we can unregister the shrinker before calling ->kill_sb
> as per above. That, in turn, means that the unmount
> invalidate_inodes() vs shrinker race goes away and the iprune_sem is
> not needed in the new prune_icache_sb() function.  I'm pretty sure
> that I can now remove the iprune_sem, but I haven't written the
> patch to do that yet.

I do really like that aspect of your patch. It's nice to have the
shrinker always only operating against active supers. So I would
be in favour of your current scheme.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Per superblock shrinkers V2
  2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
                   ` (5 preceding siblings ...)
  2010-05-26 16:44 ` [PATCH 0/5] Per superblock shrinkers V2 Nick Piggin
@ 2010-05-27 20:32 ` Andrew Morton
  2010-05-28  0:30   ` Dave Chinner
  2010-05-28  7:42   ` Artem Bityutskiy
  2010-07-02 12:13 ` Christoph Hellwig
  7 siblings, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2010-05-27 20:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, 25 May 2010 18:53:03 +1000
Dave Chinner <david@fromorbit.com> wrote:

> This series reworks the filesystem shrinkers. We currently have a
> set of issues with the current filesystem shrinkers:
> 
>         1. There is an dependency between dentry and inode cache
>            shrinking that is only implicitly defined by the order of
>            shrinker registration.
>         2. The shrinkers need to walk the superblock list and pin
>            the superblock to avoid unmount races with the sb going
>            away.
>         3. The dentry cache uses per-superblock LRUs and proportions
>            reclaim between all the superblocks which means we are
>            doing breadth based reclaim. This means we touch every
>            superblock for every shrinker call, and may only reclaim
>            a single dentry at a time from a given superblock.
>         4. The inode cache has a global LRU, so it has different
>            reclaim patterns to the dentry cache, despite the fact
>            that the dentry cache is generally the only thing that
>            pins inodes in memory.
>         5. Filesystems need to register their own shrinkers for
>            caches and can't co-ordinate them with the dentry and
>            inode cache shrinkers.

Nice description, but...  it never actually told us what the benefit of
the changes are.  Presumably some undescribed workload had some
undescribed user-visible problem.  But what was that workload, and what
was the user-visible problem, and how does the patch affect all this?

Stuff like that.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-25  8:53 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
  2010-05-26 16:17   ` Nick Piggin
@ 2010-05-27 20:32   ` Andrew Morton
  2010-05-27 22:54     ` Dave Chinner
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2010-05-27 20:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, 25 May 2010 18:53:04 +1000
Dave Chinner <david@fromorbit.com> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode unused list is currently a global LRU. This does not match
> the other global filesystem cache - the dentry cache - which uses
> per-superblock LRU lists. Hence we have related filesystem object
> types using different LRU reclaimatin schemes.
> 
> To enable a per-superblock filesystem cache shrinker, both of these
> caches need to have per-sb unused object LRU lists. Hence this patch
> converts the global inode LRU to per-sb LRUs.
> 
> The patch only does rudimentary per-sb propotioning in the shrinker
> infrastructure, as this gets removed when the per-sb shrinker
> callouts are introduced later on.
> 
> ...
>
> +			list_move(&inode->i_list, &inode->i_sb->s_inode_lru);

It's a shape that s_inode_lru is still protected by inode_lock.  One
day we're going to get in trouble over that lock.  Migrating to a
per-sb lock would be logical and might help.

Did you look into this?  I expect we'd end up taking both inode_lock
and the new sb->lru_lock in several places, which wouldn't be of any
help, at least in the interim.  Long-term, the locking for
fs-writeback.c should move to the per-superblock one also, at which
time this problem largely goes away I think.  Unfortunately the
writeback inode lists got moved into the backing_dev_info, whcih messes
things up a bit.

>  	inodes_stat.nr_unused--;
> +	inode->i_sb->s_nr_inodes_unused--;

It's regrettable to be counting the same thing twice.  Did you look
into removing (or no longer using) inodes_stat.nr_unused?


> +		/* Now, we reclaim unused dentrins with fairness.

May as well fix the typo while we're there.

Please review all these comments to ensure that they are still accurate
and complete.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
  2010-05-26 16:41   ` Nick Piggin
  2010-05-27  6:35   ` Nick Piggin
@ 2010-05-27 20:32   ` Andrew Morton
  2010-05-27 23:01     ` Dave Chinner
  2 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2010-05-27 20:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, 25 May 2010 18:53:06 +1000
Dave Chinner <david@fromorbit.com> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> With context based shrinkers, we can implement a per-superblock
> shrinker that shrinks the caches attached to the superblock. We
> currently have global shrinkers for the inode and dentry caches that
> split up into per-superblock operations via a coarse proportioning
> method that does not batch very well.  The global shrinkers also
> have a dependency - dentries pin inodes - so we have to be very
> careful about how we register the global shrinkers so that the
> implicit call order is always correct.
> 
> With a per-sb shrinker callout, we can encode this dependency
> directly into the per-sb shrinker, hence avoiding the need for
> strictly ordering shrinker registrations. We also have no need for
> any proportioning code for the shrinker subsystem already provides
> this functionality across all shrinkers. Allowing the shrinker to
> operate on a single superblock at a time means that we do less
> superblock list traversals and locking and reclaim should batch more
> effectively. This should result in less CPU overhead for reclaim and
> potentially faster reclaim of items from each filesystem.
> 

I go all tingly when a changelog contains the word "should".

OK, it _should_ do X.  But _does_ it actually do X?

>  fs/super.c         |   53 +++++++++++++++++++++
>  include/linux/fs.h |    7 +++
>  4 files changed, 88 insertions(+), 214 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index dba6b6d..d7bd781 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
>   * which flags are set. This means we don't need to maintain multiple
>   * similar copies of this loop.
>   */
> -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)

Forgot to update the kerneldoc description of `count'.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/5] superblock: add filesystem shrinker operations
  2010-05-25  8:53 ` [PATCH 4/5] superblock: add filesystem shrinker operations Dave Chinner
@ 2010-05-27 20:32   ` Andrew Morton
  0 siblings, 0 replies; 38+ messages in thread
From: Andrew Morton @ 2010-05-27 20:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Tue, 25 May 2010 18:53:07 +1000
Dave Chinner <david@fromorbit.com> wrote:

> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we have a per-superblock shrinker implementation, we can add a
> filesystem specific callout to it to allow filesystem internal
> caches to be shrunk by the superblock shrinker.
> 
> Rather than perpetuate the multipurpose shrinker callback API (i.e.
> nr_to_scan == 0 meaning "tell me how many objects freeable in the
> cache), two operations will be added. The first will return the
> number of objects that are freeable, the second is the actual
> shrinker call.
> 
>
> ...
>
>  static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
>  {
>  	struct super_block *sb;
> -	int count;
> +	int	fs_objects = 0;
> +	int	total_objects;
>  
>  	sb = container_of(shrink, struct super_block, s_shrink);
>  
> @@ -63,22 +64,40 @@ static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
>  		return -1;
>  	}
>  
> -	if (nr_to_scan) {
> -		/* proportion the scan between the two cache__ */
> -		int total;
> -
> -		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> -		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> +	if (sb->s_op && sb->s_op->nr_cached_objects)
> +		fs_objects = sb->s_op->nr_cached_objects(sb);
>  
> -		/* prune dcache first as icache is pinned by it */
> -		prune_dcache_sb(sb, count);
> -		prune_icache_sb(sb, nr_to_scan - count);
> +	total_objects = sb->s_nr_dentry_unused +
> +			sb->s_nr_inodes_unused + fs_objects + 1;
> +	if (nr_to_scan) {
> +		int	dentries;
> +		int	inodes;
> +
> +		/* proportion the scan between the cache__ */
> +		dentries = (nr_to_scan * sb->s_nr_dentry_unused) /
> +							total_objects;
> +		inodes = (nr_to_scan * sb->s_nr_inodes_unused) /
> +							total_objects;
> +		if (fs_objects)
> +			fs_objects = (nr_to_scan * fs_objects) /
> +							total_objects;
> +		/*
> +		 * prune the dcache first as the icache is pinned by it, then
> +		 * prune the icache, followed by the filesystem specific caches
> +		 */
> +		prune_dcache_sb(sb, dentries);
> +		prune_icache_sb(sb, inodes);
> +		if (sb->s_op && sb->s_op->free_cached_objects) {

Under which circumstances is a NULL ->free_cached_objects valid?

> +			sb->s_op->free_cached_objects(sb, fs_objects);
> +			fs_objects = sb->s_op->nr_cached_objects(sb);
> +		}
> +		total_objects = sb->s_nr_dentry_unused +
> +				sb->s_nr_inodes_unused + fs_objects;
>  	}

The return value from ->free_cached_objects() doesn't actually get
used.  Instead the code calls ->nr_cached_objects() twice.


> -	count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
> -						* sysctl_vfs_cache_pressure;
> +	total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
>  	up_read(&sb->s_umount);
> -	return count;
> +	return total_objects;
>  }
>  
>  /**
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5bff2dc..efcdcc6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1590,6 +1590,17 @@ struct super_operations {
>  	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
>  #endif
>  	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
> +
> +	/*
> +	 * memory shrinker operations.
> +	 * ->nr_cached_objects() should return the number of freeable cached
> +	 * objects the filesystem holds.
> +	 * ->free_cache_objects() should attempt to free the number of cached
> +	 * objects indicated. It should return how many objects it attempted to
> +	 * free.
> +	 */

I'd have thought that ->free_cache_objects() would always return the
number which it was passed.  Unless someone asked it to scan more
objects than exist, perhaps.

> +	int (*nr_cached_objects)(struct super_block *);
> +	int (*free_cached_objects)(struct super_block *, int);
>  };


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27  6:35   ` Nick Piggin
@ 2010-05-27 22:40     ` Dave Chinner
  2010-05-28  5:19       ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-05-27 22:40 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> On Tue, May 25, 2010 at 06:53:06PM +1000, Dave Chinner wrote:
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -37,6 +37,50 @@
> >  LIST_HEAD(super_blocks);
> >  DEFINE_SPINLOCK(sb_lock);
> >  
> > +static int prune_super(struct shrinker *shrink, int nr_to_scan, gfp_t gfp_mask)
> > +{
> > +	struct super_block *sb;
> > +	int count;
> > +
> > +	sb = container_of(shrink, struct super_block, s_shrink);
> > +
> > +	/*
> > +	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
> > +	 * to recurse into the FS that called us in clear_inode() and friends..
> > +	 */
> > +	if (!(gfp_mask & __GFP_FS))
> > +		return -1;
> > +
> > +	/*
> > +	 * if we can't get the umount lock, then there's no point having the
> > +	 * shrinker try again because the sb is being torn down.
> > +	 */
> > +	if (!down_read_trylock(&sb->s_umount))
> > +		return -1;
> > +
> > +	if (!sb->s_root) {
> > +		up_read(&sb->s_umount);
> > +		return -1;
> > +	}
> > +
> > +	if (nr_to_scan) {
> > +		/* proportion the scan between the two cacheѕ */
> > +		int total;
> > +
> > +		total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
> > +		count = (nr_to_scan * sb->s_nr_dentry_unused) / total;
> > +
> > +		/* prune dcache first as icache is pinned by it */
> > +		prune_dcache_sb(sb, count);
> > +		prune_icache_sb(sb, nr_to_scan - count);
> 
> Hmm, an interesting dynamic that you've changed is that previously
> we'd scan dcache LRU proportionately to pagecache, and then scan
> inode LRU in proportion to the current number of unused inodes.
> 
> But we can think of inodes that are only in use by unused (and aged)
> dentries as effectively unused themselves. So this sequence under
> estimates how many inodes to scan. This could bias pressure against
> dcache I'd think, especially considering inodes are far larger than
> dentries. Maybe require 2 passes to get the inodes unused inthe
> first pass.

It's self-balancing - it trends towards an equal number of unused
dentries and inodes in the caches. Yes, it will tear down more
dentries at first, but we need to do that to be able to reclaim
inodes. Ås reclaim progresses the propotion of inodes increases, so
the amount of inodes reclaimed increases. 

Basically this is a recognition that the important cache for
avoiding IO is the inode cache, not he dentry cache. Once the inode
cache is freed that we need to do IO to repopulate it, but
rebuilding dentries fromteh inode cache only costs CPU time. Hence
under light reclaim, inodes are mostly left in cache but we free up
memory that only costs CPU to rebuild. Under heavy, sustained
reclaim, we trend towards freeing equal amounts of objects from both
caches.

This is pretty much what the current code attempts to do - free a
lot of dentries, then free a smaller amount of the inodes that were
used by the freed dentries. Once again it is a direct encoding of
what is currently an implicit design feature - it makes it *obvious*
how we are trying to balance the caches.

Another reason for this is that the calculation changes again to
allow filesystem caches to modiy this proportioning in the next
patch....

FWIW, this also makes workloads that generate hundreds of thousands
of never-to-be-used again negative dentries free dcache memory really
quickly on memory pressure...

> Part of the problem is the funny shrinker API.
> 
> The right way to do it is to change the shrinker API so that it passes
> down the lru_pages and scanned into the callback. From there, the
> shrinkers can calculate the appropriate ratio of objects to scan.
> No need for 2-call scheme, no need for shrinker->seeks, and the
> ability to calculate an appropriate ratio first for dcache, and *then*
> for icache.

My only concern about this is that exposes the inner workings of the
shrinker and mm subsystem to code that simply doesn't need to know
about it.


> A helper of course can do the calculation (considering that every
> driver and their dog will do the wrong thing if we let them :)).
> 
> unsigned long shrinker_scan(unsigned long lru_pages,
> 			unsigned long lru_scanned,
> 			unsigned long nr_objects,
> 			unsigned long scan_ratio)
> {
> 	unsigned long long tmp = nr_objects;
> 
> 	tmp *= lru_scanned * 100;
> 	do_div(tmp, (lru_pages * scan_ratio) + 1);
> 
> 	return (unsigned long)tmp;
> }
> 
> Then the shrinker callback will go:
> 	sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
> 				sb->s_nr_dentry_unused,
> 				vfs_cache_pressure * SEEKS_PER_DENTRY);
> 	if (sb->s_nr_dentry_scan > SHRINK_BATCH)
> 		prune_dcache()
> 
> 	sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
> 				sb->s_nr_inodes_unused,
> 				vfs_cache_pressure * SEEKS_PER_INODE);
> 	...
> 
> What do you think of that? Seeing as we're changing the shrinker API
> anyway, I'd think it is high time to do somthing like this.

Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two
call API that matches the current behaviour, leaving the caclulation
of how much to reclaim in shrink_slab(). Encoding it this way makes
it more difficult to change the high level behaviour e.g. if we want
to modify the amount of slab reclaim based on reclaim priority, we'd
have to cahnge every shrinker instead of just shrink_slab().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-27 20:32   ` Andrew Morton
@ 2010-05-27 22:54     ` Dave Chinner
  2010-05-28 10:07       ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-05-27 22:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 01:32:30PM -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 18:53:04 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The inode unused list is currently a global LRU. This does not match
> > the other global filesystem cache - the dentry cache - which uses
> > per-superblock LRU lists. Hence we have related filesystem object
> > types using different LRU reclaimatin schemes.
> > 
> > To enable a per-superblock filesystem cache shrinker, both of these
> > caches need to have per-sb unused object LRU lists. Hence this patch
> > converts the global inode LRU to per-sb LRUs.
> > 
> > The patch only does rudimentary per-sb propotioning in the shrinker
> > infrastructure, as this gets removed when the per-sb shrinker
> > callouts are introduced later on.
> > 
> > ...
> >
> > +			list_move(&inode->i_list, &inode->i_sb->s_inode_lru);
> 
> It's a shape that s_inode_lru is still protected by inode_lock.  One
> day we're going to get in trouble over that lock.  Migrating to a
> per-sb lock would be logical and might help.
> 
> Did you look into this? 

Yes, I have. Yes, it's possible.  It's solving a different problem,
so I figured it can be done in a different patch set.

> I expect we'd end up taking both inode_lock
> and the new sb->lru_lock in several places, which wouldn't be of any
> help, at least in the interim.  Long-term, the locking for
> fs-writeback.c should move to the per-superblock one also, at which
> time this problem largely goes away I think.  Unfortunately the
> writeback inode lists got moved into the backing_dev_info, whcih messes
> things up a bit.

*nod*

> 
> >  	inodes_stat.nr_unused--;
> > +	inode->i_sb->s_nr_inodes_unused--;
> 
> It's regrettable to be counting the same thing twice.  Did you look
> into removing (or no longer using) inodes_stat.nr_unused?

Sort of. The complexity is the stats are userspace visible, so they
can't just be removed. Replacing the current stats means that when
they are read from /proc we would need to walk all the superblocks
to aggregate them. The bit I haven't looked at yet is whether
walking superblocks is allowed in a proc handler.

So in the mean time, I just copied what was done for the
dentry_stats. If it's ok to do this walk, then we can change both
the dentry and inode stats at the same time.

> > +		/* Now, we reclaim unused dentrins with fairness.
> 
> May as well fix the typo while we're there.
> 
> Please review all these comments to ensure that they are still accurate
> and complete.

Will do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27 20:32   ` Andrew Morton
@ 2010-05-27 23:01     ` Dave Chinner
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2010-05-27 23:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 01:32:34PM -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 18:53:06 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > With context based shrinkers, we can implement a per-superblock
> > shrinker that shrinks the caches attached to the superblock. We
> > currently have global shrinkers for the inode and dentry caches that
> > split up into per-superblock operations via a coarse proportioning
> > method that does not batch very well.  The global shrinkers also
> > have a dependency - dentries pin inodes - so we have to be very
> > careful about how we register the global shrinkers so that the
> > implicit call order is always correct.
> > 
> > With a per-sb shrinker callout, we can encode this dependency
> > directly into the per-sb shrinker, hence avoiding the need for
> > strictly ordering shrinker registrations. We also have no need for
> > any proportioning code for the shrinker subsystem already provides
> > this functionality across all shrinkers. Allowing the shrinker to
> > operate on a single superblock at a time means that we do less
> > superblock list traversals and locking and reclaim should batch more
> > effectively. This should result in less CPU overhead for reclaim and
> > potentially faster reclaim of items from each filesystem.
> > 
> 
> I go all tingly when a changelog contains the word "should".
> 
> OK, it _should_ do X.  But _does_ it actually do X?

As i said to Nick - the tests I ran showed an average improvement of
5% but the accuracy of the benchmark was +/-10%. Hence it's hard to
draw any conclusive results from that. It appears to be slightly
faster on an otherwise idle system, but...

As it is, the XFS shrinker that gets integrated into this structure
in a later patch peaks at a higher rate - 150k inodes/s vs 90k
inodes/s with the current shrinker - but still it's hard to quantify
qualitatively. I'm going to run more benchmarks to try to get better
numbers.

> >  fs/super.c         |   53 +++++++++++++++++++++
> >  include/linux/fs.h |    7 +++
> >  4 files changed, 88 insertions(+), 214 deletions(-)
> > 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index dba6b6d..d7bd781 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -456,21 +456,16 @@ static void prune_one_dentry(struct dentry * dentry)
> >   * which flags are set. This means we don't need to maintain multiple
> >   * similar copies of this loop.
> >   */
> > -static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
> > +static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
> 
> Forgot to update the kerneldoc description of `count'.

Will fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Per superblock shrinkers V2
  2010-05-27 20:32 ` Andrew Morton
@ 2010-05-28  0:30   ` Dave Chinner
  2010-05-28  7:42   ` Artem Bityutskiy
  1 sibling, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2010-05-28  0:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, May 27, 2010 at 01:32:23PM -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 18:53:03 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > This series reworks the filesystem shrinkers. We currently have a
> > set of issues with the current filesystem shrinkers:
> > 
> >         1. There is an dependency between dentry and inode cache
> >            shrinking that is only implicitly defined by the order of
> >            shrinker registration.
> >         2. The shrinkers need to walk the superblock list and pin
> >            the superblock to avoid unmount races with the sb going
> >            away.
> >         3. The dentry cache uses per-superblock LRUs and proportions
> >            reclaim between all the superblocks which means we are
> >            doing breadth based reclaim. This means we touch every
> >            superblock for every shrinker call, and may only reclaim
> >            a single dentry at a time from a given superblock.
> >         4. The inode cache has a global LRU, so it has different
> >            reclaim patterns to the dentry cache, despite the fact
> >            that the dentry cache is generally the only thing that
> >            pins inodes in memory.
> >         5. Filesystems need to register their own shrinkers for
> >            caches and can't co-ordinate them with the dentry and
> >            inode cache shrinkers.
> 
> Nice description, but...  it never actually told us what the benefit of
> the changes are. 

The first patch I wrote was a small patch to introduce context to
the shrinker callback and a per­XFS filesystem shrinker to solve OOM
probelms introduced by background reclaim of XFS inodes.  It was
simple, it worked but Nick refused to allow it because of #1 listed
above. He wanted some <handwaves> guarantee that context based
shrinkers would not break the implicit registration dependency
between the dentry and inode cache shrinkers.

We needed a fix for 2.6.34 for XFS, so I was forced to write a
global shrinker which is what introduced all the lockdep problems.
XFS does not have global inode caches, and the lock required to
manage the list of XFs mounts were what caused all the new lockdep
problems.  There's also other lockdep false positive problems w/ XFS
and shrinkers (e.g. iprune_sem and the unmount path) that need to be
fixed.

That's what this patchset tries to address. It results in simpler
code, less code, removal of implicit, undocumented dependencies,
less locking shenanegans, no superblock list traversals, provides
filesystems with hooks for cache reclaim without needing shrinker
registration and fixes all the all the false positive lockdep
problems XFS has with the current shrinker infrastructure.

If this is all too much, then I'm quite happy to go back to just the
context based shrinker patch and leave everything else alone - the
context based shrinkers are the change we *really* need.  Everything
else in this set of changes is just trying to address objections
raised (that I still don't really understand) against that simple
change.

> Presumably some undescribed workload had some
> undescribed user-visible problem.

$ find . -inum 11111

on a filesystem with more inodes in it than can be held in memory
caused OOM panics.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-27 22:40     ` Dave Chinner
@ 2010-05-28  5:19       ` Nick Piggin
  2010-05-31  6:39         ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Nick Piggin @ 2010-05-28  5:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> > But we can think of inodes that are only in use by unused (and aged)
> > dentries as effectively unused themselves. So this sequence under
> > estimates how many inodes to scan. This could bias pressure against
> > dcache I'd think, especially considering inodes are far larger than
> > dentries. Maybe require 2 passes to get the inodes unused inthe
> > first pass.
> 
> It's self-balancing - it trends towards an equal number of unused
> dentries and inodes in the caches. Yes, it will tear down more
> dentries at first, but we need to do that to be able to reclaim
> inodes.

But then it doesn't scan enough inodes on the inode pass.


> Ås reclaim progresses the propotion of inodes increases, so
> the amount of inodes reclaimed increases. 
> 
> Basically this is a recognition that the important cache for
> avoiding IO is the inode cache, not he dentry cache. Once the inode

You can bias against the dcache using multipliers.


> cache is freed that we need to do IO to repopulate it, but
> rebuilding dentries fromteh inode cache only costs CPU time. Hence
> under light reclaim, inodes are mostly left in cache but we free up
> memory that only costs CPU to rebuild. Under heavy, sustained
> reclaim, we trend towards freeing equal amounts of objects from both
> caches.

I don't know if you've got numbers or patterns to justify that.
My point is that things should stay as close to the old code as
possible without good reason.

 
> This is pretty much what the current code attempts to do - free a
> lot of dentries, then free a smaller amount of the inodes that were
> used by the freed dentries. Once again it is a direct encoding of
> what is currently an implicit design feature - it makes it *obvious*
> how we are trying to balance the caches.

With your patches, if there are no inodes free you would need to take
2 passes at freeing the dentry cache. My suggestion is closer to the
current code.
 

> Another reason for this is that the calculation changes again to
> allow filesystem caches to modiy this proportioning in the next
> patch....
> 
> FWIW, this also makes workloads that generate hundreds of thousands
> of never-to-be-used again negative dentries free dcache memory really
> quickly on memory pressure...

That would still be the case because used inodes aren't getting their
dentries freed so little inode scanning will occur.

> 
> > Part of the problem is the funny shrinker API.
> > 
> > The right way to do it is to change the shrinker API so that it passes
> > down the lru_pages and scanned into the callback. From there, the
> > shrinkers can calculate the appropriate ratio of objects to scan.
> > No need for 2-call scheme, no need for shrinker->seeks, and the
> > ability to calculate an appropriate ratio first for dcache, and *then*
> > for icache.
> 
> My only concern about this is that exposes the inner workings of the
> shrinker and mm subsystem to code that simply doesn't need to know
> about it.

It's just providing a ratio. The shrinkers allready know they are
scanning based on a ratio of pagecache scanned.


> > A helper of course can do the calculation (considering that every
> > driver and their dog will do the wrong thing if we let them :)).
> > 
> > unsigned long shrinker_scan(unsigned long lru_pages,
> > 			unsigned long lru_scanned,
> > 			unsigned long nr_objects,
> > 			unsigned long scan_ratio)
> > {
> > 	unsigned long long tmp = nr_objects;
> > 
> > 	tmp *= lru_scanned * 100;
> > 	do_div(tmp, (lru_pages * scan_ratio) + 1);
> > 
> > 	return (unsigned long)tmp;
> > }
> > 
> > Then the shrinker callback will go:
> > 	sb->s_nr_dentry_scan += shrinker_scan(lru_pages, lru_scanned,
> > 				sb->s_nr_dentry_unused,
> > 				vfs_cache_pressure * SEEKS_PER_DENTRY);
> > 	if (sb->s_nr_dentry_scan > SHRINK_BATCH)
> > 		prune_dcache()
> > 
> > 	sb->s_nr_inode_scan += shrinker_scan(lru_pages, lru_scanned,
> > 				sb->s_nr_inodes_unused,
> > 				vfs_cache_pressure * SEEKS_PER_INODE);
> > 	...
> > 
> > What do you think of that? Seeing as we're changing the shrinker API
> > anyway, I'd think it is high time to do somthing like this.
> 
> Ignoring the dcache/icache reclaim ratio issues, I'd prefer a two

Well if it is an issue, it should be changed in a different patch
I think (with numbers).


> call API that matches the current behaviour, leaving the caclulation
> of how much to reclaim in shrink_slab(). Encoding it this way makes
> it more difficult to change the high level behaviour e.g. if we want
> to modify the amount of slab reclaim based on reclaim priority, we'd
> have to cahnge every shrinker instead of just shrink_slab().

We can modifiy the ratios before calling if needed, or have a default
ratio define to multiply with as well.

But shrinkers are very subsystem specific.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Per superblock shrinkers V2
  2010-05-27 20:32 ` Andrew Morton
  2010-05-28  0:30   ` Dave Chinner
@ 2010-05-28  7:42   ` Artem Bityutskiy
  1 sibling, 0 replies; 38+ messages in thread
From: Artem Bityutskiy @ 2010-05-28  7:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-mm, xfs

On Thu, 2010-05-27 at 13:32 -0700, Andrew Morton wrote:
> On Tue, 25 May 2010 18:53:03 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > This series reworks the filesystem shrinkers. We currently have a
> > set of issues with the current filesystem shrinkers:
> > 
> >         1. There is an dependency between dentry and inode cache
> >            shrinking that is only implicitly defined by the order of
> >            shrinker registration.
> >         2. The shrinkers need to walk the superblock list and pin
> >            the superblock to avoid unmount races with the sb going
> >            away.
> >         3. The dentry cache uses per-superblock LRUs and proportions
> >            reclaim between all the superblocks which means we are
> >            doing breadth based reclaim. This means we touch every
> >            superblock for every shrinker call, and may only reclaim
> >            a single dentry at a time from a given superblock.
> >         4. The inode cache has a global LRU, so it has different
> >            reclaim patterns to the dentry cache, despite the fact
> >            that the dentry cache is generally the only thing that
> >            pins inodes in memory.
> >         5. Filesystems need to register their own shrinkers for
> >            caches and can't co-ordinate them with the dentry and
> >            inode cache shrinkers.
> 
> Nice description, but...  it never actually told us what the benefit of
> the changes are.  Presumably some undescribed workload had some
> undescribed user-visible problem.  But what was that workload, and what
> was the user-visible problem, and how does the patch affect all this?

For UBIFS it wwill give a benefit in terms of simpler UBIFS code - we
now have to keep our own list of UBIFS superblocks, provide locking for
it, and maintain. This is just extra burden. So the item 2 above will be
useful for UBIFS.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/5] inode: Make unused inode LRU per superblock
  2010-05-27 22:54     ` Dave Chinner
@ 2010-05-28 10:07       ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2010-05-28 10:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm, xfs

On Fri, May 28, 2010 at 08:54:18AM +1000, Dave Chinner wrote:
> On Thu, May 27, 2010 at 01:32:30PM -0700, Andrew Morton wrote:
> > On Tue, 25 May 2010 18:53:04 +1000
> > Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The inode unused list is currently a global LRU. This does not match
> > > the other global filesystem cache - the dentry cache - which uses
> > > per-superblock LRU lists. Hence we have related filesystem object
> > > types using different LRU reclaimatin schemes.
> > > 
> > > To enable a per-superblock filesystem cache shrinker, both of these
> > > caches need to have per-sb unused object LRU lists. Hence this patch
> > > converts the global inode LRU to per-sb LRUs.
> > > 
> > > The patch only does rudimentary per-sb propotioning in the shrinker
> > > infrastructure, as this gets removed when the per-sb shrinker
> > > callouts are introduced later on.
> > > 
> > > ...
> > >
> > > +			list_move(&inode->i_list, &inode->i_sb->s_inode_lru);
> > 
> > It's a shape that s_inode_lru is still protected by inode_lock.  One
> > day we're going to get in trouble over that lock.  Migrating to a
> > per-sb lock would be logical and might help.
> > 
> > Did you look into this? 
> 
> Yes, I have. Yes, it's possible.  It's solving a different problem,
> so I figured it can be done in a different patch set.

It almost all goes away in my inode lock splitup patches. Inode lru
and dirty lists were the last things protected by the global lock
there.

I am actually going to do per-zone lrus for these guys and per-zone
locks (which is actually better than per-sb because it gives NUMA
scalability within a single sb).

The dirty/writeback lists should probably be per-bdi locked.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-28  5:19       ` Nick Piggin
@ 2010-05-31  6:39         ` Dave Chinner
  2010-05-31  7:28           ` Nick Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-05-31  6:39 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Fri, May 28, 2010 at 03:19:24PM +1000, Nick Piggin wrote:
> On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote:
> > On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> > > But we can think of inodes that are only in use by unused (and aged)
> > > dentries as effectively unused themselves. So this sequence under
> > > estimates how many inodes to scan. This could bias pressure against
> > > dcache I'd think, especially considering inodes are far larger than
> > > dentries. Maybe require 2 passes to get the inodes unused inthe
> > > first pass.
> > 
> > It's self-balancing - it trends towards an equal number of unused
> > dentries and inodes in the caches. Yes, it will tear down more
> > dentries at first, but we need to do that to be able to reclaim
> > inodes.
> 
> But then it doesn't scan enough inodes on the inode pass.

We don't get a single shrinker call - we get batches of them fo each
shrinker.  The shrinker determines how many objects to scan in the
cache, and that only changes for each shrinker to call based on the
the shrinker->seeks and the number of objects in the cache.  Once
the number is decided, then the shrinker gets batches of reclaim to
operate on.

Fundamentally, the current shrinker will ask for the same percentage
of each cache to be scanned - if it decides to scan 20% of the
dentry cache, it will also decide to scan 20% of the inode cache
Hence what the inode shrinker is doing is scanning 20% of the inodes
freed by the dcache shrinker.

In rough numbers, say we have 100k dentries, and the shrinker
calculates it needs to scan 20% of the caches to reclaim them, the
current code will end up with:

		unused dentries		unused inodes
before		 100k			    0
after dentry	  80k			  20k
after inode	  80k			  16k

So we get 20k dentries freed and 4k inodes freed on that shrink_slab
pass.

To contrast this against the code I proposed, I'll make a couple of
simplicfications to avoid hurting my brain. That is, I'll assume
SHRINK_BATCH=100 (rather than 128) and forgetting about rounding
errors. With this, the algorithm I encoded gives roughly the
following for a 20% object reclaim:

number of batches = 20k / 100 = 200

				Unused
		dentries+inodes		dentries	inodes
before		  100k			100k		    0
batch 1		  100k			99900		  100
batch 2		  100k			99800		  200
....
batch 10	  100k			99000		 1000
batch 20	  99990			98010		 1990
batch 30	  99980			97030		 2950
batch 50	  99910			95100		 4810
batch 60	  99860			94150		 5710
.....
batch 200	  98100			81900		16200

And so (roughly) we see that the number of inodes being reclaim per
set of 10 batches roughly equals the (batch number - 10). Hence over
200 batches, we can expect to see roughly 190 + 180 + ... + 10
inodes reclaimed. That is 1900 inodes.  Similarly for dentries, we
get roughly 1000 + 990 + 980 + ... 810 dentries reclaimed - 18,100
in total.

In other words, we have roughly 18k dentries and 1.9k inodes
reclaimed for the code I wrote new algorithm. That does mean it
initially attempts to reclaim dentries faster than the current code, but
as the number of unused inodes increases, this comes back to parity
with the current code and we end up with a 1:1 reclaim ratio.

This is good behaviour - dentries are cheap to reconstruct from the
inode cache, and we should hold onto the inode cache as much as
possible. i.e. we should reclaim them more aggressively only if
there is sustained pressure on the superblock and that is what the
above algorithm does.

> > Ås reclaim progresses the propotion of inodes increases, so
> > the amount of inodes reclaimed increases. 
> > 
> > Basically this is a recognition that the important cache for
> > avoiding IO is the inode cache, not he dentry cache. Once the inode
> 
> You can bias against the dcache using multipliers.

Multipliers are not self-balancing, and generally just amplify any
imbalance an algorithm tends towards. The vfs_cache_pressure
multiplier is a shining example of this kind of utterly useless
knob...

> > > Part of the problem is the funny shrinker API.
> > > 
> > > The right way to do it is to change the shrinker API so that it passes
> > > down the lru_pages and scanned into the callback. From there, the
> > > shrinkers can calculate the appropriate ratio of objects to scan.
> > > No need for 2-call scheme, no need for shrinker->seeks, and the
> > > ability to calculate an appropriate ratio first for dcache, and *then*
> > > for icache.
> > 
> > My only concern about this is that exposes the inner workings of the
> > shrinker and mm subsystem to code that simply doesn't need to know
> > about it.
> 
> It's just providing a ratio. The shrinkers allready know they are
> scanning based on a ratio of pagecache scanned.

Sure, but the shrinkers are just a simple mechanism for implementing
VM policy decisions. IMO reclaim policy decisions should not be
pushed down and replicated in every one of these reclaim mechanisms.

> But shrinkers are very subsystem specific.

And as such should concentrate on getting their subsystem reclaim
correct, not have to worry about implementing VM policy
calculations...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure
  2010-05-31  6:39         ` Dave Chinner
@ 2010-05-31  7:28           ` Nick Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nick Piggin @ 2010-05-31  7:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Mon, May 31, 2010 at 04:39:38PM +1000, Dave Chinner wrote:
> On Fri, May 28, 2010 at 03:19:24PM +1000, Nick Piggin wrote:
> > On Fri, May 28, 2010 at 08:40:34AM +1000, Dave Chinner wrote:
> > > On Thu, May 27, 2010 at 04:35:23PM +1000, Nick Piggin wrote:
> > > > But we can think of inodes that are only in use by unused (and aged)
> > > > dentries as effectively unused themselves. So this sequence under
> > > > estimates how many inodes to scan. This could bias pressure against
> > > > dcache I'd think, especially considering inodes are far larger than
> > > > dentries. Maybe require 2 passes to get the inodes unused inthe
> > > > first pass.
> > > 
> > > It's self-balancing - it trends towards an equal number of unused
> > > dentries and inodes in the caches. Yes, it will tear down more
> > > dentries at first, but we need to do that to be able to reclaim
> > > inodes.
> > 
> > But then it doesn't scan enough inodes on the inode pass.
> 
> We don't get a single shrinker call - we get batches of them fo each
> shrinker.

OK fair point. However

[...]

> In other words, we have roughly 18k dentries and 1.9k inodes
> reclaimed for the code I wrote new algorithm. That does mean it
> initially attempts to reclaim dentries faster than the current code, but
> as the number of unused inodes increases, this comes back to parity
> with the current code and we end up with a 1:1 reclaim ratio.
> 
> This is good behaviour - dentries are cheap to reconstruct from the
> inode cache, and we should hold onto the inode cache as much as
> possible. i.e. we should reclaim them more aggressively only if
> there is sustained pressure on the superblock and that is what the
> above algorithm does.

I prefer just to keep changes to a minimum and split into seperate
patches (each with at least basic test or two showing no regression).

As-is you're already changing global inode/dentry passes into per
sb inode and dentry passes. I think it can only be a good thing
for that changeset if other changes are minimised.

Then if it is so obviously good behaviour to reduce dcache pressure,
it should be easy to justify that too.

 
> > > Ås reclaim progresses the propotion of inodes increases, so
> > > the amount of inodes reclaimed increases. 
> > > 
> > > Basically this is a recognition that the important cache for
> > > avoiding IO is the inode cache, not he dentry cache. Once the inode
> > 
> > You can bias against the dcache using multipliers.
> 
> Multipliers are not self-balancing, and generally just amplify any
> imbalance an algorithm tends towards. The vfs_cache_pressure
> multiplier is a shining example of this kind of utterly useless
> knob...

Well you can also bias against the dcache with any other means,
including the change you've made here. My main point I guess is
that it should not be in the same as this patchset (or at least
an individual patch).

 
> > > > Part of the problem is the funny shrinker API.
> > > > 
> > > > The right way to do it is to change the shrinker API so that it passes
> > > > down the lru_pages and scanned into the callback. From there, the
> > > > shrinkers can calculate the appropriate ratio of objects to scan.
> > > > No need for 2-call scheme, no need for shrinker->seeks, and the
> > > > ability to calculate an appropriate ratio first for dcache, and *then*
> > > > for icache.
> > > 
> > > My only concern about this is that exposes the inner workings of the
> > > shrinker and mm subsystem to code that simply doesn't need to know
> > > about it.
> > 
> > It's just providing a ratio. The shrinkers allready know they are
> > scanning based on a ratio of pagecache scanned.
> 
> Sure, but the shrinkers are just a simple mechanism for implementing
> VM policy decisions. IMO reclaim policy decisions should not be
> pushed down and replicated in every one of these reclaim mechanisms.

Not really. The VM doesn't know about any of those. They are just
told to provide a ratio and some scanning based on some abstract cost.

The VM doesn't know anything about usage patterns, inuse vs unused
objects, exactly how their LRU algorithms are supposed to work, etc.

There is very little policy decision by the VM in the shrinkers.

 
> > But shrinkers are very subsystem specific.
> 
> And as such should concentrate on getting their subsystem reclaim
> correct, not have to worry about implementing VM policy
> calculations...

Clearly they wouldn't with what I was proposing. And the result would
be much more flexible and also gives the shrinkers more information.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Per superblock shrinkers V2
  2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
                   ` (6 preceding siblings ...)
  2010-05-27 20:32 ` Andrew Morton
@ 2010-07-02 12:13 ` Christoph Hellwig
  2010-07-12  2:41   ` Dave Chinner
  7 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2010-07-02 12:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

Did you plan to resubmit this with the few review comments addressed?
I'd really hate to not see this in 2.6.36.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Per superblock shrinkers V2
  2010-07-02 12:13 ` Christoph Hellwig
@ 2010-07-12  2:41   ` Dave Chinner
  2010-07-12  2:52     ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2010-07-12  2:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, linux-mm, xfs

On Fri, Jul 02, 2010 at 08:13:04AM -0400, Christoph Hellwig wrote:
> Did you plan to resubmit this with the few review comments addressed?
> I'd really hate to not see this in 2.6.36.

I've been doing some more testing on it, and while I can get a 25%
reduction in the time to create and remove 10 million inodes with
per-sb shrinker, I can't get the reclaim pattern stable enough for
my liking.

At this point in the cycle, I'd much prefer just to go with adding a
context to the shrinker API to fix the XFS locking issues (i.e.  the
original patches I sent) and spend a bit more time working out which
combination of Nick's and my bits that improves reclaim speed whilst
retaining the stability of the courrent code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 0/5] Per superblock shrinkers V2
  2010-07-12  2:41   ` Dave Chinner
@ 2010-07-12  2:52     ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2010-07-12  2:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm, xfs

On Mon, Jul 12, 2010 at 12:41:04PM +1000, Dave Chinner wrote:
> At this point in the cycle, I'd much prefer just to go with adding a
> context to the shrinker API to fix the XFS locking issues (i.e.  the
> original patches I sent) and spend a bit more time working out which
> combination of Nick's and my bits that improves reclaim speed whilst
> retaining the stability of the courrent code....

That approach sounds good to me.


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2010-07-12  2:52 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-25  8:53 [PATCH 0/5] Per superblock shrinkers V2 Dave Chinner
2010-05-25  8:53 ` [PATCH 1/5] inode: Make unused inode LRU per superblock Dave Chinner
2010-05-26 16:17   ` Nick Piggin
2010-05-26 23:01     ` Dave Chinner
2010-05-27  2:04       ` Nick Piggin
2010-05-27  4:02         ` Dave Chinner
2010-05-27  4:23           ` Nick Piggin
2010-05-27 20:32   ` Andrew Morton
2010-05-27 22:54     ` Dave Chinner
2010-05-28 10:07       ` Nick Piggin
2010-05-25  8:53 ` [PATCH 2/5] mm: add context argument to shrinker callback Dave Chinner
2010-05-25  8:53 ` [PATCH 3/5] superblock: introduce per-sb cache shrinker infrastructure Dave Chinner
2010-05-26 16:41   ` Nick Piggin
2010-05-26 23:12     ` Dave Chinner
2010-05-27  1:53       ` [PATCH 3/5 v2] " Dave Chinner
2010-05-27  4:01         ` Al Viro
2010-05-27  6:17           ` Dave Chinner
2010-05-27  6:46             ` Nick Piggin
2010-05-27  2:19       ` [PATCH 3/5] " Nick Piggin
2010-05-27  4:07         ` Dave Chinner
2010-05-27  4:24           ` Nick Piggin
2010-05-27  6:35   ` Nick Piggin
2010-05-27 22:40     ` Dave Chinner
2010-05-28  5:19       ` Nick Piggin
2010-05-31  6:39         ` Dave Chinner
2010-05-31  7:28           ` Nick Piggin
2010-05-27 20:32   ` Andrew Morton
2010-05-27 23:01     ` Dave Chinner
2010-05-25  8:53 ` [PATCH 4/5] superblock: add filesystem shrinker operations Dave Chinner
2010-05-27 20:32   ` Andrew Morton
2010-05-25  8:53 ` [PATCH 5/5] xfs: make use of new shrinker callout Dave Chinner
2010-05-26 16:44 ` [PATCH 0/5] Per superblock shrinkers V2 Nick Piggin
2010-05-27 20:32 ` Andrew Morton
2010-05-28  0:30   ` Dave Chinner
2010-05-28  7:42   ` Artem Bityutskiy
2010-07-02 12:13 ` Christoph Hellwig
2010-07-12  2:41   ` Dave Chinner
2010-07-12  2:52     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).