linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fs: inode rcu freeing and hash walking V2
@ 2010-11-08  7:28 Dave Chinner
  2010-11-08  7:28 ` [PATCH 1/4] fs: pull inode->i_lock up out of writeback_single_inode Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Dave Chinner @ 2010-11-08  7:28 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, eric.dumazet, paulmck

Hi Al,

The following four patches implement the extra functionality you
wanted on top of the inode lock breakup. They lift the i_lock up out
of writeback_single_inode(), implements RCU freeing of inodes via
SLAB_DESTROY_BY_RCU, converts inode hash lookup operations to use
RCU list walks. Comments are welcome.

I'm undecided whether passing the bucket identifier along with the
hash head is the best way to handle the hash-chain-being-jumped
detection. It might be cleaner to just push the hash chain
calculation all the way in.

Most of the changes in this version are as a result of Eric
Dumazet's initial review.

Version 2:
- split the inode slabcache config changes into a separate patch.
- convert inode hash lists to use hlist_nulls variant to enable
  simple checking of whether RCU hash list traversals jumped chains.
- recheck i_ino and i_sb after gaining i_lock during RCU lookups to
  ensure we've got a hold of the correct inode.
- clear i_ino when destroying an inode to allow hash lookups to
  discover freed inodes without needing to lock the inode.
- moved inode->i_lock initialisation to slab constructor so that it
  can be safely used during RCU traversals
- introduced hlist_nulls_add_fake() and converted filesystems to use
  that

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] fs: pull inode->i_lock up out of writeback_single_inode
  2010-11-08  7:28 fs: inode rcu freeing and hash walking V2 Dave Chinner
@ 2010-11-08  7:28 ` Dave Chinner
  2010-11-08  7:28 ` [PATCH 2/4] fs: Convert inode slab creation to use consistent parameters Dave Chinner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2010-11-08  7:28 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, eric.dumazet, paulmck

From: Dave Chinner <dchinner@redhat.com>

First thing we do in writeback_single_inode() is take the i_lock and
the last thing we do is drop it. A caller already holds the i_lock,
so pull the i_lock out of writeback_single_inode() to reduce the
round trips on this lock during inode writeback.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 35669f2..567c14c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -318,9 +318,9 @@ static void inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_wb_list_lock.  Either
- * the caller has an active reference on the inode or the inode has I_WILL_FREE
- * set.
+ * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
+ * inode->i_lock.  Either the caller has an active reference on the inode or
+ * the inode has I_WILL_FREE set.
  *
  * If `wait' is set, wait on the writeout.
  *
@@ -335,7 +335,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
-	spin_lock(&inode->i_lock);
 	if (!atomic_read(&inode->i_count))
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
@@ -351,7 +350,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
 			return 0;
 		}
@@ -442,7 +440,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		}
 	}
 	inode_sync_complete(inode);
-	spin_unlock(&inode->i_lock);
 	return ret;
 }
 
@@ -530,7 +527,6 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		}
 
 		__iget(inode);
-		spin_unlock(&inode->i_lock);
 
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
@@ -541,6 +537,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			 */
 			redirty_tail(inode);
 		}
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_wb_list_lock);
 		iput(inode);
 		cond_resched();
@@ -1212,7 +1209,9 @@ int write_inode_now(struct inode *inode, int sync)
 
 	might_sleep();
 	spin_lock(&inode_wb_list_lock);
+	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, &wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_wb_list_lock);
 	if (sync)
 		inode_sync_wait(inode);
@@ -1236,7 +1235,9 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 	int ret;
 
 	spin_lock(&inode_wb_list_lock);
+	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_wb_list_lock);
 	return ret;
 }
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] fs: Convert inode slab creation to use consistent parameters
  2010-11-08  7:28 fs: inode rcu freeing and hash walking V2 Dave Chinner
  2010-11-08  7:28 ` [PATCH 1/4] fs: pull inode->i_lock up out of writeback_single_inode Dave Chinner
@ 2010-11-08  7:28 ` Dave Chinner
  2011-04-25 17:46   ` Christoph Hellwig
  2010-11-08  7:28 ` [PATCH 3/4] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU Dave Chinner
  2010-11-08  7:28 ` [PATCH 4/4] fs: rcu protect inode hash lookups Dave Chinner
  3 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2010-11-08  7:28 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, eric.dumazet, paulmck

From: Dave Chinner <dchinner@redhat.com>

The inode slab creation functions in different filesystem don't all set up the
slab caches correctly. Add a new flag SLAB_INODES that sets all the necessary
flags for an inode slab cache and convert all the callers to use it. THis makes
it easy to change the behaviour of all the inode slab caches in one go.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 Documentation/filesystems/porting |    5 +++++
 fs/adfs/super.c                   |    3 +--
 fs/affs/super.c                   |    3 +--
 fs/afs/super.c                    |    7 +++----
 fs/befs/linuxvfs.c                |    3 +--
 fs/bfs/inode.c                    |    3 +--
 fs/btrfs/inode.c                  |    2 +-
 fs/ceph/super.c                   |    3 +--
 fs/cifs/cifsfs.c                  |    3 +--
 fs/coda/inode.c                   |    3 +--
 fs/efs/super.c                    |    3 +--
 fs/exofs/super.c                  |    3 +--
 fs/ext2/super.c                   |    3 +--
 fs/ext3/super.c                   |    3 +--
 fs/ext4/super.c                   |    3 +--
 fs/fat/inode.c                    |    3 +--
 fs/freevxfs/vxfs_super.c          |    2 +-
 fs/fuse/inode.c                   |    6 +++---
 fs/gfs2/main.c                    |    3 +--
 fs/hfs/super.c                    |    3 ++-
 fs/hfsplus/super.c                |    3 ++-
 fs/hpfs/super.c                   |    3 +--
 fs/hugetlbfs/inode.c              |    2 +-
 fs/inode.c                        |    4 +---
 fs/isofs/inode.c                  |    4 +---
 fs/jffs2/super.c                  |    3 +--
 fs/jfs/super.c                    |    3 +--
 fs/logfs/inode.c                  |    2 +-
 fs/minix/inode.c                  |    3 +--
 fs/ncpfs/inode.c                  |    3 +--
 fs/nfs/inode.c                    |    3 +--
 fs/nilfs2/super.c                 |    4 ++--
 fs/ntfs/super.c                   |    3 ++-
 fs/ocfs2/dlmfs/dlmfs.c            |    3 +--
 fs/ocfs2/super.c                  |    4 +---
 fs/openpromfs/inode.c             |    4 +---
 fs/proc/inode.c                   |    3 +--
 fs/qnx4/inode.c                   |    3 +--
 fs/reiserfs/super.c               |    7 ++-----
 fs/romfs/super.c                  |    3 +--
 fs/squashfs/super.c               |    3 ++-
 fs/sysv/inode.c                   |    3 +--
 fs/ubifs/super.c                  |    2 +-
 fs/udf/super.c                    |    3 +--
 fs/ufs/super.c                    |    3 +--
 fs/xfs/linux-2.6/kmem.h           |    1 +
 fs/xfs/linux-2.6/xfs_super.c      |    4 ++--
 include/linux/slab.h              |    6 ++++++
 ipc/mqueue.c                      |    3 ++-
 mm/shmem.c                        |    2 +-
 net/socket.c                      |    9 +++------
 net/sunrpc/rpc_pipe.c             |    5 ++---
 52 files changed, 76 insertions(+), 102 deletions(-)

diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 2da4b44..b68438b 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -319,3 +319,8 @@ if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput(
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+[mandatory]
+	Inodes must be allocated via a slab cache created with the
+SLAB_INODE_CACHE flag set. This sets all the necessary slab cache flags for
+correct operation and control of the cache across the system.
diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index 959dbff..bcdf6cc 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -256,8 +256,7 @@ static int init_inodecache(void)
 {
 	adfs_inode_cachep = kmem_cache_create("adfs_inode_cache",
 					     sizeof(struct adfs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (adfs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/affs/super.c b/fs/affs/super.c
index 0cf7f43..cb335c6 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -113,8 +113,7 @@ static int init_inodecache(void)
 {
 	affs_inode_cachep = kmem_cache_create("affs_inode_cache",
 					     sizeof(struct affs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (affs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 27201cf..8995f48 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -87,10 +87,9 @@ int __init afs_fs_init(void)
 
 	ret = -ENOMEM;
 	afs_inode_cachep = kmem_cache_create("afs_inode_cache",
-					     sizeof(struct afs_vnode),
-					     0,
-					     SLAB_HWCACHE_ALIGN,
-					     afs_i_init_once);
+					sizeof(struct afs_vnode), 0,
+					SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+					afs_i_init_once);
 	if (!afs_inode_cachep) {
 		printk(KERN_NOTICE "kAFS: Failed to allocate inode cache\n");
 		return ret;
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index aa4e7c7..d95dab2 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -431,8 +431,7 @@ befs_init_inodecache(void)
 {
 	befs_inode_cachep = kmem_cache_create("befs_inode_cache",
 					      sizeof (struct befs_inode_info),
-					      0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					      0, SLAB_INODE_CACHE,
 					      init_once);
 	if (befs_inode_cachep == NULL) {
 		printk(KERN_ERR "befs_init_inodecache: "
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index 76db6d7..c76a689 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -264,8 +264,7 @@ static int init_inodecache(void)
 {
 	bfs_inode_cachep = kmem_cache_create("bfs_inode_cache",
 					     sizeof(struct bfs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (bfs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 64f99cf..80d0a64 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6369,7 +6369,7 @@ int btrfs_init_cachep(void)
 {
 	btrfs_inode_cachep = kmem_cache_create("btrfs_inode_cache",
 			sizeof(struct btrfs_inode), 0,
-			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, init_once);
+			SLAB_INODE_CACHE, init_once);
 	if (!btrfs_inode_cachep)
 		goto fail;
 
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 08b460a..72e40f0 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -521,8 +521,7 @@ static int __init init_caches(void)
 	ceph_inode_cachep = kmem_cache_create("ceph_inode_info",
 				      sizeof(struct ceph_inode_info),
 				      __alignof__(struct ceph_inode_info),
-				      (SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD),
-				      ceph_inode_init_once);
+				      SLAB_INODE_CACHE, ceph_inode_init_once);
 	if (ceph_inode_cachep == NULL)
 		return -ENOMEM;
 
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 8bd5c2c..34fc2b5 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -784,8 +784,7 @@ cifs_init_inodecache(void)
 {
 	cifs_inode_cachep = kmem_cache_create("cifs_inode_cache",
 					      sizeof(struct cifsInodeInfo),
-					      0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					      0, SLAB_INODE_CACHE,
 					      cifs_init_once);
 	if (cifs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 5ea57c8..fc0f4a1 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -72,8 +72,7 @@ int coda_init_inodecache(void)
 {
 	coda_inode_cachep = kmem_cache_create("coda_inode_cache",
 				sizeof(struct coda_inode_info),
-				0, SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
-				init_once);
+				0, SLAB_INODE_CACHE, init_once);
 	if (coda_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/efs/super.c b/fs/efs/super.c
index 5073a07..c6ee8eb 100644
--- a/fs/efs/super.c
+++ b/fs/efs/super.c
@@ -81,8 +81,7 @@ static int init_inodecache(void)
 {
 	efs_inode_cachep = kmem_cache_create("efs_inode_cache",
 				sizeof(struct efs_inode_info),
-				0, SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
-				init_once);
+				0, SLAB_INODE_CACHE, init_once);
 	if (efs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 79c3ae6..cd5e908 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -175,8 +175,7 @@ static int init_inodecache(void)
 {
 	exofs_inode_cachep = kmem_cache_create("exofs_inode_cache",
 				sizeof(struct exofs_i_info), 0,
-				SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
-				exofs_init_once);
+				SLAB_INODE_CACHE, exofs_init_once);
 	if (exofs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index d89e0b6..87bdcee 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -182,8 +182,7 @@ static int init_inodecache(void)
 {
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
 					     sizeof(struct ext2_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 2fedaf8..3cd2e00 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -509,8 +509,7 @@ static int init_inodecache(void)
 {
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
 					     sizeof(struct ext3_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 40131b7..ee94fbc3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -862,8 +862,7 @@ static int init_inodecache(void)
 {
 	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
 					     sizeof(struct ext4_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index ad6998a..84d8cde 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -535,8 +535,7 @@ static int __init fat_init_inodecache(void)
 {
 	fat_inode_cachep = kmem_cache_create("fat_inode_cache",
 					     sizeof(struct msdos_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (fat_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/freevxfs/vxfs_super.c b/fs/freevxfs/vxfs_super.c
index 9d1c995..225f18a 100644
--- a/fs/freevxfs/vxfs_super.c
+++ b/fs/freevxfs/vxfs_super.c
@@ -267,7 +267,7 @@ vxfs_init(void)
 
 	vxfs_inode_cachep = kmem_cache_create("vxfs_inode",
 			sizeof(struct vxfs_inode_info), 0,
-			SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+			SLAB_INODE_CACHE, NULL);
 	if (!vxfs_inode_cachep)
 		return -ENOMEM;
 	rv = register_filesystem(&vxfs_fs_type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index cfce3ad..df9c272 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1138,9 +1138,9 @@ static int __init fuse_fs_init(void)
 		goto out_unreg;
 
 	fuse_inode_cachep = kmem_cache_create("fuse_inode",
-					      sizeof(struct fuse_inode),
-					      0, SLAB_HWCACHE_ALIGN,
-					      fuse_inode_init_once);
+					sizeof(struct fuse_inode), 0,
+					SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+					fuse_inode_init_once);
 	err = -ENOMEM;
 	if (!fuse_inode_cachep)
 		goto out_unreg2;
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index ebef7ab..3100429 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -108,8 +108,7 @@ static int __init init_gfs2_fs(void)
 
 	gfs2_inode_cachep = kmem_cache_create("gfs2_inode",
 					      sizeof(struct gfs2_inode),
-					      0,  SLAB_RECLAIM_ACCOUNT|
-					          SLAB_MEM_SPREAD,
+					      0, SLAB_INODE_CACHE,
 					      gfs2_init_inode_once);
 	if (!gfs2_inode_cachep)
 		goto fail;
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index 4824c27..dc1a909 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -467,7 +467,8 @@ static int __init init_hfs_fs(void)
 	int err;
 
 	hfs_inode_cachep = kmem_cache_create("hfs_inode_cache",
-		sizeof(struct hfs_inode_info), 0, SLAB_HWCACHE_ALIGN,
+		sizeof(struct hfs_inode_info), 0,
+		SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
 		hfs_init_once);
 	if (!hfs_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 52cc746..ecce369 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -521,7 +521,8 @@ static int __init init_hfsplus_fs(void)
 	int err;
 
 	hfsplus_inode_cachep = kmem_cache_create("hfsplus_icache",
-		HFSPLUS_INODE_SIZE, 0, SLAB_HWCACHE_ALIGN,
+		HFSPLUS_INODE_SIZE, 0,
+		SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
 		hfsplus_init_once);
 	if (!hfsplus_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c
index bb69389..2fba9e0 100644
--- a/fs/hpfs/super.c
+++ b/fs/hpfs/super.c
@@ -195,8 +195,7 @@ static int init_inodecache(void)
 {
 	hpfs_inode_cachep = kmem_cache_create("hpfs_inode_cache",
 					     sizeof(struct hpfs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (hpfs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d6cfac1..8170b8d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -996,7 +996,7 @@ static int __init init_hugetlbfs_fs(void)
 
 	hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
 					sizeof(struct hugetlbfs_inode_info),
-					0, 0, init_once);
+					0, SLAB_INODE_CACHE, init_once);
 	if (hugetlbfs_inode_cachep == NULL)
 		goto out2;
 
diff --git a/fs/inode.c b/fs/inode.c
index fc3b0a5..5592d74 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1697,9 +1697,7 @@ void __init inode_init(void)
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
-					 0,
-					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
-					 SLAB_MEM_SPREAD),
+					 0, SLAB_INODE_CACHE | SLAB_PANIC,
 					 init_once);
 	register_shrinker(&icache_shrinker);
 	percpu_counter_init(&nr_inodes, 0);
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index bfdeb82..32e2adf 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -81,9 +81,7 @@ static int init_inodecache(void)
 {
 	isofs_inode_cachep = kmem_cache_create("isofs_inode_cache",
 					sizeof(struct iso_inode_info),
-					0, (SLAB_RECLAIM_ACCOUNT|
-					SLAB_MEM_SPREAD),
-					init_once);
+					0, SLAB_INODE_CACHE, init_once);
 	if (isofs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index c86041b..b056fd1 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -259,8 +259,7 @@ static int __init init_jffs2_fs(void)
 
 	jffs2_inode_cachep = kmem_cache_create("jffs2_i",
 					     sizeof(struct jffs2_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     jffs2_i_init_once);
 	if (!jffs2_inode_cachep) {
 		printk(KERN_ERR "JFFS2 error: Failed to initialise inode cache\n");
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 0669fc1..87d8ecc 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -795,8 +795,7 @@ static int __init init_jfs_fs(void)
 
 	jfs_inode_cachep =
 	    kmem_cache_create("jfs_ip", sizeof(struct jfs_inode_info), 0,
-			    SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
-			    init_once);
+					     SLAB_INODE_CACHE, init_once);
 	if (jfs_inode_cachep == NULL)
 		return -ENOMEM;
 
diff --git a/fs/logfs/inode.c b/fs/logfs/inode.c
index 38c559e..ab812c2 100644
--- a/fs/logfs/inode.c
+++ b/fs/logfs/inode.c
@@ -385,7 +385,7 @@ const struct super_operations logfs_super_operations = {
 int logfs_init_inode_cache(void)
 {
 	logfs_inode_cache = kmem_cache_create("logfs_inode_cache",
-			sizeof(struct logfs_inode), 0, SLAB_RECLAIM_ACCOUNT,
+			sizeof(struct logfs_inode), 0, SLAB_INODE_CACHE,
 			logfs_init_once);
 	if (!logfs_inode_cache)
 		return -ENOMEM;
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index fb20208..cb0a7f8 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -84,8 +84,7 @@ static int init_inodecache(void)
 {
 	minix_inode_cachep = kmem_cache_create("minix_inode_cache",
 					     sizeof(struct minix_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (minix_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index d290545..bf4394f 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -76,8 +76,7 @@ static int init_inodecache(void)
 {
 	ncp_inode_cachep = kmem_cache_create("ncp_inode_cache",
 					     sizeof(struct ncp_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ncp_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 314f571..5f3ca89 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1474,8 +1474,7 @@ static int __init nfs_init_inodecache(void)
 {
 	nfs_inode_cachep = kmem_cache_create("nfs_inode_cache",
 					     sizeof(struct nfs_inode),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (nfs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index f804d41..19696af 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1287,8 +1287,8 @@ static void nilfs_destroy_cachep(void)
 static int __init nilfs_init_cachep(void)
 {
 	nilfs_inode_cachep = kmem_cache_create("nilfs2_inode_cache",
-			sizeof(struct nilfs_inode_info), 0,
-			SLAB_RECLAIM_ACCOUNT, nilfs_inode_init_once);
+			sizeof(struct nilfs_inode_info), 0, SLAB_INODE_CACHE,
+			nilfs_inode_init_once);
 	if (!nilfs_inode_cachep)
 		goto fail;
 
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index a30ecacc..1922450 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -3136,9 +3136,10 @@ static int __init init_ntfs_fs(void)
 		goto inode_err_out;
 	}
 
+	/* ntfs_big_inode_cache is the inode cache used for VFS level inodes */
 	ntfs_big_inode_cache = kmem_cache_create(ntfs_big_inode_cache_name,
 			sizeof(big_ntfs_inode), 0,
-			SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
+			SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
 			ntfs_big_inode_init_once);
 	if (!ntfs_big_inode_cache) {
 		printk(KERN_CRIT "NTFS: Failed to create %s!\n",
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index b2df490..1eb031e 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -669,8 +669,7 @@ static int __init init_dlmfs_fs(void)
 
 	dlmfs_inode_cache = kmem_cache_create("dlmfs_inode_cache",
 				sizeof(struct dlmfs_inode_private),
-				0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
-					SLAB_MEM_SPREAD),
+				0, (SLAB_HWCACHE_ALIGN|SLAB_INODE_CACHE),
 				dlmfs_init_once);
 	if (!dlmfs_inode_cache) {
 		status = -ENOMEM;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index f02c0ef..1a15357 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1791,9 +1791,7 @@ static int ocfs2_initialize_mem_caches(void)
 {
 	ocfs2_inode_cachep = kmem_cache_create("ocfs2_inode_cache",
 				       sizeof(struct ocfs2_inode_info),
-				       0,
-				       (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+				       0, (SLAB_HWCACHE_ALIGN|SLAB_INODE_CACHE),
 				       ocfs2_inode_init_once);
 	ocfs2_dquot_cachep = kmem_cache_create("ocfs2_dquot_cache",
 					sizeof(struct ocfs2_dquot),
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index ddb1f41..87763f8 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -441,9 +441,7 @@ static int __init init_openprom_fs(void)
 
 	op_inode_cachep = kmem_cache_create("op_inode_cache",
 					    sizeof(struct op_inode_info),
-					    0,
-					    (SLAB_RECLAIM_ACCOUNT |
-					     SLAB_MEM_SPREAD),
+					    0, SLAB_INODE_CACHE,
 					    op_inode_init_once);
 	if (!op_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 9c2b5f4..ef1c2b3 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -82,8 +82,7 @@ void __init proc_init_inodecache(void)
 {
 	proc_inode_cachep = kmem_cache_create("proc_inode_cache",
 					     sizeof(struct proc_inode),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD|SLAB_PANIC),
+					     0, SLAB_INODE_CACHE | SLAB_PANIC,
 					     init_once);
 }
 
diff --git a/fs/qnx4/inode.c b/fs/qnx4/inode.c
index fcada42..692c74d3 100644
--- a/fs/qnx4/inode.c
+++ b/fs/qnx4/inode.c
@@ -441,8 +441,7 @@ static int init_inodecache(void)
 {
 	qnx4_inode_cachep = kmem_cache_create("qnx4_inode_cache",
 					     sizeof(struct qnx4_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (qnx4_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 3bf7a64..4fc388d 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -546,11 +546,8 @@ static void init_once(void *foo)
 static int init_inodecache(void)
 {
 	reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
-						  sizeof(struct
-							 reiserfs_inode_info),
-						  0, (SLAB_RECLAIM_ACCOUNT|
-							SLAB_MEM_SPREAD),
-						  init_once);
+					sizeof(struct reiserfs_inode_info),
+					0, SLAB_INODE_CACHE, init_once);
 	if (reiserfs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 6647f90..6ac84bc 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -618,8 +618,7 @@ static int __init init_romfs_fs(void)
 	romfs_inode_cachep =
 		kmem_cache_create("romfs_i",
 				  sizeof(struct romfs_inode_info), 0,
-				  SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
-				  romfs_i_init_once);
+				  SLAB_INODE_CACHE, romfs_i_init_once);
 
 	if (!romfs_inode_cachep) {
 		printk(KERN_ERR
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index 24de30b..c7495fe 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -392,7 +392,8 @@ static int __init init_inodecache(void)
 {
 	squashfs_inode_cachep = kmem_cache_create("squashfs_inode_cache",
 		sizeof(struct squashfs_inode_info), 0,
-		SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, init_once);
+		SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+		init_once);
 
 	return squashfs_inode_cachep ? 0 : -ENOMEM;
 }
diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index de44d06..8c96780 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -360,8 +360,7 @@ const struct super_operations sysv_sops = {
 int __init sysv_init_icache(void)
 {
 	sysv_inode_cachep = kmem_cache_create("sysv_inode_cache",
-			sizeof(struct sysv_inode_info), 0,
-			SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
+			sizeof(struct sysv_inode_info), 0, SLAB_INODE_CACHE,
 			init_once);
 	if (!sysv_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 91fac54..38c8395 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -2191,7 +2191,7 @@ static int __init ubifs_init(void)
 	err = -ENOMEM;
 	ubifs_inode_slab = kmem_cache_create("ubifs_inode_slab",
 				sizeof(struct ubifs_inode), 0,
-				SLAB_MEM_SPREAD | SLAB_RECLAIM_ACCOUNT,
+				SLAB_INODE_CACHE,
 				&inode_slab_ctor);
 	if (!ubifs_inode_slab)
 		goto out_reg;
diff --git a/fs/udf/super.c b/fs/udf/super.c
index 4a5c7c6..0e1c5d7 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -156,8 +156,7 @@ static int init_inodecache(void)
 {
 	udf_inode_cachep = kmem_cache_create("udf_inode_cache",
 					     sizeof(struct udf_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT |
-						 SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (!udf_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 2c47dae..df8020b 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1428,8 +1428,7 @@ static int init_inodecache(void)
 {
 	ufs_inode_cachep = kmem_cache_create("ufs_inode_cache",
 					     sizeof(struct ufs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ufs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/xfs/linux-2.6/kmem.h b/fs/xfs/linux-2.6/kmem.h
index f7c8f7a..4e6b372 100644
--- a/fs/xfs/linux-2.6/kmem.h
+++ b/fs/xfs/linux-2.6/kmem.h
@@ -82,6 +82,7 @@ extern void *kmem_zalloc_greedy(size_t *, size_t, size_t);
 #define KM_ZONE_HWALIGN	SLAB_HWCACHE_ALIGN
 #define KM_ZONE_RECLAIM	SLAB_RECLAIM_ACCOUNT
 #define KM_ZONE_SPREAD	SLAB_MEM_SPREAD
+#define KM_ZONE_INODES	SLAB_INODE_CACHE
 
 #define kmem_zone	kmem_cache
 #define kmem_zone_t	struct kmem_cache
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 9f3a78f..53ab47f 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1718,8 +1718,8 @@ xfs_init_zones(void)
 
 	xfs_inode_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode",
-			KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD,
-			xfs_fs_inode_init_once);
+					KM_ZONE_HWALIGN | KM_ZONE_INODES,
+					xfs_fs_inode_init_once);
 	if (!xfs_inode_zone)
 		goto out_destroy_efi_zone;
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 59260e2..dca23c5 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -93,6 +93,12 @@
 				(unsigned long)ZERO_SIZE_PTR)
 
 /*
+ * Set the default flags necessary for inode caches manipulated by the VFS.
+ */
+#define SLAB_INODE_CACHE	(SLAB_MEM_SPREAD | \
+				 SLAB_RECLAIM_ACCOUNT)
+
+/*
  * struct kmem_cache related prototypes
  */
 void __init kmem_cache_init(void);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 035f439..82dfa7c 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -1268,7 +1268,8 @@ static int __init init_mqueue_fs(void)
 
 	mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
 				sizeof(struct mqueue_inode_info), 0,
-				SLAB_HWCACHE_ALIGN, init_once);
+				SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+				init_once);
 	if (mqueue_inode_cachep == NULL)
 		return -ENOMEM;
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 47fdeeb..9a3bd83 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2435,7 +2435,7 @@ static int init_inodecache(void)
 {
 	shmem_inode_cachep = kmem_cache_create("shmem_inode_cache",
 				sizeof(struct shmem_inode_info),
-				0, SLAB_PANIC, init_once);
+				0, SLAB_INODE_CACHE|SLAB_PANIC, init_once);
 	return 0;
 }
 
diff --git a/net/socket.c b/net/socket.c
index 5247ae1..1f1563d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -288,12 +288,9 @@ static void init_once(void *foo)
 static int init_inodecache(void)
 {
 	sock_inode_cachep = kmem_cache_create("sock_inode_cache",
-					      sizeof(struct socket_alloc),
-					      0,
-					      (SLAB_HWCACHE_ALIGN |
-					       SLAB_RECLAIM_ACCOUNT |
-					       SLAB_MEM_SPREAD),
-					      init_once);
+			      sizeof(struct socket_alloc), 0,
+			      SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+			      init_once);
 	if (sock_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 10a17a3..68733fb 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -1056,9 +1056,8 @@ int register_rpc_pipefs(void)
 	int err;
 
 	rpc_inode_cachep = kmem_cache_create("rpc_inode_cache",
-				sizeof(struct rpc_inode),
-				0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+				sizeof(struct rpc_inode), 0,
+				SLAB_HWCACHE_ALIGN|SLAB_INODE_CACHE,
 				init_once);
 	if (!rpc_inode_cachep)
 		return -ENOMEM;
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU
  2010-11-08  7:28 fs: inode rcu freeing and hash walking V2 Dave Chinner
  2010-11-08  7:28 ` [PATCH 1/4] fs: pull inode->i_lock up out of writeback_single_inode Dave Chinner
  2010-11-08  7:28 ` [PATCH 2/4] fs: Convert inode slab creation to use consistent parameters Dave Chinner
@ 2010-11-08  7:28 ` Dave Chinner
  2010-11-08  7:28 ` [PATCH 4/4] fs: rcu protect inode hash lookups Dave Chinner
  3 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2010-11-08  7:28 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, eric.dumazet, paulmck

From: Dave Chinner <dchinner@redhat.com>

Change the inode caches to use RCU freed inodes via the SLAB_DESTROY_BY_RCU
method. This is the first step in converting the inode hash lookups to use RCU
methods.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/linux/slab.h |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index dca23c5..1fea5f1 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -96,6 +96,7 @@
  * Set the default flags necessary for inode caches manipulated by the VFS.
  */
 #define SLAB_INODE_CACHE	(SLAB_MEM_SPREAD | \
+				 SLAB_DESTROY_BY_RCU | \
 				 SLAB_RECLAIM_ACCOUNT)
 
 /*
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] fs: rcu protect inode hash lookups
  2010-11-08  7:28 fs: inode rcu freeing and hash walking V2 Dave Chinner
                   ` (2 preceding siblings ...)
  2010-11-08  7:28 ` [PATCH 3/4] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU Dave Chinner
@ 2010-11-08  7:28 ` Dave Chinner
  2010-11-08  8:53   ` Eric Dumazet
  2010-11-08 11:10   ` Christoph Hellwig
  3 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2010-11-08  7:28 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel, eric.dumazet, paulmck

From: Dave Chinner <dchinner@redhat.com>

Now that inodes are using RCU freeing, we can walk the hash lists
using RCU protection during lookups. Convert all the hash list
operations to use RCU-based operators and drop the inode_hash_lock
around pure lookup operations.

Because we are using SLAB_DESTROY_BY_RCU, we need to be careful
about the lockless lookups and the way we initialise the struct
inode during allocation. The inode->i_lock is the critical item we
need to avoid touching when allocating an inode, so move the
initialisation of this lock to the slab constructor so that we can
reliably use the lock even after the inode has been freed or is in
the process of being allocated or initialised.

We also need to change the order of checks during hash lookups to
validate we have an active inode before we skip it. That means we
need to revalidate the inode number and superblock after we have
checked it for being freed or unhashed under the inode->i_lock.

Finally, during traversals an inode can be freed and reallocated and
linked into a new hash chain, so an RCU safe traversal can jump to a
different chain without indication. On a cache miss we need to
validate that we ended the search on the same hash chain we started
on. Do this by converting the hash lists to the hlist_null list type
and checking the list null on walk termination.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/hfs/inode.c             |    2 +-
 fs/hfsplus/inode.c         |    2 +-
 fs/inode.c                 |  234 +++++++++++++++++++++++++++++++-------------
 fs/jfs/jfs_imap.c          |    2 +-
 include/linux/fs.h         |    5 +-
 include/linux/list_nulls.h |    7 ++
 6 files changed, 178 insertions(+), 74 deletions(-)

diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index dffb4e9..4fcdf03 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -524,7 +524,7 @@ static struct dentry *hfs_file_lookup(struct inode *dir, struct dentry *dentry,
 	HFS_I(inode)->rsrc_inode = dir;
 	HFS_I(dir)->rsrc_inode = inode;
 	igrab(dir);
-	hlist_add_fake(&inode->i_hash);
+	hlist_nulls_add_fake(&inode->i_hash);
 	mark_inode_dirty(inode);
 out:
 	d_add(dentry, inode);
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 8afd7e8..0167d18 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -211,7 +211,7 @@ static struct dentry *hfsplus_file_lookup(struct inode *dir, struct dentry *dent
 	 * appear hashed, but do not put on any lists.  hlist_del()
 	 * will work fine and require no locking.
 	 */
-	hlist_add_fake(&inode->i_hash);
+	hlist_nulls_add_fake(&inode->i_hash);
 
 	mark_inode_dirty(inode);
 out:
diff --git a/fs/inode.c b/fs/inode.c
index 5592d74..cf5180c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -51,11 +51,12 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode_sb_list_lock
- *   inode->i_lock
+ *   rcu_read_lock
+ *     inode_sb_list_lock
+ *     inode->i_lock
  *
  * iunique_lock
- *   inode_hash_lock
+ *   rcu_read_lock
  */
 
 /*
@@ -91,7 +92,7 @@
 
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
-static struct hlist_head *inode_hashtable __read_mostly;
+static struct hlist_nulls_head *inode_hashtable __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
 /*
@@ -173,6 +174,12 @@ int proc_nr_inodes(ctl_table *table, int write,
  *
  * These are initializations that need to be done on every inode
  * allocation as the fields are not initialised by slab allocation.
+ *
+ * Because we are using SLAB_DESTROY_BY_RCU, we must not initialise fields used
+ * by list traversals to detect freed inodes here. Checks are made to i_ino,
+ * i_sb, i_hash and i_state, and are protected the i_lock field. We can
+ * overwrite the i_sb field here safely as we do not reset the i_ino or i_state
+ * fields until the inode is to be inserted into the hash. The i_state
  */
 int inode_init_always(struct super_block *sb, struct inode *inode)
 {
@@ -206,7 +213,6 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 
 	if (security_inode_alloc(inode))
 		goto out;
-	spin_lock_init(&inode->i_lock);
 	lockdep_set_class(&inode->i_lock, &sb->s_type->i_lock_key);
 
 	mutex_init(&inode->i_mutex);
@@ -286,6 +292,12 @@ void __destroy_inode(struct inode *inode)
 	if (inode->i_default_acl && inode->i_default_acl != ACL_NOT_CACHED)
 		posix_acl_release(inode->i_default_acl);
 #endif
+	/*
+	 * reset the inode number so during RCU traversals we do not match this
+	 * inode in any lookups until it is fully re-initialised again during
+	 * allocation.
+	 */
+	inode->i_ino = 0;
 	percpu_counter_dec(&nr_inodes);
 }
 EXPORT_SYMBOL(__destroy_inode);
@@ -308,7 +320,6 @@ static void destroy_inode(struct inode *inode)
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_dentry);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_wb_list);
@@ -321,6 +332,7 @@ void inode_init_once(struct inode *inode)
 	INIT_RAW_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
 	INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
 	i_size_ordered_init(inode);
+	spin_lock_init(&inode->i_lock);
 #ifdef CONFIG_FSNOTIFY
 	INIT_HLIST_HEAD(&inode->i_fsnotify_marks);
 #endif
@@ -410,11 +422,11 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+	struct hlist_nulls_head *b = inode_hashtable + hash(inode->i_sb, hashval);
 
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
-	hlist_add_head(&inode->i_hash, b);
+	hlist_nulls_add_head_rcu(&inode->i_hash, b);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
@@ -430,7 +442,7 @@ void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
-	hlist_del_init(&inode->i_hash);
+	hlist_nulls_del_init_rcu(&inode->i_hash);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
@@ -736,32 +748,82 @@ static struct shrinker icache_shrinker = {
 };
 
 static void __wait_on_freeing_inode(struct inode *inode);
+
 /*
- * Called with the inode lock held.
+ * find_inode: look up an inode in the hash
+ *
+ * We can be called without any locks held (@locked == false) and so in this
+ * case we are relying on RCU to protect the chain traversal. As we are using
+ * SLAB_DESTROY_BY_RCU, there are several things we have to do to ensure that
+ * the inode is valid.
+ *
+ * Firstly, we do an unlocked check on the superblock and @test() so we avoid
+ * obvious mismatches. If we get a match, we then need to lock the inode and
+ * check the inode has not been freed. We do this by checking that it is still
+ * hashed and that it is not in the freeing state under the inode->i_lock. by
+ * holding the i_lock we ensure that the inode cannot be change state (i.e. be
+ * removed or added to the hash) so we can safely determine what to do with the
+ * inode.
+ *
+ * Once we have validated that it is an active inode, we need to recheck the
+ * inode matches what we are searching for, this time under the inode->i_lock.
+ * If it is a match, then we can take a reference to it, drop all locks and
+ * return it.
+ *
+ * The final wrinkle is that during traversal we can just hash chains by
+ * following an inode that has been freed and reallocated. Hence on a cache
+ * miss on an unlocked lookup, we need to check that the last inode we checked
+ * is on the same chain as we started. If we jumped chains, restart the search
+ * all over again.
  */
 static struct inode *find_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct hlist_nulls_head *head, int chain,
 				int (*test)(struct inode *, void *),
-				void *data)
+				void *data, bool locked)
 {
-	struct hlist_node *node;
+	struct hlist_nulls_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	rcu_read_lock();
+	hlist_nulls_for_each_entry_rcu(inode, node, head, i_hash) {
+		/* toss obvious mismatches */
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
+
+		/* ensure the inode is active */
 		spin_lock(&inode->i_lock);
+		if (inode_unhashed(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			rcu_read_unlock();
+			if (locked)
+				spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
+			if (locked)
+				spin_lock(&inode_hash_lock);
 			goto repeat;
 		}
+
+		/* recheck that it is the inode we are looking for */
+		if (inode->i_sb != sb || !test(inode, data)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
 		return inode;
 	}
+	rcu_read_unlock();
+
+	/* cache miss - check if we jumped a chain */
+	if (get_nulls_value(node) != chain) 
+		goto repeat;
 	return NULL;
 }
 
@@ -770,26 +832,48 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct hlist_nulls_head *head, int chain,
+				unsigned long ino, bool locked)
 {
-	struct hlist_node *node;
+	struct hlist_nulls_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, node, head, i_hash) {
+	rcu_read_lock();
+	hlist_nulls_for_each_entry_rcu(inode, node, head, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		spin_lock(&inode->i_lock);
+		if (inode_unhashed(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			rcu_read_unlock();
+			if (locked)
+				spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
+			if (locked)
+				spin_lock(&inode_hash_lock);
 			goto repeat;
 		}
+		/* recheck that it is the inode we are looking for */
+		if (inode->i_sb != sb || inode->i_ino != ino) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
 		return inode;
 	}
+	rcu_read_unlock();
+
+	/* cache miss - check if we jumped a chain */
+	if (get_nulls_value(node) != chain) 
+		goto repeat;
 	return NULL;
 }
 
@@ -901,7 +985,7 @@ EXPORT_SYMBOL(unlock_new_inode);
  *	-- rmk@arm.uk.linux.org
  */
 static struct inode *get_new_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct hlist_nulls_head *head, int chain,
 				int (*test)(struct inode *, void *),
 				int (*set)(struct inode *, void *),
 				void *data)
@@ -914,14 +998,14 @@ static struct inode *get_new_inode(struct super_block *sb,
 
 		spin_lock(&inode_hash_lock);
 		/* We released the lock, so.. */
-		old = find_inode(sb, head, test, data);
+		old = find_inode(sb, head, chain, test, data, true);
 		if (!old) {
 			if (set(inode, data))
 				goto set_failed;
 
 			spin_lock(&inode->i_lock);
 			inode->i_state = I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_nulls_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			inode_sb_list_add(inode);
 			spin_unlock(&inode_hash_lock);
@@ -955,7 +1039,8 @@ set_failed:
  * comment at iget_locked for details.
  */
 static struct inode *get_new_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct hlist_nulls_head *head, int chain,
+				unsigned long ino)
 {
 	struct inode *inode;
 
@@ -965,12 +1050,12 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 
 		spin_lock(&inode_hash_lock);
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, head, chain, ino, true);
 		if (!old) {
 			inode->i_ino = ino;
 			spin_lock(&inode->i_lock);
 			inode->i_state = I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_nulls_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			inode_sb_list_add(inode);
 			spin_unlock(&inode_hash_lock);
@@ -1003,19 +1088,32 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
  */
 static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *b = inode_hashtable + hash(sb, ino);
-	struct hlist_node *node;
+	int chain = hash(sb, ino);
+	struct hlist_nulls_head *b = inode_hashtable + chain;
+	struct hlist_nulls_node *node;
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, node, b, i_hash) {
-		if (inode->i_ino == ino && inode->i_sb == sb) {
-			spin_unlock(&inode_hash_lock);
-			return 0;
+repeat:
+	rcu_read_lock();
+	hlist_nulls_for_each_entry_rcu(inode, node, b, i_hash) {
+		if (inode->i_ino != ino)
+			continue;
+		if (inode->i_sb != sb)
+			continue;
+		spin_lock(&inode->i_lock);
+		if (inode_unhashed(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
 		}
+		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
+		return 0;
 	}
-	spin_unlock(&inode_hash_lock);
+	rcu_read_unlock();
 
+	/* cache miss - check if we jumped a chain */
+	if (get_nulls_value(node) != chain)
+		goto repeat;
 	return 1;
 }
 
@@ -1095,20 +1193,18 @@ EXPORT_SYMBOL(igrab);
  * Note, @test is called with the inode_lock held, so can't sleep.
  */
 static struct inode *ifind(struct super_block *sb,
-		struct hlist_head *head, int (*test)(struct inode *, void *),
+		struct hlist_nulls_head *head, int chain,
+		int (*test)(struct inode *, void *),
 		void *data, const int wait)
 {
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	inode = find_inode(sb, head, test, data);
+	inode = find_inode(sb, head, chain, test, data, false);
 	if (inode) {
-		spin_unlock(&inode_hash_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_hash_lock);
 	return NULL;
 }
 
@@ -1128,18 +1224,15 @@ static struct inode *ifind(struct super_block *sb,
  * Otherwise NULL is returned.
  */
 static struct inode *ifind_fast(struct super_block *sb,
-		struct hlist_head *head, unsigned long ino)
+		struct hlist_nulls_head *head, int chain, unsigned long ino)
 {
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	inode = find_inode_fast(sb, head, ino);
+	inode = find_inode_fast(sb, head, chain, ino, false);
 	if (inode) {
-		spin_unlock(&inode_hash_lock);
 		wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_hash_lock);
 	return NULL;
 }
 
@@ -1167,9 +1260,10 @@ static struct inode *ifind_fast(struct super_block *sb,
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	int chain = hash(sb, hashval);
+	struct hlist_nulls_head *head = inode_hashtable + chain;
 
-	return ifind(sb, head, test, data, 0);
+	return ifind(sb, head, chain, test, data, 0);
 }
 EXPORT_SYMBOL(ilookup5_nowait);
 
@@ -1195,9 +1289,10 @@ EXPORT_SYMBOL(ilookup5_nowait);
 struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	int chain = hash(sb, hashval);
+	struct hlist_nulls_head *head = inode_hashtable + chain;
 
-	return ifind(sb, head, test, data, 1);
+	return ifind(sb, head, chain, test, data, 1);
 }
 EXPORT_SYMBOL(ilookup5);
 
@@ -1217,9 +1312,10 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	int chain = hash(sb, ino);
+	struct hlist_nulls_head *head = inode_hashtable + chain;
 
-	return ifind_fast(sb, head, ino);
+	return ifind_fast(sb, head, chain, ino);
 }
 EXPORT_SYMBOL(ilookup);
 
@@ -1247,17 +1343,18 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *),
 		int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	int chain = hash(sb, hashval);
+	struct hlist_nulls_head *head = inode_hashtable + chain;
 	struct inode *inode;
 
-	inode = ifind(sb, head, test, data, 1);
+	inode = ifind(sb, head, chain, test, data, 1);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode(sb, head, test, set, data);
+	return get_new_inode(sb, head, chain, test, set, data);
 }
 EXPORT_SYMBOL(iget5_locked);
 
@@ -1278,17 +1375,18 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	int chain = hash(sb, ino);
+	struct hlist_nulls_head *head = inode_hashtable + chain;
 	struct inode *inode;
 
-	inode = ifind_fast(sb, head, ino);
+	inode = ifind_fast(sb, head, chain, ino);
 	if (inode)
 		return inode;
 	/*
 	 * get_new_inode_fast() will do the right thing, re-trying the search
 	 * in case it had to block at any point.
 	 */
-	return get_new_inode_fast(sb, head, ino);
+	return get_new_inode_fast(sb, head, chain, ino);
 }
 EXPORT_SYMBOL(iget_locked);
 
@@ -1296,13 +1394,13 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_nulls_head *head = inode_hashtable + hash(sb, ino);
 
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_nulls_node *node;
 		struct inode *old = NULL;
 		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		hlist_nulls_for_each_entry_rcu(old, node, head, i_hash) {
 			if (old->i_ino != ino)
 				continue;
 			if (old->i_sb != sb)
@@ -1314,10 +1412,10 @@ int insert_inode_locked(struct inode *inode)
 			}
 			break;
 		}
-		if (likely(!node)) {
+		if (likely(is_a_nulls(node))) {
 			spin_lock(&inode->i_lock);
 			inode->i_state |= I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_nulls_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_hash_lock);
 			return 0;
@@ -1339,14 +1437,14 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
 	struct super_block *sb = inode->i_sb;
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_nulls_head *head = inode_hashtable + hash(sb, hashval);
 
 	while (1) {
-		struct hlist_node *node;
+		struct hlist_nulls_node *node;
 		struct inode *old = NULL;
 
 		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		hlist_nulls_for_each_entry_rcu(old, node, head, i_hash) {
 			if (old->i_sb != sb)
 				continue;
 			if (!test(old, data))
@@ -1358,10 +1456,10 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 			}
 			break;
 		}
-		if (likely(!node)) {
+		if (likely(is_a_nulls(node))) {
 			spin_lock(&inode->i_lock);
 			inode->i_state |= I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_nulls_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_hash_lock);
 			return 0;
@@ -1647,10 +1745,8 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_hash_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_hash_lock);
 }
 
 static __initdata unsigned long ihash_entries;
@@ -1678,7 +1774,7 @@ void __init inode_init_early(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_nulls_head),
 					ihash_entries,
 					14,
 					HASH_EARLY,
@@ -1687,7 +1783,7 @@ void __init inode_init_early(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_NULLS_HEAD(&inode_hashtable[loop], loop);
 }
 
 void __init inode_init(void)
@@ -1709,7 +1805,7 @@ void __init inode_init(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_nulls_head),
 					ihash_entries,
 					14,
 					0,
@@ -1718,7 +1814,7 @@ void __init inode_init(void)
 					0);
 
 	for (loop = 0; loop < (1 << i_hash_shift); loop++)
-		INIT_HLIST_HEAD(&inode_hashtable[loop]);
+		INIT_HLIST_NULLS_HEAD(&inode_hashtable[loop], loop);
 }
 
 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
diff --git a/fs/jfs/jfs_imap.c b/fs/jfs/jfs_imap.c
index 3a09423..5586fc1 100644
--- a/fs/jfs/jfs_imap.c
+++ b/fs/jfs/jfs_imap.c
@@ -497,7 +497,7 @@ struct inode *diReadSpecial(struct super_block *sb, ino_t inum, int secondary)
 	 * appear hashed, but do not put on any lists.  hlist_del()
 	 * will work fine and require no locking.
 	 */
-	hlist_add_fake(&ip->i_hash);
+	hlist_nulls_add_fake(&ip->i_hash);
 
 	return (ip);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0327e1f..d99fc969 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -384,6 +384,7 @@ struct inodes_stat_t {
 #include <linux/cache.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
+#include <linux/rculist_nulls.h>
 #include <linux/radix-tree.h>
 #include <linux/prio_tree.h>
 #include <linux/init.h>
@@ -732,7 +733,7 @@ struct posix_acl;
 #define ACL_NOT_CACHED ((void *)(-1))
 
 struct inode {
-	struct hlist_node	i_hash;
+	struct hlist_nulls_node	i_hash;
 	struct list_head	i_wb_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
@@ -803,7 +804,7 @@ struct inode {
 
 static inline int inode_unhashed(struct inode *inode)
 {
-	return hlist_unhashed(&inode->i_hash);
+	return hlist_nulls_unhashed(&inode->i_hash);
 }
 
 /*
diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
index 5d10ae36..29762d1 100644
--- a/include/linux/list_nulls.h
+++ b/include/linux/list_nulls.h
@@ -83,6 +83,13 @@ static inline void hlist_nulls_del(struct hlist_nulls_node *n)
 	n->pprev = LIST_POISON2;
 }
 
+/* after that we'll appear to be on some hlist and hlist_nulls_del will work */
+static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
+{
+	n->next = (struct hlist_nulls_node *)1UL;
+	n->pprev = &n->next;
+}
+
 /**
  * hlist_nulls_for_each_entry	- iterate over list of given type
  * @tpos:	the type * to use as a loop cursor.
-- 
1.7.2.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] fs: rcu protect inode hash lookups
  2010-11-08  7:28 ` [PATCH 4/4] fs: rcu protect inode hash lookups Dave Chinner
@ 2010-11-08  8:53   ` Eric Dumazet
  2010-11-08 11:10   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2010-11-08  8:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel, paulmck

Le lundi 08 novembre 2010 à 18:28 +1100, Dave Chinner a écrit :
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that inodes are using RCU freeing, we can walk the hash lists
> using RCU protection during lookups. Convert all the hash list
> operations to use RCU-based operators and drop the inode_hash_lock
> around pure lookup operations.
> 
> Because we are using SLAB_DESTROY_BY_RCU, we need to be careful
> about the lockless lookups and the way we initialise the struct
> inode during allocation. The inode->i_lock is the critical item we
> need to avoid touching when allocating an inode, so move the
> initialisation of this lock to the slab constructor so that we can
> reliably use the lock even after the inode has been freed or is in
> the process of being allocated or initialised.
> 
> We also need to change the order of checks during hash lookups to
> validate we have an active inode before we skip it. That means we
> need to revalidate the inode number and superblock after we have
> checked it for being freed or unhashed under the inode->i_lock.
> 
> Finally, during traversals an inode can be freed and reallocated and
> linked into a new hash chain, so an RCU safe traversal can jump to a
> different chain without indication. On a cache miss we need to
> validate that we ended the search on the same hash chain we started
> on. Do this by converting the hash lists to the hlist_null list type
> and checking the list null on walk termination.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Very nice stuff Dave !

I have one question

> diff --git a/include/linux/list_nulls.h b/include/linux/list_nulls.h
> index 5d10ae36..29762d1 100644
> --- a/include/linux/list_nulls.h
> +++ b/include/linux/list_nulls.h
> @@ -83,6 +83,13 @@ static inline void hlist_nulls_del(struct hlist_nulls_node *n)
>  	n->pprev = LIST_POISON2;
>  }
>  
> +/* after that we'll appear to be on some hlist and hlist_nulls_del will work */
> +static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
> +{
> +	n->next = (struct hlist_nulls_node *)1UL;
> +	n->pprev = &n->next;
> +}
> +


I am not sure why you need to change n->next here ?

This might restart lookups (making them doing a 'goto repeat', and
possibly abort a lookup on slot 0 of hash chain, since this is a 'nulls'
value for slot 0.

So at least I suggest adding a comment, and possibly use a different
slot number (outside of slot ranges). If so, change documentation to
restrict to 2^(BITS_PER_LONG-2) number of possible slots in hash table.

n->next = (struct hlist_nulls_node *)(1UL | (1UL << (BITS_PER_LONG-1));




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 4/4] fs: rcu protect inode hash lookups
  2010-11-08  7:28 ` [PATCH 4/4] fs: rcu protect inode hash lookups Dave Chinner
  2010-11-08  8:53   ` Eric Dumazet
@ 2010-11-08 11:10   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2010-11-08 11:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel, eric.dumazet, paulmck

> -	hlist_add_fake(&inode->i_hash);
> +	hlist_nulls_add_fake(&inode->i_hash);

Please add a preparatory inode_fake_hash/inode_mark_hashed or similar
helper to isolate filesystems from the implementation details of the
hash list.

> +	/*
> +	 * reset the inode number so during RCU traversals we do not match this
> +	 * inode in any lookups until it is fully re-initialised again during
> +	 * allocation.
> +	 */
> +	inode->i_ino = 0;

There is no hard rule that i_ino is an invalid inode number.  It can
happen quite easily for inodes using the generic last_ino allocator,
and I would not be surprised if there's some filesystems using it as
part of the on disk layour either.

> +			rcu_read_unlock();
> +			if (locked)
> +				spin_unlock(&inode_hash_lock);
>  			__wait_on_freeing_inode(inode);
> +			if (locked)
> +				spin_lock(&inode_hash_lock);

I can't say I like the locked argument, but I don't see an easy way
around it.  Can you at least keept the unlocking/relocking inside
__wait_on_freeing_inode so that it's centralized in a single place for
both find_inode pathes?

While at it moving __wait_on_freeing_inode to be above ifind would
making changes in this area a lot easier to read, so maybe you can throw
in a patch for that, too?

>  static struct inode *ifind(struct super_block *sb,
> -		struct hlist_head *head, int (*test)(struct inode *, void *),
> +		struct hlist_nulls_head *head, int chain,
> +		int (*test)(struct inode *, void *),
>  		void *data, const int wait)
>  {
>  	struct inode *inode;
>  
> -	spin_lock(&inode_hash_lock);
> -	inode = find_inode(sb, head, test, data);
> +	inode = find_inode(sb, head, chain, test, data, false);
>  	if (inode) {
> -		spin_unlock(&inode_hash_lock);
>  		if (likely(wait))
>  			wait_on_inode(inode);
>  		return inode;
>  	}
> -	spin_unlock(&inode_hash_lock);
>  	return NULL;
>  }

This is starting to get a rather pointless helper.  I'd suggest just
killing ifind/ifind_fast and opencoding them in the caller, possibly
as a preparatory patch.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/4] fs: Convert inode slab creation to use consistent parameters
  2010-11-08  7:28 ` [PATCH 2/4] fs: Convert inode slab creation to use consistent parameters Dave Chinner
@ 2011-04-25 17:46   ` Christoph Hellwig
  0 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2011-04-25 17:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel, eric.dumazet, paulmck

On Mon, Nov 08, 2010 at 06:28:23PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode slab creation functions in different filesystem don't all set up the
> slab caches correctly. Add a new flag SLAB_INODES that sets all the necessary
> flags for an inode slab cache and convert all the callers to use it. THis makes
> it easy to change the behaviour of all the inode slab caches in one go.

Any chance you could rediff this one ontop of current mainline and
resend it?


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-04-25 17:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-08  7:28 fs: inode rcu freeing and hash walking V2 Dave Chinner
2010-11-08  7:28 ` [PATCH 1/4] fs: pull inode->i_lock up out of writeback_single_inode Dave Chinner
2010-11-08  7:28 ` [PATCH 2/4] fs: Convert inode slab creation to use consistent parameters Dave Chinner
2011-04-25 17:46   ` Christoph Hellwig
2010-11-08  7:28 ` [PATCH 3/4] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU Dave Chinner
2010-11-08  7:28 ` [PATCH 4/4] fs: rcu protect inode hash lookups Dave Chinner
2010-11-08  8:53   ` Eric Dumazet
2010-11-08 11:10   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).