All of lore.kernel.org
 help / color / mirror / Atom feed
* fs: inode freeing and hash lookup via RCU
@ 2010-11-01  5:33 Dave Chinner
  2010-11-01  5:33 ` [PATCH 1/3] fs: pull inode->i_lock up out of writeback_single_inode Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Dave Chinner @ 2010-11-01  5:33 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel

Hi Al,

The following three patches implement the extra functionality you
wanted on top of the inode lock breakup. The first patch lifts the
i_lock up out of writeback_single_inode(), the second implements RCU
freeing of inodes via SLAB_DESTROY_BY_RCU, and the third converts
inode hash lookup operations to use RCU list walks. Comments are
welcome.

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] fs: pull inode->i_lock up out of writeback_single_inode
  2010-11-01  5:33 fs: inode freeing and hash lookup via RCU Dave Chinner
@ 2010-11-01  5:33 ` Dave Chinner
  2010-11-01  5:33 ` [PATCH 2/3] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU Dave Chinner
  2010-11-01  5:33 ` [PATCH 3/3] fs: rcu protect inode hash lookups Dave Chinner
  2 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2010-11-01  5:33 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

First thing we do in writeback_single_inode() is take the i_lock and
the last thing we do is drop it. A caller already holds the i_lock,
so pull the i_lock out of writeback_single_inode() to reduce the
round trips on this lock during inode writeback.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/fs-writeback.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f53486f..bd199f4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -318,9 +318,9 @@ static void inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Write out an inode's dirty pages.  Called under inode_wb_list_lock.  Either
- * the caller has an active reference on the inode or the inode has I_WILL_FREE
- * set.
+ * Write out an inode's dirty pages.  Called under inode_wb_list_lock and
+ * inode->i_lock.  Either the caller has an active reference on the inode or
+ * the inode has I_WILL_FREE set.
  *
  * If `wait' is set, wait on the writeout.
  *
@@ -335,7 +335,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	unsigned dirty;
 	int ret;
 
-	spin_lock(&inode->i_lock);
 	if (!atomic_read(&inode->i_count))
 		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
 	else
@@ -351,7 +350,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		 * completed a full scan of b_io.
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
-			spin_unlock(&inode->i_lock);
 			requeue_io(inode);
 			return 0;
 		}
@@ -442,7 +440,6 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 		}
 	}
 	inode_sync_complete(inode);
-	spin_unlock(&inode->i_lock);
 	return ret;
 }
 
@@ -530,7 +527,6 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 		}
 
 		__iget(inode);
-		spin_unlock(&inode->i_lock);
 
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
@@ -541,6 +537,7 @@ static int writeback_sb_inodes(struct super_block *sb, struct bdi_writeback *wb,
 			 */
 			redirty_tail(inode);
 		}
+		spin_unlock(&inode->i_lock);
 		spin_unlock(&inode_wb_list_lock);
 		iput(inode);
 		cond_resched();
@@ -1208,7 +1205,9 @@ int write_inode_now(struct inode *inode, int sync)
 
 	might_sleep();
 	spin_lock(&inode_wb_list_lock);
+	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, &wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_wb_list_lock);
 	if (sync)
 		inode_sync_wait(inode);
@@ -1232,7 +1231,9 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 	int ret;
 
 	spin_lock(&inode_wb_list_lock);
+	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, wbc);
+	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_wb_list_lock);
 	return ret;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/3] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU
  2010-11-01  5:33 fs: inode freeing and hash lookup via RCU Dave Chinner
  2010-11-01  5:33 ` [PATCH 1/3] fs: pull inode->i_lock up out of writeback_single_inode Dave Chinner
@ 2010-11-01  5:33 ` Dave Chinner
  2010-11-01 15:31   ` Christoph Hellwig
  2010-11-01  5:33 ` [PATCH 3/3] fs: rcu protect inode hash lookups Dave Chinner
  2 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2010-11-01  5:33 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Change the inode caches to use RCU freed inodes via the SLAB_DESTROY_BY_RCU
method. While touching the slab creation functions for the inode caches, fix
all of the slabs to use consistent flags by defining SLAB_INODES to be the set
of common inode cache flags. This means no-one in future should forget to set
an inode cache to use the SLAB_DESTROY_BY_RCU flag.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 Documentation/filesystems/porting |    5 +++++
 fs/adfs/super.c                   |    3 +--
 fs/affs/super.c                   |    3 +--
 fs/afs/super.c                    |    7 +++----
 fs/befs/linuxvfs.c                |    3 +--
 fs/bfs/inode.c                    |    3 +--
 fs/btrfs/inode.c                  |    2 +-
 fs/ceph/super.c                   |    3 +--
 fs/cifs/cifsfs.c                  |    3 +--
 fs/coda/inode.c                   |    3 +--
 fs/efs/super.c                    |    3 +--
 fs/exofs/super.c                  |    3 +--
 fs/ext2/super.c                   |    3 +--
 fs/ext3/super.c                   |    3 +--
 fs/ext4/super.c                   |    3 +--
 fs/fat/inode.c                    |    3 +--
 fs/freevxfs/vxfs_super.c          |    2 +-
 fs/fuse/inode.c                   |    6 +++---
 fs/gfs2/main.c                    |    3 +--
 fs/hfs/super.c                    |    3 ++-
 fs/hfsplus/super.c                |    3 ++-
 fs/hpfs/super.c                   |    3 +--
 fs/hugetlbfs/inode.c              |    2 +-
 fs/inode.c                        |    4 +---
 fs/isofs/inode.c                  |    4 +---
 fs/jffs2/super.c                  |    3 +--
 fs/jfs/super.c                    |    3 +--
 fs/logfs/inode.c                  |    2 +-
 fs/minix/inode.c                  |    3 +--
 fs/ncpfs/inode.c                  |    3 +--
 fs/nfs/inode.c                    |    3 +--
 fs/nilfs2/super.c                 |    4 ++--
 fs/ntfs/super.c                   |    3 ++-
 fs/ocfs2/dlmfs/dlmfs.c            |    3 +--
 fs/ocfs2/super.c                  |    4 +---
 fs/openpromfs/inode.c             |    4 +---
 fs/proc/inode.c                   |    3 +--
 fs/qnx4/inode.c                   |    3 +--
 fs/reiserfs/super.c               |    7 ++-----
 fs/romfs/super.c                  |    3 +--
 fs/smbfs/inode.c                  |    3 +--
 fs/squashfs/super.c               |    3 ++-
 fs/sysv/inode.c                   |    3 +--
 fs/ubifs/super.c                  |    2 +-
 fs/udf/super.c                    |    3 +--
 fs/ufs/super.c                    |    3 +--
 fs/xfs/linux-2.6/kmem.h           |    1 +
 fs/xfs/linux-2.6/xfs_super.c      |    4 ++--
 include/linux/slab.h              |    7 +++++++
 ipc/mqueue.c                      |    3 ++-
 mm/shmem.c                        |    2 +-
 net/socket.c                      |    9 +++------
 net/sunrpc/rpc_pipe.c             |    5 ++---
 53 files changed, 78 insertions(+), 104 deletions(-)

diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 2da4b44..b68438b 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -319,3 +319,8 @@ if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput(
 may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly
 free the on-disk inode, you may end up doing that while ->write_inode() is writing
 to it.
+
+[mandatory]
+	Inodes must be allocated via a slab cache created with the
+SLAB_INODE_CACHE flag set. This sets all the necessary slab cache flags for
+correct operation and control of the cache across the system.
diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index d9803f7..f2e72f7 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -256,8 +256,7 @@ static int init_inodecache(void)
 {
 	adfs_inode_cachep = kmem_cache_create("adfs_inode_cache",
 					     sizeof(struct adfs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (adfs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/affs/super.c b/fs/affs/super.c
index fa4fbe1..59d5bff 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -113,8 +113,7 @@ static int init_inodecache(void)
 {
 	affs_inode_cachep = kmem_cache_create("affs_inode_cache",
 					     sizeof(struct affs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (affs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/afs/super.c b/fs/afs/super.c
index eacf76d..ad19614 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -88,10 +88,9 @@ int __init afs_fs_init(void)
 
 	ret = -ENOMEM;
 	afs_inode_cachep = kmem_cache_create("afs_inode_cache",
-					     sizeof(struct afs_vnode),
-					     0,
-					     SLAB_HWCACHE_ALIGN,
-					     afs_i_init_once);
+					sizeof(struct afs_vnode), 0,
+					SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+					afs_i_init_once);
 	if (!afs_inode_cachep) {
 		printk(KERN_NOTICE "kAFS: Failed to allocate inode cache\n");
 		return ret;
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index dc39d28..a9086aa 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -431,8 +431,7 @@ befs_init_inodecache(void)
 {
 	befs_inode_cachep = kmem_cache_create("befs_inode_cache",
 					      sizeof (struct befs_inode_info),
-					      0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					      0, SLAB_INODE_CACHE,
 					      init_once);
 	if (befs_inode_cachep == NULL) {
 		printk(KERN_ERR "befs_init_inodecache: "
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index 883e77a..8ade512 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -264,8 +264,7 @@ static int init_inodecache(void)
 {
 	bfs_inode_cachep = kmem_cache_create("bfs_inode_cache",
 					     sizeof(struct bfs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (bfs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 64f99cf..80d0a64 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6369,7 +6369,7 @@ int btrfs_init_cachep(void)
 {
 	btrfs_inode_cachep = kmem_cache_create("btrfs_inode_cache",
 			sizeof(struct btrfs_inode), 0,
-			SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, init_once);
+			SLAB_INODE_CACHE, init_once);
 	if (!btrfs_inode_cachep)
 		goto fail;
 
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index d6e0e04..d88da41 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -521,8 +521,7 @@ static int __init init_caches(void)
 	ceph_inode_cachep = kmem_cache_create("ceph_inode_info",
 				      sizeof(struct ceph_inode_info),
 				      __alignof__(struct ceph_inode_info),
-				      (SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD),
-				      ceph_inode_init_once);
+				      SLAB_INODE_CACHE, ceph_inode_init_once);
 	if (ceph_inode_cachep == NULL)
 		return -ENOMEM;
 
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 3437163..91e2648 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -785,8 +785,7 @@ cifs_init_inodecache(void)
 {
 	cifs_inode_cachep = kmem_cache_create("cifs_inode_cache",
 					      sizeof(struct cifsInodeInfo),
-					      0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					      0, SLAB_INODE_CACHE,
 					      cifs_init_once);
 	if (cifs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 7993b96..957b93f 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -72,8 +72,7 @@ int coda_init_inodecache(void)
 {
 	coda_inode_cachep = kmem_cache_create("coda_inode_cache",
 				sizeof(struct coda_inode_info),
-				0, SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
-				init_once);
+				0, SLAB_INODE_CACHE, init_once);
 	if (coda_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/efs/super.c b/fs/efs/super.c
index f049428..97de966 100644
--- a/fs/efs/super.c
+++ b/fs/efs/super.c
@@ -81,8 +81,7 @@ static int init_inodecache(void)
 {
 	efs_inode_cachep = kmem_cache_create("efs_inode_cache",
 				sizeof(struct efs_inode_info),
-				0, SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
-				init_once);
+				0, SLAB_INODE_CACHE, init_once);
 	if (efs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/exofs/super.c b/fs/exofs/super.c
index 047e92f..3f3c2db 100644
--- a/fs/exofs/super.c
+++ b/fs/exofs/super.c
@@ -175,8 +175,7 @@ static int init_inodecache(void)
 {
 	exofs_inode_cachep = kmem_cache_create("exofs_inode_cache",
 				sizeof(struct exofs_i_info), 0,
-				SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
-				exofs_init_once);
+				SLAB_INODE_CACHE, exofs_init_once);
 	if (exofs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 0901320..39c8705 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -182,8 +182,7 @@ static int init_inodecache(void)
 {
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
 					     sizeof(struct ext2_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 3777680..d4bc896 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -509,8 +509,7 @@ static int init_inodecache(void)
 {
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
 					     sizeof(struct ext3_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8ecc1e5..1f18577 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -852,8 +852,7 @@ static int init_inodecache(void)
 {
 	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
 					     sizeof(struct ext4_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index ad6998a..84d8cde 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -535,8 +535,7 @@ static int __init fat_init_inodecache(void)
 {
 	fat_inode_cachep = kmem_cache_create("fat_inode_cache",
 					     sizeof(struct msdos_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (fat_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/freevxfs/vxfs_super.c b/fs/freevxfs/vxfs_super.c
index 71b0148..ac65637 100644
--- a/fs/freevxfs/vxfs_super.c
+++ b/fs/freevxfs/vxfs_super.c
@@ -268,7 +268,7 @@ vxfs_init(void)
 
 	vxfs_inode_cachep = kmem_cache_create("vxfs_inode",
 			sizeof(struct vxfs_inode_info), 0,
-			SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD, NULL);
+			SLAB_INODE_CACHE, NULL);
 	if (!vxfs_inode_cachep)
 		return -ENOMEM;
 	rv = register_filesystem(&vxfs_fs_type);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index da9e6e1..9f62e16 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1139,9 +1139,9 @@ static int __init fuse_fs_init(void)
 		goto out_unreg;
 
 	fuse_inode_cachep = kmem_cache_create("fuse_inode",
-					      sizeof(struct fuse_inode),
-					      0, SLAB_HWCACHE_ALIGN,
-					      fuse_inode_init_once);
+					sizeof(struct fuse_inode), 0,
+					SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+					fuse_inode_init_once);
 	err = -ENOMEM;
 	if (!fuse_inode_cachep)
 		goto out_unreg2;
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index ebef7ab..3100429 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -108,8 +108,7 @@ static int __init init_gfs2_fs(void)
 
 	gfs2_inode_cachep = kmem_cache_create("gfs2_inode",
 					      sizeof(struct gfs2_inode),
-					      0,  SLAB_RECLAIM_ACCOUNT|
-					          SLAB_MEM_SPREAD,
+					      0, SLAB_INODE_CACHE,
 					      gfs2_init_inode_once);
 	if (!gfs2_inode_cachep)
 		goto fail;
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index 6ee1586..d5decb8 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -468,7 +468,8 @@ static int __init init_hfs_fs(void)
 	int err;
 
 	hfs_inode_cachep = kmem_cache_create("hfs_inode_cache",
-		sizeof(struct hfs_inode_info), 0, SLAB_HWCACHE_ALIGN,
+		sizeof(struct hfs_inode_info), 0,
+		SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
 		hfs_init_once);
 	if (!hfs_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 9a88d75..6df834c 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -523,7 +523,8 @@ static int __init init_hfsplus_fs(void)
 	int err;
 
 	hfsplus_inode_cachep = kmem_cache_create("hfsplus_icache",
-		HFSPLUS_INODE_SIZE, 0, SLAB_HWCACHE_ALIGN,
+		HFSPLUS_INODE_SIZE, 0,
+		SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
 		hfsplus_init_once);
 	if (!hfsplus_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/hpfs/super.c b/fs/hpfs/super.c
index c969a1a..8867e7b 100644
--- a/fs/hpfs/super.c
+++ b/fs/hpfs/super.c
@@ -195,8 +195,7 @@ static int init_inodecache(void)
 {
 	hpfs_inode_cachep = kmem_cache_create("hpfs_inode_cache",
 					     sizeof(struct hpfs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (hpfs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8d0607b..cf93718 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -981,7 +981,7 @@ static int __init init_hugetlbfs_fs(void)
 
 	hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
 					sizeof(struct hugetlbfs_inode_info),
-					0, 0, init_once);
+					0, SLAB_INODE_CACHE, init_once);
 	if (hugetlbfs_inode_cachep == NULL)
 		goto out2;
 
diff --git a/fs/inode.c b/fs/inode.c
index c5baf18..106ec7a 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1696,9 +1696,7 @@ void __init inode_init(void)
 	/* inode slab cache */
 	inode_cachep = kmem_cache_create("inode_cache",
 					 sizeof(struct inode),
-					 0,
-					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
-					 SLAB_MEM_SPREAD),
+					 0, SLAB_INODE_CACHE | SLAB_PANIC,
 					 init_once);
 	register_shrinker(&icache_shrinker);
 	percpu_counter_init(&nr_inodes, 0);
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index 60c2b94..df61577 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -81,9 +81,7 @@ static int init_inodecache(void)
 {
 	isofs_inode_cachep = kmem_cache_create("isofs_inode_cache",
 					sizeof(struct iso_inode_info),
-					0, (SLAB_RECLAIM_ACCOUNT|
-					SLAB_MEM_SPREAD),
-					init_once);
+					0, SLAB_INODE_CACHE, init_once);
 	if (isofs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index d1ae5df..7bfa0eb 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -260,8 +260,7 @@ static int __init init_jffs2_fs(void)
 
 	jffs2_inode_cachep = kmem_cache_create("jffs2_i",
 					     sizeof(struct jffs2_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     jffs2_i_init_once);
 	if (!jffs2_inode_cachep) {
 		printk(KERN_ERR "JFFS2 error: Failed to initialise inode cache\n");
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 68eee2b..e27fda8 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -796,8 +796,7 @@ static int __init init_jfs_fs(void)
 
 	jfs_inode_cachep =
 	    kmem_cache_create("jfs_ip", sizeof(struct jfs_inode_info), 0,
-			    SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
-			    init_once);
+					     SLAB_INODE_CACHE, init_once);
 	if (jfs_inode_cachep == NULL)
 		return -ENOMEM;
 
diff --git a/fs/logfs/inode.c b/fs/logfs/inode.c
index 38c559e..ab812c2 100644
--- a/fs/logfs/inode.c
+++ b/fs/logfs/inode.c
@@ -385,7 +385,7 @@ const struct super_operations logfs_super_operations = {
 int logfs_init_inode_cache(void)
 {
 	logfs_inode_cache = kmem_cache_create("logfs_inode_cache",
-			sizeof(struct logfs_inode), 0, SLAB_RECLAIM_ACCOUNT,
+			sizeof(struct logfs_inode), 0, SLAB_INODE_CACHE,
 			logfs_init_once);
 	if (!logfs_inode_cache)
 		return -ENOMEM;
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index e39d6bf..614ddb6 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -84,8 +84,7 @@ static int init_inodecache(void)
 {
 	minix_inode_cachep = kmem_cache_create("minix_inode_cache",
 					     sizeof(struct minix_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (minix_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index 985fabb..af0ec48 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -76,8 +76,7 @@ static int init_inodecache(void)
 {
 	ncp_inode_cachep = kmem_cache_create("ncp_inode_cache",
 					     sizeof(struct ncp_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ncp_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 7d2d6c7..79f4327 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -1470,8 +1470,7 @@ static int __init nfs_init_inodecache(void)
 {
 	nfs_inode_cachep = kmem_cache_create("nfs_inode_cache",
 					     sizeof(struct nfs_inode),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (nfs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index 35ae03c..2ba8b9d 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -1289,8 +1289,8 @@ static void nilfs_destroy_cachep(void)
 static int __init nilfs_init_cachep(void)
 {
 	nilfs_inode_cachep = kmem_cache_create("nilfs2_inode_cache",
-			sizeof(struct nilfs_inode_info), 0,
-			SLAB_RECLAIM_ACCOUNT, nilfs_inode_init_once);
+			sizeof(struct nilfs_inode_info), 0, SLAB_INODE_CACHE,
+			nilfs_inode_init_once);
 	if (!nilfs_inode_cachep)
 		goto fail;
 
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index d3fbe57..dd835f5 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -3137,9 +3137,10 @@ static int __init init_ntfs_fs(void)
 		goto inode_err_out;
 	}
 
+	/* ntfs_big_inode_cache is the inode cache used for VFS level inodes */
 	ntfs_big_inode_cache = kmem_cache_create(ntfs_big_inode_cache_name,
 			sizeof(big_ntfs_inode), 0,
-			SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
+			SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
 			ntfs_big_inode_init_once);
 	if (!ntfs_big_inode_cache) {
 		printk(KERN_CRIT "NTFS: Failed to create %s!\n",
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 75e115f..3c33bff 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -669,8 +669,7 @@ static int __init init_dlmfs_fs(void)
 
 	dlmfs_inode_cache = kmem_cache_create("dlmfs_inode_cache",
 				sizeof(struct dlmfs_inode_private),
-				0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
-					SLAB_MEM_SPREAD),
+				0, (SLAB_HWCACHE_ALIGN|SLAB_INODE_CACHE),
 				dlmfs_init_once);
 	if (!dlmfs_inode_cache) {
 		status = -ENOMEM;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 56f0cb3..ca5e3a6 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -1794,9 +1794,7 @@ static int ocfs2_initialize_mem_caches(void)
 {
 	ocfs2_inode_cachep = kmem_cache_create("ocfs2_inode_cache",
 				       sizeof(struct ocfs2_inode_info),
-				       0,
-				       (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+				       0, (SLAB_HWCACHE_ALIGN|SLAB_INODE_CACHE),
 				       ocfs2_inode_init_once);
 	ocfs2_dquot_cachep = kmem_cache_create("ocfs2_dquot_cache",
 					sizeof(struct ocfs2_dquot),
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index ffcd04f..681c765 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -441,9 +441,7 @@ static int __init init_openprom_fs(void)
 
 	op_inode_cachep = kmem_cache_create("op_inode_cache",
 					    sizeof(struct op_inode_info),
-					    0,
-					    (SLAB_RECLAIM_ACCOUNT |
-					     SLAB_MEM_SPREAD),
+					    0, SLAB_INODE_CACHE,
 					    op_inode_init_once);
 	if (!op_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 9c2b5f4..ef1c2b3 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -82,8 +82,7 @@ void __init proc_init_inodecache(void)
 {
 	proc_inode_cachep = kmem_cache_create("proc_inode_cache",
 					     sizeof(struct proc_inode),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD|SLAB_PANIC),
+					     0, SLAB_INODE_CACHE | SLAB_PANIC,
 					     init_once);
 }
 
diff --git a/fs/qnx4/inode.c b/fs/qnx4/inode.c
index 01bad30..48754c2 100644
--- a/fs/qnx4/inode.c
+++ b/fs/qnx4/inode.c
@@ -441,8 +441,7 @@ static int init_inodecache(void)
 {
 	qnx4_inode_cachep = kmem_cache_create("qnx4_inode_cache",
 					     sizeof(struct qnx4_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (qnx4_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index e15ff61..8514cfb 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -546,11 +546,8 @@ static void init_once(void *foo)
 static int init_inodecache(void)
 {
 	reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
-						  sizeof(struct
-							 reiserfs_inode_info),
-						  0, (SLAB_RECLAIM_ACCOUNT|
-							SLAB_MEM_SPREAD),
-						  init_once);
+					sizeof(struct reiserfs_inode_info),
+					0, SLAB_INODE_CACHE, init_once);
 	if (reiserfs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/fs/romfs/super.c b/fs/romfs/super.c
index 2685805..26f66cd 100644
--- a/fs/romfs/super.c
+++ b/fs/romfs/super.c
@@ -619,8 +619,7 @@ static int __init init_romfs_fs(void)
 	romfs_inode_cachep =
 		kmem_cache_create("romfs_i",
 				  sizeof(struct romfs_inode_info), 0,
-				  SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
-				  romfs_i_init_once);
+				  SLAB_INODE_CACHE, romfs_i_init_once);
 
 	if (!romfs_inode_cachep) {
 		printk(KERN_ERR
diff --git a/fs/smbfs/inode.c b/fs/smbfs/inode.c
index f6e9ee5..03c1fda 100644
--- a/fs/smbfs/inode.c
+++ b/fs/smbfs/inode.c
@@ -78,8 +78,7 @@ static int init_inodecache(void)
 {
 	smb_inode_cachep = kmem_cache_create("smb_inode_cache",
 					     sizeof(struct smb_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (smb_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
index 07a4f11..3644be6 100644
--- a/fs/squashfs/super.c
+++ b/fs/squashfs/super.c
@@ -394,7 +394,8 @@ static int __init init_inodecache(void)
 {
 	squashfs_inode_cachep = kmem_cache_create("squashfs_inode_cache",
 		sizeof(struct squashfs_inode_info), 0,
-		SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, init_once);
+		SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+		init_once);
 
 	return squashfs_inode_cachep ? 0 : -ENOMEM;
 }
diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index de44d06..8c96780 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -360,8 +360,7 @@ const struct super_operations sysv_sops = {
 int __init sysv_init_icache(void)
 {
 	sysv_inode_cachep = kmem_cache_create("sysv_inode_cache",
-			sizeof(struct sysv_inode_info), 0,
-			SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD,
+			sizeof(struct sysv_inode_info), 0, SLAB_INODE_CACHE,
 			init_once);
 	if (!sysv_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 9a47c9f..623c71b 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -2192,7 +2192,7 @@ static int __init ubifs_init(void)
 	err = -ENOMEM;
 	ubifs_inode_slab = kmem_cache_create("ubifs_inode_slab",
 				sizeof(struct ubifs_inode), 0,
-				SLAB_MEM_SPREAD | SLAB_RECLAIM_ACCOUNT,
+				SLAB_INODE_CACHE,
 				&inode_slab_ctor);
 	if (!ubifs_inode_slab)
 		goto out_reg;
diff --git a/fs/udf/super.c b/fs/udf/super.c
index 76f3d6d..d286457 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -157,8 +157,7 @@ static int init_inodecache(void)
 {
 	udf_inode_cachep = kmem_cache_create("udf_inode_cache",
 					     sizeof(struct udf_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT |
-						 SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (!udf_inode_cachep)
 		return -ENOMEM;
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index 6b9be90..7babaa0 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -1428,8 +1428,7 @@ static int init_inodecache(void)
 {
 	ufs_inode_cachep = kmem_cache_create("ufs_inode_cache",
 					     sizeof(struct ufs_inode_info),
-					     0, (SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+					     0, SLAB_INODE_CACHE,
 					     init_once);
 	if (ufs_inode_cachep == NULL)
 		return -ENOMEM;
diff --git a/fs/xfs/linux-2.6/kmem.h b/fs/xfs/linux-2.6/kmem.h
index f7c8f7a..4e6b372 100644
--- a/fs/xfs/linux-2.6/kmem.h
+++ b/fs/xfs/linux-2.6/kmem.h
@@ -82,6 +82,7 @@ extern void *kmem_zalloc_greedy(size_t *, size_t, size_t);
 #define KM_ZONE_HWALIGN	SLAB_HWCACHE_ALIGN
 #define KM_ZONE_RECLAIM	SLAB_RECLAIM_ACCOUNT
 #define KM_ZONE_SPREAD	SLAB_MEM_SPREAD
+#define KM_ZONE_INODES	SLAB_INODE_CACHE
 
 #define kmem_zone	kmem_cache
 #define kmem_zone_t	struct kmem_cache
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index cf80878..2c731d4 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1720,8 +1720,8 @@ xfs_init_zones(void)
 
 	xfs_inode_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode",
-			KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD,
-			xfs_fs_inode_init_once);
+					KM_ZONE_HWALIGN | KM_ZONE_INODES,
+					xfs_fs_inode_init_once);
 	if (!xfs_inode_zone)
 		goto out_destroy_efi_zone;
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 59260e2..1fea5f1 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -93,6 +93,13 @@
 				(unsigned long)ZERO_SIZE_PTR)
 
 /*
+ * Set the default flags necessary for inode caches manipulated by the VFS.
+ */
+#define SLAB_INODE_CACHE	(SLAB_MEM_SPREAD | \
+				 SLAB_DESTROY_BY_RCU | \
+				 SLAB_RECLAIM_ACCOUNT)
+
+/*
  * struct kmem_cache related prototypes
  */
 void __init kmem_cache_init(void);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 3a61ffe..f5025b1 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -1268,7 +1268,8 @@ static int __init init_mqueue_fs(void)
 
 	mqueue_inode_cachep = kmem_cache_create("mqueue_inode_cache",
 				sizeof(struct mqueue_inode_info), 0,
-				SLAB_HWCACHE_ALIGN, init_once);
+				SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+				init_once);
 	if (mqueue_inode_cachep == NULL)
 		return -ENOMEM;
 
diff --git a/mm/shmem.c b/mm/shmem.c
index f6d350e..c9794ea 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2435,7 +2435,7 @@ static int init_inodecache(void)
 {
 	shmem_inode_cachep = kmem_cache_create("shmem_inode_cache",
 				sizeof(struct shmem_inode_info),
-				0, SLAB_PANIC, init_once);
+				0, SLAB_INODE_CACHE|SLAB_PANIC, init_once);
 	return 0;
 }
 
diff --git a/net/socket.c b/net/socket.c
index 5cac1c7..ee71f1e 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -288,12 +288,9 @@ static void init_once(void *foo)
 static int init_inodecache(void)
 {
 	sock_inode_cachep = kmem_cache_create("sock_inode_cache",
-					      sizeof(struct socket_alloc),
-					      0,
-					      (SLAB_HWCACHE_ALIGN |
-					       SLAB_RECLAIM_ACCOUNT |
-					       SLAB_MEM_SPREAD),
-					      init_once);
+			      sizeof(struct socket_alloc), 0,
+			      SLAB_HWCACHE_ALIGN | SLAB_INODE_CACHE,
+			      init_once);
 	if (sock_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
diff --git a/net/sunrpc/rpc_pipe.c b/net/sunrpc/rpc_pipe.c
index 7df92d2..abf0236 100644
--- a/net/sunrpc/rpc_pipe.c
+++ b/net/sunrpc/rpc_pipe.c
@@ -1056,9 +1056,8 @@ int register_rpc_pipefs(void)
 	int err;
 
 	rpc_inode_cachep = kmem_cache_create("rpc_inode_cache",
-				sizeof(struct rpc_inode),
-				0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT|
-						SLAB_MEM_SPREAD),
+				sizeof(struct rpc_inode), 0,
+				SLAB_HWCACHE_ALIGN|SLAB_INODE_CACHE,
 				init_once);
 	if (!rpc_inode_cachep)
 		return -ENOMEM;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/3] fs: rcu protect inode hash lookups
  2010-11-01  5:33 fs: inode freeing and hash lookup via RCU Dave Chinner
  2010-11-01  5:33 ` [PATCH 1/3] fs: pull inode->i_lock up out of writeback_single_inode Dave Chinner
  2010-11-01  5:33 ` [PATCH 2/3] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU Dave Chinner
@ 2010-11-01  5:33 ` Dave Chinner
  2010-11-01  9:38     ` Eric Dumazet
  2010-11-16 23:56   ` Paul E. McKenney
  2 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2010-11-01  5:33 UTC (permalink / raw)
  To: viro; +Cc: linux-fsdevel, linux-kernel

From: Dave Chinner <dchinner@redhat.com>

Now that inodes are using RCU freeing, we can walk the hash lists
using RCU protection during lookups. Convert all the hash list
operations to use RCU-based operators and drop the inode_hash_lock
around pure lookup operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/inode.c |   89 ++++++++++++++++++++++++++++++++++++++---------------------
 1 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 106ec7a..6bead3d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -50,11 +50,12 @@
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode_sb_list_lock
- *   inode->i_lock
+ *   rcu_read_lock
+ *     inode_sb_list_lock
+ *     inode->i_lock
  *
  * iunique_lock
- *   inode_hash_lock
+ *   rcu_read_lock
  */
 
 /*
@@ -413,7 +414,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
-	hlist_add_head(&inode->i_hash, b);
+	hlist_add_head_rcu(&inode->i_hash, b);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
@@ -429,7 +430,7 @@ void remove_inode_hash(struct inode *inode)
 {
 	spin_lock(&inode_hash_lock);
 	spin_lock(&inode->i_lock);
-	hlist_del_init(&inode->i_hash);
+	hlist_del_init_rcu(&inode->i_hash);
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&inode_hash_lock);
 }
@@ -741,26 +742,38 @@ static void __wait_on_freeing_inode(struct inode *inode);
 static struct inode *find_inode(struct super_block *sb,
 				struct hlist_head *head,
 				int (*test)(struct inode *, void *),
-				void *data)
+				void *data, bool locked)
 {
 	struct hlist_node *node;
 	struct inode *inode = NULL;
 
 repeat:
+	rcu_read_lock();
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		spin_lock(&inode->i_lock);
+		if (inode_unhashed(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			rcu_read_unlock();
+			if (locked)
+				spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
+			if (locked)
+				spin_lock(&inode_hash_lock);
 			goto repeat;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
 		return inode;
 	}
+	rcu_read_unlock();
 	return NULL;
 }
 
@@ -769,26 +782,39 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct hlist_head *head, unsigned long ino,
+				bool locked)
 {
 	struct hlist_node *node;
 	struct inode *inode = NULL;
 
 repeat:
+	rcu_read_lock();
 	hlist_for_each_entry(inode, node, head, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		spin_lock(&inode->i_lock);
+		if (inode_unhashed(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+			rcu_read_unlock();
+			if (locked)
+				spin_unlock(&inode_hash_lock);
 			__wait_on_freeing_inode(inode);
+			if (locked)
+				spin_lock(&inode_hash_lock);
 			goto repeat;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
 		return inode;
 	}
+	rcu_read_unlock();
 	return NULL;
 }
 
@@ -913,14 +939,14 @@ static struct inode *get_new_inode(struct super_block *sb,
 
 		spin_lock(&inode_hash_lock);
 		/* We released the lock, so.. */
-		old = find_inode(sb, head, test, data);
+		old = find_inode(sb, head, test, data, true);
 		if (!old) {
 			if (set(inode, data))
 				goto set_failed;
 
 			spin_lock(&inode->i_lock);
 			inode->i_state = I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			inode_sb_list_add(inode);
 			spin_unlock(&inode_hash_lock);
@@ -964,12 +990,12 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
 
 		spin_lock(&inode_hash_lock);
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, head, ino, true);
 		if (!old) {
 			inode->i_ino = ino;
 			spin_lock(&inode->i_lock);
 			inode->i_state = I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			inode_sb_list_add(inode);
 			spin_unlock(&inode_hash_lock);
@@ -1006,15 +1032,22 @@ static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 	struct hlist_node *node;
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, node, b, i_hash) {
-		if (inode->i_ino == ino && inode->i_sb == sb) {
-			spin_unlock(&inode_hash_lock);
-			return 0;
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(inode, node, b, i_hash) {
+		if (inode->i_ino != ino)
+			continue;
+		if (inode->i_sb != sb)
+			continue;
+		spin_lock(&inode->i_lock);
+		if (inode_unhashed(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
 		}
+		spin_unlock(&inode->i_lock);
+		rcu_read_unlock();
+		return 0;
 	}
-	spin_unlock(&inode_hash_lock);
-
+	rcu_read_unlock();
 	return 1;
 }
 
@@ -1099,15 +1132,12 @@ static struct inode *ifind(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	inode = find_inode(sb, head, test, data);
+	inode = find_inode(sb, head, test, data, false);
 	if (inode) {
-		spin_unlock(&inode_hash_lock);
 		if (likely(wait))
 			wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_hash_lock);
 	return NULL;
 }
 
@@ -1131,14 +1161,11 @@ static struct inode *ifind_fast(struct super_block *sb,
 {
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	inode = find_inode_fast(sb, head, ino);
+	inode = find_inode_fast(sb, head, ino, false);
 	if (inode) {
-		spin_unlock(&inode_hash_lock);
 		wait_on_inode(inode);
 		return inode;
 	}
-	spin_unlock(&inode_hash_lock);
 	return NULL;
 }
 
@@ -1301,7 +1328,7 @@ int insert_inode_locked(struct inode *inode)
 		struct hlist_node *node;
 		struct inode *old = NULL;
 		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		hlist_for_each_entry_rcu(old, node, head, i_hash) {
 			if (old->i_ino != ino)
 				continue;
 			if (old->i_sb != sb)
@@ -1316,7 +1343,7 @@ int insert_inode_locked(struct inode *inode)
 		if (likely(!node)) {
 			spin_lock(&inode->i_lock);
 			inode->i_state |= I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_hash_lock);
 			return 0;
@@ -1345,7 +1372,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		struct inode *old = NULL;
 
 		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, node, head, i_hash) {
+		hlist_for_each_entry_rcu(old, node, head, i_hash) {
 			if (old->i_sb != sb)
 				continue;
 			if (!test(old, data))
@@ -1360,7 +1387,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		if (likely(!node)) {
 			spin_lock(&inode->i_lock);
 			inode->i_state |= I_NEW;
-			hlist_add_head(&inode->i_hash, head);
+			hlist_add_head_rcu(&inode->i_hash, head);
 			spin_unlock(&inode->i_lock);
 			spin_unlock(&inode_hash_lock);
 			return 0;
@@ -1646,10 +1673,8 @@ static void __wait_on_freeing_inode(struct inode *inode)
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_hash_lock);
 	schedule();
 	finish_wait(wq, &wait.wait);
-	spin_lock(&inode_hash_lock);
 }
 
 static __initdata unsigned long ihash_entries;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
  2010-11-01  5:33 ` [PATCH 3/3] fs: rcu protect inode hash lookups Dave Chinner
@ 2010-11-01  9:38     ` Eric Dumazet
  2010-11-16 23:56   ` Paul E. McKenney
  1 sibling, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2010-11-01  9:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

Le lundi 01 novembre 2010 à 16:33 +1100, Dave Chinner a écrit :
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that inodes are using RCU freeing, we can walk the hash lists
> using RCU protection during lookups. Convert all the hash list
> operations to use RCU-based operators and drop the inode_hash_lock
> around pure lookup operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

You probably should copy Paul on this stuff, I added him in Cc, because
SLAB_DESTROY_BY_RCU is really tricky, and Paul review is a must.

> ---
>  fs/inode.c |   89 ++++++++++++++++++++++++++++++++++++++---------------------
>  1 files changed, 57 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 106ec7a..6bead3d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -50,11 +50,12 @@
>   *   inode->i_lock
>   *
>   * inode_hash_lock
> - *   inode_sb_list_lock
> - *   inode->i_lock
> + *   rcu_read_lock
> + *     inode_sb_list_lock
> + *     inode->i_lock
>   *
>   * iunique_lock
> - *   inode_hash_lock
> + *   rcu_read_lock
>   */
>  
>  /*
> @@ -413,7 +414,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
>  
>  	spin_lock(&inode_hash_lock);
>  	spin_lock(&inode->i_lock);
> -	hlist_add_head(&inode->i_hash, b);
> +	hlist_add_head_rcu(&inode->i_hash, b);
>  	spin_unlock(&inode->i_lock);
>  	spin_unlock(&inode_hash_lock);
>  }
> @@ -429,7 +430,7 @@ void remove_inode_hash(struct inode *inode)
>  {
>  	spin_lock(&inode_hash_lock);
>  	spin_lock(&inode->i_lock);
> -	hlist_del_init(&inode->i_hash);
> +	hlist_del_init_rcu(&inode->i_hash);
>  	spin_unlock(&inode->i_lock);
>  	spin_unlock(&inode_hash_lock);
>  }
> @@ -741,26 +742,38 @@ static void __wait_on_freeing_inode(struct inode *inode);
>  static struct inode *find_inode(struct super_block *sb,
>  				struct hlist_head *head,
>  				int (*test)(struct inode *, void *),
> -				void *data)
> +				void *data, bool locked)
>  {
>  	struct hlist_node *node;
>  	struct inode *inode = NULL;
>  
>  repeat:
> +	rcu_read_lock();
>  	hlist_for_each_entry(inode, node, head, i_hash) {
>  		if (inode->i_sb != sb)
>  			continue;
>  		if (!test(inode, data))
>  			continue;
>  		spin_lock(&inode->i_lock);

Problem with SLAB_DESTROY_BY_RCU is the inode can be freed, and reused
immediately (no grace period) by another cpu.

So you need to recheck test(inode, data) _after_ getting a stable
reference on the inode (spin_lock() in this case), to make sure you
indeed found the inode you are looking for, not another one.

The test on inode->i_sb != sb can be omitted, _if_ each sb has its own
kmem_cache (but I am not sure, please check if this is the case)

Also, you should make sure the allocation of inode is careful of not
overwriting some fields (the i_lock in particular), since you could
break a concurrent lookup. This is really tricky, you cannot use
spin_lock_init(&inode->i_lock) anymore in inode_init_always().

You can read Documentation/RCU/rculist_nulls.txt for some doc I wrote
when adding SLAB_DESTROY_BY_RCU to UDP/TCP sockets. Sockets stable
reference is not a spinlock, but a refcount, so it was easier to init
this refcount. With a spinlock, I believe you might need to use SLAB
constructor, to initialize the spinlock only on fresh objects, not on
reused ones.

> +		if (inode_unhashed(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> +			rcu_read_unlock();
> +			if (locked)
> +				spin_unlock(&inode_hash_lock);
>  			__wait_on_freeing_inode(inode);
> +			if (locked)
> +				spin_lock(&inode_hash_lock);
>  			goto repeat;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
>  		return inode;
>  	}
> +	rcu_read_unlock();
>  	return NULL;
>  }
>  
> @@ -769,26 +782,39 @@ repeat:
>   * iget_locked for details.
>   */
>  static struct inode *find_inode_fast(struct super_block *sb,
> -				struct hlist_head *head, unsigned long ino)
> +				struct hlist_head *head, unsigned long ino,
> +				bool locked)
>  {
>  	struct hlist_node *node;
>  	struct inode *inode = NULL;
>  
>  repeat:
> +	rcu_read_lock();
>  	hlist_for_each_entry(inode, node, head, i_hash) {
>  		if (inode->i_ino != ino)
>  			continue;
>  		if (inode->i_sb != sb)
>  			continue;
>  		spin_lock(&inode->i_lock);

same here, you must recheck if (inode->i_ino != ino)

> +		if (inode_unhashed(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> +			rcu_read_unlock();
> +			if (locked)
> +				spin_unlock(&inode_hash_lock);
>  			__wait_on_freeing_inode(inode);
> +			if (locked)
> +				spin_lock(&inode_hash_lock);
>  			goto repeat;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
>  		return inode;
>  	}
> +	rcu_read_unlock();
>  	return NULL;
>  }
>  
> @@ -913,14 +939,14 @@ static struct inode *get_new_inode(struct super_block *sb,
>  
>  		spin_lock(&inode_hash_lock);
>  		/* We released the lock, so.. */
> -		old = find_inode(sb, head, test, data);
> +		old = find_inode(sb, head, test, data, true);
>  		if (!old) {
>  			if (set(inode, data))
>  				goto set_failed;
>  
>  			spin_lock(&inode->i_lock);
>  			inode->i_state = I_NEW;
> -			hlist_add_head(&inode->i_hash, head);
> +			hlist_add_head_rcu(&inode->i_hash, head);
>  			spin_unlock(&inode->i_lock);
>  			inode_sb_list_add(inode);
>  			spin_unlock(&inode_hash_lock);
> @@ -964,12 +990,12 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
>  
>  		spin_lock(&inode_hash_lock);
>  		/* We released the lock, so.. */
> -		old = find_inode_fast(sb, head, ino);
> +		old = find_inode_fast(sb, head, ino, true);
>  		if (!old) {
>  			inode->i_ino = ino;
>  			spin_lock(&inode->i_lock);
>  			inode->i_state = I_NEW;
> -			hlist_add_head(&inode->i_hash, head);
> +			hlist_add_head_rcu(&inode->i_hash, head);
>  			spin_unlock(&inode->i_lock);
>  			inode_sb_list_add(inode);
>  			spin_unlock(&inode_hash_lock);
> @@ -1006,15 +1032,22 @@ static int test_inode_iunique(struct super_block *sb, unsigned long ino)
>  	struct hlist_node *node;
>  	struct inode *inode;
>  
> -	spin_lock(&inode_hash_lock);
> -	hlist_for_each_entry(inode, node, b, i_hash) {
> -		if (inode->i_ino == ino && inode->i_sb == sb) {
> -			spin_unlock(&inode_hash_lock);
> -			return 0;
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(inode, node, b, i_hash) {
> +		if (inode->i_ino != ino)
> +			continue;
> +		if (inode->i_sb != sb)
> +			continue;
> +		spin_lock(&inode->i_lock);

same here

> +		if (inode_unhashed(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
>  		}
> +		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
> +		return 0;
>  	}
> -	spin_unlock(&inode_hash_lock);
> -
> +	rcu_read_unlock();
>  	return 1;
>  }



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
@ 2010-11-01  9:38     ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2010-11-01  9:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

Le lundi 01 novembre 2010 à 16:33 +1100, Dave Chinner a écrit :
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that inodes are using RCU freeing, we can walk the hash lists
> using RCU protection during lookups. Convert all the hash list
> operations to use RCU-based operators and drop the inode_hash_lock
> around pure lookup operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

You probably should copy Paul on this stuff, I added him in Cc, because
SLAB_DESTROY_BY_RCU is really tricky, and Paul review is a must.

> ---
>  fs/inode.c |   89 ++++++++++++++++++++++++++++++++++++++---------------------
>  1 files changed, 57 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 106ec7a..6bead3d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -50,11 +50,12 @@
>   *   inode->i_lock
>   *
>   * inode_hash_lock
> - *   inode_sb_list_lock
> - *   inode->i_lock
> + *   rcu_read_lock
> + *     inode_sb_list_lock
> + *     inode->i_lock
>   *
>   * iunique_lock
> - *   inode_hash_lock
> + *   rcu_read_lock
>   */
>  
>  /*
> @@ -413,7 +414,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
>  
>  	spin_lock(&inode_hash_lock);
>  	spin_lock(&inode->i_lock);
> -	hlist_add_head(&inode->i_hash, b);
> +	hlist_add_head_rcu(&inode->i_hash, b);
>  	spin_unlock(&inode->i_lock);
>  	spin_unlock(&inode_hash_lock);
>  }
> @@ -429,7 +430,7 @@ void remove_inode_hash(struct inode *inode)
>  {
>  	spin_lock(&inode_hash_lock);
>  	spin_lock(&inode->i_lock);
> -	hlist_del_init(&inode->i_hash);
> +	hlist_del_init_rcu(&inode->i_hash);
>  	spin_unlock(&inode->i_lock);
>  	spin_unlock(&inode_hash_lock);
>  }
> @@ -741,26 +742,38 @@ static void __wait_on_freeing_inode(struct inode *inode);
>  static struct inode *find_inode(struct super_block *sb,
>  				struct hlist_head *head,
>  				int (*test)(struct inode *, void *),
> -				void *data)
> +				void *data, bool locked)
>  {
>  	struct hlist_node *node;
>  	struct inode *inode = NULL;
>  
>  repeat:
> +	rcu_read_lock();
>  	hlist_for_each_entry(inode, node, head, i_hash) {
>  		if (inode->i_sb != sb)
>  			continue;
>  		if (!test(inode, data))
>  			continue;
>  		spin_lock(&inode->i_lock);

Problem with SLAB_DESTROY_BY_RCU is the inode can be freed, and reused
immediately (no grace period) by another cpu.

So you need to recheck test(inode, data) _after_ getting a stable
reference on the inode (spin_lock() in this case), to make sure you
indeed found the inode you are looking for, not another one.

The test on inode->i_sb != sb can be omitted, _if_ each sb has its own
kmem_cache (but I am not sure, please check if this is the case)

Also, you should make sure the allocation of inode is careful of not
overwriting some fields (the i_lock in particular), since you could
break a concurrent lookup. This is really tricky, you cannot use
spin_lock_init(&inode->i_lock) anymore in inode_init_always().

You can read Documentation/RCU/rculist_nulls.txt for some doc I wrote
when adding SLAB_DESTROY_BY_RCU to UDP/TCP sockets. Sockets stable
reference is not a spinlock, but a refcount, so it was easier to init
this refcount. With a spinlock, I believe you might need to use SLAB
constructor, to initialize the spinlock only on fresh objects, not on
reused ones.

> +		if (inode_unhashed(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> +			rcu_read_unlock();
> +			if (locked)
> +				spin_unlock(&inode_hash_lock);
>  			__wait_on_freeing_inode(inode);
> +			if (locked)
> +				spin_lock(&inode_hash_lock);
>  			goto repeat;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
>  		return inode;
>  	}
> +	rcu_read_unlock();
>  	return NULL;
>  }
>  
> @@ -769,26 +782,39 @@ repeat:
>   * iget_locked for details.
>   */
>  static struct inode *find_inode_fast(struct super_block *sb,
> -				struct hlist_head *head, unsigned long ino)
> +				struct hlist_head *head, unsigned long ino,
> +				bool locked)
>  {
>  	struct hlist_node *node;
>  	struct inode *inode = NULL;
>  
>  repeat:
> +	rcu_read_lock();
>  	hlist_for_each_entry(inode, node, head, i_hash) {
>  		if (inode->i_ino != ino)
>  			continue;
>  		if (inode->i_sb != sb)
>  			continue;
>  		spin_lock(&inode->i_lock);

same here, you must recheck if (inode->i_ino != ino)

> +		if (inode_unhashed(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> +			rcu_read_unlock();
> +			if (locked)
> +				spin_unlock(&inode_hash_lock);
>  			__wait_on_freeing_inode(inode);
> +			if (locked)
> +				spin_lock(&inode_hash_lock);
>  			goto repeat;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
>  		return inode;
>  	}
> +	rcu_read_unlock();
>  	return NULL;
>  }
>  
> @@ -913,14 +939,14 @@ static struct inode *get_new_inode(struct super_block *sb,
>  
>  		spin_lock(&inode_hash_lock);
>  		/* We released the lock, so.. */
> -		old = find_inode(sb, head, test, data);
> +		old = find_inode(sb, head, test, data, true);
>  		if (!old) {
>  			if (set(inode, data))
>  				goto set_failed;
>  
>  			spin_lock(&inode->i_lock);
>  			inode->i_state = I_NEW;
> -			hlist_add_head(&inode->i_hash, head);
> +			hlist_add_head_rcu(&inode->i_hash, head);
>  			spin_unlock(&inode->i_lock);
>  			inode_sb_list_add(inode);
>  			spin_unlock(&inode_hash_lock);
> @@ -964,12 +990,12 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
>  
>  		spin_lock(&inode_hash_lock);
>  		/* We released the lock, so.. */
> -		old = find_inode_fast(sb, head, ino);
> +		old = find_inode_fast(sb, head, ino, true);
>  		if (!old) {
>  			inode->i_ino = ino;
>  			spin_lock(&inode->i_lock);
>  			inode->i_state = I_NEW;
> -			hlist_add_head(&inode->i_hash, head);
> +			hlist_add_head_rcu(&inode->i_hash, head);
>  			spin_unlock(&inode->i_lock);
>  			inode_sb_list_add(inode);
>  			spin_unlock(&inode_hash_lock);
> @@ -1006,15 +1032,22 @@ static int test_inode_iunique(struct super_block *sb, unsigned long ino)
>  	struct hlist_node *node;
>  	struct inode *inode;
>  
> -	spin_lock(&inode_hash_lock);
> -	hlist_for_each_entry(inode, node, b, i_hash) {
> -		if (inode->i_ino == ino && inode->i_sb == sb) {
> -			spin_unlock(&inode_hash_lock);
> -			return 0;
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(inode, node, b, i_hash) {
> +		if (inode->i_ino != ino)
> +			continue;
> +		if (inode->i_sb != sb)
> +			continue;
> +		spin_lock(&inode->i_lock);

same here

> +		if (inode_unhashed(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
>  		}
> +		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
> +		return 0;
>  	}
> -	spin_unlock(&inode_hash_lock);
> -
> +	rcu_read_unlock();
>  	return 1;
>  }


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
  2010-11-01  9:38     ` Eric Dumazet
@ 2010-11-01 13:44       ` Dave Chinner
  -1 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2010-11-01 13:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

On Mon, Nov 01, 2010 at 10:38:07AM +0100, Eric Dumazet wrote:
> Le lundi 01 novembre 2010 à 16:33 +1100, Dave Chinner a écrit :
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that inodes are using RCU freeing, we can walk the hash lists
> > using RCU protection during lookups. Convert all the hash list
> > operations to use RCU-based operators and drop the inode_hash_lock
> > around pure lookup operations.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> You probably should copy Paul on this stuff, I added him in Cc, because
> SLAB_DESTROY_BY_RCU is really tricky, and Paul review is a must.
>
> >  repeat:
> > +	rcu_read_lock();
> >  	hlist_for_each_entry(inode, node, head, i_hash) {
> >  		if (inode->i_sb != sb)
> >  			continue;
> >  		if (!test(inode, data))
> >  			continue;
> >  		spin_lock(&inode->i_lock);
> 
> Problem with SLAB_DESTROY_BY_RCU is the inode can be freed, and reused
> immediately (no grace period) by another cpu.
> 
> So you need to recheck test(inode, data) _after_ getting a stable
> reference on the inode (spin_lock() in this case), to make sure you
> indeed found the inode you are looking for, not another one.

Possibly. The test callback is a private callback to determine if,
indeed, it is the inode the caller is looking for. I need to do a
deeper look into what ordering is required for this callback.

> The test on inode->i_sb != sb can be omitted, _if_ each sb has its own
> kmem_cache (but I am not sure, please check if this is the case)

There's a slab cache per filesystem type, not per filesystem, so the
check is necessary.

> Also, you should make sure the allocation of inode is careful of not
> overwriting some fields (the i_lock in particular), since you could
> break a concurrent lookup. This is really tricky, you cannot use
> spin_lock_init(&inode->i_lock) anymore in inode_init_always().

Yes, I missed that one. Good catch.  I'm used to the XFS code where
most locks are initialised only once in the slab constructor....

The other fields of note:

	i_sb: overwritten in inode_init_always(). Should be safe
	simply by rechecking after validating the inode is not in
	the freed state as you suggest.
	i_ino: overwritten just before the inode is re-inserted into
	the hash. redo check like i_sb.
	i_state: initialised atomically with hash insert via i_lock.
	i_hash: inserted into hash list under i_lock

My intent is that the i_state/i_hash atomicity acts as the real
guard against reusing a freed inode, but you are right that the
other fields needs to be rechecked for validity after establishing
that it is not a freed inode.

> You can read Documentation/RCU/rculist_nulls.txt for some doc I wrote
> when adding SLAB_DESTROY_BY_RCU to UDP/TCP sockets. Sockets stable

Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
because the current name seems unrelated to the contents. :/

> reference is not a spinlock, but a refcount, so it was easier to init
> this refcount. With a spinlock, I believe you might need to use SLAB
> constructor, to initialize the spinlock only on fresh objects, not on
> reused ones.

Yeah, that is what I intended.

Thanks for the comments, Eric.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
@ 2010-11-01 13:44       ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2010-11-01 13:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

On Mon, Nov 01, 2010 at 10:38:07AM +0100, Eric Dumazet wrote:
> Le lundi 01 novembre 2010 à 16:33 +1100, Dave Chinner a écrit :
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Now that inodes are using RCU freeing, we can walk the hash lists
> > using RCU protection during lookups. Convert all the hash list
> > operations to use RCU-based operators and drop the inode_hash_lock
> > around pure lookup operations.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> You probably should copy Paul on this stuff, I added him in Cc, because
> SLAB_DESTROY_BY_RCU is really tricky, and Paul review is a must.
>
> >  repeat:
> > +	rcu_read_lock();
> >  	hlist_for_each_entry(inode, node, head, i_hash) {
> >  		if (inode->i_sb != sb)
> >  			continue;
> >  		if (!test(inode, data))
> >  			continue;
> >  		spin_lock(&inode->i_lock);
> 
> Problem with SLAB_DESTROY_BY_RCU is the inode can be freed, and reused
> immediately (no grace period) by another cpu.
> 
> So you need to recheck test(inode, data) _after_ getting a stable
> reference on the inode (spin_lock() in this case), to make sure you
> indeed found the inode you are looking for, not another one.

Possibly. The test callback is a private callback to determine if,
indeed, it is the inode the caller is looking for. I need to do a
deeper look into what ordering is required for this callback.

> The test on inode->i_sb != sb can be omitted, _if_ each sb has its own
> kmem_cache (but I am not sure, please check if this is the case)

There's a slab cache per filesystem type, not per filesystem, so the
check is necessary.

> Also, you should make sure the allocation of inode is careful of not
> overwriting some fields (the i_lock in particular), since you could
> break a concurrent lookup. This is really tricky, you cannot use
> spin_lock_init(&inode->i_lock) anymore in inode_init_always().

Yes, I missed that one. Good catch.  I'm used to the XFS code where
most locks are initialised only once in the slab constructor....

The other fields of note:

	i_sb: overwritten in inode_init_always(). Should be safe
	simply by rechecking after validating the inode is not in
	the freed state as you suggest.
	i_ino: overwritten just before the inode is re-inserted into
	the hash. redo check like i_sb.
	i_state: initialised atomically with hash insert via i_lock.
	i_hash: inserted into hash list under i_lock

My intent is that the i_state/i_hash atomicity acts as the real
guard against reusing a freed inode, but you are right that the
other fields needs to be rechecked for validity after establishing
that it is not a freed inode.

> You can read Documentation/RCU/rculist_nulls.txt for some doc I wrote
> when adding SLAB_DESTROY_BY_RCU to UDP/TCP sockets. Sockets stable

Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
because the current name seems unrelated to the contents. :/

> reference is not a spinlock, but a refcount, so it was easier to init
> this refcount. With a spinlock, I believe you might need to use SLAB
> constructor, to initialize the spinlock only on fresh objects, not on
> reused ones.

Yeah, that is what I intended.

Thanks for the comments, Eric.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
  2010-11-01 13:44       ` Dave Chinner
@ 2010-11-01 15:29         ` Eric Dumazet
  -1 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2010-11-01 15:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

Le mardi 02 novembre 2010 à 00:44 +1100, Dave Chinner a écrit :

> Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
> because the current name seems unrelated to the contents. :/
> 

Hmm, I dont know, this doc really is about the nulls thing.

This stuff also addressed one problem I forgot to tell you about: During
a lookup, you might find an item that is moved to another chain by
another cpu, so your lookup is redirected to another chain. You can miss
your target.

You must find a way to detect such thing to restart the lookup at the
beginning (of the right chain).

Either you stick the chain number in a new inode field (that costs extra
memory), or you use the 'nulls' value that can let you know if you ended
your lookup in the right chain.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
@ 2010-11-01 15:29         ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2010-11-01 15:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

Le mardi 02 novembre 2010 à 00:44 +1100, Dave Chinner a écrit :

> Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
> because the current name seems unrelated to the contents. :/
> 

Hmm, I dont know, this doc really is about the nulls thing.

This stuff also addressed one problem I forgot to tell you about: During
a lookup, you might find an item that is moved to another chain by
another cpu, so your lookup is redirected to another chain. You can miss
your target.

You must find a way to detect such thing to restart the lookup at the
beginning (of the right chain).

Either you stick the chain number in a new inode field (that costs extra
memory), or you use the 'nulls' value that can let you know if you ended
your lookup in the right chain.



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU
  2010-11-01  5:33 ` [PATCH 2/3] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU Dave Chinner
@ 2010-11-01 15:31   ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2010-11-01 15:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel

On Mon, Nov 01, 2010 at 04:33:43PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Change the inode caches to use RCU freed inodes via the SLAB_DESTROY_BY_RCU
> method. While touching the slab creation functions for the inode caches, fix
> all of the slabs to use consistent flags by defining SLAB_INODES to be the set
> of common inode cache flags

I think this should be done as a separate preparatory patch.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
  2010-11-01 15:29         ` Eric Dumazet
@ 2010-11-02  0:01           ` Dave Chinner
  -1 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2010-11-02  0:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

On Mon, Nov 01, 2010 at 04:29:22PM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 00:44 +1100, Dave Chinner a écrit :
> 
> > Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
> > because the current name seems unrelated to the contents. :/
> > 
> 
> Hmm, I dont know, this doc really is about the nulls thing.

Ok, now I understand - there's a new list type that I didn't know
about called hlist_nulls. not surprising - it's not documented
anywhere.

Maybe explicitly describing what the list_null pattern іs or a
pointer to linux/include/list_nulls.h might be appropriate, because
I managed to read that documentation and not realise that it was
refering to a specific type of list that was already implemented
rather than a simple marker technique.

> This stuff also addressed one problem I forgot to tell you about: During
> a lookup, you might find an item that is moved to another chain by
> another cpu, so your lookup is redirected to another chain. You can miss
> your target.

<groan>

So, to go to per-chain locks as per the proposed bit-lock-on-the-
low-bit-of-the-head-pointer infrastructure, we'll have to cross that
with the hlist_null code that plays low-bit pointer games for
detecting the end of the chain.

That's just messy - another hash chain specific scalability hackup.
My dislike of using hash tables for unbounded caches is not
improved by this....

> You must find a way to detect such thing to restart the lookup at
> the beginning (of the right chain).  Either you stick the chain
> number in a new inode field (that costs extra memory), or you use
> the 'nulls' value that can let you know if you ended your lookup
> in the right chain.

The chain is determined by hashing the inode number. Perhaps a
simple enough test is to hash the last inode number on a cache miss
and if that doesn't match the hash of the lookup key we redo the
search. That seems to me like it will avoid needing to play games
with termination markers - does that sound reasonable?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
@ 2010-11-02  0:01           ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2010-11-02  0:01 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

On Mon, Nov 01, 2010 at 04:29:22PM +0100, Eric Dumazet wrote:
> Le mardi 02 novembre 2010 à 00:44 +1100, Dave Chinner a écrit :
> 
> > Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
> > because the current name seems unrelated to the contents. :/
> > 
> 
> Hmm, I dont know, this doc really is about the nulls thing.

Ok, now I understand - there's a new list type that I didn't know
about called hlist_nulls. not surprising - it's not documented
anywhere.

Maybe explicitly describing what the list_null pattern іs or a
pointer to linux/include/list_nulls.h might be appropriate, because
I managed to read that documentation and not realise that it was
refering to a specific type of list that was already implemented
rather than a simple marker technique.

> This stuff also addressed one problem I forgot to tell you about: During
> a lookup, you might find an item that is moved to another chain by
> another cpu, so your lookup is redirected to another chain. You can miss
> your target.

<groan>

So, to go to per-chain locks as per the proposed bit-lock-on-the-
low-bit-of-the-head-pointer infrastructure, we'll have to cross that
with the hlist_null code that plays low-bit pointer games for
detecting the end of the chain.

That's just messy - another hash chain specific scalability hackup.
My dislike of using hash tables for unbounded caches is not
improved by this....

> You must find a way to detect such thing to restart the lookup at
> the beginning (of the right chain).  Either you stick the chain
> number in a new inode field (that costs extra memory), or you use
> the 'nulls' value that can let you know if you ended your lookup
> in the right chain.

The chain is determined by hashing the inode number. Perhaps a
simple enough test is to hash the last inode number on a cache miss
and if that doesn't match the hash of the lookup key we redo the
search. That seems to me like it will avoid needing to play games
with termination markers - does that sound reasonable?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
  2010-11-02  0:01           ` Dave Chinner
@ 2010-11-02  4:46             ` Eric Dumazet
  -1 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2010-11-02  4:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

Le mardi 02 novembre 2010 à 11:01 +1100, Dave Chinner a écrit :

> So, to go to per-chain locks as per the proposed bit-lock-on-the-
> low-bit-of-the-head-pointer infrastructure, we'll have to cross that
> with the hlist_null code that plays low-bit pointer games for
> detecting the end of the chain.
> 

So what, just use another bit for the lock, this still let 30 bit for
the nulls value (62 bits on 64bit arches). Anyway, the bit lock is
rejected by RT guys, so why do you still care ?


> The chain is determined by hashing the inode number. Perhaps a
> simple enough test is to hash the last inode number on a cache miss
> and if that doesn't match the hash of the lookup key we redo the
> search. That seems to me like it will avoid needing to play games
> with termination markers - does that sound reasonable?
> 

Sure, if only the inode number is the key, it can be fetched in atomic
way so it should be good, even without taking a stable reference on last
item. It could be a problem if i_ino becomes a larger field...

Note : you'll have to remember the last item i_ino in a variable, not to
refetch it from a 'last item' pointer (this last item could also change
its chain...)

For TCP, hash key is a complex hash function of many fields needing to
take a reference before computing it, this was not an option.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
@ 2010-11-02  4:46             ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2010-11-02  4:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel, Paul E. McKenney

Le mardi 02 novembre 2010 à 11:01 +1100, Dave Chinner a écrit :

> So, to go to per-chain locks as per the proposed bit-lock-on-the-
> low-bit-of-the-head-pointer infrastructure, we'll have to cross that
> with the hlist_null code that plays low-bit pointer games for
> detecting the end of the chain.
> 

So what, just use another bit for the lock, this still let 30 bit for
the nulls value (62 bits on 64bit arches). Anyway, the bit lock is
rejected by RT guys, so why do you still care ?


> The chain is determined by hashing the inode number. Perhaps a
> simple enough test is to hash the last inode number on a cache miss
> and if that doesn't match the hash of the lookup key we redo the
> search. That seems to me like it will avoid needing to play games
> with termination markers - does that sound reasonable?
> 

Sure, if only the inode number is the key, it can be fetched in atomic
way so it should be good, even without taking a stable reference on last
item. It could be a problem if i_ino becomes a larger field...

Note : you'll have to remember the last item i_ino in a variable, not to
refetch it from a 'last item' pointer (this last item could also change
its chain...)

For TCP, hash key is a complex hash function of many fields needing to
take a reference before computing it, this was not an option.



--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
  2010-11-02  0:01           ` Dave Chinner
@ 2010-11-02 12:11             ` Paul E. McKenney
  -1 siblings, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2010-11-02 12:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Dumazet, viro, linux-fsdevel, linux-kernel

On Tue, Nov 02, 2010 at 11:01:52AM +1100, Dave Chinner wrote:
> On Mon, Nov 01, 2010 at 04:29:22PM +0100, Eric Dumazet wrote:
> > Le mardi 02 novembre 2010 à 00:44 +1100, Dave Chinner a écrit :
> > 
> > > Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
> > > because the current name seems unrelated to the contents. :/
> > > 
> > 
> > Hmm, I dont know, this doc really is about the nulls thing.
> 
> Ok, now I understand - there's a new list type that I didn't know
> about called hlist_nulls. not surprising - it's not documented
> anywhere.
> 
> Maybe explicitly describing what the list_null pattern іs or a
> pointer to linux/include/list_nulls.h might be appropriate, because
> I managed to read that documentation and not realise that it was
> refering to a specific type of list that was already implemented
> rather than a simple marker technique.
> 
> > This stuff also addressed one problem I forgot to tell you about: During
> > a lookup, you might find an item that is moved to another chain by
> > another cpu, so your lookup is redirected to another chain. You can miss
> > your target.
> 
> <groan>
> 
> So, to go to per-chain locks as per the proposed bit-lock-on-the-
> low-bit-of-the-head-pointer infrastructure, we'll have to cross that
> with the hlist_null code that plays low-bit pointer games for
> detecting the end of the chain.
> 
> That's just messy - another hash chain specific scalability hackup.
> My dislike of using hash tables for unbounded caches is not
> improved by this....

Indeed, using call_rcu() or synchronize_rcu() guarantees that the identity
of a given RCU-protected structure will not change while you remain in a
given RCU read-side critical section.  In contrast, SLAB_DESTROY_BY_RCU
only guarantees that the type of the object will remain the same.  So this
is the usual complexity/speed tradeoff.  Use of call_rcu() gives the freed
memory more time to grow cache-cold, and synchronize_rcu() further blocks
the caller for several milliseconds, but allows much simpler identity
checks, often permitting you to dispense entirely with identity checks.

In contrast, SLAB_DESTROY_BY_RCU avoid cache-coldness, but requires
much more careful identity checks.

> > You must find a way to detect such thing to restart the lookup at
> > the beginning (of the right chain).  Either you stick the chain
> > number in a new inode field (that costs extra memory), or you use
> > the 'nulls' value that can let you know if you ended your lookup
> > in the right chain.
> 
> The chain is determined by hashing the inode number. Perhaps a
> simple enough test is to hash the last inode number on a cache miss
> and if that doesn't match the hash of the lookup key we redo the
> search. That seems to me like it will avoid needing to play games
> with termination markers - does that sound reasonable?

As long as you do the test under a lock that prevents the inode's identity
from changing.  Not sure what you should do about hash collisions, though.
Perhaps recheck the incoming pointer?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
@ 2010-11-02 12:11             ` Paul E. McKenney
  0 siblings, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2010-11-02 12:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Dumazet, viro, linux-fsdevel, linux-kernel

On Tue, Nov 02, 2010 at 11:01:52AM +1100, Dave Chinner wrote:
> On Mon, Nov 01, 2010 at 04:29:22PM +0100, Eric Dumazet wrote:
> > Le mardi 02 novembre 2010 à 00:44 +1100, Dave Chinner a écrit :
> > 
> > > Perhaps you should rename that file "slab_destroy_by_rcu-tips.txt",
> > > because the current name seems unrelated to the contents. :/
> > > 
> > 
> > Hmm, I dont know, this doc really is about the nulls thing.
> 
> Ok, now I understand - there's a new list type that I didn't know
> about called hlist_nulls. not surprising - it's not documented
> anywhere.
> 
> Maybe explicitly describing what the list_null pattern іs or a
> pointer to linux/include/list_nulls.h might be appropriate, because
> I managed to read that documentation and not realise that it was
> refering to a specific type of list that was already implemented
> rather than a simple marker technique.
> 
> > This stuff also addressed one problem I forgot to tell you about: During
> > a lookup, you might find an item that is moved to another chain by
> > another cpu, so your lookup is redirected to another chain. You can miss
> > your target.
> 
> <groan>
> 
> So, to go to per-chain locks as per the proposed bit-lock-on-the-
> low-bit-of-the-head-pointer infrastructure, we'll have to cross that
> with the hlist_null code that plays low-bit pointer games for
> detecting the end of the chain.
> 
> That's just messy - another hash chain specific scalability hackup.
> My dislike of using hash tables for unbounded caches is not
> improved by this....

Indeed, using call_rcu() or synchronize_rcu() guarantees that the identity
of a given RCU-protected structure will not change while you remain in a
given RCU read-side critical section.  In contrast, SLAB_DESTROY_BY_RCU
only guarantees that the type of the object will remain the same.  So this
is the usual complexity/speed tradeoff.  Use of call_rcu() gives the freed
memory more time to grow cache-cold, and synchronize_rcu() further blocks
the caller for several milliseconds, but allows much simpler identity
checks, often permitting you to dispense entirely with identity checks.

In contrast, SLAB_DESTROY_BY_RCU avoid cache-coldness, but requires
much more careful identity checks.

> > You must find a way to detect such thing to restart the lookup at
> > the beginning (of the right chain).  Either you stick the chain
> > number in a new inode field (that costs extra memory), or you use
> > the 'nulls' value that can let you know if you ended your lookup
> > in the right chain.
> 
> The chain is determined by hashing the inode number. Perhaps a
> simple enough test is to hash the last inode number on a cache miss
> and if that doesn't match the hash of the lookup key we redo the
> search. That seems to me like it will avoid needing to play games
> with termination markers - does that sound reasonable?

As long as you do the test under a lock that prevents the inode's identity
from changing.  Not sure what you should do about hash collisions, though.
Perhaps recheck the incoming pointer?

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] fs: rcu protect inode hash lookups
  2010-11-01  5:33 ` [PATCH 3/3] fs: rcu protect inode hash lookups Dave Chinner
  2010-11-01  9:38     ` Eric Dumazet
@ 2010-11-16 23:56   ` Paul E. McKenney
  1 sibling, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2010-11-16 23:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: viro, linux-fsdevel, linux-kernel

On Mon, Nov 01, 2010 at 04:33:44PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now that inodes are using RCU freeing, we can walk the hash lists
> using RCU protection during lookups. Convert all the hash list
> operations to use RCU-based operators and drop the inode_hash_lock
> around pure lookup operations.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/inode.c |   89 ++++++++++++++++++++++++++++++++++++++---------------------
>  1 files changed, 57 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 106ec7a..6bead3d 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -50,11 +50,12 @@
>   *   inode->i_lock
>   *
>   * inode_hash_lock
> - *   inode_sb_list_lock
> - *   inode->i_lock
> + *   rcu_read_lock
> + *     inode_sb_list_lock
> + *     inode->i_lock
>   *
>   * iunique_lock
> - *   inode_hash_lock
> + *   rcu_read_lock
>   */
> 
>  /*
> @@ -413,7 +414,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval)
> 
>  	spin_lock(&inode_hash_lock);
>  	spin_lock(&inode->i_lock);
> -	hlist_add_head(&inode->i_hash, b);
> +	hlist_add_head_rcu(&inode->i_hash, b);
>  	spin_unlock(&inode->i_lock);
>  	spin_unlock(&inode_hash_lock);
>  }
> @@ -429,7 +430,7 @@ void remove_inode_hash(struct inode *inode)
>  {
>  	spin_lock(&inode_hash_lock);
>  	spin_lock(&inode->i_lock);
> -	hlist_del_init(&inode->i_hash);
> +	hlist_del_init_rcu(&inode->i_hash);
>  	spin_unlock(&inode->i_lock);
>  	spin_unlock(&inode_hash_lock);
>  }
> @@ -741,26 +742,38 @@ static void __wait_on_freeing_inode(struct inode *inode);
>  static struct inode *find_inode(struct super_block *sb,
>  				struct hlist_head *head,
>  				int (*test)(struct inode *, void *),
> -				void *data)
> +				void *data, bool locked)
>  {
>  	struct hlist_node *node;
>  	struct inode *inode = NULL;
> 
>  repeat:
> +	rcu_read_lock();
>  	hlist_for_each_entry(inode, node, head, i_hash) {

The above needs to be hlist_for_each_entry_rcu(), correct?

This is needed even in the SLAB_DESTROY_BY_RCU case, because you
are still inserting elements into this list concurrently with
readers traversing it.

That said, this seems to be replaced by hlist_bl_for_each_entry()
in your git tree with the caller doing hlist_bl_lock(), though it is
entirely possible that I am looking at the wrong branch.  I feel
slow and late to the party...  ;-)

							Thanx, Paul

>  		if (inode->i_sb != sb)
>  			continue;
>  		if (!test(inode, data))
>  			continue;
>  		spin_lock(&inode->i_lock);
> +		if (inode_unhashed(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> +			rcu_read_unlock();
> +			if (locked)
> +				spin_unlock(&inode_hash_lock);
>  			__wait_on_freeing_inode(inode);
> +			if (locked)
> +				spin_lock(&inode_hash_lock);
>  			goto repeat;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
>  		return inode;
>  	}
> +	rcu_read_unlock();
>  	return NULL;
>  }
> 
> @@ -769,26 +782,39 @@ repeat:
>   * iget_locked for details.
>   */
>  static struct inode *find_inode_fast(struct super_block *sb,
> -				struct hlist_head *head, unsigned long ino)
> +				struct hlist_head *head, unsigned long ino,
> +				bool locked)
>  {
>  	struct hlist_node *node;
>  	struct inode *inode = NULL;
> 
>  repeat:
> +	rcu_read_lock();
>  	hlist_for_each_entry(inode, node, head, i_hash) {
>  		if (inode->i_ino != ino)
>  			continue;
>  		if (inode->i_sb != sb)
>  			continue;
>  		spin_lock(&inode->i_lock);
> +		if (inode_unhashed(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
>  		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
> +			rcu_read_unlock();
> +			if (locked)
> +				spin_unlock(&inode_hash_lock);
>  			__wait_on_freeing_inode(inode);
> +			if (locked)
> +				spin_lock(&inode_hash_lock);
>  			goto repeat;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
>  		return inode;
>  	}
> +	rcu_read_unlock();
>  	return NULL;
>  }
> 
> @@ -913,14 +939,14 @@ static struct inode *get_new_inode(struct super_block *sb,
> 
>  		spin_lock(&inode_hash_lock);
>  		/* We released the lock, so.. */
> -		old = find_inode(sb, head, test, data);
> +		old = find_inode(sb, head, test, data, true);
>  		if (!old) {
>  			if (set(inode, data))
>  				goto set_failed;
> 
>  			spin_lock(&inode->i_lock);
>  			inode->i_state = I_NEW;
> -			hlist_add_head(&inode->i_hash, head);
> +			hlist_add_head_rcu(&inode->i_hash, head);
>  			spin_unlock(&inode->i_lock);
>  			inode_sb_list_add(inode);
>  			spin_unlock(&inode_hash_lock);
> @@ -964,12 +990,12 @@ static struct inode *get_new_inode_fast(struct super_block *sb,
> 
>  		spin_lock(&inode_hash_lock);
>  		/* We released the lock, so.. */
> -		old = find_inode_fast(sb, head, ino);
> +		old = find_inode_fast(sb, head, ino, true);
>  		if (!old) {
>  			inode->i_ino = ino;
>  			spin_lock(&inode->i_lock);
>  			inode->i_state = I_NEW;
> -			hlist_add_head(&inode->i_hash, head);
> +			hlist_add_head_rcu(&inode->i_hash, head);
>  			spin_unlock(&inode->i_lock);
>  			inode_sb_list_add(inode);
>  			spin_unlock(&inode_hash_lock);
> @@ -1006,15 +1032,22 @@ static int test_inode_iunique(struct super_block *sb, unsigned long ino)
>  	struct hlist_node *node;
>  	struct inode *inode;
> 
> -	spin_lock(&inode_hash_lock);
> -	hlist_for_each_entry(inode, node, b, i_hash) {
> -		if (inode->i_ino == ino && inode->i_sb == sb) {
> -			spin_unlock(&inode_hash_lock);
> -			return 0;
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(inode, node, b, i_hash) {
> +		if (inode->i_ino != ino)
> +			continue;
> +		if (inode->i_sb != sb)
> +			continue;
> +		spin_lock(&inode->i_lock);
> +		if (inode_unhashed(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
>  		}
> +		spin_unlock(&inode->i_lock);
> +		rcu_read_unlock();
> +		return 0;
>  	}
> -	spin_unlock(&inode_hash_lock);
> -
> +	rcu_read_unlock();
>  	return 1;
>  }
> 
> @@ -1099,15 +1132,12 @@ static struct inode *ifind(struct super_block *sb,
>  {
>  	struct inode *inode;
> 
> -	spin_lock(&inode_hash_lock);
> -	inode = find_inode(sb, head, test, data);
> +	inode = find_inode(sb, head, test, data, false);
>  	if (inode) {
> -		spin_unlock(&inode_hash_lock);
>  		if (likely(wait))
>  			wait_on_inode(inode);
>  		return inode;
>  	}
> -	spin_unlock(&inode_hash_lock);
>  	return NULL;
>  }
> 
> @@ -1131,14 +1161,11 @@ static struct inode *ifind_fast(struct super_block *sb,
>  {
>  	struct inode *inode;
> 
> -	spin_lock(&inode_hash_lock);
> -	inode = find_inode_fast(sb, head, ino);
> +	inode = find_inode_fast(sb, head, ino, false);
>  	if (inode) {
> -		spin_unlock(&inode_hash_lock);
>  		wait_on_inode(inode);
>  		return inode;
>  	}
> -	spin_unlock(&inode_hash_lock);
>  	return NULL;
>  }
> 
> @@ -1301,7 +1328,7 @@ int insert_inode_locked(struct inode *inode)
>  		struct hlist_node *node;
>  		struct inode *old = NULL;
>  		spin_lock(&inode_hash_lock);
> -		hlist_for_each_entry(old, node, head, i_hash) {
> +		hlist_for_each_entry_rcu(old, node, head, i_hash) {
>  			if (old->i_ino != ino)
>  				continue;
>  			if (old->i_sb != sb)
> @@ -1316,7 +1343,7 @@ int insert_inode_locked(struct inode *inode)
>  		if (likely(!node)) {
>  			spin_lock(&inode->i_lock);
>  			inode->i_state |= I_NEW;
> -			hlist_add_head(&inode->i_hash, head);
> +			hlist_add_head_rcu(&inode->i_hash, head);
>  			spin_unlock(&inode->i_lock);
>  			spin_unlock(&inode_hash_lock);
>  			return 0;
> @@ -1345,7 +1372,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
>  		struct inode *old = NULL;
> 
>  		spin_lock(&inode_hash_lock);
> -		hlist_for_each_entry(old, node, head, i_hash) {
> +		hlist_for_each_entry_rcu(old, node, head, i_hash) {
>  			if (old->i_sb != sb)
>  				continue;
>  			if (!test(old, data))
> @@ -1360,7 +1387,7 @@ int insert_inode_locked4(struct inode *inode, unsigned long hashval,
>  		if (likely(!node)) {
>  			spin_lock(&inode->i_lock);
>  			inode->i_state |= I_NEW;
> -			hlist_add_head(&inode->i_hash, head);
> +			hlist_add_head_rcu(&inode->i_hash, head);
>  			spin_unlock(&inode->i_lock);
>  			spin_unlock(&inode_hash_lock);
>  			return 0;
> @@ -1646,10 +1673,8 @@ static void __wait_on_freeing_inode(struct inode *inode)
>  	wq = bit_waitqueue(&inode->i_state, __I_NEW);
>  	prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
>  	spin_unlock(&inode->i_lock);
> -	spin_unlock(&inode_hash_lock);
>  	schedule();
>  	finish_wait(wq, &wait.wait);
> -	spin_lock(&inode_hash_lock);
>  }
> 
>  static __initdata unsigned long ihash_entries;
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-11-16 23:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-01  5:33 fs: inode freeing and hash lookup via RCU Dave Chinner
2010-11-01  5:33 ` [PATCH 1/3] fs: pull inode->i_lock up out of writeback_single_inode Dave Chinner
2010-11-01  5:33 ` [PATCH 2/3] fs: Use RCU freeing of inodes via SLAB_DESTROY_BY_RCU Dave Chinner
2010-11-01 15:31   ` Christoph Hellwig
2010-11-01  5:33 ` [PATCH 3/3] fs: rcu protect inode hash lookups Dave Chinner
2010-11-01  9:38   ` Eric Dumazet
2010-11-01  9:38     ` Eric Dumazet
2010-11-01 13:44     ` Dave Chinner
2010-11-01 13:44       ` Dave Chinner
2010-11-01 15:29       ` Eric Dumazet
2010-11-01 15:29         ` Eric Dumazet
2010-11-02  0:01         ` Dave Chinner
2010-11-02  0:01           ` Dave Chinner
2010-11-02  4:46           ` Eric Dumazet
2010-11-02  4:46             ` Eric Dumazet
2010-11-02 12:11           ` Paul E. McKenney
2010-11-02 12:11             ` Paul E. McKenney
2010-11-16 23:56   ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.