linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET][RFC][CFT] parallel lookups
@ 2016-04-16  0:52 Al Viro
  2016-04-16  0:55 ` [PATCH 01/15] security_d_instantiate(): move to the point prior to attaching dentry to inode Al Viro
                   ` (15 more replies)
  0 siblings, 16 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

	The thing appears to be working.  It's in vfs.git#work.lookups; the
last 5 commits are the infrastructure (fs/namei.c and fs/dcache.c; no changes
in fs/*/*) + actual switch to rwsem.

	The missing bits: down_write_killable() (there had been a series
posted introducing just that; for now I've replaced mutex_lock_killable()
calls with plain inode_lock() - they are not critical for any testing and
as soon as down_write_killable() gets there I'll replace those), lockdep
bits might need corrections and right now it's only for lookups.

	I'm going to add readdir to the mix; the primitive added in this
series (d_alloc_parallel()) will need to be used in dcache pre-seeding
paths, ncpfs use of dentry_update_name_case() will need to be changed to
something less hacky and syscalls calling iterate_dir() will need to
switch to fdget_pos() (with FMODE_ATOMIC_POS set for directories as well
as regulars).  The last bit is needed for exclusion on struct file
level - there's a bunch of cases where we maintain data structures
hanging off file->private and those really need to be serialized.  Besides,
serializing ->f_pos updates is needed for sane semantics; right now we
tend to use ->i_mutex for that, but it would be easier to go for the same
mechanism as for regular files.  With any luck we'll have working parallel
readdir in addition to parallel lookups in this cycle as well.

	The patchset is on top of switching getxattr to passing dentry and
inode separately; that part will get changes (in particular, the stuff
agruen has posted lately), but the lookups queue proper cares only about
being able to move security_d_instantiate() to the point before dentry
is attached to inode.

1/15: security_d_instantiate(): move to the point prior to attaching dentry
to inode.  Depends on getxattr changes, allows to do the "attach to inode"
and "add to dentry hash" parts without dropping ->d_lock in between.

2/15 -- 8/15: preparations - stuff similar to what went in during the last
cycle; several places switched to lookup_one_len_unlocked(), a bunch of
direct manipulations of ->i_mutex replaced with inode_lock, etc. helpers.

kernfs: use lookup_one_len_unlocked().
configfs_detach_prep(): make sure that wait_mutex won't go away
ocfs2: don't open-code inode_lock/inode_unlock
orangefs: don't open-code inode_lock/inode_unlock
reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
reconnect_one(): use lookup_one_len_unlocked()
ovl_lookup_real(): use lookup_one_len_unlocked()

9/15: lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
open-code real_lookup() call in lookup_slow(), move IS_DEADDIR check upwards.

10/15: __d_add(): don't drop/regain ->d_lock
that's what 1/15 had been for; might make sense to reorder closer to it.

11/15 -- 14/15: actual machinery for parallel lookups.  This stuff could've
been a single commit, along with the actual switch to rwsem and shared lock
in lookup_slow(), but it's easier to review if carved up like that.  From the
testing POV it's one chunk - it is bisect-safe, but the added code really
comes into play only after we go for shared lock, which happens in 15/15.
That's the core of the series.

beginning of transition to parallel lookups - marking in-lookup dentries
parallel lookups machinery, part 2
parallel lookups machinery, part 3
parallel lookups machinery, part 4 (and last)

15/15: parallel lookups: actual switch to rwsem

Note that filesystems would be free to switch some of their own uses of
inode_lock() to grabbing it shared - it's really up to them.  This series
works only with directories locking, but this field has become an rwsem
for all inodes.  XFS folks in particular might be interested in using it...

I'll post the individual patches in followups.  Again, this is also available
in vfs.git #work.lookups (head at e2d622a right now).  The thing survives
LTP and xfstests without regressions, but more testing would certainly be
appreciated.  So would review, of course.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/15] security_d_instantiate(): move to the point prior to attaching dentry to inode
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 02/15] kernfs: use lookup_one_len_unlocked() Al Viro
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 32ceae3..e9de4d9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1772,11 +1772,11 @@ void d_instantiate(struct dentry *entry, struct inode * inode)
 {
 	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
 	if (inode) {
+		security_d_instantiate(entry, inode);
 		spin_lock(&inode->i_lock);
 		__d_instantiate(entry, inode);
 		spin_unlock(&inode->i_lock);
 	}
-	security_d_instantiate(entry, inode);
 }
 EXPORT_SYMBOL(d_instantiate);
 
@@ -1793,6 +1793,7 @@ int d_instantiate_no_diralias(struct dentry *entry, struct inode *inode)
 {
 	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
 
+	security_d_instantiate(entry, inode);
 	spin_lock(&inode->i_lock);
 	if (S_ISDIR(inode->i_mode) && !hlist_empty(&inode->i_dentry)) {
 		spin_unlock(&inode->i_lock);
@@ -1801,7 +1802,6 @@ int d_instantiate_no_diralias(struct dentry *entry, struct inode *inode)
 	}
 	__d_instantiate(entry, inode);
 	spin_unlock(&inode->i_lock);
-	security_d_instantiate(entry, inode);
 
 	return 0;
 }
@@ -1875,6 +1875,7 @@ static struct dentry *__d_obtain_alias(struct inode *inode, int disconnected)
 		goto out_iput;
 	}
 
+	security_d_instantiate(tmp, inode);
 	spin_lock(&inode->i_lock);
 	res = __d_find_any_alias(inode);
 	if (res) {
@@ -1897,13 +1898,10 @@ static struct dentry *__d_obtain_alias(struct inode *inode, int disconnected)
 	hlist_bl_unlock(&tmp->d_sb->s_anon);
 	spin_unlock(&tmp->d_lock);
 	spin_unlock(&inode->i_lock);
-	security_d_instantiate(tmp, inode);
 
 	return tmp;
 
  out_iput:
-	if (res && !IS_ERR(res))
-		security_d_instantiate(res, inode);
 	iput(inode);
 	return res;
 }
@@ -2369,7 +2367,6 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
 		__d_instantiate(dentry, inode);
 		spin_unlock(&inode->i_lock);
 	}
-	security_d_instantiate(dentry, inode);
 	d_rehash(dentry);
 }
 
@@ -2384,8 +2381,10 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
 
 void d_add(struct dentry *entry, struct inode *inode)
 {
-	if (inode)
+	if (inode) {
+		security_d_instantiate(entry, inode);
 		spin_lock(&inode->i_lock);
+	}
 	__d_add(entry, inode);
 }
 EXPORT_SYMBOL(d_add);
@@ -2779,6 +2778,7 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 	if (!inode)
 		goto out;
 
+	security_d_instantiate(dentry, inode);
 	spin_lock(&inode->i_lock);
 	if (S_ISDIR(inode->i_mode)) {
 		struct dentry *new = __d_find_any_alias(inode);
@@ -2806,7 +2806,6 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 			} else {
 				__d_move(new, dentry, false);
 				write_sequnlock(&rename_lock);
-				security_d_instantiate(new, inode);
 			}
 			iput(inode);
 			return new;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/15] kernfs: use lookup_one_len_unlocked()
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
  2016-04-16  0:55 ` [PATCH 01/15] security_d_instantiate(): move to the point prior to attaching dentry to inode Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 03/15] configfs_detach_prep(): make sure that wait_mutex won't go away Al Viro
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/kernfs/mount.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index b67dbcc..e006d30 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -120,9 +120,8 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
 		kntmp = find_next_ancestor(kn, knparent);
 		if (WARN_ON(!kntmp))
 			return ERR_PTR(-EINVAL);
-		mutex_lock(&d_inode(dentry)->i_mutex);
-		dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
-		mutex_unlock(&d_inode(dentry)->i_mutex);
+		dtmp = lookup_one_len_unlocked(kntmp->name, dentry,
+					       strlen(kntmp->name));
 		dput(dentry);
 		if (IS_ERR(dtmp))
 			return dtmp;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/15] configfs_detach_prep(): make sure that wait_mutex won't go away
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
  2016-04-16  0:55 ` [PATCH 01/15] security_d_instantiate(): move to the point prior to attaching dentry to inode Al Viro
  2016-04-16  0:55 ` [PATCH 02/15] kernfs: use lookup_one_len_unlocked() Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 04/15] ocfs2: don't open-code inode_lock/inode_unlock Al Viro
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

grab a reference to dentry we'd got the sucker from, and return
that dentry via *wait, rather than just returning the address of
->i_mutex.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/configfs/dir.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index ea59c89..48929c4 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -494,7 +494,7 @@ out:
  * If there is an error, the caller will reset the flags via
  * configfs_detach_rollback().
  */
-static int configfs_detach_prep(struct dentry *dentry, struct mutex **wait_mutex)
+static int configfs_detach_prep(struct dentry *dentry, struct dentry **wait)
 {
 	struct configfs_dirent *parent_sd = dentry->d_fsdata;
 	struct configfs_dirent *sd;
@@ -515,8 +515,8 @@ static int configfs_detach_prep(struct dentry *dentry, struct mutex **wait_mutex
 		if (sd->s_type & CONFIGFS_USET_DEFAULT) {
 			/* Abort if racing with mkdir() */
 			if (sd->s_type & CONFIGFS_USET_IN_MKDIR) {
-				if (wait_mutex)
-					*wait_mutex = &d_inode(sd->s_dentry)->i_mutex;
+				if (wait)
+					*wait= dget(sd->s_dentry);
 				return -EAGAIN;
 			}
 
@@ -524,7 +524,7 @@ static int configfs_detach_prep(struct dentry *dentry, struct mutex **wait_mutex
 			 * Yup, recursive.  If there's a problem, blame
 			 * deep nesting of default_groups
 			 */
-			ret = configfs_detach_prep(sd->s_dentry, wait_mutex);
+			ret = configfs_detach_prep(sd->s_dentry, wait);
 			if (!ret)
 				continue;
 		} else
@@ -1458,7 +1458,7 @@ static int configfs_rmdir(struct inode *dir, struct dentry *dentry)
 	 * the new link is temporarily attached
 	 */
 	do {
-		struct mutex *wait_mutex;
+		struct dentry *wait;
 
 		mutex_lock(&configfs_symlink_mutex);
 		spin_lock(&configfs_dirent_lock);
@@ -1469,7 +1469,7 @@ static int configfs_rmdir(struct inode *dir, struct dentry *dentry)
 		 */
 		ret = sd->s_dependent_count ? -EBUSY : 0;
 		if (!ret) {
-			ret = configfs_detach_prep(dentry, &wait_mutex);
+			ret = configfs_detach_prep(dentry, &wait);
 			if (ret)
 				configfs_detach_rollback(dentry);
 		}
@@ -1483,8 +1483,9 @@ static int configfs_rmdir(struct inode *dir, struct dentry *dentry)
 			}
 
 			/* Wait until the racing operation terminates */
-			mutex_lock(wait_mutex);
-			mutex_unlock(wait_mutex);
+			inode_lock(d_inode(wait));
+			inode_unlock(d_inode(wait));
+			dput(wait);
 		}
 	} while (ret == -EAGAIN);
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/15] ocfs2: don't open-code inode_lock/inode_unlock
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (2 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 03/15] configfs_detach_prep(): make sure that wait_mutex won't go away Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 05/15] orangefs: " Al Viro
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/ocfs2/aops.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1581240..f048a33 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2311,7 +2311,7 @@ static void ocfs2_dio_end_io_write(struct inode *inode,
 	/* ocfs2_file_write_iter will get i_mutex, so we need not lock if we
 	 * are in that context. */
 	if (dwc->dw_writer_pid != task_pid_nr(current)) {
-		mutex_lock(&inode->i_mutex);
+		inode_lock(inode);
 		locked = 1;
 	}
 
@@ -2390,7 +2390,7 @@ out:
 		ocfs2_free_alloc_context(meta_ac);
 	ocfs2_run_deallocs(osb, &dealloc);
 	if (locked)
-		mutex_unlock(&inode->i_mutex);
+		inode_unlock(inode);
 	ocfs2_dio_free_write_ctx(inode, dwc);
 }
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/15] orangefs: don't open-code inode_lock/inode_unlock
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (3 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 04/15] ocfs2: don't open-code inode_lock/inode_unlock Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 06/15] reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack() Al Viro
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/orangefs/file.c            | 4 ++--
 fs/orangefs/orangefs-kernel.h | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index ae92795..491e82c 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -445,7 +445,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
 
 	gossip_debug(GOSSIP_FILE_DEBUG, "orangefs_file_write_iter\n");
 
-	mutex_lock(&file->f_mapping->host->i_mutex);
+	inode_lock(file->f_mapping->host);
 
 	/* Make sure generic_write_checks sees an up to date inode size. */
 	if (file->f_flags & O_APPEND) {
@@ -492,7 +492,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
 
 out:
 
-	mutex_unlock(&file->f_mapping->host->i_mutex);
+	inode_unlock(file->f_mapping->host);
 	return rc;
 }
 
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index a9925e2..2281882 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -612,11 +612,11 @@ do {									\
 static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
 {
 #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
-	mutex_lock(&inode->i_mutex);
+	inode_lock(inode);
 #endif
 	i_size_write(inode, i_size);
 #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
-	mutex_unlock(&inode->i_mutex);
+	inode_unlock(inode);
 #endif
 }
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/15] reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (4 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 05/15] orangefs: " Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 07/15] reconnect_one(): use lookup_one_len_unlocked() Al Viro
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

... and have it use inode_lock()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/reiserfs/ioctl.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/reiserfs/ioctl.c b/fs/reiserfs/ioctl.c
index 036a1fc..f49afe7 100644
--- a/fs/reiserfs/ioctl.c
+++ b/fs/reiserfs/ioctl.c
@@ -187,7 +187,11 @@ int reiserfs_unpack(struct inode *inode, struct file *filp)
 	}
 
 	/* we need to make sure nobody is changing the file size beneath us */
-	reiserfs_mutex_lock_safe(&inode->i_mutex, inode->i_sb);
+{
+	int depth = reiserfs_write_unlock_nested(inode->i_sb);
+	inode_lock(inode);
+	reiserfs_write_lock_nested(inode->i_sb, depth);
+}
 
 	reiserfs_write_lock(inode->i_sb);
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/15] reconnect_one(): use lookup_one_len_unlocked()
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (5 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 06/15] reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack() Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-18 19:23   ` J. Bruce Fields
  2016-04-16  0:55 ` [PATCH 08/15] ovl_lookup_real(): " Al Viro
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

... and explain the non-obvious logics in case when lookup yields
a different dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/exportfs/expfs.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index c46f1a1..402c5ca 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -143,14 +143,18 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
 	if (err)
 		goto out_err;
 	dprintk("%s: found name: %s\n", __func__, nbuf);
-	inode_lock(parent->d_inode);
-	tmp = lookup_one_len(nbuf, parent, strlen(nbuf));
-	inode_unlock(parent->d_inode);
+	tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
 	if (IS_ERR(tmp)) {
 		dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
 		goto out_err;
 	}
 	if (tmp != dentry) {
+		/*
+		 * Somebody has renamed it since exportfs_get_name();
+		 * great, since it could've only been renamed if it
+		 * got looked up and thus connected, and it would
+		 * remain connected afterwards.  We are done.
+		 */
 		dput(tmp);
 		goto out_reconnected;
 	}
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/15] ovl_lookup_real(): use lookup_one_len_unlocked()
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (6 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 07/15] reconnect_one(): use lookup_one_len_unlocked() Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 09/15] lookup_slow(): bugger off on IS_DEADDIR() from the very beginning Al Viro
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/overlayfs/super.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 14cab38..4c26225 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -378,9 +378,7 @@ static inline struct dentry *ovl_lookup_real(struct dentry *dir,
 {
 	struct dentry *dentry;
 
-	inode_lock(dir->d_inode);
-	dentry = lookup_one_len(name->name, dir, name->len);
-	inode_unlock(dir->d_inode);
+	dentry = lookup_one_len_unlocked(name->name, dir, name->len);
 
 	if (IS_ERR(dentry)) {
 		if (PTR_ERR(dentry) == -ENOENT)
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/15] lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (7 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 08/15] ovl_lookup_real(): " Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 10/15] __d_add(): don't drop/regain ->d_lock Al Viro
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namei.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c0d551f..6fb33a7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1603,8 +1603,15 @@ static struct dentry *lookup_slow(const struct qstr *name,
 				  struct dentry *dir,
 				  unsigned int flags)
 {
-	struct dentry *dentry;
-	inode_lock(dir->d_inode);
+	struct dentry *dentry, *old;
+	struct inode *inode = dir->d_inode;
+
+	inode_lock(inode);
+	/* Don't go there if it's already dead */
+	if (unlikely(IS_DEADDIR(inode))) {
+		inode_unlock(inode);
+		return ERR_PTR(-ENOENT);
+	}
 	dentry = d_lookup(dir, name);
 	if (unlikely(dentry)) {
 		if ((dentry->d_flags & DCACHE_OP_REVALIDATE) &&
@@ -1618,17 +1625,21 @@ static struct dentry *lookup_slow(const struct qstr *name,
 			}
 		}
 		if (dentry) {
-			inode_unlock(dir->d_inode);
+			inode_unlock(inode);
 			return dentry;
 		}
 	}
 	dentry = d_alloc(dir, name);
 	if (unlikely(!dentry)) {
-		inode_unlock(dir->d_inode);
+		inode_unlock(inode);
 		return ERR_PTR(-ENOMEM);
 	}
-	dentry = lookup_real(dir->d_inode, dentry, flags);
-	inode_unlock(dir->d_inode);
+	old = inode->i_op->lookup(inode, dentry, flags);
+	if (unlikely(old)) {
+		dput(dentry);
+		dentry = old;
+	}
+	inode_unlock(inode);
 	return dentry;
 }
 
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/15] __d_add(): don't drop/regain ->d_lock
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (8 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 09/15] lookup_slow(): bugger off on IS_DEADDIR() from the very beginning Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-24 18:09   ` Jeff Layton
  2016-04-16  0:55 ` [PATCH 11/15] beginning of transition to parallel lookups - marking in-lookup dentries Al Viro
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e9de4d9..33cad8a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2363,11 +2363,19 @@ EXPORT_SYMBOL(d_rehash);
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
+	spin_lock(&dentry->d_lock);
 	if (inode) {
-		__d_instantiate(dentry, inode);
+		unsigned add_flags = d_flags_for_inode(inode);
+		hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+		raw_write_seqcount_begin(&dentry->d_seq);
+		__d_set_inode_and_type(dentry, inode, add_flags);
+		raw_write_seqcount_end(&dentry->d_seq);
+		__fsnotify_d_instantiate(dentry);
+	}
+	_d_rehash(dentry);
+	spin_unlock(&dentry->d_lock);
+	if (inode)
 		spin_unlock(&inode->i_lock);
-	}
-	d_rehash(dentry);
 }
 
 /**
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/15] beginning of transition to parallel lookups - marking in-lookup dentries
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (9 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 10/15] __d_add(): don't drop/regain ->d_lock Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 12/15] parallel lookups machinery, part 2 Al Viro
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

marked as such when (would be) parallel lookup is about to pass them
to actual ->lookup(); unmarked when
	* __d_add() is about to make it hashed, positive or not.
	* __d_move() (from d_splice_alias(), directly or via
__d_unalias()) puts a preexisting dentry in its place
	* in caller of ->lookup() if it has escaped all of the
above.  Bug (WARN_ON, actually) if it reaches the final dput()
or d_instantiate() while still marked such.

As the result, we are guaranteed that for as long as the flag is
set, dentry will
	* remain negative unhashed with positive refcount
	* never have its ->d_alias looked at
	* never have its ->d_lru looked at
	* never have its ->d_parent and ->d_name changed

Right now we have at most one such for any given parent directory.
With parallel lookups that restriction will weaken to
	* only exist when parent is locked shared
	* at most one with given (parent,name) pair (comparison of
names is according to ->d_compare())
	* only exist when there's no hashed dentry with the same
(parent,name)

Transition will take the next several commits; unfortunately, we'll
only be able to switch to rwsem at the end of this series.  The
reason for not making it a single patch is to simplify review.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c            | 12 ++++++++++++
 fs/namei.c             |  4 ++++
 include/linux/dcache.h | 13 +++++++++++++
 3 files changed, 29 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 33cad8a..5cea3cb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -761,6 +761,8 @@ repeat:
 	/* Slow case: now with the dentry lock held */
 	rcu_read_unlock();
 
+	WARN_ON(dentry->d_flags & DCACHE_PAR_LOOKUP);
+
 	/* Unreachable? Get rid of it */
 	if (unlikely(d_unhashed(dentry)))
 		goto kill_it;
@@ -1743,6 +1745,7 @@ type_determined:
 static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 {
 	unsigned add_flags = d_flags_for_inode(inode);
+	WARN_ON(dentry->d_flags & DCACHE_PAR_LOOKUP);
 
 	spin_lock(&dentry->d_lock);
 	hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
@@ -2358,12 +2361,19 @@ void d_rehash(struct dentry * entry)
 }
 EXPORT_SYMBOL(d_rehash);
 
+void __d_not_in_lookup(struct dentry *dentry)
+{
+	dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
+	/* more stuff will land here */
+}
 
 /* inode->i_lock held if inode is non-NULL */
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
 	spin_lock(&dentry->d_lock);
+	if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP))
+		__d_not_in_lookup(dentry);
 	if (inode) {
 		unsigned add_flags = d_flags_for_inode(inode);
 		hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
@@ -2609,6 +2619,8 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	BUG_ON(d_ancestor(target, dentry));
 
 	dentry_lock_for_move(dentry, target);
+	if (unlikely(target->d_flags & DCACHE_PAR_LOOKUP))
+		__d_not_in_lookup(target);
 
 	write_seqcount_begin(&dentry->d_seq);
 	write_seqcount_begin_nested(&target->d_seq, DENTRY_D_LOCK_NESTED);
diff --git a/fs/namei.c b/fs/namei.c
index 6fb33a7..0ee8b9d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1634,7 +1634,11 @@ static struct dentry *lookup_slow(const struct qstr *name,
 		inode_unlock(inode);
 		return ERR_PTR(-ENOMEM);
 	}
+	spin_lock(&dentry->d_lock);
+	dentry->d_flags |= DCACHE_PAR_LOOKUP;
+	spin_unlock(&dentry->d_lock);
 	old = inode->i_op->lookup(inode, dentry, flags);
+	d_not_in_lookup(dentry);
 	if (unlikely(old)) {
 		dput(dentry);
 		dentry = old;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7cb043d..cfc1240 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -230,6 +230,8 @@ struct dentry_operations {
 
 #define DCACHE_ENCRYPTED_WITH_KEY	0x04000000 /* dir is encrypted with a valid key */
 
+#define DCACHE_PAR_LOOKUP		0x08000000 /* being looked up (with parent locked shared) */
+
 extern seqlock_t rename_lock;
 
 /*
@@ -365,6 +367,17 @@ static inline void dont_mount(struct dentry *dentry)
 	spin_unlock(&dentry->d_lock);
 }
 
+extern void __d_not_in_lookup(struct dentry *);
+
+static inline void d_not_in_lookup(struct dentry *dentry)
+{
+	if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP)) {
+		spin_lock(&dentry->d_lock);
+		__d_not_in_lookup(dentry);
+		spin_unlock(&dentry->d_lock);
+	}
+}
+
 extern void dput(struct dentry *);
 
 static inline bool d_managed(const struct dentry *dentry)
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 12/15] parallel lookups machinery, part 2
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (10 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 11/15] beginning of transition to parallel lookups - marking in-lookup dentries Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 13/15] parallel lookups machinery, part 3 Al Viro
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

We'll need to verify that there's neither a hashed nor in-lookup
dentry with desired parent/name before adding to in-lookup set.

One possible solution would be to hold the parent's ->d_lock through
both checks, but while the in-lookup set is relatively small at any
time, dcache is not.  And holding the parent's ->d_lock through
something like __d_lookup_rcu() would suck too badly.

So we leave the parent's ->d_lock alone, which means that we watch
out for the following scenario:
	* we verify that there's no hashed match
	* existing in-lookup match gets hashed by another process
	* we verify that there's no in-lookup matches and decide
that everything's fine.

Solution: per-directory kinda-sorta seqlock, bumped around the times
we hash something that used to be in-lookup or move (and hash)
something in place of in-lookup.  Then the above would turn into
	* read the counter
	* do dcache lookup
	* if no matches found, check for in-lookup matches
	* if there had been none of those either, check if the
counter has changed; repeat if it has.

The "kinda-sorta" part is due to the fact that we don't have much spare
space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
so the counter part is not a problem, but spinlock is a different story.

We could use the parent's ->d_lock, and it would be less painful in
terms of contention, for __d_add() it would be rather inconvenient to
grab; we could do that (using lock_parent()), but...

Fortunately, we can get serialization on the counter itself, and it
might be a good idea in general; we can use cmpxchg() in a loop to
get from even to odd and smp_store_release() from odd to even.

This commit adds the counter and updating logics; the readers will be
added in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c        | 34 ++++++++++++++++++++++++++++++++--
 fs/inode.c         |  1 +
 include/linux/fs.h |  1 +
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 5cea3cb..3959f18 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2361,6 +2361,22 @@ void d_rehash(struct dentry * entry)
 }
 EXPORT_SYMBOL(d_rehash);
 
+static inline unsigned start_dir_add(struct inode *dir)
+{
+
+	for (;;) {
+		unsigned n = dir->i_dir_seq;
+		if (!(n & 1) && cmpxchg(&dir->i_dir_seq, n, n + 1) == n)
+			return n;
+		cpu_relax();
+	}
+}
+
+static inline void end_dir_add(struct inode *dir, unsigned n)
+{
+	smp_store_release(&dir->i_dir_seq, n + 2);
+}
+
 void __d_not_in_lookup(struct dentry *dentry)
 {
 	dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
@@ -2371,9 +2387,14 @@ void __d_not_in_lookup(struct dentry *dentry)
 
 static inline void __d_add(struct dentry *dentry, struct inode *inode)
 {
+	struct inode *dir = NULL;
+	unsigned n;
 	spin_lock(&dentry->d_lock);
-	if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP))
+	if (unlikely(dentry->d_flags & DCACHE_PAR_LOOKUP)) {
+		dir = dentry->d_parent->d_inode;
+		n = start_dir_add(dir);
 		__d_not_in_lookup(dentry);
+	}
 	if (inode) {
 		unsigned add_flags = d_flags_for_inode(inode);
 		hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
@@ -2383,6 +2404,8 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
 		__fsnotify_d_instantiate(dentry);
 	}
 	_d_rehash(dentry);
+	if (dir)
+		end_dir_add(dir, n);
 	spin_unlock(&dentry->d_lock);
 	if (inode)
 		spin_unlock(&inode->i_lock);
@@ -2612,6 +2635,8 @@ static void dentry_unlock_for_move(struct dentry *dentry, struct dentry *target)
 static void __d_move(struct dentry *dentry, struct dentry *target,
 		     bool exchange)
 {
+	struct inode *dir = NULL;
+	unsigned n;
 	if (!dentry->d_inode)
 		printk(KERN_WARNING "VFS: moving negative dcache entry\n");
 
@@ -2619,8 +2644,11 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	BUG_ON(d_ancestor(target, dentry));
 
 	dentry_lock_for_move(dentry, target);
-	if (unlikely(target->d_flags & DCACHE_PAR_LOOKUP))
+	if (unlikely(target->d_flags & DCACHE_PAR_LOOKUP)) {
+		dir = target->d_parent->d_inode;
+		n = start_dir_add(dir);
 		__d_not_in_lookup(target);
+	}
 
 	write_seqcount_begin(&dentry->d_seq);
 	write_seqcount_begin_nested(&target->d_seq, DENTRY_D_LOCK_NESTED);
@@ -2670,6 +2698,8 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	write_seqcount_end(&target->d_seq);
 	write_seqcount_end(&dentry->d_seq);
 
+	if (dir)
+		end_dir_add(dir, n);
 	dentry_unlock_for_move(dentry, target);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 4202aac..4b884f7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -151,6 +151,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	inode->i_bdev = NULL;
 	inode->i_cdev = NULL;
 	inode->i_link = NULL;
+	inode->i_dir_seq = 0;
 	inode->i_rdev = 0;
 	inode->dirtied_when = 0;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1b5fcae..0a32045 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -684,6 +684,7 @@ struct inode {
 		struct block_device	*i_bdev;
 		struct cdev		*i_cdev;
 		char			*i_link;
+		unsigned		i_dir_seq;
 	};
 
 	__u32			i_generation;
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 13/15] parallel lookups machinery, part 3
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (11 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 12/15] parallel lookups machinery, part 2 Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-18 20:45   ` J. Bruce Fields
  2016-04-16  0:55 ` [PATCH 14/15] parallel lookups machinery, part 4 (and last) Al Viro
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

We will need to be able to check if there is an in-lookup
dentry with matching parent/name.  Right now it's impossible,
but as soon as start locking directories shared such beasts
will appear.

Add a secondary hash for locating those.  Hash chains go through
the same space where d_alias will be once it's not in-lookup anymore.
Search is done under the same bitlock we use for modifications -
with the primary hash we can rely on d_rehash() into the wrong
chain being the worst that could happen, but here the pointers are
buggered once it's removed from the chain.  On the other hand,
the chains are not going to be long and normally we'll end up
adding to the chain anyway.  That allows us to avoid bothering with
->d_lock when doing the comparisons - everything is stable until
removed from chain.

New helper: d_alloc_parallel().  Right now it allocates, verifies
that no hashed and in-lookup matches exist and adds to in-lookup
hash.

Returns ERR_PTR() for error, hashed match (in the unlikely case it's
been found) or new dentry.  In-lookup matches trigger BUG() for
now; that will change in the next commit when we introduce waiting
for ongoing lookup to finish.  Note that in-lookup matches won't be
possible until we actually go for shared locking.

lookup_slow() switched to use of d_alloc_parallel().

Again, these commits are separated only for making it easier to
review.  All this machinery will start doing something useful only
when we go for shared locking; it's just that the combination is
too large for my taste.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c            | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/namei.c             | 44 +++++++++++--------------
 include/linux/dcache.h |  2 ++
 3 files changed, 108 insertions(+), 25 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3959f18..0552002 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -111,6 +111,17 @@ static inline struct hlist_bl_head *d_hash(const struct dentry *parent,
 	return dentry_hashtable + hash_32(hash, d_hash_shift);
 }
 
+#define IN_LOOKUP_SHIFT 10
+static struct hlist_bl_head in_lookup_hashtable[1 << IN_LOOKUP_SHIFT];
+
+static inline struct hlist_bl_head *in_lookup_hash(const struct dentry *parent,
+					unsigned int hash)
+{
+	hash += (unsigned long) parent / L1_CACHE_BYTES;
+	return in_lookup_hashtable + hash_32(hash, IN_LOOKUP_SHIFT);
+}
+
+
 /* Statistics gathering. */
 struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
@@ -2377,9 +2388,85 @@ static inline void end_dir_add(struct inode *dir, unsigned n)
 	smp_store_release(&dir->i_dir_seq, n + 2);
 }
 
+struct dentry *d_alloc_parallel(struct dentry *parent,
+				const struct qstr *name)
+{
+	unsigned int len = name->len;
+	unsigned int hash = name->hash;
+	const unsigned char *str = name->name;
+	struct hlist_bl_head *b = in_lookup_hash(parent, hash);
+	struct hlist_bl_node *node;
+	struct dentry *new = d_alloc(parent, name);
+	struct dentry *dentry;
+	unsigned seq;
+
+	if (unlikely(!new))
+		return ERR_PTR(-ENOMEM);
+
+retry:
+	seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1;
+	dentry = d_lookup(parent, name);
+	if (unlikely(dentry)) {
+		dput(new);
+		return dentry;
+	}
+
+	hlist_bl_lock(b);
+	smp_rmb();
+	if (unlikely(parent->d_inode->i_dir_seq != seq)) {
+		hlist_bl_unlock(b);
+		goto retry;
+	}
+	/*
+	 * No changes for the parent since the beginning of d_lookup().
+	 * Since all removals from the chain happen with hlist_bl_lock(),
+	 * any potential in-lookup matches are going to stay here until
+	 * we unlock the chain.  All fields are stable in everything
+	 * we encounter.
+	 */
+	hlist_bl_for_each_entry(dentry, node, b, d_u.d_in_lookup_hash) {
+		if (dentry->d_name.hash != hash)
+			continue;
+		if (dentry->d_parent != parent)
+			continue;
+		if (d_unhashed(dentry))
+			continue;
+		if (parent->d_flags & DCACHE_OP_COMPARE) {
+			int tlen = dentry->d_name.len;
+			const char *tname = dentry->d_name.name;
+			if (parent->d_op->d_compare(parent, dentry, tlen, tname, name))
+				continue;
+		} else {
+			if (dentry->d_name.len != len)
+				continue;
+			if (dentry_cmp(dentry, str, len))
+				continue;
+		}
+		dget(dentry);
+		hlist_bl_unlock(b);
+		/* impossible until we actually enable parallel lookups */
+		BUG();
+		/* and this will be "wait for it to stop being in-lookup" */
+		/* this one will be handled in the next commit */
+		dput(new);
+		return dentry;
+	}
+	/* we can't take ->d_lock here; it's OK, though. */
+	new->d_flags |= DCACHE_PAR_LOOKUP;
+	hlist_bl_add_head_rcu(&new->d_u.d_in_lookup_hash, b);
+	hlist_bl_unlock(b);
+	return new;
+}
+
 void __d_not_in_lookup(struct dentry *dentry)
 {
+	struct hlist_bl_head *b = in_lookup_hash(dentry->d_parent,
+						 dentry->d_name.hash);
+	hlist_bl_lock(b);
 	dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
+	__hlist_bl_del(&dentry->d_u.d_in_lookup_hash);
+	hlist_bl_unlock(b);
+	INIT_HLIST_NODE(&dentry->d_u.d_alias);
 	/* more stuff will land here */
 }
 
diff --git a/fs/namei.c b/fs/namei.c
index 0ee8b9d..fbce016 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1603,46 +1603,40 @@ static struct dentry *lookup_slow(const struct qstr *name,
 				  struct dentry *dir,
 				  unsigned int flags)
 {
-	struct dentry *dentry, *old;
+	struct dentry *dentry = ERR_PTR(-ENOENT), *old;
 	struct inode *inode = dir->d_inode;
 
 	inode_lock(inode);
 	/* Don't go there if it's already dead */
-	if (unlikely(IS_DEADDIR(inode))) {
-		inode_unlock(inode);
-		return ERR_PTR(-ENOENT);
-	}
-	dentry = d_lookup(dir, name);
-	if (unlikely(dentry)) {
+	if (unlikely(IS_DEADDIR(inode)))
+		goto out;
+again:
+	dentry = d_alloc_parallel(dir, name);
+	if (IS_ERR(dentry))
+		goto out;
+	if (unlikely(!(dentry->d_flags & DCACHE_PAR_LOOKUP))) {
 		if ((dentry->d_flags & DCACHE_OP_REVALIDATE) &&
 		    !(flags & LOOKUP_NO_REVAL)) {
 			int error = d_revalidate(dentry, flags);
 			if (unlikely(error <= 0)) {
-				if (!error)
+				if (!error) {
 					d_invalidate(dentry);
+					dput(dentry);
+					goto again;
+				}
 				dput(dentry);
 				dentry = ERR_PTR(error);
 			}
 		}
-		if (dentry) {
-			inode_unlock(inode);
-			return dentry;
+	} else {
+		old = inode->i_op->lookup(inode, dentry, flags);
+		d_not_in_lookup(dentry);
+		if (unlikely(old)) {
+			dput(dentry);
+			dentry = old;
 		}
 	}
-	dentry = d_alloc(dir, name);
-	if (unlikely(!dentry)) {
-		inode_unlock(inode);
-		return ERR_PTR(-ENOMEM);
-	}
-	spin_lock(&dentry->d_lock);
-	dentry->d_flags |= DCACHE_PAR_LOOKUP;
-	spin_unlock(&dentry->d_lock);
-	old = inode->i_op->lookup(inode, dentry, flags);
-	d_not_in_lookup(dentry);
-	if (unlikely(old)) {
-		dput(dentry);
-		dentry = old;
-	}
+out:
 	inode_unlock(inode);
 	return dentry;
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index cfc1240..3ab5ce4 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -131,6 +131,7 @@ struct dentry {
 	 */
 	union {
 		struct hlist_node d_alias;	/* inode alias list */
+		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
 	 	struct rcu_head d_rcu;
 	} d_u;
 };
@@ -248,6 +249,7 @@ extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op
 /* allocate/de-allocate */
 extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
 extern struct dentry * d_alloc_pseudo(struct super_block *, const struct qstr *);
+extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *);
 extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
 extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
 extern struct dentry * d_exact_alias(struct dentry *, struct inode *);
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 14/15] parallel lookups machinery, part 4 (and last)
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (12 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 13/15] parallel lookups machinery, part 3 Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  0:55 ` [PATCH 15/15] parallel lookups: actual switch to rwsem Al Viro
  2016-04-16  3:02 ` [PATCHSET][RFC][CFT] parallel lookups Andreas Dilger
  15 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

If we *do* run into an in-lookup match, we need to wait for it to
cease being in-lookup.  Fortunately, we do have unused space in
in-lookup dentries - d_lru is never looked at until it stops being
in-lookup.

So we can stash a pointer to wait_queue_head from stack frame of
the caller of ->lookup().  Some precautions are needed while
waiting, but it's not that hard - we do hold a reference to dentry
we are waiting for, so it can't go away.  If it's found to be
in-lookup the wait_queue_head is still alive and will remain so
at least while ->d_lock is held.  Moreover, the condition we
are waiting for becomes true at the same point where everything
on that wq gets woken up, so we can just add ourselves to the
queue once.

d_alloc_parallel() gets a pointer to wait_queue_head_t from its
caller; lookup_slow() adjusted, d_add_ci() taught to use
d_alloc_parallel() if the dentry passed to it happens to be
in-lookup one (i.e. if it's been called from the parallel lookup).

That's pretty much it - all that remains is to switch ->i_mutex
to rwsem and have lookup_slow() take it shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c            | 94 +++++++++++++++++++++++++++++++++++++++-----------
 fs/namei.c             |  3 +-
 include/linux/dcache.h |  8 +++--
 3 files changed, 82 insertions(+), 23 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0552002..5965588 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1984,28 +1984,36 @@ EXPORT_SYMBOL(d_obtain_root);
 struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 			struct qstr *name)
 {
-	struct dentry *found;
-	struct dentry *new;
+	struct dentry *found, *res;
 
 	/*
 	 * First check if a dentry matching the name already exists,
 	 * if not go ahead and create it now.
 	 */
 	found = d_hash_and_lookup(dentry->d_parent, name);
-	if (!found) {
-		new = d_alloc(dentry->d_parent, name);
-		if (!new) {
-			found = ERR_PTR(-ENOMEM);
-		} else {
-			found = d_splice_alias(inode, new);
-			if (found) {
-				dput(new);
-				return found;
-			}
-			return new;
+	if (found) {
+		iput(inode);
+		return found;
+	}
+	if (dentry->d_flags & DCACHE_PAR_LOOKUP) {
+		found = d_alloc_parallel(dentry->d_parent, name,
+					dentry->d_wait);
+		if (IS_ERR(found) || !(found->d_flags & DCACHE_PAR_LOOKUP)) {
+			iput(inode);
+			return found;
 		}
+	} else {
+		found = d_alloc(dentry->d_parent, name);
+		if (!found) {
+			iput(inode);
+			return ERR_PTR(-ENOMEM);
+		} 
+	}
+	res = d_splice_alias(inode, found);
+	if (res) {
+		dput(found);
+		return res;
 	}
-	iput(inode);
 	return found;
 }
 EXPORT_SYMBOL(d_add_ci);
@@ -2388,8 +2396,23 @@ static inline void end_dir_add(struct inode *dir, unsigned n)
 	smp_store_release(&dir->i_dir_seq, n + 2);
 }
 
+static void d_wait_lookup(struct dentry *dentry)
+{
+	if (dentry->d_flags & DCACHE_PAR_LOOKUP) {
+		DECLARE_WAITQUEUE(wait, current);
+		add_wait_queue(dentry->d_wait, &wait);
+		do {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			spin_unlock(&dentry->d_lock);
+			schedule();
+			spin_lock(&dentry->d_lock);
+		} while (dentry->d_flags & DCACHE_PAR_LOOKUP);
+	}
+}
+
 struct dentry *d_alloc_parallel(struct dentry *parent,
-				const struct qstr *name)
+				const struct qstr *name,
+				wait_queue_head_t *wq)
 {
 	unsigned int len = name->len;
 	unsigned int hash = name->hash;
@@ -2444,18 +2467,47 @@ retry:
 		}
 		dget(dentry);
 		hlist_bl_unlock(b);
-		/* impossible until we actually enable parallel lookups */
-		BUG();
-		/* and this will be "wait for it to stop being in-lookup" */
-		/* this one will be handled in the next commit */
+		/* somebody is doing lookup for it right now; wait for it */
+		spin_lock(&dentry->d_lock);
+		d_wait_lookup(dentry);
+		/*
+		 * it's not in-lookup anymore; in principle we should repeat
+		 * everything from dcache lookup, but it's likely to be what
+		 * d_lookup() would've found anyway.  If it is, just return it;
+		 * otherwise we really have to repeat the whole thing.
+		 */
+		if (unlikely(dentry->d_name.hash != hash))
+			goto mismatch;
+		if (unlikely(dentry->d_parent != parent))
+			goto mismatch;
+		if (unlikely(d_unhashed(dentry)))
+			goto mismatch;
+		if (parent->d_flags & DCACHE_OP_COMPARE) {
+			int tlen = dentry->d_name.len;
+			const char *tname = dentry->d_name.name;
+			if (parent->d_op->d_compare(parent, dentry, tlen, tname, name))
+				goto mismatch;
+		} else {
+			if (unlikely(dentry->d_name.len != len))
+				goto mismatch;
+			if (unlikely(dentry_cmp(dentry, str, len)))
+				goto mismatch;
+		}
+		/* OK, it *is* a hashed match; return it */
+		spin_unlock(&dentry->d_lock);
 		dput(new);
 		return dentry;
 	}
 	/* we can't take ->d_lock here; it's OK, though. */
 	new->d_flags |= DCACHE_PAR_LOOKUP;
+	new->d_wait = wq;
 	hlist_bl_add_head_rcu(&new->d_u.d_in_lookup_hash, b);
 	hlist_bl_unlock(b);
 	return new;
+mismatch:
+	spin_unlock(&dentry->d_lock);
+	dput(dentry);
+	goto retry;
 }
 
 void __d_not_in_lookup(struct dentry *dentry)
@@ -2465,9 +2517,11 @@ void __d_not_in_lookup(struct dentry *dentry)
 	hlist_bl_lock(b);
 	dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
 	__hlist_bl_del(&dentry->d_u.d_in_lookup_hash);
+	wake_up_all(dentry->d_wait);
+	dentry->d_wait = NULL;
 	hlist_bl_unlock(b);
 	INIT_HLIST_NODE(&dentry->d_u.d_alias);
-	/* more stuff will land here */
+	INIT_LIST_HEAD(&dentry->d_lru);
 }
 
 /* inode->i_lock held if inode is non-NULL */
diff --git a/fs/namei.c b/fs/namei.c
index fbce016..eb879d6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1605,13 +1605,14 @@ static struct dentry *lookup_slow(const struct qstr *name,
 {
 	struct dentry *dentry = ERR_PTR(-ENOENT), *old;
 	struct inode *inode = dir->d_inode;
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	inode_lock(inode);
 	/* Don't go there if it's already dead */
 	if (unlikely(IS_DEADDIR(inode)))
 		goto out;
 again:
-	dentry = d_alloc_parallel(dir, name);
+	dentry = d_alloc_parallel(dir, name, &wq);
 	if (IS_ERR(dentry))
 		goto out;
 	if (unlikely(!(dentry->d_flags & DCACHE_PAR_LOOKUP))) {
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3ab5ce4..bb4b44f 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -123,7 +123,10 @@ struct dentry {
 	unsigned long d_time;		/* used by d_revalidate */
 	void *d_fsdata;			/* fs-specific data */
 
-	struct list_head d_lru;		/* LRU list */
+	union {
+		struct list_head d_lru;		/* LRU list */
+		wait_queue_head_t *d_wait;	/* in-lookup ones only */
+	};
 	struct list_head d_child;	/* child of parent list */
 	struct list_head d_subdirs;	/* our children */
 	/*
@@ -249,7 +252,8 @@ extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op
 /* allocate/de-allocate */
 extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
 extern struct dentry * d_alloc_pseudo(struct super_block *, const struct qstr *);
-extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *);
+extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *,
+					wait_queue_head_t *);
 extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
 extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
 extern struct dentry * d_exact_alias(struct dentry *, struct inode *);
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 15/15] parallel lookups: actual switch to rwsem
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (13 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 14/15] parallel lookups machinery, part 4 (and last) Al Viro
@ 2016-04-16  0:55 ` Al Viro
  2016-04-16  3:02   ` Andreas Dilger
  2016-04-16  3:02 ` [PATCHSET][RFC][CFT] parallel lookups Andreas Dilger
  15 siblings, 1 reply; 24+ messages in thread
From: Al Viro @ 2016-04-16  0:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

From: Al Viro <viro@zeniv.linux.org.uk>

ta-da!

The main issue is the lack of down_write_killable(), so the places
like readdir.c switched to plain inode_lock(); once killable
variants of rwsem primitives appear, that'll be dealt with.

lockdep side also might need more work

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/btrfs/ioctl.c       | 16 ++++++++++------
 fs/configfs/inode.c    |  2 +-
 fs/dcache.c            |  9 +++++----
 fs/gfs2/ops_fstype.c   |  2 +-
 fs/inode.c             | 12 ++++++------
 fs/namei.c             |  4 ++--
 fs/ocfs2/inode.c       |  2 +-
 fs/overlayfs/readdir.c |  4 +++-
 fs/readdir.c           |  7 ++++---
 include/linux/fs.h     | 12 ++++++------
 10 files changed, 39 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 053e677..db1e830 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -837,9 +837,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
 	struct dentry *dentry;
 	int error;
 
-	error = mutex_lock_killable_nested(&dir->i_mutex, I_MUTEX_PARENT);
-	if (error == -EINTR)
-		return error;
+	inode_lock_nested(dir, I_MUTEX_PARENT);
+	// XXX: should've been
+	// mutex_lock_killable_nested(&dir->i_mutex, I_MUTEX_PARENT);
+	// if (error == -EINTR)
+	//	return error;
 
 	dentry = lookup_one_len(name, parent->dentry, namelen);
 	error = PTR_ERR(dentry);
@@ -2366,9 +2368,11 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
 		goto out;
 
 
-	err = mutex_lock_killable_nested(&dir->i_mutex, I_MUTEX_PARENT);
-	if (err == -EINTR)
-		goto out_drop_write;
+	inode_lock_nested(dir, I_MUTEX_PARENT);
+	// XXX: should've been
+	// err = mutex_lock_killable_nested(&dir->i_mutex, I_MUTEX_PARENT);
+	// if (err == -EINTR)
+	//	goto out_drop_write;
 	dentry = lookup_one_len(vol_args->name, parent, namelen);
 	if (IS_ERR(dentry)) {
 		err = PTR_ERR(dentry);
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index 03d124a..0387968 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -156,7 +156,7 @@ static void configfs_set_inode_lock_class(struct configfs_dirent *sd,
 
 	if (depth > 0) {
 		if (depth <= ARRAY_SIZE(default_group_class)) {
-			lockdep_set_class(&inode->i_mutex,
+			lockdep_set_class(&inode->i_rwsem,
 					  &default_group_class[depth - 1]);
 		} else {
 			/*
diff --git a/fs/dcache.c b/fs/dcache.c
index 5965588..d110040 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2911,7 +2911,8 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
 static int __d_unalias(struct inode *inode,
 		struct dentry *dentry, struct dentry *alias)
 {
-	struct mutex *m1 = NULL, *m2 = NULL;
+	struct mutex *m1 = NULL;
+	struct rw_semaphore *m2 = NULL;
 	int ret = -ESTALE;
 
 	/* If alias and dentry share a parent, then no extra locks required */
@@ -2922,15 +2923,15 @@ static int __d_unalias(struct inode *inode,
 	if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
 		goto out_err;
 	m1 = &dentry->d_sb->s_vfs_rename_mutex;
-	if (!inode_trylock(alias->d_parent->d_inode))
+	if (!down_read_trylock(&alias->d_parent->d_inode->i_rwsem))
 		goto out_err;
-	m2 = &alias->d_parent->d_inode->i_mutex;
+	m2 = &alias->d_parent->d_inode->i_rwsem;
 out_unalias:
 	__d_move(alias, dentry, false);
 	ret = 0;
 out_err:
 	if (m2)
-		mutex_unlock(m2);
+		up_read(m2);
 	if (m1)
 		mutex_unlock(m1);
 	return ret;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index c09c63d..4546360 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -824,7 +824,7 @@ static int init_inodes(struct gfs2_sbd *sdp, int undo)
 	 * i_mutex on quota files is special. Since this inode is hidden system
 	 * file, we are safe to define locking ourselves.
 	 */
-	lockdep_set_class(&sdp->sd_quota_inode->i_mutex,
+	lockdep_set_class(&sdp->sd_quota_inode->i_rwsem,
 			  &gfs2_quota_imutex_key);
 
 	error = gfs2_rindex_update(sdp);
diff --git a/fs/inode.c b/fs/inode.c
index 4b884f7..4ccbc21 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -166,8 +166,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 	spin_lock_init(&inode->i_lock);
 	lockdep_set_class(&inode->i_lock, &sb->s_type->i_lock_key);
 
-	mutex_init(&inode->i_mutex);
-	lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key);
+	init_rwsem(&inode->i_rwsem);
+	lockdep_set_class(&inode->i_rwsem, &sb->s_type->i_mutex_key);
 
 	atomic_set(&inode->i_dio_count, 0);
 
@@ -925,13 +925,13 @@ void lockdep_annotate_inode_mutex_key(struct inode *inode)
 		struct file_system_type *type = inode->i_sb->s_type;
 
 		/* Set new key only if filesystem hasn't already changed it */
-		if (lockdep_match_class(&inode->i_mutex, &type->i_mutex_key)) {
+		if (lockdep_match_class(&inode->i_rwsem, &type->i_mutex_key)) {
 			/*
 			 * ensure nobody is actually holding i_mutex
 			 */
-			mutex_destroy(&inode->i_mutex);
-			mutex_init(&inode->i_mutex);
-			lockdep_set_class(&inode->i_mutex,
+			// mutex_destroy(&inode->i_mutex);
+			init_rwsem(&inode->i_rwsem);
+			lockdep_set_class(&inode->i_rwsem,
 					  &type->i_mutex_dir_key);
 		}
 	}
diff --git a/fs/namei.c b/fs/namei.c
index eb879d6..877e9ef 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1607,7 +1607,7 @@ static struct dentry *lookup_slow(const struct qstr *name,
 	struct inode *inode = dir->d_inode;
 	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
-	inode_lock(inode);
+	down_read(&inode->i_rwsem);
 	/* Don't go there if it's already dead */
 	if (unlikely(IS_DEADDIR(inode)))
 		goto out;
@@ -1638,7 +1638,7 @@ again:
 		}
 	}
 out:
-	inode_unlock(inode);
+	up_read(&inode->i_rwsem);
 	return dentry;
 }
 
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 12f4a9e..0748777 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -262,7 +262,7 @@ static int ocfs2_init_locked_inode(struct inode *inode, void *opaque)
 	inode->i_ino = args->fi_ino;
 	OCFS2_I(inode)->ip_blkno = args->fi_blkno;
 	if (args->fi_sysfile_type != 0)
-		lockdep_set_class(&inode->i_mutex,
+		lockdep_set_class(&inode->i_rwsem,
 			&ocfs2_sysfile_lock_key[args->fi_sysfile_type]);
 	if (args->fi_sysfile_type == USER_QUOTA_SYSTEM_INODE ||
 	    args->fi_sysfile_type == GROUP_QUOTA_SYSTEM_INODE ||
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 6ec1e43..da186ee 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -218,7 +218,9 @@ static int ovl_check_whiteouts(struct dentry *dir, struct ovl_readdir_data *rdd)
 	cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
 	old_cred = override_creds(override_cred);
 
-	err = mutex_lock_killable(&dir->d_inode->i_mutex);
+	inode_lock(dir->d_inode);
+	err = 0;
+	// XXX: err = mutex_lock_killable(&dir->d_inode->i_mutex);
 	if (!err) {
 		while (rdd->first_maybe_whiteout) {
 			p = rdd->first_maybe_whiteout;
diff --git a/fs/readdir.c b/fs/readdir.c
index e69ef3b..bf583e8 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -32,9 +32,10 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
 	if (res)
 		goto out;
 
-	res = mutex_lock_killable(&inode->i_mutex);
-	if (res)
-		goto out;
+	inode_lock(inode);
+	// res = mutex_lock_killable(&inode->i_mutex);
+	// if (res)
+	//	goto out;
 
 	res = -ENOENT;
 	if (!IS_DEADDIR(inode)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0a32045..313ad28 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -647,7 +647,7 @@ struct inode {
 
 	/* Misc */
 	unsigned long		i_state;
-	struct mutex		i_mutex;
+	struct rw_semaphore	i_rwsem;
 
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 	unsigned long		dirtied_time_when;
@@ -734,27 +734,27 @@ enum inode_i_mutex_lock_class
 
 static inline void inode_lock(struct inode *inode)
 {
-	mutex_lock(&inode->i_mutex);
+	down_write(&inode->i_rwsem);
 }
 
 static inline void inode_unlock(struct inode *inode)
 {
-	mutex_unlock(&inode->i_mutex);
+	up_write(&inode->i_rwsem);
 }
 
 static inline int inode_trylock(struct inode *inode)
 {
-	return mutex_trylock(&inode->i_mutex);
+	return down_write_trylock(&inode->i_rwsem);
 }
 
 static inline int inode_is_locked(struct inode *inode)
 {
-	return mutex_is_locked(&inode->i_mutex);
+	return rwsem_is_locked(&inode->i_rwsem);
 }
 
 static inline void inode_lock_nested(struct inode *inode, unsigned subclass)
 {
-	mutex_lock_nested(&inode->i_mutex, subclass);
+	down_write_nested(&inode->i_rwsem, subclass);
 }
 
 void lock_two_nondirectories(struct inode *, struct inode*);
-- 
2.8.0.rc3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCHSET][RFC][CFT] parallel lookups
  2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
                   ` (14 preceding siblings ...)
  2016-04-16  0:55 ` [PATCH 15/15] parallel lookups: actual switch to rwsem Al Viro
@ 2016-04-16  3:02 ` Andreas Dilger
  2016-04-16  3:27   ` Al Viro
  15 siblings, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2016-04-16  3:02 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 6609 bytes --]

On Apr 15, 2016, at 6:52 PM, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> 	The thing appears to be working.  It's in vfs.git#work.lookups; the
> last 5 commits are the infrastructure (fs/namei.c and fs/dcache.c; no changes
> in fs/*/*) + actual switch to rwsem.
> 
> 	The missing bits: down_write_killable() (there had been a series
> posted introducing just that; for now I've replaced mutex_lock_killable()
> calls with plain inode_lock() - they are not critical for any testing and
> as soon as down_write_killable() gets there I'll replace those), lockdep
> bits might need corrections and right now it's only for lookups.
> 
> 	I'm going to add readdir to the mix; the primitive added in this
> series (d_alloc_parallel()) will need to be used in dcache pre-seeding
> paths, ncpfs use of dentry_update_name_case() will need to be changed to
> something less hacky and syscalls calling iterate_dir() will need to
> switch to fdget_pos() (with FMODE_ATOMIC_POS set for directories as well
> as regulars).  The last bit is needed for exclusion on struct file
> level - there's a bunch of cases where we maintain data structures
> hanging off file->private and those really need to be serialized.  Besides,
> serializing ->f_pos updates is needed for sane semantics; right now we
> tend to use ->i_mutex for that, but it would be easier to go for the same
> mechanism as for regular files.  With any luck we'll have working parallel
> readdir in addition to parallel lookups in this cycle as well.
> 
> 	The patchset is on top of switching getxattr to passing dentry and
> inode separately; that part will get changes (in particular, the stuff
> agruen has posted lately), but the lookups queue proper cares only about
> being able to move security_d_instantiate() to the point before dentry
> is attached to inode.
> 
> 1/15: security_d_instantiate(): move to the point prior to attaching dentry
> to inode.  Depends on getxattr changes, allows to do the "attach to inode"
> and "add to dentry hash" parts without dropping ->d_lock in between.
> 
> 2/15 -- 8/15: preparations - stuff similar to what went in during the last
> cycle; several places switched to lookup_one_len_unlocked(), a bunch of
> direct manipulations of ->i_mutex replaced with inode_lock, etc. helpers.
> 
> kernfs: use lookup_one_len_unlocked().
> configfs_detach_prep(): make sure that wait_mutex won't go away
> ocfs2: don't open-code inode_lock/inode_unlock
> orangefs: don't open-code inode_lock/inode_unlock
> reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
> reconnect_one(): use lookup_one_len_unlocked()
> ovl_lookup_real(): use lookup_one_len_unlocked()
> 
> 9/15: lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
> open-code real_lookup() call in lookup_slow(), move IS_DEADDIR check upwards.
> 
> 10/15: __d_add(): don't drop/regain ->d_lock
> that's what 1/15 had been for; might make sense to reorder closer to it.
> 
> 11/15 -- 14/15: actual machinery for parallel lookups.  This stuff could've
> been a single commit, along with the actual switch to rwsem and shared lock
> in lookup_slow(), but it's easier to review if carved up like that.  From the
> testing POV it's one chunk - it is bisect-safe, but the added code really
> comes into play only after we go for shared lock, which happens in 15/15.
> That's the core of the series.
> 
> beginning of transition to parallel lookups - marking in-lookup dentries
> parallel lookups machinery, part 2
> parallel lookups machinery, part 3
> parallel lookups machinery, part 4 (and last)
> 
> 15/15: parallel lookups: actual switch to rwsem
> 
> Note that filesystems would be free to switch some of their own uses of
> inode_lock() to grabbing it shared - it's really up to them.  This series
> works only with directories locking, but this field has become an rwsem
> for all inodes.  XFS folks in particular might be interested in using it...

Looks very interesting, and long awaited.  How do you see the parallel
operations moving forward?  Staying as lookup only, or moving on to parallel
modifications as well?

We've been carrying an out-of-tree patch for ext4 for several years to allow
parallel create/unlink for directory entries*, as I discussed a few times with
you in the past.  It is still a bit heavyweight for doing read-only lookups,
but after this patch series it might finally be interesting to merge into ext4,
with a hope that the VFS might allow parallel directory changes in the future?
We can already do this on a Lustre server, and it would be nice to be able to
so on the client, since the files may even be on different servers (hashed by
name at the client to decide which server to contact) and network latency during
parallel file creates (one thread per CPU core, which is getting into the
low hundreds these days) is a much bigger deal than for local filesystems.

The actual inode_*lock() handling would need to be delegated to the filesystems,
with the VFS just using i_rwsem if the filesystem has no ->inode_lock() method,
but calling the inode_lock() method of the filesystem if available, so that
they can do parallel create/unlink for files and directories (though not rename since that way lies insanity).

We can discuss more at LSF, but now that you posted your patch series, I'm
curious. :-)

Cheers, Andreas



[*] Patch not usable as-is since it has no VFS interface and is only uptodate
for 3.12, just linked here in case of interest.  Essentially, it creates a lock
tree for the ext4 htree directory, and opportunistically only locks the leaf
blocks of the directory on the expectation that there will be ~50-60 entries
added to each leaf before it is split, before it needs to back out to write
lock the parent htree block.  The larger the directory, the more leaf and
parent blocks that can be locked concurrently, and the better scaling gets.

http://git.hpdd.intel.com/fs/lustre-release.git/blob/HEAD:/ldiskfs/kernel_patches/patches/sles12/ext4-pdirop.patch

> I'll post the individual patches in followups.  Again, this is also available
> in vfs.git #work.lookups (head at e2d622a right now).  The thing survives
> LTP and xfstests without regressions, but more testing would certainly be
> appreciated.  So would review, of course.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas



[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 15/15] parallel lookups: actual switch to rwsem
  2016-04-16  0:55 ` [PATCH 15/15] parallel lookups: actual switch to rwsem Al Viro
@ 2016-04-16  3:02   ` Andreas Dilger
  2016-04-16  3:31     ` Al Viro
  0 siblings, 1 reply; 24+ messages in thread
From: Andreas Dilger @ 2016-04-16  3:02 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 9857 bytes --]

On Apr 15, 2016, at 6:55 PM, Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> ta-da!
> 
> The main issue is the lack of down_write_killable(), so the places
> like readdir.c switched to plain inode_lock(); once killable
> variants of rwsem primitives appear, that'll be dealt with.
> 
> lockdep side also might need more work
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> fs/btrfs/ioctl.c       | 16 ++++++++++------
> fs/configfs/inode.c    |  2 +-
> fs/dcache.c            |  9 +++++----
> fs/gfs2/ops_fstype.c   |  2 +-
> fs/inode.c             | 12 ++++++------
> fs/namei.c             |  4 ++--
> fs/ocfs2/inode.c       |  2 +-
> fs/overlayfs/readdir.c |  4 +++-
> fs/readdir.c           |  7 ++++---
> include/linux/fs.h     | 12 ++++++------
> 10 files changed, 39 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 053e677..db1e830 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -837,9 +837,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
> 	struct dentry *dentry;
> 	int error;
> 
> -	error = mutex_lock_killable_nested(&dir->i_mutex, I_MUTEX_PARENT);
> -	if (error == -EINTR)
> -		return error;
> +	inode_lock_nested(dir, I_MUTEX_PARENT);
> +	// XXX: should've been
> +	// mutex_lock_killable_nested(&dir->i_mutex, I_MUTEX_PARENT);
> +	// if (error == -EINTR)
> +	//	return error;
> 
> 	dentry = lookup_one_len(name, parent->dentry, namelen);
> 	error = PTR_ERR(dentry);
> @@ -2366,9 +2368,11 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
> 		goto out;
> 
> 
> -	err = mutex_lock_killable_nested(&dir->i_mutex, I_MUTEX_PARENT);
> -	if (err == -EINTR)
> -		goto out_drop_write;
> +	inode_lock_nested(dir, I_MUTEX_PARENT);
> +	// XXX: should've been
> +	// err = mutex_lock_killable_nested(&dir->i_mutex, I_MUTEX_PARENT);
> +	// if (err == -EINTR)
> +	//	goto out_drop_write;
> 	dentry = lookup_one_len(vol_args->name, parent, namelen);
> 	if (IS_ERR(dentry)) {
> 		err = PTR_ERR(dentry);
> diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
> index 03d124a..0387968 100644
> --- a/fs/configfs/inode.c
> +++ b/fs/configfs/inode.c
> @@ -156,7 +156,7 @@ static void configfs_set_inode_lock_class(struct configfs_dirent *sd,
> 
> 	if (depth > 0) {
> 		if (depth <= ARRAY_SIZE(default_group_class)) {
> -			lockdep_set_class(&inode->i_mutex,
> +			lockdep_set_class(&inode->i_rwsem,
> 					  &default_group_class[depth - 1]);
> 		} else {
> 			/*
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 5965588..d110040 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2911,7 +2911,8 @@ struct dentry *d_ancestor(struct dentry *p1, struct dentry *p2)
> static int __d_unalias(struct inode *inode,
> 		struct dentry *dentry, struct dentry *alias)
> {
> -	struct mutex *m1 = NULL, *m2 = NULL;
> +	struct mutex *m1 = NULL;
> +	struct rw_semaphore *m2 = NULL;
> 	int ret = -ESTALE;
> 
> 	/* If alias and dentry share a parent, then no extra locks required */
> @@ -2922,15 +2923,15 @@ static int __d_unalias(struct inode *inode,
> 	if (!mutex_trylock(&dentry->d_sb->s_vfs_rename_mutex))
> 		goto out_err;
> 	m1 = &dentry->d_sb->s_vfs_rename_mutex;
> -	if (!inode_trylock(alias->d_parent->d_inode))
> +	if (!down_read_trylock(&alias->d_parent->d_inode->i_rwsem))
> 		goto out_err;
> -	m2 = &alias->d_parent->d_inode->i_mutex;
> +	m2 = &alias->d_parent->d_inode->i_rwsem;
> out_unalias:
> 	__d_move(alias, dentry, false);
> 	ret = 0;
> out_err:
> 	if (m2)
> -		mutex_unlock(m2);
> +		up_read(m2);
> 	if (m1)
> 		mutex_unlock(m1);
> 	return ret;
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index c09c63d..4546360 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -824,7 +824,7 @@ static int init_inodes(struct gfs2_sbd *sdp, int undo)
> 	 * i_mutex on quota files is special. Since this inode is hidden system
> 	 * file, we are safe to define locking ourselves.
> 	 */
> -	lockdep_set_class(&sdp->sd_quota_inode->i_mutex,
> +	lockdep_set_class(&sdp->sd_quota_inode->i_rwsem,
> 			  &gfs2_quota_imutex_key);
> 
> 	error = gfs2_rindex_update(sdp);
> diff --git a/fs/inode.c b/fs/inode.c
> index 4b884f7..4ccbc21 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -166,8 +166,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> 	spin_lock_init(&inode->i_lock);
> 	lockdep_set_class(&inode->i_lock, &sb->s_type->i_lock_key);
> 
> -	mutex_init(&inode->i_mutex);
> -	lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key);
> +	init_rwsem(&inode->i_rwsem);
> +	lockdep_set_class(&inode->i_rwsem, &sb->s_type->i_mutex_key);
> 
> 	atomic_set(&inode->i_dio_count, 0);
> 
> @@ -925,13 +925,13 @@ void lockdep_annotate_inode_mutex_key(struct inode *inode)
> 		struct file_system_type *type = inode->i_sb->s_type;
> 
> 		/* Set new key only if filesystem hasn't already changed it */
> -		if (lockdep_match_class(&inode->i_mutex, &type->i_mutex_key)) {
> +		if (lockdep_match_class(&inode->i_rwsem, &type->i_mutex_key)) {
> 			/*
> 			 * ensure nobody is actually holding i_mutex
> 			 */
> -			mutex_destroy(&inode->i_mutex);
> -			mutex_init(&inode->i_mutex);
> -			lockdep_set_class(&inode->i_mutex,
> +			// mutex_destroy(&inode->i_mutex);
> +			init_rwsem(&inode->i_rwsem);
> +			lockdep_set_class(&inode->i_rwsem,
> 					  &type->i_mutex_dir_key);
> 		}
> 	}
> diff --git a/fs/namei.c b/fs/namei.c
> index eb879d6..877e9ef 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1607,7 +1607,7 @@ static struct dentry *lookup_slow(const struct qstr *name,
> 	struct inode *inode = dir->d_inode;
> 	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
> 
> -	inode_lock(inode);
> +	down_read(&inode->i_rwsem);

Wouldn't it make sense to have helpers like "inode_read_lock(inode)" or similar,
so that it is consistent with other parts of the code and easier to find?
It's a bit strange to have the filesystems use "inode_lock()" and some places
here use "inode_lock_nested()", but other places use up_read() and down_read()
directly on &inode->i_rwsem.  That would also simplify delegating the directory
locking to the filesystems in the future.

Cheers, Andreas

> 	/* Don't go there if it's already dead */
> 	if (unlikely(IS_DEADDIR(inode)))
> 		goto out;
> @@ -1638,7 +1638,7 @@ again:
> 		}
> 	}
> out:
> -	inode_unlock(inode);
> +	up_read(&inode->i_rwsem);
> 	return dentry;
> }
> 
> diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
> index 12f4a9e..0748777 100644
> --- a/fs/ocfs2/inode.c
> +++ b/fs/ocfs2/inode.c
> @@ -262,7 +262,7 @@ static int ocfs2_init_locked_inode(struct inode *inode, void *opaque)
> 	inode->i_ino = args->fi_ino;
> 	OCFS2_I(inode)->ip_blkno = args->fi_blkno;
> 	if (args->fi_sysfile_type != 0)
> -		lockdep_set_class(&inode->i_mutex,
> +		lockdep_set_class(&inode->i_rwsem,
> 			&ocfs2_sysfile_lock_key[args->fi_sysfile_type]);
> 	if (args->fi_sysfile_type == USER_QUOTA_SYSTEM_INODE ||
> 	    args->fi_sysfile_type == GROUP_QUOTA_SYSTEM_INODE ||
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 6ec1e43..da186ee 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -218,7 +218,9 @@ static int ovl_check_whiteouts(struct dentry *dir, struct ovl_readdir_data *rdd)
> 	cap_raise(override_cred->cap_effective, CAP_DAC_OVERRIDE);
> 	old_cred = override_creds(override_cred);
> 
> -	err = mutex_lock_killable(&dir->d_inode->i_mutex);
> +	inode_lock(dir->d_inode);
> +	err = 0;
> +	// XXX: err = mutex_lock_killable(&dir->d_inode->i_mutex);
> 	if (!err) {
> 		while (rdd->first_maybe_whiteout) {
> 			p = rdd->first_maybe_whiteout;
> diff --git a/fs/readdir.c b/fs/readdir.c
> index e69ef3b..bf583e8 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -32,9 +32,10 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
> 	if (res)
> 		goto out;
> 
> -	res = mutex_lock_killable(&inode->i_mutex);
> -	if (res)
> -		goto out;
> +	inode_lock(inode);
> +	// res = mutex_lock_killable(&inode->i_mutex);
> +	// if (res)
> +	//	goto out;
> 
> 	res = -ENOENT;
> 	if (!IS_DEADDIR(inode)) {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0a32045..313ad28 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -647,7 +647,7 @@ struct inode {
> 
> 	/* Misc */
> 	unsigned long		i_state;
> -	struct mutex		i_mutex;
> +	struct rw_semaphore	i_rwsem;
> 
> 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
> 	unsigned long		dirtied_time_when;
> @@ -734,27 +734,27 @@ enum inode_i_mutex_lock_class
> 
> static inline void inode_lock(struct inode *inode)
> {
> -	mutex_lock(&inode->i_mutex);
> +	down_write(&inode->i_rwsem);
> }
> 
> static inline void inode_unlock(struct inode *inode)
> {
> -	mutex_unlock(&inode->i_mutex);
> +	up_write(&inode->i_rwsem);
> }
> 
> static inline int inode_trylock(struct inode *inode)
> {
> -	return mutex_trylock(&inode->i_mutex);
> +	return down_write_trylock(&inode->i_rwsem);
> }
> 
> static inline int inode_is_locked(struct inode *inode)
> {
> -	return mutex_is_locked(&inode->i_mutex);
> +	return rwsem_is_locked(&inode->i_rwsem);
> }
> 
> static inline void inode_lock_nested(struct inode *inode, unsigned subclass)
> {
> -	mutex_lock_nested(&inode->i_mutex, subclass);
> +	down_write_nested(&inode->i_rwsem, subclass);
> }
> 
> void lock_two_nondirectories(struct inode *, struct inode*);
> --
> 2.8.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCHSET][RFC][CFT] parallel lookups
  2016-04-16  3:02 ` [PATCHSET][RFC][CFT] parallel lookups Andreas Dilger
@ 2016-04-16  3:27   ` Al Viro
  0 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  3:27 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Fri, Apr 15, 2016 at 09:02:02PM -0600, Andreas Dilger wrote:

> Looks very interesting, and long awaited.  How do you see the parallel
> operations moving forward?  Staying as lookup only, or moving on to parallel
> modifications as well?

lookup + readdir.  Not even atomic_open at this point, and that's the
route I'd suggest for modifiers - i.e. a combined lookup + mkdir, etc.
operations.  But we'd really need to sort atomic_open pathway out first...

Let's discuss that at LSFMM, corridor track if needed.  With lookups I'd been
able to keep the surgery site pretty much entirely in VFS proper - fs/dcache.c
and (after earlier massage) a single function in fs/namei.c.  With readdir
it'll be somewhat more invasive - pre-seeding dcache is done in a bunch of
filesystems right now (mostly the network ones, where readdir request is
equivalent to bulk lookup, as well as synthetic-inodes ones a-la procfs)
and it'll need to be regularized; ncpfs is particularly nasty, what with its
case-changing crap), but at least it will be reasonably compact.  For
atomic_open, and worse yet - mkdir/mknod/symlink/link/unlink/rmdir/rename
it will really dip into filesystem code.  A lot.

FWIW, I agree that relying on i_mutex^Wi_rwsem for dcache protection is
something worth getting rid of in the longer term.  But that protection is
there right now, and getting rid of that will take quite a bit of careful
massage.  I don't have such a transition plotted yet; not enough information
at the moment, and I seriously suspect that atomic_open would be the best
place to start.  If nothing else, there are reasonably few instances of that
puppy.  Moreover, we badly need to regularize the paths around do_last() -
right now they are messy as hell.  Once that is sorted out, we'll be in better
position to deal with the rest of directory-modifying operations.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 15/15] parallel lookups: actual switch to rwsem
  2016-04-16  3:02   ` Andreas Dilger
@ 2016-04-16  3:31     ` Al Viro
  0 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-16  3:31 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Fri, Apr 15, 2016 at 09:02:06PM -0600, Andreas Dilger wrote:

> Wouldn't it make sense to have helpers like "inode_read_lock(inode)" or similar,
> so that it is consistent with other parts of the code and easier to find?
> It's a bit strange to have the filesystems use "inode_lock()" and some places
> here use "inode_lock_nested()", but other places use up_read() and down_read()
> directly on &inode->i_rwsem.  That would also simplify delegating the directory
> locking to the filesystems in the future.

FWIW, my preference would be inode_lock_shared(), but that's bikeshedding;
seeing that we have very few callers at the moment *and* there's the missing
down_write_killable() stuff...  This patch will obviously be reworked and
it's small enough to be understandable, open-coding or not.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 07/15] reconnect_one(): use lookup_one_len_unlocked()
  2016-04-16  0:55 ` [PATCH 07/15] reconnect_one(): use lookup_one_len_unlocked() Al Viro
@ 2016-04-18 19:23   ` J. Bruce Fields
  0 siblings, 0 replies; 24+ messages in thread
From: J. Bruce Fields @ 2016-04-18 19:23 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sat, Apr 16, 2016 at 01:55:19AM +0100, Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> ... and explain the non-obvious logics in case when lookup yields
> a different dentry.

ACK to this independent of the rest of the series, my only minor gripe
is that the point made in this new comment is also made at
out_reconnected: and at the head of reconnect_one().  And would also
apply at the other "goto out_reconnected".  My preference is to leave
things as they are, but if you found it easy to overlook maybe just
something like:

 -		goto out_reconnected;
 +		goto out_reconnected; /* see comment there*/

would be helpful.  And/or improving the final comment.

--b.

> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/exportfs/expfs.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> index c46f1a1..402c5ca 100644
> --- a/fs/exportfs/expfs.c
> +++ b/fs/exportfs/expfs.c
> @@ -143,14 +143,18 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
>  	if (err)
>  		goto out_err;
>  	dprintk("%s: found name: %s\n", __func__, nbuf);
> -	inode_lock(parent->d_inode);
> -	tmp = lookup_one_len(nbuf, parent, strlen(nbuf));
> -	inode_unlock(parent->d_inode);
> +	tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
>  	if (IS_ERR(tmp)) {
>  		dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
>  		goto out_err;
>  	}
>  	if (tmp != dentry) {
> +		/*
> +		 * Somebody has renamed it since exportfs_get_name();
> +		 * great, since it could've only been renamed if it
> +		 * got looked up and thus connected, and it would
> +		 * remain connected afterwards.  We are done.
> +		 */
>  		dput(tmp);
>  		goto out_reconnected;
>  	}
> -- 
> 2.8.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 13/15] parallel lookups machinery, part 3
  2016-04-16  0:55 ` [PATCH 13/15] parallel lookups machinery, part 3 Al Viro
@ 2016-04-18 20:45   ` J. Bruce Fields
  0 siblings, 0 replies; 24+ messages in thread
From: J. Bruce Fields @ 2016-04-18 20:45 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sat, Apr 16, 2016 at 01:55:25AM +0100, Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> We will need to be able to check if there is an in-lookup
> dentry with matching parent/name.  Right now it's impossible,
> but as soon as start locking directories shared such beasts
> will appear.
> 
> Add a secondary hash for locating those.  Hash chains go through
> the same space where d_alias will be once it's not in-lookup anymore.
> Search is done under the same bitlock we use for modifications -
> with the primary hash we can rely on d_rehash() into the wrong
> chain being the worst that could happen, but here the pointers are
> buggered once it's removed from the chain.  On the other hand,
> the chains are not going to be long and normally we'll end up
> adding to the chain anyway.  That allows us to avoid bothering with
> ->d_lock when doing the comparisons - everything is stable until
> removed from chain.
> 
> New helper: d_alloc_parallel().  Right now it allocates, verifies
> that no hashed and in-lookup matches exist and adds to in-lookup
> hash.
> 
> Returns ERR_PTR() for error, hashed match (in the unlikely case it's
> been found) or new dentry.  In-lookup matches trigger BUG() for
> now; that will change in the next commit when we introduce waiting
> for ongoing lookup to finish.  Note that in-lookup matches won't be
> possible until we actually go for shared locking.
> 
> lookup_slow() switched to use of d_alloc_parallel().
> 
> Again, these commits are separated only for making it easier to
> review.  All this machinery will start doing something useful only
> when we go for shared locking; it's just that the combination is
> too large for my taste.

Thanks for doing that!  I'm not sure I get this yet, but I feel like I
have at least a fighting chance....

--b.

> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/dcache.c            | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/namei.c             | 44 +++++++++++--------------
>  include/linux/dcache.h |  2 ++
>  3 files changed, 108 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 3959f18..0552002 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -111,6 +111,17 @@ static inline struct hlist_bl_head *d_hash(const struct dentry *parent,
>  	return dentry_hashtable + hash_32(hash, d_hash_shift);
>  }
>  
> +#define IN_LOOKUP_SHIFT 10
> +static struct hlist_bl_head in_lookup_hashtable[1 << IN_LOOKUP_SHIFT];
> +
> +static inline struct hlist_bl_head *in_lookup_hash(const struct dentry *parent,
> +					unsigned int hash)
> +{
> +	hash += (unsigned long) parent / L1_CACHE_BYTES;
> +	return in_lookup_hashtable + hash_32(hash, IN_LOOKUP_SHIFT);
> +}
> +
> +
>  /* Statistics gathering. */
>  struct dentry_stat_t dentry_stat = {
>  	.age_limit = 45,
> @@ -2377,9 +2388,85 @@ static inline void end_dir_add(struct inode *dir, unsigned n)
>  	smp_store_release(&dir->i_dir_seq, n + 2);
>  }
>  
> +struct dentry *d_alloc_parallel(struct dentry *parent,
> +				const struct qstr *name)
> +{
> +	unsigned int len = name->len;
> +	unsigned int hash = name->hash;
> +	const unsigned char *str = name->name;
> +	struct hlist_bl_head *b = in_lookup_hash(parent, hash);
> +	struct hlist_bl_node *node;
> +	struct dentry *new = d_alloc(parent, name);
> +	struct dentry *dentry;
> +	unsigned seq;
> +
> +	if (unlikely(!new))
> +		return ERR_PTR(-ENOMEM);
> +
> +retry:
> +	seq = smp_load_acquire(&parent->d_inode->i_dir_seq) & ~1;
> +	dentry = d_lookup(parent, name);
> +	if (unlikely(dentry)) {
> +		dput(new);
> +		return dentry;
> +	}
> +
> +	hlist_bl_lock(b);
> +	smp_rmb();
> +	if (unlikely(parent->d_inode->i_dir_seq != seq)) {
> +		hlist_bl_unlock(b);
> +		goto retry;
> +	}
> +	/*
> +	 * No changes for the parent since the beginning of d_lookup().
> +	 * Since all removals from the chain happen with hlist_bl_lock(),
> +	 * any potential in-lookup matches are going to stay here until
> +	 * we unlock the chain.  All fields are stable in everything
> +	 * we encounter.
> +	 */
> +	hlist_bl_for_each_entry(dentry, node, b, d_u.d_in_lookup_hash) {
> +		if (dentry->d_name.hash != hash)
> +			continue;
> +		if (dentry->d_parent != parent)
> +			continue;
> +		if (d_unhashed(dentry))
> +			continue;
> +		if (parent->d_flags & DCACHE_OP_COMPARE) {
> +			int tlen = dentry->d_name.len;
> +			const char *tname = dentry->d_name.name;
> +			if (parent->d_op->d_compare(parent, dentry, tlen, tname, name))
> +				continue;
> +		} else {
> +			if (dentry->d_name.len != len)
> +				continue;
> +			if (dentry_cmp(dentry, str, len))
> +				continue;
> +		}
> +		dget(dentry);
> +		hlist_bl_unlock(b);
> +		/* impossible until we actually enable parallel lookups */
> +		BUG();
> +		/* and this will be "wait for it to stop being in-lookup" */
> +		/* this one will be handled in the next commit */
> +		dput(new);
> +		return dentry;
> +	}
> +	/* we can't take ->d_lock here; it's OK, though. */
> +	new->d_flags |= DCACHE_PAR_LOOKUP;
> +	hlist_bl_add_head_rcu(&new->d_u.d_in_lookup_hash, b);
> +	hlist_bl_unlock(b);
> +	return new;
> +}
> +
>  void __d_not_in_lookup(struct dentry *dentry)
>  {
> +	struct hlist_bl_head *b = in_lookup_hash(dentry->d_parent,
> +						 dentry->d_name.hash);
> +	hlist_bl_lock(b);
>  	dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
> +	__hlist_bl_del(&dentry->d_u.d_in_lookup_hash);
> +	hlist_bl_unlock(b);
> +	INIT_HLIST_NODE(&dentry->d_u.d_alias);
>  	/* more stuff will land here */
>  }
>  
> diff --git a/fs/namei.c b/fs/namei.c
> index 0ee8b9d..fbce016 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1603,46 +1603,40 @@ static struct dentry *lookup_slow(const struct qstr *name,
>  				  struct dentry *dir,
>  				  unsigned int flags)
>  {
> -	struct dentry *dentry, *old;
> +	struct dentry *dentry = ERR_PTR(-ENOENT), *old;
>  	struct inode *inode = dir->d_inode;
>  
>  	inode_lock(inode);
>  	/* Don't go there if it's already dead */
> -	if (unlikely(IS_DEADDIR(inode))) {
> -		inode_unlock(inode);
> -		return ERR_PTR(-ENOENT);
> -	}
> -	dentry = d_lookup(dir, name);
> -	if (unlikely(dentry)) {
> +	if (unlikely(IS_DEADDIR(inode)))
> +		goto out;
> +again:
> +	dentry = d_alloc_parallel(dir, name);
> +	if (IS_ERR(dentry))
> +		goto out;
> +	if (unlikely(!(dentry->d_flags & DCACHE_PAR_LOOKUP))) {
>  		if ((dentry->d_flags & DCACHE_OP_REVALIDATE) &&
>  		    !(flags & LOOKUP_NO_REVAL)) {
>  			int error = d_revalidate(dentry, flags);
>  			if (unlikely(error <= 0)) {
> -				if (!error)
> +				if (!error) {
>  					d_invalidate(dentry);
> +					dput(dentry);
> +					goto again;
> +				}
>  				dput(dentry);
>  				dentry = ERR_PTR(error);
>  			}
>  		}
> -		if (dentry) {
> -			inode_unlock(inode);
> -			return dentry;
> +	} else {
> +		old = inode->i_op->lookup(inode, dentry, flags);
> +		d_not_in_lookup(dentry);
> +		if (unlikely(old)) {
> +			dput(dentry);
> +			dentry = old;
>  		}
>  	}
> -	dentry = d_alloc(dir, name);
> -	if (unlikely(!dentry)) {
> -		inode_unlock(inode);
> -		return ERR_PTR(-ENOMEM);
> -	}
> -	spin_lock(&dentry->d_lock);
> -	dentry->d_flags |= DCACHE_PAR_LOOKUP;
> -	spin_unlock(&dentry->d_lock);
> -	old = inode->i_op->lookup(inode, dentry, flags);
> -	d_not_in_lookup(dentry);
> -	if (unlikely(old)) {
> -		dput(dentry);
> -		dentry = old;
> -	}
> +out:
>  	inode_unlock(inode);
>  	return dentry;
>  }
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index cfc1240..3ab5ce4 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -131,6 +131,7 @@ struct dentry {
>  	 */
>  	union {
>  		struct hlist_node d_alias;	/* inode alias list */
> +		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
>  	 	struct rcu_head d_rcu;
>  	} d_u;
>  };
> @@ -248,6 +249,7 @@ extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op
>  /* allocate/de-allocate */
>  extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
>  extern struct dentry * d_alloc_pseudo(struct super_block *, const struct qstr *);
> +extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *);
>  extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
>  extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
>  extern struct dentry * d_exact_alias(struct dentry *, struct inode *);
> -- 
> 2.8.0.rc3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 10/15] __d_add(): don't drop/regain ->d_lock
  2016-04-16  0:55 ` [PATCH 10/15] __d_add(): don't drop/regain ->d_lock Al Viro
@ 2016-04-24 18:09   ` Jeff Layton
  2016-04-24 19:21     ` Al Viro
  0 siblings, 1 reply; 24+ messages in thread
From: Jeff Layton @ 2016-04-24 18:09 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds; +Cc: linux-kernel, linux-fsdevel

On Sat, 2016-04-16 at 01:55 +0100, Al Viro wrote:
> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/dcache.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index e9de4d9..33cad8a 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -2363,11 +2363,19 @@ EXPORT_SYMBOL(d_rehash);
>  
>  static inline void __d_add(struct dentry *dentry, struct inode *inode)
>  {
> +	spin_lock(&dentry->d_lock);
>  	if (inode) {
> -		__d_instantiate(dentry, inode);
> +		unsigned add_flags = d_flags_for_inode(inode);
> +		hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
> +		raw_write_seqcount_begin(&dentry->d_seq);
> +		__d_set_inode_and_type(dentry, inode, add_flags);
> +		raw_write_seqcount_end(&dentry->d_seq);
> +		__fsnotify_d_instantiate(dentry);

Should the above be a new __d_instantiate_locked instead of open-coding 
it?

> +	}
> +	_d_rehash(dentry);
> +	spin_unlock(&dentry->d_lock);
> +	if (inode)
>  		spin_unlock(&inode->i_lock);
> -	}
> -	d_rehash(dentry);
>  }
>  
>  /**
-- 
Jeff Layton <jlayton@poochiereds.net>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 10/15] __d_add(): don't drop/regain ->d_lock
  2016-04-24 18:09   ` Jeff Layton
@ 2016-04-24 19:21     ` Al Viro
  0 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2016-04-24 19:21 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Linus Torvalds, linux-kernel, linux-fsdevel

On Sun, Apr 24, 2016 at 02:09:09PM -0400, Jeff Layton wrote:

> Should the above be a new __d_instantiate_locked instead of open-coding 
> it?

Nope - it gets rehash mashed into it a few commits later.  In principle
we could try to fish the common helper out of it and __d_instantiate()
once the dust settles, but IMO there's not much point.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-04-24 19:21 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-16  0:52 [PATCHSET][RFC][CFT] parallel lookups Al Viro
2016-04-16  0:55 ` [PATCH 01/15] security_d_instantiate(): move to the point prior to attaching dentry to inode Al Viro
2016-04-16  0:55 ` [PATCH 02/15] kernfs: use lookup_one_len_unlocked() Al Viro
2016-04-16  0:55 ` [PATCH 03/15] configfs_detach_prep(): make sure that wait_mutex won't go away Al Viro
2016-04-16  0:55 ` [PATCH 04/15] ocfs2: don't open-code inode_lock/inode_unlock Al Viro
2016-04-16  0:55 ` [PATCH 05/15] orangefs: " Al Viro
2016-04-16  0:55 ` [PATCH 06/15] reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack() Al Viro
2016-04-16  0:55 ` [PATCH 07/15] reconnect_one(): use lookup_one_len_unlocked() Al Viro
2016-04-18 19:23   ` J. Bruce Fields
2016-04-16  0:55 ` [PATCH 08/15] ovl_lookup_real(): " Al Viro
2016-04-16  0:55 ` [PATCH 09/15] lookup_slow(): bugger off on IS_DEADDIR() from the very beginning Al Viro
2016-04-16  0:55 ` [PATCH 10/15] __d_add(): don't drop/regain ->d_lock Al Viro
2016-04-24 18:09   ` Jeff Layton
2016-04-24 19:21     ` Al Viro
2016-04-16  0:55 ` [PATCH 11/15] beginning of transition to parallel lookups - marking in-lookup dentries Al Viro
2016-04-16  0:55 ` [PATCH 12/15] parallel lookups machinery, part 2 Al Viro
2016-04-16  0:55 ` [PATCH 13/15] parallel lookups machinery, part 3 Al Viro
2016-04-18 20:45   ` J. Bruce Fields
2016-04-16  0:55 ` [PATCH 14/15] parallel lookups machinery, part 4 (and last) Al Viro
2016-04-16  0:55 ` [PATCH 15/15] parallel lookups: actual switch to rwsem Al Viro
2016-04-16  3:02   ` Andreas Dilger
2016-04-16  3:31     ` Al Viro
2016-04-16  3:02 ` [PATCHSET][RFC][CFT] parallel lookups Andreas Dilger
2016-04-16  3:27   ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).