All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RFC v2 0/5] fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK
@ 2020-12-12 16:51 Jens Axboe
  2020-12-12 16:51 ` [PATCH 1/5] fs: make unlazy_walk() error handling consistent Jens Axboe
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Jens Axboe @ 2020-12-12 16:51 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, viro

Hi,

After the discussion yesterday, I reworked some of the bits since the v1
posting:

- Improved the 1/5 cleanup patch, based on Al's feedback.

- 2/5 is much cleaner now, also based on Al's feedback. For later
  patches, we also need the same LOOKUP_NONBLOCK failure case in
  unlazy_child(), so I added that. Also fixed the LOOKUP_NONBLOCK
  check in path_init().

- Add mnt_want_write_trylock(). Since we already have
  sb_start_write_trylock(), this is pretty trivial.

- Add 4/5 which handles LOOKUP_NONBLOCK for open_last_lookups(),
  do_open(), and do_tmpfile(). O_TRUNC is expressly not allowed
  with LOOKUP_NONBLOCK/RESOLVE_NONBLOCK, so I didn't want to
  attempt to wire up nonblocking semantics there.

- Fixed comment in 5/5, and added the O_TRUNC check there too. No
  further changes there. I'll commit to writing the change for the
  openat2(2) man page as well, as well as some test cases, once we make
  some more progress on the kernel side.

Ran this through some basic testing with both io_uring (that patch
I left out of this series) and openat2(), and works for me...

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] fs: make unlazy_walk() error handling consistent
  2020-12-12 16:51 [PATCHSET RFC v2 0/5] fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK Jens Axboe
@ 2020-12-12 16:51 ` Jens Axboe
  2020-12-12 16:51 ` [PATCH 2/5] fs: add support for LOOKUP_NONBLOCK Jens Axboe
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-12-12 16:51 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

Most callers check for non-zero return, and assume it's -ECHILD (which
it always will be). One caller uses the actual error return. Clean this
up and make it fully consistent, by having unlazy_walk() return a bool
instead. Rename it to try_to_unlazy() and return true on success, and
failure on error. That's easier to read.

No functional changes in this patch.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/namei.c | 43 +++++++++++++++++--------------------------
 1 file changed, 17 insertions(+), 26 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 03d0e11e4f36..7eb7830da298 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -669,17 +669,17 @@ static bool legitimize_root(struct nameidata *nd)
  */
 
 /**
- * unlazy_walk - try to switch to ref-walk mode.
+ * try_to_unlazy - try to switch to ref-walk mode.
  * @nd: nameidata pathwalk data
- * Returns: 0 on success, -ECHILD on failure
+ * Returns: true on success, false on failure
  *
- * unlazy_walk attempts to legitimize the current nd->path and nd->root
+ * try_to_unlazyk attempts to legitimize the current nd->path and nd->root
  * for ref-walk mode.
  * Must be called from rcu-walk context.
- * Nothing should touch nameidata between unlazy_walk() failure and
+ * Nothing should touch nameidata between try_to_unlazy() failure and
  * terminate_walk().
  */
-static int unlazy_walk(struct nameidata *nd)
+static bool try_to_unlazy(struct nameidata *nd)
 {
 	struct dentry *parent = nd->path.dentry;
 
@@ -694,14 +694,14 @@ static int unlazy_walk(struct nameidata *nd)
 		goto out;
 	rcu_read_unlock();
 	BUG_ON(nd->inode != parent->d_inode);
-	return 0;
+	return true;
 
 out1:
 	nd->path.mnt = NULL;
 	nd->path.dentry = NULL;
 out:
 	rcu_read_unlock();
-	return -ECHILD;
+	return false;
 }
 
 /**
@@ -792,7 +792,7 @@ static int complete_walk(struct nameidata *nd)
 		 */
 		if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED)))
 			nd->root.mnt = NULL;
-		if (unlikely(unlazy_walk(nd)))
+		if (!try_to_unlazy(nd))
 			return -ECHILD;
 	}
 
@@ -1466,7 +1466,7 @@ static struct dentry *lookup_fast(struct nameidata *nd,
 		unsigned seq;
 		dentry = __d_lookup_rcu(parent, &nd->last, &seq);
 		if (unlikely(!dentry)) {
-			if (unlazy_walk(nd))
+			if (!try_to_unlazy(nd))
 				return ERR_PTR(-ECHILD);
 			return NULL;
 		}
@@ -1567,10 +1567,8 @@ static inline int may_lookup(struct nameidata *nd)
 {
 	if (nd->flags & LOOKUP_RCU) {
 		int err = inode_permission(nd->inode, MAY_EXEC|MAY_NOT_BLOCK);
-		if (err != -ECHILD)
+		if (err != -ECHILD || !try_to_unlazy(nd))
 			return err;
-		if (unlazy_walk(nd))
-			return -ECHILD;
 	}
 	return inode_permission(nd->inode, MAY_EXEC);
 }
@@ -1592,7 +1590,7 @@ static int reserve_stack(struct nameidata *nd, struct path *link, unsigned seq)
 		// unlazy even if we fail to grab the link - cleanup needs it
 		bool grabbed_link = legitimize_path(nd, link, seq);
 
-		if (unlazy_walk(nd) != 0 || !grabbed_link)
+		if (!try_to_unlazy(nd) != 0 || !grabbed_link)
 			return -ECHILD;
 
 		if (nd_alloc_stack(nd))
@@ -1634,7 +1632,7 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
 		touch_atime(&last->link);
 		cond_resched();
 	} else if (atime_needs_update(&last->link, inode)) {
-		if (unlikely(unlazy_walk(nd)))
+		if (!try_to_unlazy(nd))
 			return ERR_PTR(-ECHILD);
 		touch_atime(&last->link);
 	}
@@ -1651,11 +1649,8 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
 		get = inode->i_op->get_link;
 		if (nd->flags & LOOKUP_RCU) {
 			res = get(NULL, inode, &last->done);
-			if (res == ERR_PTR(-ECHILD)) {
-				if (unlikely(unlazy_walk(nd)))
-					return ERR_PTR(-ECHILD);
+			if (res == ERR_PTR(-ECHILD) && try_to_unlazy(nd))
 				res = get(link->dentry, inode, &last->done);
-			}
 		} else {
 			res = get(link->dentry, inode, &last->done);
 		}
@@ -2193,7 +2188,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 		}
 		if (unlikely(!d_can_lookup(nd->path.dentry))) {
 			if (nd->flags & LOOKUP_RCU) {
-				if (unlazy_walk(nd))
+				if (!try_to_unlazy(nd))
 					return -ECHILD;
 			}
 			return -ENOTDIR;
@@ -3127,7 +3122,6 @@ static const char *open_last_lookups(struct nameidata *nd,
 	struct inode *inode;
 	struct dentry *dentry;
 	const char *res;
-	int error;
 
 	nd->flags |= op->intent;
 
@@ -3151,9 +3145,8 @@ static const char *open_last_lookups(struct nameidata *nd,
 	} else {
 		/* create side of things */
 		if (nd->flags & LOOKUP_RCU) {
-			error = unlazy_walk(nd);
-			if (unlikely(error))
-				return ERR_PTR(error);
+			if (!try_to_unlazy(nd))
+				return ERR_PTR(-ECHILD);
 		}
 		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
 		/* trailing slashes? */
@@ -3162,9 +3155,7 @@ static const char *open_last_lookups(struct nameidata *nd,
 	}
 
 	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
-		error = mnt_want_write(nd->path.mnt);
-		if (!error)
-			got_write = true;
+		got_write = !mnt_want_write(nd->path.mnt);
 		/*
 		 * do _not_ fail yet - we might not need that or fail with
 		 * a different error; let lookup_open() decide; we'll be
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] fs: add support for LOOKUP_NONBLOCK
  2020-12-12 16:51 [PATCHSET RFC v2 0/5] fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK Jens Axboe
  2020-12-12 16:51 ` [PATCH 1/5] fs: make unlazy_walk() error handling consistent Jens Axboe
@ 2020-12-12 16:51 ` Jens Axboe
  2020-12-12 16:51 ` [PATCH 3/5] fs: add mnt_want_write_trylock() Jens Axboe
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-12-12 16:51 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

io_uring always punts opens to async context, since there's no control
over whether the lookup blocks or not. Add LOOKUP_NONBLOCK to support
just doing the fast RCU based lookups, which we know will not block. If
we can do a cached path resolution of the filename, then we don't have
to always punt lookups for a worker.

During path resolution, we always do LOOKUP_RCU first. If that fails and
we terminate LOOKUP_RCU, then fail a LOOKUP_NONBLOCK attempt as well.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/namei.c            | 9 +++++++++
 include/linux/namei.h | 1 +
 2 files changed, 10 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 7eb7830da298..07a1aa874f65 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -686,6 +686,8 @@ static bool try_to_unlazy(struct nameidata *nd)
 	BUG_ON(!(nd->flags & LOOKUP_RCU));
 
 	nd->flags &= ~LOOKUP_RCU;
+	if (nd->flags & LOOKUP_NONBLOCK)
+		goto out1;
 	if (unlikely(!legitimize_links(nd)))
 		goto out1;
 	if (unlikely(!legitimize_path(nd, &nd->path, nd->seq)))
@@ -722,6 +724,8 @@ static int unlazy_child(struct nameidata *nd, struct dentry *dentry, unsigned se
 	BUG_ON(!(nd->flags & LOOKUP_RCU));
 
 	nd->flags &= ~LOOKUP_RCU;
+	if (nd->flags & LOOKUP_NONBLOCK)
+		goto out2;
 	if (unlikely(!legitimize_links(nd)))
 		goto out2;
 	if (unlikely(!legitimize_mnt(nd->path.mnt, nd->m_seq)))
@@ -792,6 +796,7 @@ static int complete_walk(struct nameidata *nd)
 		 */
 		if (!(nd->flags & (LOOKUP_ROOT | LOOKUP_IS_SCOPED)))
 			nd->root.mnt = NULL;
+		nd->flags &= ~LOOKUP_NONBLOCK;
 		if (!try_to_unlazy(nd))
 			return -ECHILD;
 	}
@@ -2202,6 +2207,10 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 	int error;
 	const char *s = nd->name->name;
 
+	/* LOOKUP_NONBLOCK requires RCU, ask caller to retry */
+	if ((flags & (LOOKUP_RCU | LOOKUP_NONBLOCK)) == LOOKUP_NONBLOCK)
+		return ERR_PTR(-EAGAIN);
+
 	if (!*s)
 		flags &= ~LOOKUP_RCU;
 	if (flags & LOOKUP_RCU)
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a4bb992623c4..c36c4e0805fc 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -46,6 +46,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 #define LOOKUP_NO_XDEV		0x040000 /* No mountpoint crossing. */
 #define LOOKUP_BENEATH		0x080000 /* No escaping from starting point. */
 #define LOOKUP_IN_ROOT		0x100000 /* Treat dirfd as fs root. */
+#define LOOKUP_NONBLOCK		0x200000 /* don't block for lookup */
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
 #define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] fs: add mnt_want_write_trylock()
  2020-12-12 16:51 [PATCHSET RFC v2 0/5] fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK Jens Axboe
  2020-12-12 16:51 ` [PATCH 1/5] fs: make unlazy_walk() error handling consistent Jens Axboe
  2020-12-12 16:51 ` [PATCH 2/5] fs: add support for LOOKUP_NONBLOCK Jens Axboe
@ 2020-12-12 16:51 ` Jens Axboe
  2020-12-12 16:51 ` [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open Jens Axboe
  2020-12-12 16:51 ` [PATCH 5/5] fs: expose LOOKUP_NONBLOCK through openat2() RESOLVE_NONBLOCK Jens Axboe
  4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-12-12 16:51 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

trylock variant of mnt_want_write() - this ends up being pretty trivial,
as we already have a trylock variant of the sb_start_write() helper.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/namespace.c        | 18 ++++++++++++++++++
 include/linux/mount.h |  1 +
 2 files changed, 19 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index cebaa3e81794..7881cb5595af 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -359,6 +359,24 @@ int mnt_want_write(struct vfsmount *m)
 }
 EXPORT_SYMBOL_GPL(mnt_want_write);
 
+/**
+ * mnt_want_write_trylock - try to get write access to a mount
+ * @m: the mount on which to take a write
+ *
+ * trylock variant of @mnt_want_write. See description above.
+ */
+int mnt_want_write_trylock(struct vfsmount *m)
+{
+	int ret;
+
+	if (!sb_start_write_trylock(m->mnt_sb))
+		return -EAGAIN;
+	ret = __mnt_want_write(m);
+	if (ret)
+		sb_end_write(m->mnt_sb);
+	return ret;
+}
+
 /**
  * mnt_clone_write - get write access to a mount
  * @mnt: the mount on which to take a write
diff --git a/include/linux/mount.h b/include/linux/mount.h
index aaf343b38671..e267e622d843 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -78,6 +78,7 @@ struct file; /* forward dec */
 struct path;
 
 extern int mnt_want_write(struct vfsmount *mnt);
+extern int mnt_want_write_trylock(struct vfsmount *mnt);
 extern int mnt_want_write_file(struct file *file);
 extern int mnt_clone_write(struct vfsmount *mnt);
 extern void mnt_drop_write(struct vfsmount *mnt);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-12 16:51 [PATCHSET RFC v2 0/5] fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK Jens Axboe
                   ` (2 preceding siblings ...)
  2020-12-12 16:51 ` [PATCH 3/5] fs: add mnt_want_write_trylock() Jens Axboe
@ 2020-12-12 16:51 ` Jens Axboe
  2020-12-12 17:25   ` Al Viro
  2020-12-12 18:57   ` Linus Torvalds
  2020-12-12 16:51 ` [PATCH 5/5] fs: expose LOOKUP_NONBLOCK through openat2() RESOLVE_NONBLOCK Jens Axboe
  4 siblings, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2020-12-12 16:51 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

We handle it for the path resolution itself, but we should also factor
it in for open_last_lookups() and tmpfile open.

We don't allow RESOLVE_NONBLOCK with O_TRUNC, so that case we can safely
ignore.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/namei.c | 39 +++++++++++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 07a1aa874f65..1f976a213eef 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3127,6 +3127,7 @@ static const char *open_last_lookups(struct nameidata *nd,
 	struct dentry *dir = nd->path.dentry;
 	int open_flag = op->open_flag;
 	bool got_write = false;
+	bool nonblock = nd->flags & LOOKUP_NONBLOCK;
 	unsigned seq;
 	struct inode *inode;
 	struct dentry *dentry;
@@ -3164,17 +3165,38 @@ static const char *open_last_lookups(struct nameidata *nd,
 	}
 
 	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
-		got_write = !mnt_want_write(nd->path.mnt);
+		if (nonblock) {
+			got_write = !mnt_want_write_trylock(nd->path.mnt);
+			if (!got_write)
+				return ERR_PTR(-EAGAIN);
+		} else {
+			got_write = !mnt_want_write(nd->path.mnt);
+		}
 		/*
 		 * do _not_ fail yet - we might not need that or fail with
 		 * a different error; let lookup_open() decide; we'll be
 		 * dropping this one anyway.
 		 */
 	}
-	if (open_flag & O_CREAT)
-		inode_lock(dir->d_inode);
-	else
-		inode_lock_shared(dir->d_inode);
+	if (open_flag & O_CREAT) {
+		if (nonblock) {
+			if (!inode_trylock(dir->d_inode)) {
+				dentry = ERR_PTR(-EAGAIN);
+				goto drop_write;
+			}
+		} else {
+			inode_lock(dir->d_inode);
+		}
+	} else {
+		if (nonblock) {
+			if (!inode_trylock_shared(dir->d_inode)) {
+				dentry = ERR_PTR(-EAGAIN);
+				goto drop_write;
+			}
+		} else {
+			inode_lock_shared(dir->d_inode);
+		}
+	}
 	dentry = lookup_open(nd, file, op, got_write);
 	if (!IS_ERR(dentry) && (file->f_mode & FMODE_CREATED))
 		fsnotify_create(dir->d_inode, dentry);
@@ -3183,6 +3205,7 @@ static const char *open_last_lookups(struct nameidata *nd,
 	else
 		inode_unlock_shared(dir->d_inode);
 
+drop_write:
 	if (got_write)
 		mnt_drop_write(nd->path.mnt);
 
@@ -3242,6 +3265,7 @@ static int do_open(struct nameidata *nd,
 		open_flag &= ~O_TRUNC;
 		acc_mode = 0;
 	} else if (d_is_reg(nd->path.dentry) && open_flag & O_TRUNC) {
+		WARN_ON_ONCE(nd->flags & LOOKUP_NONBLOCK);
 		error = mnt_want_write(nd->path.mnt);
 		if (error)
 			return error;
@@ -3311,7 +3335,10 @@ static int do_tmpfile(struct nameidata *nd, unsigned flags,
 	int error = path_lookupat(nd, flags | LOOKUP_DIRECTORY, &path);
 	if (unlikely(error))
 		return error;
-	error = mnt_want_write(path.mnt);
+	if (flags & LOOKUP_NONBLOCK)
+		error = mnt_want_write_trylock(path.mnt);
+	else
+		error = mnt_want_write(path.mnt);
 	if (unlikely(error))
 		goto out;
 	child = vfs_tmpfile(path.dentry, op->mode, op->open_flag);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] fs: expose LOOKUP_NONBLOCK through openat2() RESOLVE_NONBLOCK
  2020-12-12 16:51 [PATCHSET RFC v2 0/5] fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK Jens Axboe
                   ` (3 preceding siblings ...)
  2020-12-12 16:51 ` [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open Jens Axboe
@ 2020-12-12 16:51 ` Jens Axboe
  4 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-12-12 16:51 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: torvalds, viro, Jens Axboe

Now that we support non-blocking path resolution internally, expose it
via openat2() in the struct open_how ->resolve flags. This allows
applications using openat2() to limit path resolution to the extent that
it is already cached.

If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
will return -1/-EAGAIN.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/open.c                    | 5 +++++
 include/linux/fcntl.h        | 2 +-
 include/uapi/linux/openat2.h | 4 ++++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index 9af548fb841b..3561ef4a6689 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1087,6 +1087,11 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
 		lookup_flags |= LOOKUP_BENEATH;
 	if (how->resolve & RESOLVE_IN_ROOT)
 		lookup_flags |= LOOKUP_IN_ROOT;
+	if (how->resolve & RESOLVE_NONBLOCK) {
+		if (flags & O_TRUNC)
+			return -EAGAIN;
+		lookup_flags |= LOOKUP_NONBLOCK;
+	}
 
 	op->lookup_flags = lookup_flags;
 	return 0;
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 921e750843e6..919a13c9317c 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -19,7 +19,7 @@
 /* List of all valid flags for the how->resolve argument: */
 #define VALID_RESOLVE_FLAGS \
 	(RESOLVE_NO_XDEV | RESOLVE_NO_MAGICLINKS | RESOLVE_NO_SYMLINKS | \
-	 RESOLVE_BENEATH | RESOLVE_IN_ROOT)
+	 RESOLVE_BENEATH | RESOLVE_IN_ROOT | RESOLVE_NONBLOCK)
 
 /* List of all open_how "versions". */
 #define OPEN_HOW_SIZE_VER0	24 /* sizeof first published struct */
diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
index 58b1eb711360..7bc1d0c35108 100644
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -35,5 +35,9 @@ struct open_how {
 #define RESOLVE_IN_ROOT		0x10 /* Make all jumps to "/" and ".."
 					be scoped inside the dirfd
 					(similar to chroot(2)). */
+#define RESOLVE_NONBLOCK	0x20 /* Only complete if resolution can be
+					completed through cached lookup. May
+					return -EAGAIN if that's not
+					possible. */
 
 #endif /* _UAPI_LINUX_OPENAT2_H */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-12 16:51 ` [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open Jens Axboe
@ 2020-12-12 17:25   ` Al Viro
  2020-12-12 17:47     ` Jens Axboe
  2020-12-12 18:57   ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Al Viro @ 2020-12-12 17:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, torvalds

On Sat, Dec 12, 2020 at 09:51:04AM -0700, Jens Axboe wrote:

>  	struct dentry *dentry;
> @@ -3164,17 +3165,38 @@ static const char *open_last_lookups(struct nameidata *nd,
>  	}
>  
>  	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
> -		got_write = !mnt_want_write(nd->path.mnt);
> +		if (nonblock) {
> +			got_write = !mnt_want_write_trylock(nd->path.mnt);
> +			if (!got_write)
> +				return ERR_PTR(-EAGAIN);
> +		} else {
> +			got_write = !mnt_want_write(nd->path.mnt);
> +		}
>  		/*
>  		 * do _not_ fail yet - we might not need that or fail with
>  		 * a different error; let lookup_open() decide; we'll be
>  		 * dropping this one anyway.
>  		 */

Read the comment immediately after the place you are modifying.  Really.
To elaborate: consider e.g. the case when /mnt/foo is a symlink to /tmp/bar,
/mnt is mounted r/o and you are asking to open /mnt/foo for write.  We get
to /mnt, call open_last_lookups() to resolve the last component ("foo") in
it.  And find that the sucker happens to be an absolute symlink, so we need
to jump into root and resolve "tmp/bar" staring from there.  Which is what
the loop in the caller is about.  Eventually we'll get to /tmp and call
open_last_lookups() to resolve "bar" there.  /tmp needs to be mounted
writable; /mnt does not.

Sure, you bail out only in nonblock case, so normally the next time around
it'll go sanely.  But you are making the damn thing (and it's still too
convoluted, even after a lot of massage towards sanity) harder to reason
about.

> +	if (open_flag & O_CREAT) {
> +		if (nonblock) {
> +			if (!inode_trylock(dir->d_inode)) {
> +				dentry = ERR_PTR(-EAGAIN);
> +				goto drop_write;
> +			}
> +		} else {
> +			inode_lock(dir->d_inode);
> +		}
> +	} else {
> +		if (nonblock) {
> +			if (!inode_trylock_shared(dir->d_inode)) {
> +				dentry = ERR_PTR(-EAGAIN);
> +				goto drop_write;
> +			}
> +		} else {
> +			inode_lock_shared(dir->d_inode);
> +		}
> +	}
>  	dentry = lookup_open(nd, file, op, got_write);

... as well as more bloated, with no obvious benefits (take a look
at lookup_open()).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-12 17:25   ` Al Viro
@ 2020-12-12 17:47     ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-12-12 17:47 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, torvalds

On 12/12/20 10:25 AM, Al Viro wrote:
> On Sat, Dec 12, 2020 at 09:51:04AM -0700, Jens Axboe wrote:
> 
>>  	struct dentry *dentry;
>> @@ -3164,17 +3165,38 @@ static const char *open_last_lookups(struct nameidata *nd,
>>  	}
>>  
>>  	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
>> -		got_write = !mnt_want_write(nd->path.mnt);
>> +		if (nonblock) {
>> +			got_write = !mnt_want_write_trylock(nd->path.mnt);
>> +			if (!got_write)
>> +				return ERR_PTR(-EAGAIN);
>> +		} else {
>> +			got_write = !mnt_want_write(nd->path.mnt);
>> +		}
>>  		/*
>>  		 * do _not_ fail yet - we might not need that or fail with
>>  		 * a different error; let lookup_open() decide; we'll be
>>  		 * dropping this one anyway.
>>  		 */
> 
> Read the comment immediately after the place you are modifying.  Really.
> To elaborate: consider e.g. the case when /mnt/foo is a symlink to /tmp/bar,
> /mnt is mounted r/o and you are asking to open /mnt/foo for write.  We get
> to /mnt, call open_last_lookups() to resolve the last component ("foo") in
> it.  And find that the sucker happens to be an absolute symlink, so we need
> to jump into root and resolve "tmp/bar" staring from there.  Which is what
> the loop in the caller is about.  Eventually we'll get to /tmp and call
> open_last_lookups() to resolve "bar" there.  /tmp needs to be mounted
> writable; /mnt does not.
> 
> Sure, you bail out only in nonblock case, so normally the next time around
> it'll go sanely.  But you are making the damn thing (and it's still too
> convoluted, even after a lot of massage towards sanity) harder to reason
> about.

Thanks - I did read that comment, but reading your full explanation it's
starting to make sense. I'll have it follow the same paradigm for the
nonblocking side.

>> +	if (open_flag & O_CREAT) {
>> +		if (nonblock) {
>> +			if (!inode_trylock(dir->d_inode)) {
>> +				dentry = ERR_PTR(-EAGAIN);
>> +				goto drop_write;
>> +			}
>> +		} else {
>> +			inode_lock(dir->d_inode);
>> +		}
>> +	} else {
>> +		if (nonblock) {
>> +			if (!inode_trylock_shared(dir->d_inode)) {
>> +				dentry = ERR_PTR(-EAGAIN);
>> +				goto drop_write;
>> +			}
>> +		} else {
>> +			inode_lock_shared(dir->d_inode);
>> +		}
>> +	}
>>  	dentry = lookup_open(nd, file, op, got_write);
> 
> ... as well as more bloated, with no obvious benefits (take a look
> at lookup_open()).

Can you elaborate? It's hard not to make

if (nonblock) {
	trylock;
} else
	lock;
}

not look/feel bloated, as it tends to explode it like the above. Can
hide it in helpers, but really, the logic needs to look like that...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-12 16:51 ` [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open Jens Axboe
  2020-12-12 17:25   ` Al Viro
@ 2020-12-12 18:57   ` Linus Torvalds
  2020-12-12 21:25     ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2020-12-12 18:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, Al Viro

On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> We handle it for the path resolution itself, but we should also factor
> it in for open_last_lookups() and tmpfile open.

So I think this one is fundamentally wrong, for two reasons.

One is that "nonblock" shouldn't necessarily mean "take no locks at
all". That directory inode lock is very very different from "go down
to the filesystem to do IO". No other NONBLOCK thing has ever been "no
locks at all", they have all been about possibly long-term blocking.

The other this is that the mnt_want_write() check really smells no
different at all from any of the "some ->open functions might block".
Which you don't handle, and which again is entirely different from the
pathname resolution itself blocking.

So I don't think either of these cases are about LOOKUP_NONBLOCK, the
same way they aren't about LOOKUP_RCU.

The  mnt_want_write() in particular is much more about the kinds of
things we already check for in do_open().

In fact, looking at this patch, I think mnt_want_write() itself is
actually conceptually buggy, although it doesn't really matter: think
about device nodes etc. Opening a device node for writing doesn't
write to the filesystem that the device node is on.

Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to
do with the open() wanting to write to the filesystem. We don't even
hold that lock after the open - we'll always drop it even for a
successful open.

Only O_CREAT | O_TRUNC should matter, since those are the ones that
cause writes as part of the *open*.

Again - I don't think that this is really a problem. I mention it more
as a "this shows how the code is _conceptually_ wrong", and how it's
not really about the pathname resolution any more.

In fact, I think that how we pass on that "got_write" to lookup_open()
is just another sign of how we do that mnt_want_write() in the wrong
place and at trhe wrong level. We shouldn't have been doing that
mnt_want_write() for writable oipens (that's a different thing), and
looking at lookup_open(), I think we should have waited with doing it
at all until we've checked if we even need it (ie O_CREAT on a file
that already exists does *not* need the mnt_want_write() at all, and
we'll see that when we get to that

        /* Negative dentry, just create the file */
        if (!dentry->d_inode && (open_flag & O_CREAT)) {

thing.

So I think this patch actually shows that we do things wrong in this
area, and I think the kinds of things it does are questionable as a
result. In particular, I'm not convinced that the directory semaphore
thing should be tied to LOOKUP_NONBLOCK, and I don't think the
mnt_want_write() logic is right at all.

The first one would be fixed by a separate flag.

The second one I think is more about "we are doing mnt_want_write() at
the wrong point, and if we move mnt_want_write() to the right place
we'd just fail it explicitly for the LOOKUP_NONBLOCK case" - the same
way a truncate should just be an explicit fail, not a "trylock"
failure.

I also think we need to deal with the O_NONBLOCK kind of situation at
the actual ->open() time (ie the whole "oh, NFS wants to revalidate
caches due to open/close consistency, named pipes want to wait for
pairing, device nodes want to wait for device". Again, I think that's
separate from LOOKUP_NONBLOCK, though.

               Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-12 18:57   ` Linus Torvalds
@ 2020-12-12 21:25     ` Jens Axboe
  2020-12-12 22:03       ` Linus Torvalds
  2020-12-13 22:50       ` Dave Chinner
  0 siblings, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2020-12-12 21:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-fsdevel, Al Viro

On 12/12/20 11:57 AM, Linus Torvalds wrote:
> On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> We handle it for the path resolution itself, but we should also factor
>> it in for open_last_lookups() and tmpfile open.
> 
> So I think this one is fundamentally wrong, for two reasons.
> 
> One is that "nonblock" shouldn't necessarily mean "take no locks at
> all". That directory inode lock is very very different from "go down
> to the filesystem to do IO". No other NONBLOCK thing has ever been "no
> locks at all", they have all been about possibly long-term blocking.

Do we ever do long term IO _while_ holding the direcoty inode lock? If
we don't, then we can probably just ignore that side alltogether.

> The other this is that the mnt_want_write() check really smells no
> different at all from any of the "some ->open functions might block".
> Which you don't handle, and which again is entirely different from the
> pathname resolution itself blocking.
> 
> So I don't think either of these cases are about LOOKUP_NONBLOCK, the
> same way they aren't about LOOKUP_RCU.
> 
> The  mnt_want_write() in particular is much more about the kinds of
> things we already check for in do_open().
> 
> In fact, looking at this patch, I think mnt_want_write() itself is
> actually conceptually buggy, although it doesn't really matter: think
> about device nodes etc. Opening a device node for writing doesn't
> write to the filesystem that the device node is on.
> 
> Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to
> do with the open() wanting to write to the filesystem. We don't even
> hold that lock after the open - we'll always drop it even for a
> successful open.
> 
> Only O_CREAT | O_TRUNC should matter, since those are the ones that
> cause writes as part of the *open*.
> 
> Again - I don't think that this is really a problem. I mention it more
> as a "this shows how the code is _conceptually_ wrong", and how it's
> not really about the pathname resolution any more.
> 
> In fact, I think that how we pass on that "got_write" to lookup_open()
> is just another sign of how we do that mnt_want_write() in the wrong
> place and at trhe wrong level. We shouldn't have been doing that
> mnt_want_write() for writable oipens (that's a different thing), and
> looking at lookup_open(), I think we should have waited with doing it
> at all until we've checked if we even need it (ie O_CREAT on a file
> that already exists does *not* need the mnt_want_write() at all, and
> we'll see that when we get to that
> 
>         /* Negative dentry, just create the file */
>         if (!dentry->d_inode && (open_flag & O_CREAT)) {
> 
> thing.
> 
> So I think this patch actually shows that we do things wrong in this
> area, and I think the kinds of things it does are questionable as a
> result. In particular, I'm not convinced that the directory semaphore
> thing should be tied to LOOKUP_NONBLOCK, and I don't think the
> mnt_want_write() logic is right at all.
> 
> The first one would be fixed by a separate flag.
> 
> The second one I think is more about "we are doing mnt_want_write() at
> the wrong point, and if we move mnt_want_write() to the right place
> we'd just fail it explicitly for the LOOKUP_NONBLOCK case" - the same
> way a truncate should just be an explicit fail, not a "trylock"
> failure.

I'm going to let Al comment on the mnt_want_write() logic. It did feel
iffy to me, and the error handling (and passing of it) around makes it
hard to reason about. Would likely make this change more straight
forward if we sort out that first.

I do agree that we should keep the two things separate - and potentially
just have the RESOLVE_NONBLOCK be RESOLVE_CACHE (or something like that)
and not deal with the locking / want write side of things at all. For
io_uring, we really do want to ensure that we don't stall the submission
pipeline by potentially being stuck on the directory lock or waiting for
a frozen file system.

And I don't think we can get around having two flags at that point -
probably by renaming the initial LOOKUP_NONBLOCK to LOOKUP_CACHE, and by
having a LOOKUP_NONBLOCK which has a slightly wider scope.

> I also think we need to deal with the O_NONBLOCK kind of situation at
> the actual ->open() time (ie the whole "oh, NFS wants to revalidate
> caches due to open/close consistency, named pipes want to wait for
> pairing, device nodes want to wait for device". Again, I think that's
> separate from LOOKUP_NONBLOCK, though.

Agree, that's still not done in this patch. I did think about it, and
the only potential idea I had around that was to wrap the actual open in
setting/clearing O_NDELAY/O_NONBLOCK around the open. Which _should_
work as long as it happens before fd_install() is done, but doesn't feel
that architecturally clean.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-12 21:25     ` Jens Axboe
@ 2020-12-12 22:03       ` Linus Torvalds
  2020-12-13 22:50       ` Dave Chinner
  1 sibling, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2020-12-12 22:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-fsdevel, Al Viro

On Sat, Dec 12, 2020 at 1:25 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Do we ever do long term IO _while_ holding the direcoty inode lock? If
> we don't, then we can probably just ignore that side alltogether.

The inode lock is all kinds of messy. Part of it is that we have these
helper functions for taking it ("inode_lock_shared()" and friends).
Part of it is that some codepaths do *not* use those helpers and use
"inode->i_rwsem" directly. And part of it is that our comments
sometimes talk about the old name ("i_mutex").

The inode lock *can* be a problem. The historical problem point is
actually readdir(), which takes the lock for reading, but does so over
not just IO but also the user space accesses.

That used to be a huge problem when it was a mutex, not an rwlock. But
I think it can still be a problem for (a) filesystems that haven't
been converted to 'iterate_shared' or (b) if a slow readdir has the
lock, and a O_CREAT comes in, then new readers will block too.

Honestly, the inode lock is nasty and broken. It's made worse by the
fact that it really doesn't have great semantics: filesystems use it
randomly for internal "lock this inode" too.

A lot of inode lock users don't actually do any IO at all. The
messiness of that lock comes literally from the fact that it was this
random per-inode lock that just grew a lot of random uses. Many of
them aren't particularly relevant for directories, though.

It's one of my least favorite locks in the kernel, but practically
speaking it seldom causes problems.

But if you haven't figured out the pattern by now, let's just say that
"it's completely random".

It would be interesting to see if it causes actual problems. Because
maybe that could push us towards fixing some of them.

               Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-12 21:25     ` Jens Axboe
  2020-12-12 22:03       ` Linus Torvalds
@ 2020-12-13 22:50       ` Dave Chinner
  2020-12-14  0:45         ` Linus Torvalds
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2020-12-13 22:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, linux-fsdevel, Al Viro

On Sat, Dec 12, 2020 at 02:25:00PM -0700, Jens Axboe wrote:
> On 12/12/20 11:57 AM, Linus Torvalds wrote:
> > On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> We handle it for the path resolution itself, but we should also factor
> >> it in for open_last_lookups() and tmpfile open.
> > 
> > So I think this one is fundamentally wrong, for two reasons.
> > 
> > One is that "nonblock" shouldn't necessarily mean "take no locks at
> > all". That directory inode lock is very very different from "go down
> > to the filesystem to do IO". No other NONBLOCK thing has ever been "no
> > locks at all", they have all been about possibly long-term blocking.
> 
> Do we ever do long term IO _while_ holding the direcoty inode lock? If
> we don't, then we can probably just ignore that side alltogether.

Yes - "all the time" is the simple answer.

Readdir is a simple example, but that is just extent mapping tree
and directory block IO you'd have to wait for, just like regular
file IO.

The big problem is that modifications to directories are atomic and
transactional in most filesystems, which means we might block a
create/unlink/attr/etc in a transaction start for an indefinite
amount of time while we wait for metadata writeback to free up
journal/reservation space. And while we are doing this, nothing else
can access the directory because the VFS holds the directory inode
lock....

We also have metadata IO within transactions, but in most
journalling filesystems once we've started the transaction we can't
back out and return -EAGAIN. So once we are in a transaction
context, the filesystem will block as necessary to run the operation
to completion.

So, really, at the filesystem level I don't see much value in trying
to push non-blocking directory modifications down to the filesystem.
The commonly used filesystems will mostly have to return -EAGAIN
immediately without being able to do anything at all because they
simply aren't architected with the modification rollback
capabilities needed to run fully non-blocking transactional
modification operations.

> > Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to
> > do with the open() wanting to write to the filesystem. We don't even
> > hold that lock after the open - we'll always drop it even for a
> > successful open.
> > 
> > Only O_CREAT | O_TRUNC should matter, since those are the ones that
> > cause writes as part of the *open*.

And __O_TMPFILE, which is the same as O_CREAT.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-13 22:50       ` Dave Chinner
@ 2020-12-14  0:45         ` Linus Torvalds
  2020-12-14  1:52           ` Dave Chinner
  2020-12-14 17:43           ` Jens Axboe
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2020-12-14  0:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jens Axboe, linux-fsdevel, Al Viro

On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > Only O_CREAT | O_TRUNC should matter, since those are the ones that
> > > cause writes as part of the *open*.
>
> And __O_TMPFILE, which is the same as O_CREAT.

This made me go look at the code, but we seem to be ok here -
__O_TMPFILE should never get to the do_open() logic at all, because it
gets caught before that and does off to do_tmpfile() and then
vfs_tmpfile() instead.

And then it's up to the filesystem to do the inode locking if it needs
to - it has a separate i_io->tempfile function for that.

From a LOOKUP_NONBLOCK standpoint, I think we should just disallow
O_TMPFILE the same way Jens disallowed O_TRUNCATE.

Otherwise we'd have to teach filesystems about it.

Which might be an option long-term of course, but I don't think it
makes sense for any initial patch-set: the real "we can do this with
no locks and quickly" case is just opening an existing file entirely
using the dcache.

Creating new files isn't exactly _uncommon_, of course, but it's
probably still two orders of magnitude less common than just opening
an existing file. Wild handwaving.

So I suspect O_CREAT simply isn't really interesting either. Sure, the
"it already exists" case could potentially be done without any locking
or anything like that, but again, I don't think that case is worth
optimizing for: O_CREAT probably just isn't common enough.

[ I don't have tons of hard data to back that handwaving argument
based on my gut feel up, but I did do a trace of a kernel build just
because that's my "default load" and out of 250 thousand openat()
calls, only 560 had O_CREAT and/or O_TRUNC. But while that's _my_
default load, it's obviously not necessarily very representative of
anything else. The "almost three orders of magnitude more regular
opens" doesn't _surprise_ me though ]

So I really think the right model is not to worry about trylock for
the inode lock at all, but to simply just always fail with EAGAIN in
that case - and not do LOOKUP_NONBLOCK for O_CREAT at all.

The normal fast-path case in open_last_lookups() is the "goto
finish_lookup" case in this sequence:

        if (!(open_flag & O_CREAT)) {
                if (nd->last.name[nd->last.len])
                        nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
                /* we _can_ be in RCU mode here */
                dentry = lookup_fast(nd, &inode, &seq);
                if (IS_ERR(dentry))
                        return ERR_CAST(dentry);
                if (likely(dentry))
                        goto finish_lookup;

and we never get to the inode locking sequence at all.

So just adding a

        if (nd->flags & LOOKUP_NONBLOCK)
                return -EAGAIN;

there is probably the right thing - rather than worry about trylock on
the inode lock.

Because if you've missed in the dcache, you've by definition lost the fast-path.

That said, numbers talk, BS walks, and maybe somebody has better loads
with better numbers. I assume Jens has some internal FB io_uring loads
and could do stats on what kinds of pathname opens matter there...

          Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-14  0:45         ` Linus Torvalds
@ 2020-12-14  1:52           ` Dave Chinner
  2020-12-14 18:06             ` Linus Torvalds
  2020-12-14 17:43           ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2020-12-14  1:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, linux-fsdevel, Al Viro

On Sun, Dec 13, 2020 at 04:45:39PM -0800, Linus Torvalds wrote:
> On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > Only O_CREAT | O_TRUNC should matter, since those are the ones that
> > > > cause writes as part of the *open*.
> >
> > And __O_TMPFILE, which is the same as O_CREAT.
> 
> This made me go look at the code, but we seem to be ok here -
> __O_TMPFILE should never get to the do_open() logic at all, because it
> gets caught before that and does off to do_tmpfile() and then
> vfs_tmpfile() instead.
> 
> And then it's up to the filesystem to do the inode locking if it needs
> to - it has a separate i_io->tempfile function for that.

Sure, and then it blocks. Guaranteed, for the same reasons that
O_CREAT will block when calling ->create() after the path lookup:
the filesystem runs a transaction to allocate an inode and track it
on the orphan list so that it gets cleaned up by recovery after a
crash while the tmpfile is still open.

So it doesn't matter if the lookup is non-blocking, the tmpfile
creation is guaranteed to block for the same reason O_CREAT and
O_TRUNCATE will block....

> From a LOOKUP_NONBLOCK standpoint, I think we should just disallow
> O_TMPFILE the same way Jens disallowed O_TRUNCATE.

*nod*

I just don't think it makes sense to try to make any of the
filesystem level stuff open() might do non-blocking. The moment we
start a filesystem modification, we have to be able to block because
it is the only way to guarantee forwards progress. So if we know we
are going to call into the filesystem to make a modification if the
pathwalk is successful, why even bother starting the pathwalk?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-14  0:45         ` Linus Torvalds
  2020-12-14  1:52           ` Dave Chinner
@ 2020-12-14 17:43           ` Jens Axboe
  1 sibling, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-12-14 17:43 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner; +Cc: linux-fsdevel, Al Viro

On 12/13/20 5:45 PM, Linus Torvalds wrote:
> On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote:
>>>>
>>>> Only O_CREAT | O_TRUNC should matter, since those are the ones that
>>>> cause writes as part of the *open*.
>>
>> And __O_TMPFILE, which is the same as O_CREAT.
> 
> This made me go look at the code, but we seem to be ok here -
> __O_TMPFILE should never get to the do_open() logic at all, because it
> gets caught before that and does off to do_tmpfile() and then
> vfs_tmpfile() instead.
> 
> And then it's up to the filesystem to do the inode locking if it needs
> to - it has a separate i_io->tempfile function for that.
> 
> From a LOOKUP_NONBLOCK standpoint, I think we should just disallow
> O_TMPFILE the same way Jens disallowed O_TRUNCATE.
> 
> Otherwise we'd have to teach filesystems about it.

Good point, let's avoid that for now... More below.

> Which might be an option long-term of course, but I don't think it
> makes sense for any initial patch-set: the real "we can do this with
> no locks and quickly" case is just opening an existing file entirely
> using the dcache.
> 
> Creating new files isn't exactly _uncommon_, of course, but it's
> probably still two orders of magnitude less common than just opening
> an existing file. Wild handwaving.

Obviously depends on the workload, but there are plenty of interesting
workloads that are not create intensive at all and just want a fast way
to open an existing file.

> So I suspect O_CREAT simply isn't really interesting either. Sure, the
> "it already exists" case could potentially be done without any locking
> or anything like that, but again, I don't think that case is worth
> optimizing for: O_CREAT probably just isn't common enough.
> 
> [ I don't have tons of hard data to back that handwaving argument
> based on my gut feel up, but I did do a trace of a kernel build just
> because that's my "default load" and out of 250 thousand openat()
> calls, only 560 had O_CREAT and/or O_TRUNC. But while that's _my_
> default load, it's obviously not necessarily very representative of
> anything else. The "almost three orders of magnitude more regular
> opens" doesn't _surprise_ me though ]
> 
> So I really think the right model is not to worry about trylock for
> the inode lock at all, but to simply just always fail with EAGAIN in
> that case - and not do LOOKUP_NONBLOCK for O_CREAT at all.
> 
> The normal fast-path case in open_last_lookups() is the "goto
> finish_lookup" case in this sequence:
> 
>         if (!(open_flag & O_CREAT)) {
>                 if (nd->last.name[nd->last.len])
>                         nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
>                 /* we _can_ be in RCU mode here */
>                 dentry = lookup_fast(nd, &inode, &seq);
>                 if (IS_ERR(dentry))
>                         return ERR_CAST(dentry);
>                 if (likely(dentry))
>                         goto finish_lookup;
> 
> and we never get to the inode locking sequence at all.
> 
> So just adding a
> 
>         if (nd->flags & LOOKUP_NONBLOCK)
>                 return -EAGAIN;
> 
> there is probably the right thing - rather than worry about trylock on
> the inode lock.

I really like that, as it also means we can safely drop the
mnt_want_write() path and just safe that for a separate cleanup that's
not related to this at all. With that, I'd suggest folding the two
patches into one as we no longer need followup work. It also means we
can drop the concept of having two flags for this, LOOKUP_NONBLOCK
covers the entire case of the existing fast path.

Might still make sense to name it LOOKUP_CACHE instead, but I'll leave
that for you/Al to decide...

> Because if you've missed in the dcache, you've by definition lost the fast-path.
> 
> That said, numbers talk, BS walks, and maybe somebody has better loads
> with better numbers. I assume Jens has some internal FB io_uring loads
> and could do stats on what kinds of pathname opens matter there...

Definitely, was already planning on checking that. And just as
importantly, once we have this, it's pretty trivial to do the same thing
on the stat side as well, which makes me excited for that change too.
io_uring currently punts that one too, and that's arguably even more
interesting than open if we can avoid that for the fast case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open
  2020-12-14  1:52           ` Dave Chinner
@ 2020-12-14 18:06             ` Linus Torvalds
  0 siblings, 0 replies; 16+ messages in thread
From: Linus Torvalds @ 2020-12-14 18:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jens Axboe, linux-fsdevel, Al Viro

On Sun, Dec 13, 2020 at 5:52 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sun, Dec 13, 2020 at 04:45:39PM -0800, Linus Torvalds wrote:
> > On Sun, Dec 13, 2020 at 2:50 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > Only O_CREAT | O_TRUNC should matter, since those are the ones that
> > > > > cause writes as part of the *open*.
> > >
> > > And __O_TMPFILE, which is the same as O_CREAT.
> >
> > This made me go look at the code, but we seem to be ok here -
> > __O_TMPFILE should never get to the do_open() logic at all, because it
> > gets caught before that and does off to do_tmpfile() and then
> > vfs_tmpfile() instead.
> >
> > And then it's up to the filesystem to do the inode locking if it needs
> > to - it has a separate i_io->tempfile function for that.
>
> Sure, and then it blocks.

Yes. I was more just double-checking that currently really odd

        if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {

condition that didn't make sense to me (it basically does two
different kinds of writablity checks). So it was more that you pointed
out that __O_TMPFILE was also missing from that odd condition, and
that turns out to be because it was handled separately.

So no disagreement about __O_TMPFILE being a "not a cached operation"
- purely a "that condition is odd".

It was just that O_WRONLY | O_RDWR didn't make tons of sense to me,
since we then get the write count only to then drop it immediately
immediately without having actually done any writes.

But I guess they are there only as a "even if we don't write to the
filesystem right now, we do want to get the EROFS error return from
open(), rather than at write() time".

I think technically it shouldn't need to do the pointless "synchronize
and increment writers only to decrement them again" dance, and could
just do a "mnt_is_readonly()" test for the plain "open writably, but
without O_CREAT/O_TRUNC" case.

But I guess there's no real downside to doing it the way we're doing
it - it just looked odd to me when I was looking at it in the context
of just pathname lookup.

In do_faccessat() we have that bare __mnt_is_readonly() for the "I
want EROFS, but I'm not actually writing now" case.

            Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-12-14 18:14 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-12 16:51 [PATCHSET RFC v2 0/5] fs: Support for LOOKUP_NONBLOCK / RESOLVE_NONBLOCK Jens Axboe
2020-12-12 16:51 ` [PATCH 1/5] fs: make unlazy_walk() error handling consistent Jens Axboe
2020-12-12 16:51 ` [PATCH 2/5] fs: add support for LOOKUP_NONBLOCK Jens Axboe
2020-12-12 16:51 ` [PATCH 3/5] fs: add mnt_want_write_trylock() Jens Axboe
2020-12-12 16:51 ` [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open Jens Axboe
2020-12-12 17:25   ` Al Viro
2020-12-12 17:47     ` Jens Axboe
2020-12-12 18:57   ` Linus Torvalds
2020-12-12 21:25     ` Jens Axboe
2020-12-12 22:03       ` Linus Torvalds
2020-12-13 22:50       ` Dave Chinner
2020-12-14  0:45         ` Linus Torvalds
2020-12-14  1:52           ` Dave Chinner
2020-12-14 18:06             ` Linus Torvalds
2020-12-14 17:43           ` Jens Axboe
2020-12-12 16:51 ` [PATCH 5/5] fs: expose LOOKUP_NONBLOCK through openat2() RESOLVE_NONBLOCK Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.