linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/11] io_uring getdents
@ 2023-08-27 13:28 Hao Xu
  2023-08-27 13:28 ` [PATCH 01/11] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
                   ` (11 more replies)
  0 siblings, 12 replies; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

This series introduce getdents64 to io_uring, the code logic is similar
with the snychronized version's. It first try nowait issue, and offload
it to io-wq threads if the first try fails.

Patch1 and Patch2 are some preparation
Patch3 supports nowait for xfs getdents code
Patch4-11 are vfs change, include adding helpers and trylock for locks

Tests I've done:
A liburing test case for functional test:
https://github.com/HowHsu/liburing/commit/39dc9a8e19c06a8cebf8c2301b85320eb45c061e?diff=unified

Tested it with a liburing performance test:
https://github.com/HowHsu/liburing/blob/getdents/test/getdents2.c

The test is controlled by the below script[2] which runs getdents2.t 100
times and calulate the avg.
The result show that io_uring version is about 2.6% faster:

note:
[1] the number of getdents call/request in io_uring and normal sync version
are made sure to be same beforehand.

[2] run_getdents.py

```python3

import subprocess

N = 100
sum = 0.0
args = ["/data/home/howeyxu/tmpdir", "sync"]

for i in range(N):
    output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
    sum += float(output)

average = sum / N
print("Average of sync:", average)

sum = 0.0
args = ["/data/home/howeyxu/tmpdir", "iouring"]

for i in range(N):
    output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
    sum += float(output)

average = sum / N
print("Average of iouring:", average)

```

v5->v6:
 - remove xfs journal stuff since there are fundamental issues in the
   design.

v4->v5:
 - move atime update to the beginning of getdents operation
 - trylock for i_rwsem
 - nowait semantics for involved xfs journal stuff

v3->v4:
 - add Dave's xfs nowait code and fix a deadlock problem, with some code
   style tweak.
 - disable fixed file to avoid a race problem for now
 - add a test program.

v2->v3:
 - removed the kernfs patches
 - add f_pos_lock logic
 - remove the "reduce last EOF getdents try" optimization since
   Dominique reports that doesn't make difference
 - remove the rewind logic, I think the right way is to introduce lseek
   to io_uring not to patch this logic to getdents.
 - add Singed-off-by of Stefan Roesch for patch 1 since checkpatch
   complained that Co-developed-by someone should be accompanied with
   Signed-off-by same person, I can remove them if Stefan thinks that's
   not proper.


Dominique Martinet (1):
  fs: split off vfs_getdents function of getdents64 syscall

Hao Xu (10):
  xfs: add NOWAIT semantics for readdir
  vfs: add nowait flag for struct dir_context
  vfs: add a vfs helper for io_uring file pos lock
  vfs: add file_pos_unlock() for io_uring usage
  vfs: add a nowait parameter for touch_atime()
  vfs: add nowait parameter for file_accessed()
  vfs: move file_accessed() to the beginning of iterate_dir()
  vfs: error out -EAGAIN if atime needs to be updated
  vfs: trylock inode->i_rwsem in iterate_dir() to support nowait
  io_uring: add support for getdents

 arch/s390/hypfs/inode.c        |  2 +-
 block/fops.c                   |  2 +-
 fs/btrfs/file.c                |  2 +-
 fs/btrfs/inode.c               |  2 +-
 fs/cachefiles/namei.c          |  2 +-
 fs/coda/dir.c                  |  4 +--
 fs/ecryptfs/file.c             |  4 +--
 fs/ext2/file.c                 |  4 +--
 fs/ext4/file.c                 |  6 ++--
 fs/f2fs/file.c                 |  4 +--
 fs/file.c                      | 13 ++++++++
 fs/fuse/dax.c                  |  2 +-
 fs/fuse/file.c                 |  4 +--
 fs/gfs2/file.c                 |  2 +-
 fs/hugetlbfs/inode.c           |  2 +-
 fs/inode.c                     | 10 ++++--
 fs/internal.h                  |  8 +++++
 fs/namei.c                     |  4 +--
 fs/nfsd/vfs.c                  |  2 +-
 fs/nilfs2/file.c               |  2 +-
 fs/orangefs/file.c             |  2 +-
 fs/orangefs/inode.c            |  2 +-
 fs/overlayfs/file.c            |  2 +-
 fs/overlayfs/inode.c           |  2 +-
 fs/pipe.c                      |  2 +-
 fs/ramfs/file-nommu.c          |  2 +-
 fs/readdir.c                   | 61 ++++++++++++++++++++++++++--------
 fs/smb/client/cifsfs.c         |  2 +-
 fs/splice.c                    |  2 +-
 fs/stat.c                      |  2 +-
 fs/ubifs/file.c                |  2 +-
 fs/udf/file.c                  |  2 +-
 fs/xfs/libxfs/xfs_da_btree.c   | 16 +++++++++
 fs/xfs/libxfs/xfs_da_btree.h   |  1 +
 fs/xfs/libxfs/xfs_dir2_block.c |  7 ++--
 fs/xfs/libxfs/xfs_dir2_priv.h  |  2 +-
 fs/xfs/scrub/dir.c             |  2 +-
 fs/xfs/scrub/readdir.c         |  2 +-
 fs/xfs/xfs_dir2_readdir.c      | 49 +++++++++++++++++++++------
 fs/xfs/xfs_file.c              |  6 ++--
 fs/xfs/xfs_inode.c             | 27 +++++++++++++++
 fs/xfs/xfs_inode.h             | 17 ++++++----
 fs/zonefs/file.c               |  4 +--
 include/linux/file.h           |  7 ++++
 include/linux/fs.h             | 15 +++++++--
 include/uapi/linux/io_uring.h  |  1 +
 io_uring/fs.c                  | 53 +++++++++++++++++++++++++++++
 io_uring/fs.h                  |  3 ++
 io_uring/opdef.c               |  8 +++++
 kernel/bpf/inode.c             |  4 +--
 mm/filemap.c                   |  8 ++---
 mm/shmem.c                     |  6 ++--
 net/unix/af_unix.c             |  4 +--
 53 files changed, 310 insertions(+), 96 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/11] fs: split off vfs_getdents function of getdents64 syscall
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-08-27 13:28 ` [PATCH 02/11] xfs: add NOWAIT semantics for readdir Hao Xu
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Dominique Martinet <asmadeus@codewreck.org>

This splits off the vfs_getdents function from the getdents64 system
call.
This will allow io_uring to call the vfs_getdents function.

Co-developed-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/internal.h |  8 ++++++++
 fs/readdir.c  | 34 ++++++++++++++++++++++++++--------
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index f7a3dc111026..b1f66e52d61b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -304,3 +304,11 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+
+/*
+ * fs/readdir.c
+ */
+struct linux_dirent64;
+
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count);
diff --git a/fs/readdir.c b/fs/readdir.c
index b264ce60114d..9592259b7e7f 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -21,6 +21,7 @@
 #include <linux/unistd.h>
 #include <linux/compat.h>
 #include <linux/uaccess.h>
+#include "internal.h"
 
 #include <asm/unaligned.h>
 
@@ -351,10 +352,16 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
 	return false;
 }
 
-SYSCALL_DEFINE3(getdents64, unsigned int, fd,
-		struct linux_dirent64 __user *, dirent, unsigned int, count)
+
+/**
+ * vfs_getdents - getdents without fdget
+ * @file    : pointer to file struct of directory
+ * @dirent  : pointer to user directory structure
+ * @count   : size of buffer
+ */
+int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
+		 unsigned int count)
 {
-	struct fd f;
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
 		.count = count,
@@ -362,11 +369,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	};
 	int error;
 
-	f = fdget_pos(fd);
-	if (!f.file)
-		return -EBADF;
-
-	error = iterate_dir(f.file, &buf.ctx);
+	error = iterate_dir(file, &buf.ctx);
 	if (error >= 0)
 		error = buf.error;
 	if (buf.prev_reclen) {
@@ -379,6 +382,21 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 		else
 			error = count - buf.count;
 	}
+	return error;
+}
+
+SYSCALL_DEFINE3(getdents64, unsigned int, fd,
+		struct linux_dirent64 __user *, dirent, unsigned int, count)
+{
+	struct fd f;
+	int error;
+
+	f = fdget_pos(fd);
+	if (!f.file)
+		return -EBADF;
+
+	error = vfs_getdents(f.file, dirent, count);
+
 	fdput_pos(f);
 	return error;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/11] xfs: add NOWAIT semantics for readdir
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
  2023-08-27 13:28 ` [PATCH 01/11] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-08-27 20:44   ` Matthew Wilcox
  2023-09-04  1:02   ` Dave Chinner
  2023-08-27 13:28 ` [PATCH 03/11] vfs: add nowait flag for struct dir_context Hao Xu
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

Implement NOWAIT semantics for readdir. Return EAGAIN error to the
caller if it would block, like failing to get locks, or going to
do IO.

Co-developed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Hao Xu <howeyxu@tencent.com>
[fixes deadlock issue, tweak code style]
---
 fs/xfs/libxfs/xfs_da_btree.c   | 16 +++++++++++
 fs/xfs/libxfs/xfs_da_btree.h   |  1 +
 fs/xfs/libxfs/xfs_dir2_block.c |  7 ++---
 fs/xfs/libxfs/xfs_dir2_priv.h  |  2 +-
 fs/xfs/scrub/dir.c             |  2 +-
 fs/xfs/scrub/readdir.c         |  2 +-
 fs/xfs/xfs_dir2_readdir.c      | 49 ++++++++++++++++++++++++++--------
 fs/xfs/xfs_inode.c             | 27 +++++++++++++++++++
 fs/xfs/xfs_inode.h             | 17 +++++++-----
 9 files changed, 99 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index e576560b46e9..7a1a0af24197 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2643,16 +2643,32 @@ xfs_da_read_buf(
 	struct xfs_buf_map	map, *mapp = &map;
 	int			nmap = 1;
 	int			error;
+	int			buf_flags = 0;
 
 	*bpp = NULL;
 	error = xfs_dabuf_map(dp, bno, flags, whichfork, &mapp, &nmap);
 	if (error || !nmap)
 		goto out_free;
 
+	/*
+	 * NOWAIT semantics mean we don't wait on the buffer lock nor do we
+	 * issue IO for this buffer if it is not already in memory. Caller will
+	 * retry. This will return -EAGAIN if the buffer is in memory and cannot
+	 * be locked, and no buffer and no error if it isn't in memory.  We
+	 * translate both of those into a return state of -EAGAIN and *bpp =
+	 * NULL.
+	 */
+	if (flags & XFS_DABUF_NOWAIT)
+		buf_flags |= XBF_TRYLOCK | XBF_INCORE;
 	error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0,
 			&bp, ops);
 	if (error)
 		goto out_free;
+	if (!bp) {
+		ASSERT(flags & XFS_DABUF_NOWAIT);
+		error = -EAGAIN;
+		goto out_free;
+	}
 
 	if (whichfork == XFS_ATTR_FORK)
 		xfs_buf_set_ref(bp, XFS_ATTR_BTREE_REF);
diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h
index ffa3df5b2893..32e7b1cca402 100644
--- a/fs/xfs/libxfs/xfs_da_btree.h
+++ b/fs/xfs/libxfs/xfs_da_btree.h
@@ -205,6 +205,7 @@ int	xfs_da3_node_read_mapped(struct xfs_trans *tp, struct xfs_inode *dp,
  */
 
 #define XFS_DABUF_MAP_HOLE_OK	(1u << 0)
+#define XFS_DABUF_NOWAIT	(1u << 1)
 
 int	xfs_da_grow_inode(xfs_da_args_t *args, xfs_dablk_t *new_blkno);
 int	xfs_da_grow_inode_int(struct xfs_da_args *args, xfs_fileoff_t *bno,
diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c
index 00f960a703b2..59b24a594add 100644
--- a/fs/xfs/libxfs/xfs_dir2_block.c
+++ b/fs/xfs/libxfs/xfs_dir2_block.c
@@ -135,13 +135,14 @@ int
 xfs_dir3_block_read(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*dp,
+	unsigned int		flags,
 	struct xfs_buf		**bpp)
 {
 	struct xfs_mount	*mp = dp->i_mount;
 	xfs_failaddr_t		fa;
 	int			err;
 
-	err = xfs_da_read_buf(tp, dp, mp->m_dir_geo->datablk, 0, bpp,
+	err = xfs_da_read_buf(tp, dp, mp->m_dir_geo->datablk, flags, bpp,
 				XFS_DATA_FORK, &xfs_dir3_block_buf_ops);
 	if (err || !*bpp)
 		return err;
@@ -380,7 +381,7 @@ xfs_dir2_block_addname(
 	tp = args->trans;
 
 	/* Read the (one and only) directory block into bp. */
-	error = xfs_dir3_block_read(tp, dp, &bp);
+	error = xfs_dir3_block_read(tp, dp, 0, &bp);
 	if (error)
 		return error;
 
@@ -695,7 +696,7 @@ xfs_dir2_block_lookup_int(
 	dp = args->dp;
 	tp = args->trans;
 
-	error = xfs_dir3_block_read(tp, dp, &bp);
+	error = xfs_dir3_block_read(tp, dp, 0, &bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h
index 7404a9ff1a92..7d4cf8a0f15b 100644
--- a/fs/xfs/libxfs/xfs_dir2_priv.h
+++ b/fs/xfs/libxfs/xfs_dir2_priv.h
@@ -51,7 +51,7 @@ extern int xfs_dir_cilookup_result(struct xfs_da_args *args,
 
 /* xfs_dir2_block.c */
 extern int xfs_dir3_block_read(struct xfs_trans *tp, struct xfs_inode *dp,
-			       struct xfs_buf **bpp);
+			       unsigned int flags, struct xfs_buf **bpp);
 extern int xfs_dir2_block_addname(struct xfs_da_args *args);
 extern int xfs_dir2_block_lookup(struct xfs_da_args *args);
 extern int xfs_dir2_block_removename(struct xfs_da_args *args);
diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c
index 0b491784b759..5cc51f201bd7 100644
--- a/fs/xfs/scrub/dir.c
+++ b/fs/xfs/scrub/dir.c
@@ -313,7 +313,7 @@ xchk_directory_data_bestfree(
 		/* dir block format */
 		if (lblk != XFS_B_TO_FSBT(mp, XFS_DIR2_DATA_OFFSET))
 			xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk);
-		error = xfs_dir3_block_read(sc->tp, sc->ip, &bp);
+		error = xfs_dir3_block_read(sc->tp, sc->ip, 0, &bp);
 	} else {
 		/* dir data format */
 		error = xfs_dir3_data_read(sc->tp, sc->ip, lblk, 0, &bp);
diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c
index e51c1544be63..f0a727311632 100644
--- a/fs/xfs/scrub/readdir.c
+++ b/fs/xfs/scrub/readdir.c
@@ -101,7 +101,7 @@ xchk_dir_walk_block(
 	unsigned int		off, next_off, end;
 	int			error;
 
-	error = xfs_dir3_block_read(sc->tp, dp, &bp);
+	error = xfs_dir3_block_read(sc->tp, dp, 0, &bp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 9f3ceb461515..dcdbd26e0402 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -149,6 +149,7 @@ xfs_dir2_block_getdents(
 	struct xfs_da_geometry	*geo = args->geo;
 	unsigned int		offset, next_offset;
 	unsigned int		end;
+	unsigned int		flags = 0;
 
 	/*
 	 * If the block number in the offset is out of range, we're done.
@@ -156,7 +157,9 @@ xfs_dir2_block_getdents(
 	if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk)
 		return 0;
 
-	error = xfs_dir3_block_read(args->trans, dp, &bp);
+	if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
+		flags |= XFS_DABUF_NOWAIT;
+	error = xfs_dir3_block_read(args->trans, dp, flags, &bp);
 	if (error)
 		return error;
 
@@ -240,6 +243,7 @@ xfs_dir2_block_getdents(
 STATIC int
 xfs_dir2_leaf_readbuf(
 	struct xfs_da_args	*args,
+	struct dir_context	*ctx,
 	size_t			bufsize,
 	xfs_dir2_off_t		*cur_off,
 	xfs_dablk_t		*ra_blk,
@@ -258,10 +262,15 @@ xfs_dir2_leaf_readbuf(
 	struct xfs_iext_cursor	icur;
 	int			ra_want;
 	int			error = 0;
-
-	error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
-	if (error)
-		goto out;
+	unsigned int		flags = 0;
+
+	if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
+		flags |= XFS_DABUF_NOWAIT;
+	} else {
+		error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
+		if (error)
+			goto out;
+	}
 
 	/*
 	 * Look for mapped directory blocks at or above the current offset.
@@ -280,7 +289,7 @@ xfs_dir2_leaf_readbuf(
 	new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
 	if (new_off > *cur_off)
 		*cur_off = new_off;
-	error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, &bp);
+	error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, flags, &bp);
 	if (error)
 		goto out;
 
@@ -360,6 +369,7 @@ xfs_dir2_leaf_getdents(
 	int			byteoff;	/* offset in current block */
 	unsigned int		offset = 0;
 	int			error = 0;	/* error return value */
+	int			written = 0;
 
 	/*
 	 * If the offset is at or past the largest allowed value,
@@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
 				bp = NULL;
 			}
 
-			if (*lock_mode == 0)
-				*lock_mode = xfs_ilock_data_map_shared(dp);
-			error = xfs_dir2_leaf_readbuf(args, bufsize, &curoff,
-					&rablk, &bp);
+			if (*lock_mode == 0) {
+				*lock_mode =
+					xfs_ilock_data_map_shared_generic(dp,
+					ctx->flags & DIR_CONTEXT_F_NOWAIT);
+				if (!*lock_mode) {
+					error = -EAGAIN;
+					break;
+				}
+			}
+			error = xfs_dir2_leaf_readbuf(args, ctx, bufsize,
+					&curoff, &rablk, &bp);
 			if (error || !bp)
 				break;
 
@@ -479,6 +496,7 @@ xfs_dir2_leaf_getdents(
 		 */
 		offset += length;
 		curoff += length;
+		written += length;
 		/* bufsize may have just been a guess; don't go negative */
 		bufsize = bufsize > length ? bufsize - length : 0;
 	}
@@ -492,6 +510,8 @@ xfs_dir2_leaf_getdents(
 		ctx->pos = xfs_dir2_byte_to_dataptr(curoff) & 0x7fffffff;
 	if (bp)
 		xfs_trans_brelse(args->trans, bp);
+	if (error == -EAGAIN && written > 0)
+		error = 0;
 	return error;
 }
 
@@ -514,6 +534,7 @@ xfs_readdir(
 	unsigned int		lock_mode;
 	bool			isblock;
 	int			error;
+	bool			nowait;
 
 	trace_xfs_readdir(dp);
 
@@ -531,7 +552,11 @@ xfs_readdir(
 	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
 		return xfs_dir2_sf_getdents(&args, ctx);
 
-	lock_mode = xfs_ilock_data_map_shared(dp);
+	nowait = ctx->flags & DIR_CONTEXT_F_NOWAIT;
+	lock_mode = xfs_ilock_data_map_shared_generic(dp, nowait);
+	if (!lock_mode)
+		return -EAGAIN;
+
 	error = xfs_dir2_isblock(&args, &isblock);
 	if (error)
 		goto out_unlock;
@@ -546,5 +571,7 @@ xfs_readdir(
 out_unlock:
 	if (lock_mode)
 		xfs_iunlock(dp, lock_mode);
+	if (error == -EAGAIN)
+		ASSERT(nowait);
 	return error;
 }
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 9e62cc500140..d088f7d0c23a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -120,6 +120,33 @@ xfs_ilock_data_map_shared(
 	return lock_mode;
 }
 
+/*
+ * Similar to xfs_ilock_data_map_shared(), except that it will only try to lock
+ * the inode in shared mode if the extents are already in memory. If it fails to
+ * get the lock or has to do IO to read the extent list, fail the operation by
+ * returning 0 as the lock mode.
+ */
+uint
+xfs_ilock_data_map_shared_nowait(
+	struct xfs_inode	*ip)
+{
+	if (xfs_need_iread_extents(&ip->i_df))
+		return 0;
+	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
+		return 0;
+	return XFS_ILOCK_SHARED;
+}
+
+int
+xfs_ilock_data_map_shared_generic(
+	struct xfs_inode	*dp,
+	bool			nowait)
+{
+	if (nowait)
+		return xfs_ilock_data_map_shared_nowait(dp);
+	return xfs_ilock_data_map_shared(dp);
+}
+
 uint
 xfs_ilock_attr_map_shared(
 	struct xfs_inode	*ip)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7547caf2f2ab..ea206a5a27df 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -490,13 +490,16 @@ int		xfs_rename(struct mnt_idmap *idmap,
 			   struct xfs_name *target_name,
 			   struct xfs_inode *target_ip, unsigned int flags);
 
-void		xfs_ilock(xfs_inode_t *, uint);
-int		xfs_ilock_nowait(xfs_inode_t *, uint);
-void		xfs_iunlock(xfs_inode_t *, uint);
-void		xfs_ilock_demote(xfs_inode_t *, uint);
-bool		xfs_isilocked(struct xfs_inode *, uint);
-uint		xfs_ilock_data_map_shared(struct xfs_inode *);
-uint		xfs_ilock_attr_map_shared(struct xfs_inode *);
+void		xfs_ilock(struct xfs_inode *ip, uint lockmode);
+int		xfs_ilock_nowait(struct xfs_inode *ip, uint lockmode);
+void		xfs_iunlock(struct xfs_inode *ip, uint lockmode);
+void		xfs_ilock_demote(struct xfs_inode *ip, uint lockmode);
+bool		xfs_isilocked(struct xfs_inode *ip, uint lockmode);
+uint		xfs_ilock_data_map_shared(struct xfs_inode *ip);
+uint		xfs_ilock_data_map_shared_nowait(struct xfs_inode *ip);
+int		xfs_ilock_data_map_shared_generic(struct xfs_inode *ip,
+						  bool nowait);
+uint		xfs_ilock_attr_map_shared(struct xfs_inode *ip);
 
 uint		xfs_ip2xflags(struct xfs_inode *);
 int		xfs_ifree(struct xfs_trans *, struct xfs_inode *);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/11] vfs: add nowait flag for struct dir_context
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
  2023-08-27 13:28 ` [PATCH 01/11] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
  2023-08-27 13:28 ` [PATCH 02/11] xfs: add NOWAIT semantics for readdir Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-08-27 13:28 ` [PATCH 04/11] vfs: add a vfs helper for io_uring file pos lock Hao Xu
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

The flags will allow passing DIR_CONTEXT_F_NOWAIT to iterate()
implementations that support it (as signaled through FMODE_NWAIT
in file->f_mode)

Notes:
- considered using IOCB_NOWAIT but if we add more flags later it
would be confusing to keep track of which values are valid, use
dedicated flags
- might want to check ctx.flags & DIR_CONTEXT_F_NOWAIT is only set
when file->f_mode & FMODE_NOWAIT in iterate_dir() as e.g. WARN_ONCE?

Co-developed-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/internal.h      | 2 +-
 fs/readdir.c       | 6 ++++--
 include/linux/fs.h | 8 ++++++++
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index b1f66e52d61b..7508d485c655 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -311,4 +311,4 @@ void mnt_idmap_put(struct mnt_idmap *idmap);
 struct linux_dirent64;
 
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count);
+		 unsigned int count, unsigned long flags);
diff --git a/fs/readdir.c b/fs/readdir.c
index 9592259b7e7f..b80caf4c9321 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -358,12 +358,14 @@ static bool filldir64(struct dir_context *ctx, const char *name, int namlen,
  * @file    : pointer to file struct of directory
  * @dirent  : pointer to user directory structure
  * @count   : size of buffer
+ * @flags   : additional dir_context flags
  */
 int vfs_getdents(struct file *file, struct linux_dirent64 __user *dirent,
-		 unsigned int count)
+		 unsigned int count, unsigned long flags)
 {
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
+		.ctx.flags = flags,
 		.count = count,
 		.current_dir = dirent
 	};
@@ -395,7 +397,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	if (!f.file)
 		return -EBADF;
 
-	error = vfs_getdents(f.file, dirent, count);
+	error = vfs_getdents(f.file, dirent, count, 0);
 
 	fdput_pos(f);
 	return error;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6867512907d6..f3e315e8efdd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1719,8 +1719,16 @@ typedef bool (*filldir_t)(struct dir_context *, const char *, int, loff_t, u64,
 struct dir_context {
 	filldir_t actor;
 	loff_t pos;
+	unsigned long flags;
 };
 
+/*
+ * flags for dir_context flags
+ * DIR_CONTEXT_F_NOWAIT: Request non-blocking iterate
+ *                       (requires file->f_mode & FMODE_NOWAIT)
+ */
+#define DIR_CONTEXT_F_NOWAIT	(1 << 0)
+
 /*
  * These flags let !MMU mmap() govern direct device mapping vs immediate
  * copying more easily for MAP_PRIVATE, especially for ROM filesystems.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/11] vfs: add a vfs helper for io_uring file pos lock
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
                   ` (2 preceding siblings ...)
  2023-08-27 13:28 ` [PATCH 03/11] vfs: add nowait flag for struct dir_context Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-08-27 20:47   ` Matthew Wilcox
  2023-08-27 13:28 ` [PATCH 05/11] vfs: add file_pos_unlock() for io_uring usage Hao Xu
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

Add a vfs helper file_pos_lock_nowait() for io_uring usage. The function
have conditional nowait logic, i.e. if nowait is needed, return -EAGAIN
when trylock fails.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/file.c            | 13 +++++++++++++
 include/linux/file.h |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/fs/file.c b/fs/file.c
index 35c62b54c9d6..8e5c38f5db52 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1053,6 +1053,19 @@ void __f_unlock_pos(struct file *f)
 	mutex_unlock(&f->f_pos_lock);
 }
 
+int file_pos_lock_nowait(struct file *file, bool nowait)
+{
+	if (!(file->f_mode & FMODE_ATOMIC_POS))
+		return 0;
+
+	if (!nowait)
+		mutex_lock(&file->f_pos_lock);
+	else if (!mutex_trylock(&file->f_pos_lock))
+		return -EAGAIN;
+
+	return 1;
+}
+
 /*
  * We only lock f_pos if we have threads or if the file might be
  * shared with another process. In both cases we'll have an elevated
diff --git a/include/linux/file.h b/include/linux/file.h
index 6e9099d29343..bcc6ba0aec50 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -81,6 +81,8 @@ static inline void fdput_pos(struct fd f)
 	fdput(f);
 }
 
+extern int file_pos_lock_nowait(struct file *file, bool nowait);
+
 DEFINE_CLASS(fd, struct fd, fdput(_T), fdget(fd), int fd)
 
 extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/11] vfs: add file_pos_unlock() for io_uring usage
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
                   ` (3 preceding siblings ...)
  2023-08-27 13:28 ` [PATCH 04/11] vfs: add a vfs helper for io_uring file pos lock Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-08-27 13:28 ` [PATCH 06/11] vfs: add a nowait parameter for touch_atime() Hao Xu
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

Add a helper to unlock f_pos_lock without any condition. Introduce this
since io_uring handles f_pos_lock not with a fd struct, thus
FDPUT_POS_UNLOCK isn't used.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 include/linux/file.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/file.h b/include/linux/file.h
index bcc6ba0aec50..a179f4794341 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -81,6 +81,11 @@ static inline void fdput_pos(struct fd f)
 	fdput(f);
 }
 
+static inline void file_pos_unlock(struct file *file)
+{
+	__f_unlock_pos(file);
+}
+
 extern int file_pos_lock_nowait(struct file *file, bool nowait);
 
 DEFINE_CLASS(fd, struct fd, fdput(_T), fdget(fd), int fd)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/11] vfs: add a nowait parameter for touch_atime()
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
                   ` (4 preceding siblings ...)
  2023-08-27 13:28 ` [PATCH 05/11] vfs: add file_pos_unlock() for io_uring usage Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-08-27 13:28 ` [PATCH 07/11] vfs: add nowait parameter for file_accessed() Hao Xu
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

Add a nowait boolean parameter for touch_atime() to support nowait
semantics. It is true only when io_uring is the initial caller.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/cachefiles/namei.c | 2 +-
 fs/ecryptfs/file.c    | 4 ++--
 fs/inode.c            | 7 ++++---
 fs/namei.c            | 4 ++--
 fs/nfsd/vfs.c         | 2 +-
 fs/overlayfs/file.c   | 2 +-
 fs/overlayfs/inode.c  | 2 +-
 fs/stat.c             | 2 +-
 include/linux/fs.h    | 4 ++--
 kernel/bpf/inode.c    | 4 ++--
 net/unix/af_unix.c    | 4 ++--
 11 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index d9d22d0ec38a..7a21bf0e36b8 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -591,7 +591,7 @@ static bool cachefiles_open_file(struct cachefiles_object *object,
 	 * used to keep track of culling, and atimes are only updated by read,
 	 * write and readdir but not lookup or open).
 	 */
-	touch_atime(&file->f_path);
+	touch_atime(&file->f_path, false);
 	dput(dentry);
 	return true;
 
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index ce0a3c5ed0ca..3db7006cc440 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -39,7 +39,7 @@ static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb,
 	rc = generic_file_read_iter(iocb, to);
 	if (rc >= 0) {
 		path = ecryptfs_dentry_to_lower_path(file->f_path.dentry);
-		touch_atime(path);
+		touch_atime(path, false);
 	}
 	return rc;
 }
@@ -64,7 +64,7 @@ static ssize_t ecryptfs_splice_read_update_atime(struct file *in, loff_t *ppos,
 	rc = filemap_splice_read(in, ppos, pipe, len, flags);
 	if (rc >= 0) {
 		path = ecryptfs_dentry_to_lower_path(in->f_path.dentry);
-		touch_atime(path);
+		touch_atime(path, false);
 	}
 	return rc;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 8fefb69e1f84..e83b836f2d09 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1961,17 +1961,17 @@ bool atime_needs_update(const struct path *path, struct inode *inode)
 	return true;
 }
 
-void touch_atime(const struct path *path)
+int touch_atime(const struct path *path, bool nowait)
 {
 	struct vfsmount *mnt = path->mnt;
 	struct inode *inode = d_inode(path->dentry);
 	struct timespec64 now;
 
 	if (!atime_needs_update(path, inode))
-		return;
+		return 0;
 
 	if (!sb_start_write_trylock(inode->i_sb))
-		return;
+		return 0;
 
 	if (__mnt_want_write(mnt) != 0)
 		goto skip_update;
@@ -1989,6 +1989,7 @@ void touch_atime(const struct path *path)
 	__mnt_drop_write(mnt);
 skip_update:
 	sb_end_write(inode->i_sb);
+	return 0;
 }
 EXPORT_SYMBOL(touch_atime);
 
diff --git a/fs/namei.c b/fs/namei.c
index e56ff39a79bc..35731d405730 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1776,12 +1776,12 @@ static const char *pick_link(struct nameidata *nd, struct path *link,
 		return ERR_PTR(-ELOOP);
 
 	if (!(nd->flags & LOOKUP_RCU)) {
-		touch_atime(&last->link);
+		touch_atime(&last->link, false);
 		cond_resched();
 	} else if (atime_needs_update(&last->link, inode)) {
 		if (!try_to_unlazy(nd))
 			return ERR_PTR(-ECHILD);
-		touch_atime(&last->link);
+		touch_atime(&last->link, false);
 	}
 
 	error = security_inode_follow_link(link->dentry, inode,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 8a2321d19194..3179e7b5d209 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1569,7 +1569,7 @@ nfsd_readlink(struct svc_rqst *rqstp, struct svc_fh *fhp, char *buf, int *lenp)
 	if (unlikely(!d_is_symlink(path.dentry)))
 		return nfserr_inval;
 
-	touch_atime(&path);
+	touch_atime(&path, false);
 
 	link = vfs_get_link(path.dentry, &done);
 	if (IS_ERR(link))
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 21245b00722a..6ff466ef98ea 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -255,7 +255,7 @@ static void ovl_file_accessed(struct file *file)
 		inode->i_ctime = upperinode->i_ctime;
 	}
 
-	touch_atime(&file->f_path);
+	touch_atime(&file->f_path, false);
 }
 
 static rwf_t ovl_iocb_to_rwf(int ifl)
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index a63e57447be9..66e03025e748 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -703,7 +703,7 @@ int ovl_update_time(struct inode *inode, struct timespec64 *ts, int flags)
 		};
 
 		if (upperpath.dentry) {
-			touch_atime(&upperpath);
+			touch_atime(&upperpath, false);
 			inode->i_atime = d_inode(upperpath.dentry)->i_atime;
 		}
 	}
diff --git a/fs/stat.c b/fs/stat.c
index 7c238da22ef0..713773e61110 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -485,7 +485,7 @@ static int do_readlinkat(int dfd, const char __user *pathname,
 		if (d_is_symlink(path.dentry) || inode->i_op->readlink) {
 			error = security_inode_readlink(path.dentry);
 			if (!error) {
-				touch_atime(&path);
+				touch_atime(&path, false);
 				error = vfs_readlink(path.dentry, buf, bufsiz);
 			}
 		}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f3e315e8efdd..ba54879089ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2201,13 +2201,13 @@ enum file_time_flags {
 };
 
 extern bool atime_needs_update(const struct path *, struct inode *);
-extern void touch_atime(const struct path *);
+extern int touch_atime(const struct path *path, bool nowait);
 int inode_update_time(struct inode *inode, struct timespec64 *time, int flags);
 
 static inline void file_accessed(struct file *file)
 {
 	if (!(file->f_flags & O_NOATIME))
-		touch_atime(&file->f_path);
+		touch_atime(&file->f_path, false);
 }
 
 extern int file_modified(struct file *file);
diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c
index 4174f76133df..bc020b45d5c8 100644
--- a/kernel/bpf/inode.c
+++ b/kernel/bpf/inode.c
@@ -517,7 +517,7 @@ static void *bpf_obj_do_get(int path_fd, const char __user *pathname,
 
 	raw = bpf_any_get(inode->i_private, *type);
 	if (!IS_ERR(raw))
-		touch_atime(&path);
+		touch_atime(&path, false);
 
 	path_put(&path);
 	return raw;
@@ -591,7 +591,7 @@ struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type typ
 		return ERR_PTR(ret);
 	prog = __get_prog_inode(d_backing_inode(path.dentry), type);
 	if (!IS_ERR(prog))
-		touch_atime(&path);
+		touch_atime(&path, false);
 	path_put(&path);
 	return prog;
 }
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 123b35ddfd71..5868e4e47320 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1084,7 +1084,7 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
 
 	err = -EPROTOTYPE;
 	if (sk->sk_type == type)
-		touch_atime(&path);
+		touch_atime(&path, false);
 	else
 		goto sock_put;
 
@@ -1114,7 +1114,7 @@ static struct sock *unix_find_abstract(struct net *net,
 
 	dentry = unix_sk(sk)->path.dentry;
 	if (dentry)
-		touch_atime(&unix_sk(sk)->path);
+		touch_atime(&unix_sk(sk)->path, false);
 
 	return sk;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/11] vfs: add nowait parameter for file_accessed()
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
                   ` (5 preceding siblings ...)
  2023-08-27 13:28 ` [PATCH 06/11] vfs: add a nowait parameter for touch_atime() Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-08-27 21:32   ` Matthew Wilcox
  2023-09-04  9:51   ` Christian Brauner
  2023-08-27 13:28 ` [PATCH 08/11] vfs: move file_accessed() to the beginning of iterate_dir() Hao Xu
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

Add a boolean parameter for file_accessed() to support nowait semantics.
Currently it is true only with io_uring as its initial caller.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 arch/s390/hypfs/inode.c | 2 +-
 block/fops.c            | 2 +-
 fs/btrfs/file.c         | 2 +-
 fs/btrfs/inode.c        | 2 +-
 fs/coda/dir.c           | 4 ++--
 fs/ext2/file.c          | 4 ++--
 fs/ext4/file.c          | 6 +++---
 fs/f2fs/file.c          | 4 ++--
 fs/fuse/dax.c           | 2 +-
 fs/fuse/file.c          | 4 ++--
 fs/gfs2/file.c          | 2 +-
 fs/hugetlbfs/inode.c    | 2 +-
 fs/nilfs2/file.c        | 2 +-
 fs/orangefs/file.c      | 2 +-
 fs/orangefs/inode.c     | 2 +-
 fs/pipe.c               | 2 +-
 fs/ramfs/file-nommu.c   | 2 +-
 fs/readdir.c            | 2 +-
 fs/smb/client/cifsfs.c  | 2 +-
 fs/splice.c             | 2 +-
 fs/ubifs/file.c         | 2 +-
 fs/udf/file.c           | 2 +-
 fs/xfs/xfs_file.c       | 6 +++---
 fs/zonefs/file.c        | 4 ++--
 include/linux/fs.h      | 5 +++--
 mm/filemap.c            | 8 ++++----
 mm/shmem.c              | 6 +++---
 27 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
index ee919bfc8186..55f562027c4f 100644
--- a/arch/s390/hypfs/inode.c
+++ b/arch/s390/hypfs/inode.c
@@ -157,7 +157,7 @@ static ssize_t hypfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (!count)
 		return -EFAULT;
 	iocb->ki_pos = pos + count;
-	file_accessed(file);
+	file_accessed(file, false);
 	return count;
 }
 
diff --git a/block/fops.c b/block/fops.c
index a286bf3325c5..546ecd3c8084 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -601,7 +601,7 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		ret = kiocb_write_and_wait(iocb, count);
 		if (ret < 0)
 			goto reexpand;
-		file_accessed(iocb->ki_filp);
+		file_accessed(iocb->ki_filp, false);
 
 		ret = blkdev_direct_IO(iocb, to);
 		if (ret >= 0) {
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fd03e689a6be..24c0bf3818a6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2013,7 +2013,7 @@ static int btrfs_file_mmap(struct file	*filp, struct vm_area_struct *vma)
 	if (!mapping->a_ops->read_folio)
 		return -ENOEXEC;
 
-	file_accessed(filp);
+	file_accessed(filp, false);
 	vma->vm_ops = &btrfs_file_vm_ops;
 
 	return 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index dbbb67293e34..50e9ae8c388c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10153,7 +10153,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 	struct extent_map *em;
 	bool unlocked = false;
 
-	file_accessed(iocb->ki_filp);
+	file_accessed(iocb->ki_filp, false);
 
 	btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
 
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index 8450b1bd354b..1d94c013ac88 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -436,12 +436,12 @@ static int coda_readdir(struct file *coda_file, struct dir_context *ctx)
 			if (host_file->f_op->iterate_shared) {
 				inode_lock_shared(host_inode);
 				ret = host_file->f_op->iterate_shared(host_file, ctx);
-				file_accessed(host_file);
+				file_accessed(host_file, false);
 				inode_unlock_shared(host_inode);
 			} else {
 				inode_lock(host_inode);
 				ret = host_file->f_op->iterate(host_file, ctx);
-				file_accessed(host_file);
+				file_accessed(host_file, false);
 				inode_unlock(host_inode);
 			}
 		}
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0b4c91c62e1f..dc059cae50a4 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -44,7 +44,7 @@ static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ret = dax_iomap_rw(iocb, to, &ext2_iomap_ops);
 	inode_unlock_shared(inode);
 
-	file_accessed(iocb->ki_filp);
+	file_accessed(iocb->ki_filp, false);
 	return ret;
 }
 
@@ -127,7 +127,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!IS_DAX(file_inode(file)))
 		return generic_file_mmap(file, vma);
 
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &ext2_dax_vm_ops;
 	return 0;
 }
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index c457c8517f0f..2ab790a668a8 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -94,7 +94,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, NULL, 0);
 	inode_unlock_shared(inode);
 
-	file_accessed(iocb->ki_filp);
+	file_accessed(iocb->ki_filp, false);
 	return ret;
 }
 
@@ -122,7 +122,7 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ret = dax_iomap_rw(iocb, to, &ext4_iomap_ops);
 	inode_unlock_shared(inode);
 
-	file_accessed(iocb->ki_filp);
+	file_accessed(iocb->ki_filp, false);
 	return ret;
 }
 #endif
@@ -820,7 +820,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!daxdev_mapping_supported(vma, dax_dev))
 		return -EOPNOTSUPP;
 
-	file_accessed(file);
+	file_accessed(file, false);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
 		vm_flags_set(vma, VM_HUGEPAGE);
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 093039dee992..246e61d78f92 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -524,7 +524,7 @@ static int f2fs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!f2fs_is_compress_backend_ready(inode))
 		return -EOPNOTSUPP;
 
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &f2fs_file_vm_ops;
 	set_inode_flag(inode, FI_MMAP_FILE);
 	return 0;
@@ -4380,7 +4380,7 @@ static ssize_t f2fs_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	f2fs_up_read(&fi->i_gc_rwsem[READ]);
 
-	file_accessed(file);
+	file_accessed(file, false);
 out:
 	trace_f2fs_direct_IO_exit(inode, pos, count, READ, ret);
 	return ret;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 8e74f278a3f6..8a43c37195dd 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -858,7 +858,7 @@ static const struct vm_operations_struct fuse_dax_vm_ops = {
 
 int fuse_dax_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &fuse_dax_vm_ops;
 	vm_flags_set(vma, VM_MIXEDMAP | VM_HUGEPAGE);
 	return 0;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index bc4115288eec..3c4cbc5e2de6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2496,7 +2496,7 @@ static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
 		fuse_link_write_file(file);
 
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &fuse_file_vm_ops;
 	return 0;
 }
@@ -3193,7 +3193,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
 		clear_bit(FUSE_I_SIZE_UNSTABLE, &fi_out->state);
 
 	inode_unlock(inode_out);
-	file_accessed(file_in);
+	file_accessed(file_in, false);
 
 	fuse_flush_time_update(inode_out);
 
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 1bf3c4453516..3003be5b8266 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -601,7 +601,7 @@ static int gfs2_mmap(struct file *file, struct vm_area_struct *vma)
 			return error;
 		/* grab lock to update inode */
 		gfs2_glock_dq_uninit(&i_gh);
-		file_accessed(file);
+		file_accessed(file, false);
 	}
 	vma->vm_ops = &gfs2_vm_ops;
 
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7b17ccfa039d..729f66346c3c 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -161,7 +161,7 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 		return -EINVAL;
 
 	inode_lock(inode);
-	file_accessed(file);
+	file_accessed(file, false);
 
 	ret = -ENOMEM;
 	if (!hugetlb_reserve_pages(inode,
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index a9eb3487efb2..a857ebcf099c 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -119,7 +119,7 @@ static const struct vm_operations_struct nilfs_file_vm_ops = {
 
 static int nilfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &nilfs_file_vm_ops;
 	return 0;
 }
diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index d68372241b30..5c7a17995fe1 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -412,7 +412,7 @@ static int orangefs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	/* set the sequential readahead hint */
 	vm_flags_mod(vma, VM_SEQ_READ, VM_RAND_READ);
 
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &orangefs_file_vm_ops;
 	return 0;
 }
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 9014bbcc8031..77d56703bb09 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -597,7 +597,7 @@ static ssize_t orangefs_direct_IO(struct kiocb *iocb,
 		ret = total_count;
 	if (ret > 0) {
 		if (type == ORANGEFS_IO_READ) {
-			file_accessed(file);
+			file_accessed(file, false);
 		} else {
 			file_update_time(file);
 			if (*offset > i_size_read(inode))
diff --git a/fs/pipe.c b/fs/pipe.c
index 2d88f73f585a..ce1038d3de4b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -393,7 +393,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 	kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
 	if (ret > 0)
-		file_accessed(filp);
+		file_accessed(filp, false);
 	return ret;
 }
 
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c
index efb1b4c1a0a4..ad69f828f6ad 100644
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -267,7 +267,7 @@ static int ramfs_nommu_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!is_nommu_shared_mapping(vma->vm_flags))
 		return -ENOSYS;
 
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &generic_file_vm_ops;
 	return 0;
 }
diff --git a/fs/readdir.c b/fs/readdir.c
index b80caf4c9321..2f4c9c663a39 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -68,7 +68,7 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
 			res = file->f_op->iterate(file, ctx);
 		file->f_pos = ctx->pos;
 		fsnotify_access(file);
-		file_accessed(file);
+		file_accessed(file, ctx->flags & DIR_CONTEXT_F_NOWAIT);
 	}
 	if (shared)
 		inode_unlock_shared(inode);
diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c
index a4d8b0ea1c8c..20156c5e83e6 100644
--- a/fs/smb/client/cifsfs.c
+++ b/fs/smb/client/cifsfs.c
@@ -1307,7 +1307,7 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
 		rc = target_tcon->ses->server->ops->copychunk_range(xid,
 			smb_file_src, smb_file_target, off, len, destoff);
 
-	file_accessed(src_file);
+	file_accessed(src_file, false);
 
 	/* force revalidate of size and timestamps of target file now
 	 * that target is updated on the server
diff --git a/fs/splice.c b/fs/splice.c
index 004eb1c4ce31..e4dcfa1c0fef 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1104,7 +1104,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 
 done:
 	pipe->tail = pipe->head = 0;
-	file_accessed(in);
+	file_accessed(in, false);
 	return bytes;
 
 read_failure:
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 6738fe43040b..a27c73848571 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1603,7 +1603,7 @@ static int ubifs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	vma->vm_ops = &ubifs_file_vm_ops;
 
 	if (IS_ENABLED(CONFIG_UBIFS_ATIME_SUPPORT))
-		file_accessed(file);
+		file_accessed(file, false);
 
 	return 0;
 }
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 243840dc83ad..46edf6e64632 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -191,7 +191,7 @@ static int udf_release_file(struct inode *inode, struct file *filp)
 
 static int udf_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &udf_file_vm_ops;
 
 	return 0;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4f502219ae4f..c72efdb9e43e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -227,7 +227,7 @@ xfs_file_dio_read(
 	if (!iov_iter_count(to))
 		return 0; /* skip atime */
 
-	file_accessed(iocb->ki_filp);
+	file_accessed(iocb->ki_filp, false);
 
 	ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
 	if (ret)
@@ -257,7 +257,7 @@ xfs_file_dax_read(
 	ret = dax_iomap_rw(iocb, to, &xfs_read_iomap_ops);
 	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
 
-	file_accessed(iocb->ki_filp);
+	file_accessed(iocb->ki_filp, false);
 	return ret;
 }
 
@@ -1434,7 +1434,7 @@ xfs_file_mmap(
 	if (!daxdev_mapping_supported(vma, target->bt_daxdev))
 		return -EOPNOTSUPP;
 
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &xfs_file_vm_ops;
 	if (IS_DAX(inode))
 		vm_flags_set(vma, VM_HUGEPAGE);
diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 92c9aaae3663..664ebae181bd 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -323,7 +323,7 @@ static int zonefs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	    (vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
 		return -EINVAL;
 
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &zonefs_file_vm_ops;
 
 	return 0;
@@ -736,7 +736,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			ret = -EINVAL;
 			goto inode_unlock;
 		}
-		file_accessed(iocb->ki_filp);
+		file_accessed(iocb->ki_filp, false);
 		ret = iomap_dio_rw(iocb, to, &zonefs_read_iomap_ops,
 				   &zonefs_read_dio_ops, 0, NULL, 0);
 	} else {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ba54879089ac..ed60b3d70d1e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2204,10 +2204,11 @@ extern bool atime_needs_update(const struct path *, struct inode *);
 extern int touch_atime(const struct path *path, bool nowait);
 int inode_update_time(struct inode *inode, struct timespec64 *time, int flags);
 
-static inline void file_accessed(struct file *file)
+static inline int file_accessed(struct file *file, bool nowait)
 {
 	if (!(file->f_flags & O_NOATIME))
-		touch_atime(&file->f_path, false);
+		return touch_atime(&file->f_path, nowait);
+	return 0;
 }
 
 extern int file_modified(struct file *file);
diff --git a/mm/filemap.c b/mm/filemap.c
index 9e44a49bbd74..1f2032f4fd10 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2723,7 +2723,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		folio_batch_init(&fbatch);
 	} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
 
-	file_accessed(filp);
+	file_accessed(filp, false);
 
 	return already_read ? already_read : error;
 }
@@ -2809,7 +2809,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		retval = kiocb_write_and_wait(iocb, count);
 		if (retval < 0)
 			return retval;
-		file_accessed(file);
+		file_accessed(file, false);
 
 		retval = mapping->a_ops->direct_IO(iocb, iter);
 		if (retval >= 0) {
@@ -2978,7 +2978,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
 
 out:
 	folio_batch_release(&fbatch);
-	file_accessed(in);
+	file_accessed(in, false);
 
 	return total_spliced ? total_spliced : error;
 }
@@ -3613,7 +3613,7 @@ int generic_file_mmap(struct file *file, struct vm_area_struct *vma)
 
 	if (!mapping->a_ops->read_folio)
 		return -ENOEXEC;
-	file_accessed(file);
+	file_accessed(file, false);
 	vma->vm_ops = &generic_file_vm_ops;
 	return 0;
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 2f2e0e618072..440b23e2d9e1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2317,7 +2317,7 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	/* arm64 - allow memory tagging on RAM-based files */
 	vm_flags_set(vma, VM_MTE_ALLOWED);
 
-	file_accessed(file);
+	file_accessed(file, false);
 	/* This is anonymous shared memory if it is unlinked at the time of mmap */
 	if (inode->i_nlink)
 		vma->vm_ops = &shmem_vm_ops;
@@ -2727,7 +2727,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	}
 
 	*ppos = ((loff_t) index << PAGE_SHIFT) + offset;
-	file_accessed(file);
+	file_accessed(file, false);
 	return retval ? retval : error;
 }
 
@@ -2859,7 +2859,7 @@ static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
 	if (folio)
 		folio_put(folio);
 
-	file_accessed(in);
+	file_accessed(in, false);
 	return total_spliced ? total_spliced : error;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/11] vfs: move file_accessed() to the beginning of iterate_dir()
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
                   ` (6 preceding siblings ...)
  2023-08-27 13:28 ` [PATCH 07/11] vfs: add nowait parameter for file_accessed() Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-08-27 13:28 ` [PATCH 09/11] vfs: error out -EAGAIN if atime needs to be updated Hao Xu
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

Move file_accessed() to the beginning of iterate_dir() so that we don't
need to rollback all the work done when file_accessed() returns -EAGAIN
at the end of getdents.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/readdir.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/readdir.c b/fs/readdir.c
index 2f4c9c663a39..6469f076ba6e 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -61,6 +61,10 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
 
 	res = -ENOENT;
 	if (!IS_DEADDIR(inode)) {
+		res = file_accessed(file, ctx->flags & DIR_CONTEXT_F_NOWAIT);
+		if (res == -EAGAIN)
+			goto out_unlock;
+
 		ctx->pos = file->f_pos;
 		if (shared)
 			res = file->f_op->iterate_shared(file, ctx);
@@ -68,8 +72,9 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
 			res = file->f_op->iterate(file, ctx);
 		file->f_pos = ctx->pos;
 		fsnotify_access(file);
-		file_accessed(file, ctx->flags & DIR_CONTEXT_F_NOWAIT);
 	}
+
+out_unlock:
 	if (shared)
 		inode_unlock_shared(inode);
 	else
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/11] vfs: error out -EAGAIN if atime needs to be updated
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
                   ` (7 preceding siblings ...)
  2023-08-27 13:28 ` [PATCH 08/11] vfs: move file_accessed() to the beginning of iterate_dir() Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-08-27 20:51   ` Matthew Wilcox
  2023-08-27 13:28 ` [PATCH 10/11] vfs: trylock inode->i_rwsem in iterate_dir() to support nowait Hao Xu
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

To enforce nowait semantics, error out -EAGAIN if atime needs to be
updated.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index e83b836f2d09..32d81be65cf9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1970,6 +1970,9 @@ int touch_atime(const struct path *path, bool nowait)
 	if (!atime_needs_update(path, inode))
 		return 0;
 
+	if (nowait)
+		return -EAGAIN;
+
 	if (!sb_start_write_trylock(inode->i_sb))
 		return 0;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/11] vfs: trylock inode->i_rwsem in iterate_dir() to support nowait
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
                   ` (8 preceding siblings ...)
  2023-08-27 13:28 ` [PATCH 09/11] vfs: error out -EAGAIN if atime needs to be updated Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-09-04  9:37   ` Christian Brauner
  2023-08-27 13:28 ` [PATCH 11/11] io_uring: add support for getdents Hao Xu
  2023-09-04  9:57 ` [PATCH v6 00/11] io_uring getdents Christian Brauner
  11 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

Trylock inode->i_rwsem in iterate_dir() to support nowait semantics and
error out -EAGAIN when there is contention.

Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 fs/readdir.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/readdir.c b/fs/readdir.c
index 6469f076ba6e..664ecd9665a1 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -43,6 +43,8 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
 	struct inode *inode = file_inode(file);
 	bool shared = false;
 	int res = -ENOTDIR;
+	bool nowait;
+
 	if (file->f_op->iterate_shared)
 		shared = true;
 	else if (!file->f_op->iterate)
@@ -52,16 +54,22 @@ int iterate_dir(struct file *file, struct dir_context *ctx)
 	if (res)
 		goto out;
 
-	if (shared)
-		res = down_read_killable(&inode->i_rwsem);
-	else
-		res = down_write_killable(&inode->i_rwsem);
-	if (res)
+	nowait = ctx->flags & DIR_CONTEXT_F_NOWAIT;
+	if (nowait) {
+		res = shared ? down_read_trylock(&inode->i_rwsem) :
+			       down_write_trylock(&inode->i_rwsem);
+		if (!res)
+			res = -EAGAIN;
+	} else {
+		res = shared ? down_read_killable(&inode->i_rwsem) :
+			       down_write_killable(&inode->i_rwsem);
+	}
+	if (res < 0)
 		goto out;
 
 	res = -ENOENT;
 	if (!IS_DEADDIR(inode)) {
-		res = file_accessed(file, ctx->flags & DIR_CONTEXT_F_NOWAIT);
+		res = file_accessed(file, nowait);
 		if (res == -EAGAIN)
 			goto out_unlock;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/11] io_uring: add support for getdents
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
                   ` (9 preceding siblings ...)
  2023-08-27 13:28 ` [PATCH 10/11] vfs: trylock inode->i_rwsem in iterate_dir() to support nowait Hao Xu
@ 2023-08-27 13:28 ` Hao Xu
  2023-09-04  9:57 ` [PATCH v6 00/11] io_uring getdents Christian Brauner
  11 siblings, 0 replies; 28+ messages in thread
From: Hao Xu @ 2023-08-27 13:28 UTC (permalink / raw)
  To: io-uring, Jens Axboe
  Cc: Dominique Martinet, Pavel Begunkov, Christian Brauner,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

From: Hao Xu <howeyxu@tencent.com>

This add support for getdents64 to io_uring, acting exactly like the
syscall: the directory is iterated from it's current's position as
stored in the file struct, and the file's position is updated exactly as
if getdents64 had been called.

For filesystems that support NOWAIT in iterate_shared(), try to use it
first; if a user already knows the filesystem they use do not support
nowait they can force async through IOSQE_ASYNC in the sqe flags,
avoiding the need to bounce back through a useless EAGAIN return.

Co-developed-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Hao Xu <howeyxu@tencent.com>
---
 include/uapi/linux/io_uring.h |  1 +
 io_uring/fs.c                 | 53 +++++++++++++++++++++++++++++++++++
 io_uring/fs.h                 |  3 ++
 io_uring/opdef.c              |  8 ++++++
 4 files changed, 65 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 8e61f8b7c2ce..3896397a1998 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -240,6 +240,7 @@ enum io_uring_op {
 	IORING_OP_URING_CMD,
 	IORING_OP_SEND_ZC,
 	IORING_OP_SENDMSG_ZC,
+	IORING_OP_GETDENTS,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..04711feac4e6 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -47,6 +47,12 @@ struct io_link {
 	int				flags;
 };
 
+struct io_getdents {
+	struct file			*file;
+	struct linux_dirent64 __user	*dirent;
+	unsigned int			count;
+};
+
 int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
@@ -291,3 +297,50 @@ void io_link_cleanup(struct io_kiocb *req)
 	putname(sl->oldpath);
 	putname(sl->newpath);
 }
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+
+	if (READ_ONCE(sqe->off))
+		return -EINVAL;
+
+	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	gd->count = READ_ONCE(sqe->len);
+
+	return 0;
+}
+
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+	struct file *file = req->file;
+	unsigned long getdents_flags = 0;
+	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	bool locked;
+	int ret;
+
+	if (force_nonblock) {
+		if (!(file->f_flags & O_NONBLOCK) &&
+		    !(file->f_mode & FMODE_NOWAIT))
+			return -EAGAIN;
+
+		getdents_flags = DIR_CONTEXT_F_NOWAIT;
+	}
+
+	ret = file_pos_lock_nowait(file, force_nonblock);
+	if (ret == -EAGAIN)
+		return ret;
+	locked = ret;
+
+	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
+	if (locked)
+		file_pos_unlock(file);
+
+	if (ret == -EAGAIN && force_nonblock)
+		return -EAGAIN;
+
+	io_req_set_res(req, ret, 0);
+	return 0;
+}
+
diff --git a/io_uring/fs.h b/io_uring/fs.h
index 0bb5efe3d6bb..f83a6f3a678d 100644
--- a/io_uring/fs.h
+++ b/io_uring/fs.h
@@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
 int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
 void io_link_cleanup(struct io_kiocb *req);
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3b9c6489b8b6..1bae6b2a8d0b 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
 		.prep			= io_eopnotsupp_prep,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.needs_file		= 1,
+		.prep			= io_getdents_prep,
+		.issue			= io_getdents,
+	},
 };
 
 
@@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
 		.fail			= io_sendrecv_fail,
 #endif
 	},
+	[IORING_OP_GETDENTS] = {
+		.name			= "GETDENTS",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/11] xfs: add NOWAIT semantics for readdir
  2023-08-27 13:28 ` [PATCH 02/11] xfs: add NOWAIT semantics for readdir Hao Xu
@ 2023-08-27 20:44   ` Matthew Wilcox
  2023-08-29  7:41     ` Hao Xu
  2023-09-04  1:02   ` Dave Chinner
  1 sibling, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2023-08-27 20:44 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote:
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -2643,16 +2643,32 @@ xfs_da_read_buf(
>  	struct xfs_buf_map	map, *mapp = &map;
>  	int			nmap = 1;
>  	int			error;
> +	int			buf_flags = 0;
>  
>  	*bpp = NULL;
>  	error = xfs_dabuf_map(dp, bno, flags, whichfork, &mapp, &nmap);
>  	if (error || !nmap)
>  		goto out_free;
>  
> +	/*
> +	 * NOWAIT semantics mean we don't wait on the buffer lock nor do we
> +	 * issue IO for this buffer if it is not already in memory. Caller will
> +	 * retry. This will return -EAGAIN if the buffer is in memory and cannot
> +	 * be locked, and no buffer and no error if it isn't in memory.  We
> +	 * translate both of those into a return state of -EAGAIN and *bpp =
> +	 * NULL.
> +	 */

I would not include this comment.

> +	if (flags & XFS_DABUF_NOWAIT)
> +		buf_flags |= XBF_TRYLOCK | XBF_INCORE;
>  	error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0,
>  			&bp, ops);

what tsting did you do with this?  Because you don't actually _use_
buf_flags anywhere in this patch (presumably they should be the
sixth argument to xfs_trans_read_buf_map() instead of 0).  So I can only
conclude that either you didn't test, or your testing was inadequate.

>  	if (error)
>  		goto out_free;
> +	if (!bp) {
> +		ASSERT(flags & XFS_DABUF_NOWAIT);

I don't think this ASSERT is appropriate.

> @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
>  				bp = NULL;
>  			}
>  
> -			if (*lock_mode == 0)
> -				*lock_mode = xfs_ilock_data_map_shared(dp);
> +			if (*lock_mode == 0) {
> +				*lock_mode =
> +					xfs_ilock_data_map_shared_generic(dp,
> +					ctx->flags & DIR_CONTEXT_F_NOWAIT);
> +				if (!*lock_mode) {
> +					error = -EAGAIN;
> +					break;
> +				}
> +			}

'generic' doesn't seem like a great suffix to mean 'takes nowait flag'.
And this is far too far indented.

			xfs_dir2_lock(dp, ctx, lock_mode);

with:

STATIC void xfs_dir2_lock(struct xfs_inode *dp, struct dir_context *ctx,
		unsigned int lock_mode)
{
	if (*lock_mode)
		return;
	if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
		return xfs_ilock_data_map_shared_nowait(dp);
	return xfs_ilock_data_map_shared(dp);
}

... which I think you can use elsewhere in this patch (reformat it to
XFS coding style, of course).  And then you don't need
xfs_ilock_data_map_shared_generic().


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 04/11] vfs: add a vfs helper for io_uring file pos lock
  2023-08-27 13:28 ` [PATCH 04/11] vfs: add a vfs helper for io_uring file pos lock Hao Xu
@ 2023-08-27 20:47   ` Matthew Wilcox
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2023-08-27 20:47 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Sun, Aug 27, 2023 at 09:28:28PM +0800, Hao Xu wrote:
> +++ b/include/linux/file.h
> @@ -81,6 +81,8 @@ static inline void fdput_pos(struct fd f)
>  	fdput(f);
>  }
>  
> +extern int file_pos_lock_nowait(struct file *file, bool nowait);

No extern on function declarations.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 09/11] vfs: error out -EAGAIN if atime needs to be updated
  2023-08-27 13:28 ` [PATCH 09/11] vfs: error out -EAGAIN if atime needs to be updated Hao Xu
@ 2023-08-27 20:51   ` Matthew Wilcox
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2023-08-27 20:51 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Sun, Aug 27, 2023 at 09:28:33PM +0800, Hao Xu wrote:
> From: Hao Xu <howeyxu@tencent.com>
> 
> To enforce nowait semantics, error out -EAGAIN if atime needs to be
> updated.

Squash this into patch 6.  Otherwise patch 6 makes no sense.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] vfs: add nowait parameter for file_accessed()
  2023-08-27 13:28 ` [PATCH 07/11] vfs: add nowait parameter for file_accessed() Hao Xu
@ 2023-08-27 21:32   ` Matthew Wilcox
  2023-08-29  7:46     ` Hao Xu
  2023-09-04  9:51   ` Christian Brauner
  1 sibling, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2023-08-27 21:32 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> From: Hao Xu <howeyxu@tencent.com>
> 
> Add a boolean parameter for file_accessed() to support nowait semantics.
> Currently it is true only with io_uring as its initial caller.

So why do we need to do this as part of this series?  Apparently it
hasn't caused any problems for filemap_read().

> +++ b/mm/filemap.c
> @@ -2723,7 +2723,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
>  		folio_batch_init(&fbatch);
>  	} while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
>  
> -	file_accessed(filp);
> +	file_accessed(filp, false);
>  
>  	return already_read ? already_read : error;
>  }
> @@ -2809,7 +2809,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  		retval = kiocb_write_and_wait(iocb, count);
>  		if (retval < 0)
>  			return retval;
> -		file_accessed(file);
> +		file_accessed(file, false);
>  
>  		retval = mapping->a_ops->direct_IO(iocb, iter);
>  		if (retval >= 0) {
> @@ -2978,7 +2978,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
>  
>  out:
>  	folio_batch_release(&fbatch);
> -	file_accessed(in);
> +	file_accessed(in, false);
>  
>  	return total_spliced ? total_spliced : error;
>  }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/11] xfs: add NOWAIT semantics for readdir
  2023-08-27 20:44   ` Matthew Wilcox
@ 2023-08-29  7:41     ` Hao Xu
  2023-08-29 13:05       ` Matthew Wilcox
  0 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-08-29  7:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On 8/28/23 04:44, Matthew Wilcox wrote:
> On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote:
>> +++ b/fs/xfs/libxfs/xfs_da_btree.c
>> @@ -2643,16 +2643,32 @@ xfs_da_read_buf(
>>   	struct xfs_buf_map	map, *mapp = &map;
>>   	int			nmap = 1;
>>   	int			error;
>> +	int			buf_flags = 0;
>>   
>>   	*bpp = NULL;
>>   	error = xfs_dabuf_map(dp, bno, flags, whichfork, &mapp, &nmap);
>>   	if (error || !nmap)
>>   		goto out_free;
>>   
>> +	/*
>> +	 * NOWAIT semantics mean we don't wait on the buffer lock nor do we
>> +	 * issue IO for this buffer if it is not already in memory. Caller will
>> +	 * retry. This will return -EAGAIN if the buffer is in memory and cannot
>> +	 * be locked, and no buffer and no error if it isn't in memory.  We
>> +	 * translate both of those into a return state of -EAGAIN and *bpp =
>> +	 * NULL.
>> +	 */
> 
> I would not include this comment.

No strong comment here, since this patch is mostly from Dave, it's
better if Dave can ack this.

> 
>> +	if (flags & XFS_DABUF_NOWAIT)
>> +		buf_flags |= XBF_TRYLOCK | XBF_INCORE;
>>   	error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0,
>>   			&bp, ops);
> 
> what tsting did you do with this?  Because you don't actually _use_
> buf_flags anywhere in this patch (presumably they should be the
> sixth argument to xfs_trans_read_buf_map() instead of 0).  So I can only
> conclude that either you didn't test, or your testing was inadequate.
> 


The tests I've done are listed in the cover-letter, this one is missed, 
the tricky place is it's hard to get this kind of mistake since it runs
well without nowait logic...I'll fix it in next version.

>>   	if (error)
>>   		goto out_free;
>> +	if (!bp) {
>> +		ASSERT(flags & XFS_DABUF_NOWAIT);
> 
> I don't think this ASSERT is appropriate.
> 
>> @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
>>   				bp = NULL;
>>   			}
>>   
>> -			if (*lock_mode == 0)
>> -				*lock_mode = xfs_ilock_data_map_shared(dp);
>> +			if (*lock_mode == 0) {
>> +				*lock_mode =
>> +					xfs_ilock_data_map_shared_generic(dp,
>> +					ctx->flags & DIR_CONTEXT_F_NOWAIT);
>> +				if (!*lock_mode) {
>> +					error = -EAGAIN;
>> +					break;
>> +				}
>> +			}
> 
> 'generic' doesn't seem like a great suffix to mean 'takes nowait flag'.
> And this is far too far indented.
> 
> 			xfs_dir2_lock(dp, ctx, lock_mode);
> 
> with:
> 
> STATIC void xfs_dir2_lock(struct xfs_inode *dp, struct dir_context *ctx,
> 		unsigned int lock_mode)
> {
> 	if (*lock_mode)
> 		return;
> 	if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
> 		return xfs_ilock_data_map_shared_nowait(dp);
> 	return xfs_ilock_data_map_shared(dp);
> }
> 
> ... which I think you can use elsewhere in this patch (reformat it to
> XFS coding style, of course).  And then you don't need
> xfs_ilock_data_map_shared_generic().
> 

How about rename xfs_ilock_data_map_shared() to 
xfs_ilock_data_map_block() and rename 
xfs_ilock_data_map_shared_generic() to xfs_ilock_data_map_shared()?

STATIC void xfs_ilock_data_map_shared(struct xfs_inode *dp, struct 
dir_context *ctx, unsigned int lock_mode)
{
  	if (*lock_mode)
  		return;
  	if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
  		return xfs_ilock_data_map_shared_nowait(dp);
  	return xfs_ilock_data_map_shared_block(dp);
}



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] vfs: add nowait parameter for file_accessed()
  2023-08-27 21:32   ` Matthew Wilcox
@ 2023-08-29  7:46     ` Hao Xu
  2023-08-29 11:53       ` Matthew Wilcox
  0 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-08-29  7:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On 8/28/23 05:32, Matthew Wilcox wrote:
> On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
>> From: Hao Xu <howeyxu@tencent.com>
>>
>> Add a boolean parameter for file_accessed() to support nowait semantics.
>> Currently it is true only with io_uring as its initial caller.
> 
> So why do we need to do this as part of this series?  Apparently it
> hasn't caused any problems for filemap_read().
> 

We need this parameter to indicate if nowait semantics should be 
enforced in touch_atime(), There are locks and maybe IOs in it.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] vfs: add nowait parameter for file_accessed()
  2023-08-29  7:46     ` Hao Xu
@ 2023-08-29 11:53       ` Matthew Wilcox
  2023-08-30  6:11         ` Hao Xu
  0 siblings, 1 reply; 28+ messages in thread
From: Matthew Wilcox @ 2023-08-29 11:53 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> On 8/28/23 05:32, Matthew Wilcox wrote:
> > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > From: Hao Xu <howeyxu@tencent.com>
> > > 
> > > Add a boolean parameter for file_accessed() to support nowait semantics.
> > > Currently it is true only with io_uring as its initial caller.
> > 
> > So why do we need to do this as part of this series?  Apparently it
> > hasn't caused any problems for filemap_read().
> > 
> 
> We need this parameter to indicate if nowait semantics should be enforced in
> touch_atime(), There are locks and maybe IOs in it.

That's not my point.  We currently call file_accessed() and
touch_atime() for nowait reads and nowait writes.  You haven't done
anything to fix those.

I suspect you can trim this patchset down significantly by avoiding
fixing the file_accessed() problem.  And then come back with a later
patchset that fixes it for all nowait i/o.  Or do a separate prep series
first that fixes it for the existing nowait users, and then a second
series to do all the directory stuff.

I'd do the first thing.  Just ignore the problem.  Directory atime
updates cause I/O so rarely that you can afford to ignore it.  Almost
everyone uses relatime or nodiratime.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/11] xfs: add NOWAIT semantics for readdir
  2023-08-29  7:41     ` Hao Xu
@ 2023-08-29 13:05       ` Matthew Wilcox
  0 siblings, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2023-08-29 13:05 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Tue, Aug 29, 2023 at 03:41:43PM +0800, Hao Xu wrote:
> On 8/28/23 04:44, Matthew Wilcox wrote:
> > > @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
> > >   				bp = NULL;
> > >   			}
> > > -			if (*lock_mode == 0)
> > > -				*lock_mode = xfs_ilock_data_map_shared(dp);
> > > +			if (*lock_mode == 0) {
> > > +				*lock_mode =
> > > +					xfs_ilock_data_map_shared_generic(dp,
> > > +					ctx->flags & DIR_CONTEXT_F_NOWAIT);
> > > +				if (!*lock_mode) {
> > > +					error = -EAGAIN;
> > > +					break;
> > > +				}
> > > +			}
> > 
> > 'generic' doesn't seem like a great suffix to mean 'takes nowait flag'.
> > And this is far too far indented.
> > 
> > 			xfs_dir2_lock(dp, ctx, lock_mode);
> > 
> > with:
> > 
> > STATIC void xfs_dir2_lock(struct xfs_inode *dp, struct dir_context *ctx,
> > 		unsigned int lock_mode)
> > {
> > 	if (*lock_mode)
> > 		return;
> > 	if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
> > 		return xfs_ilock_data_map_shared_nowait(dp);
> > 	return xfs_ilock_data_map_shared(dp);
> > }
> > 
> > ... which I think you can use elsewhere in this patch (reformat it to
> > XFS coding style, of course).  And then you don't need
> > xfs_ilock_data_map_shared_generic().
> 
> How about rename xfs_ilock_data_map_shared() to xfs_ilock_data_map_block()
> and rename xfs_ilock_data_map_shared_generic() to
> xfs_ilock_data_map_shared()?
> 
> STATIC void xfs_ilock_data_map_shared(struct xfs_inode *dp, struct
> dir_context *ctx, unsigned int lock_mode)
> {
>  	if (*lock_mode)
>  		return;
>  	if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
>  		return xfs_ilock_data_map_shared_nowait(dp);
>  	return xfs_ilock_data_map_shared_block(dp);
> }

xfs_ilock_data_map_shared() is used for a lot of things which are not
directories.  I think a new function name is appropriate, and that
function name should include the word 'dir' in it somewhere.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] vfs: add nowait parameter for file_accessed()
  2023-08-29 11:53       ` Matthew Wilcox
@ 2023-08-30  6:11         ` Hao Xu
  2023-09-03 22:30           ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Hao Xu @ 2023-08-30  6:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Dave Chinner, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On 8/29/23 19:53, Matthew Wilcox wrote:
> On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
>> On 8/28/23 05:32, Matthew Wilcox wrote:
>>> On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>
>>>> Add a boolean parameter for file_accessed() to support nowait semantics.
>>>> Currently it is true only with io_uring as its initial caller.
>>>
>>> So why do we need to do this as part of this series?  Apparently it
>>> hasn't caused any problems for filemap_read().
>>>
>>
>> We need this parameter to indicate if nowait semantics should be enforced in
>> touch_atime(), There are locks and maybe IOs in it.
> 
> That's not my point.  We currently call file_accessed() and
> touch_atime() for nowait reads and nowait writes.  You haven't done
> anything to fix those.
> 
> I suspect you can trim this patchset down significantly by avoiding
> fixing the file_accessed() problem.  And then come back with a later
> patchset that fixes it for all nowait i/o.  Or do a separate prep series

I'm ok to do that.

> first that fixes it for the existing nowait users, and then a second
> series to do all the directory stuff.
> 
> I'd do the first thing.  Just ignore the problem.  Directory atime
> updates cause I/O so rarely that you can afford to ignore it.  Almost
> everyone uses relatime or nodiratime.

Hi Matthew,
The previous discussion shows this does cause issues in real
producations: 
https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a30510f0@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] vfs: add nowait parameter for file_accessed()
  2023-08-30  6:11         ` Hao Xu
@ 2023-09-03 22:30           ` Dave Chinner
  2023-09-08  0:29             ` Pavel Begunkov
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Chinner @ 2023-09-03 22:30 UTC (permalink / raw)
  To: Hao Xu
  Cc: Matthew Wilcox, io-uring, Jens Axboe, Dominique Martinet,
	Pavel Begunkov, Christian Brauner, Alexander Viro, Stefan Roesch,
	Clay Harris, Darrick J . Wong, linux-fsdevel, linux-xfs,
	linux-ext4, linux-cachefs, ecryptfs, linux-nfs, linux-unionfs,
	bpf, netdev, linux-s390, linux-kernel, linux-block, linux-btrfs,
	codalist, linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs,
	devel, linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote:
> On 8/29/23 19:53, Matthew Wilcox wrote:
> > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> > > On 8/28/23 05:32, Matthew Wilcox wrote:
> > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > > > From: Hao Xu <howeyxu@tencent.com>
> > > > > 
> > > > > Add a boolean parameter for file_accessed() to support nowait semantics.
> > > > > Currently it is true only with io_uring as its initial caller.
> > > > 
> > > > So why do we need to do this as part of this series?  Apparently it
> > > > hasn't caused any problems for filemap_read().
> > > > 
> > > 
> > > We need this parameter to indicate if nowait semantics should be enforced in
> > > touch_atime(), There are locks and maybe IOs in it.
> > 
> > That's not my point.  We currently call file_accessed() and
> > touch_atime() for nowait reads and nowait writes.  You haven't done
> > anything to fix those.
> > 
> > I suspect you can trim this patchset down significantly by avoiding
> > fixing the file_accessed() problem.  And then come back with a later
> > patchset that fixes it for all nowait i/o.  Or do a separate prep series
> 
> I'm ok to do that.
> 
> > first that fixes it for the existing nowait users, and then a second
> > series to do all the directory stuff.
> > 
> > I'd do the first thing.  Just ignore the problem.  Directory atime
> > updates cause I/O so rarely that you can afford to ignore it.  Almost
> > everyone uses relatime or nodiratime.
> 
> Hi Matthew,
> The previous discussion shows this does cause issues in real
> producations: https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a30510f0@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write
> 

Then separate it out into it's own patch set so we can have a
discussion on the merits of requiring using noatime, relatime or
lazytime for really latency sensitive IO applications. Changing code
is not always the right solution...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/11] xfs: add NOWAIT semantics for readdir
  2023-08-27 13:28 ` [PATCH 02/11] xfs: add NOWAIT semantics for readdir Hao Xu
  2023-08-27 20:44   ` Matthew Wilcox
@ 2023-09-04  1:02   ` Dave Chinner
  1 sibling, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2023-09-04  1:02 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote:
> From: Hao Xu <howeyxu@tencent.com>
> 
> Implement NOWAIT semantics for readdir. Return EAGAIN error to the
> caller if it would block, like failing to get locks, or going to
> do IO.
> 
> Co-developed-by: Dave Chinner <dchinner@redhat.com>

Not really.

"Co-developed" implies equal development input between all the
parties, which is not the case here - this patch is based on
prototype I wrote, whilst you're doing the refining, testing and
correctness work.

In these cases with XFS code, we add a line in the commit message to
say:

"This is based on a patch originally written by Dave Chinner."


> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Hao Xu <howeyxu@tencent.com>
> [fixes deadlock issue, tweak code style]

With a signoff chain like you already have.

In the end you'll also get a RVB from me, which seems rather wrong
to me if I've apparently been "co-developing" the code....

....

> @@ -156,7 +157,9 @@ xfs_dir2_block_getdents(
>  	if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk)
>  		return 0;
>  
> -	error = xfs_dir3_block_read(args->trans, dp, &bp);
> +	if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
> +		flags |= XFS_DABUF_NOWAIT;
> +	error = xfs_dir3_block_read(args->trans, dp, flags, &bp);
>  	if (error)
>  		return error;
>  

Given we do this same check in both block and leaf formats to set
XFS_DABUF_NOWAIT, and we do the DIR_CONTEXT_F_NOWAIT check in
xfs_readdir() as well, we should probably do this check once at the
higher level and pass flags down from there with XFS_DABUF_NOWAIT
already set.

> @@ -240,6 +243,7 @@ xfs_dir2_block_getdents(
>  STATIC int
>  xfs_dir2_leaf_readbuf(
>  	struct xfs_da_args	*args,
> +	struct dir_context	*ctx,
>  	size_t			bufsize,
>  	xfs_dir2_off_t		*cur_off,
>  	xfs_dablk_t		*ra_blk,
> @@ -258,10 +262,15 @@ xfs_dir2_leaf_readbuf(
>  	struct xfs_iext_cursor	icur;
>  	int			ra_want;
>  	int			error = 0;
> -
> -	error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
> -	if (error)
> -		goto out;
> +	unsigned int		flags = 0;
> +
> +	if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
> +		flags |= XFS_DABUF_NOWAIT;
> +	} else {
> +		error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
> +		if (error)
> +			goto out;
> +	}

Especially as, in hindsight, this doesn't make a whole lot of sense.
If XFS_DABUF_NOWAIT is set, we keep going until
xfs_ilock_data_map_shared_nowait() where we call
xfs_need_iread_extents() to see if we need to read the extents in
and abort at that point.

So, really, we shouldn't get this far with nowait semantics if
we haven't read the extents in yet - we're supposed to already have
the inode locked here and so we should have already checked this
condition before we bother locking the inode...

i.e. all we should be doing here is this:

	if (!(flags & XFS_DABUF_NOWAIT)) {
		error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
		if (error)
			goto out;
	}

And then we don't need to pass the VFS dir_context down into low
level XFS functions unnecessarily.


>  
>  	/*
>  	 * Look for mapped directory blocks at or above the current offset.
> @@ -280,7 +289,7 @@ xfs_dir2_leaf_readbuf(
>  	new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
>  	if (new_off > *cur_off)
>  		*cur_off = new_off;
> -	error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, &bp);
> +	error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, flags, &bp);
>  	if (error)
>  		goto out;
>  
> @@ -360,6 +369,7 @@ xfs_dir2_leaf_getdents(
>  	int			byteoff;	/* offset in current block */
>  	unsigned int		offset = 0;
>  	int			error = 0;	/* error return value */
> +	int			written = 0;
>  
>  	/*
>  	 * If the offset is at or past the largest allowed value,
> @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
>  				bp = NULL;
>  			}
>  
> -			if (*lock_mode == 0)
> -				*lock_mode = xfs_ilock_data_map_shared(dp);
> -			error = xfs_dir2_leaf_readbuf(args, bufsize, &curoff,
> -					&rablk, &bp);
> +			if (*lock_mode == 0) {
> +				*lock_mode =
> +					xfs_ilock_data_map_shared_generic(dp,
> +					ctx->flags & DIR_CONTEXT_F_NOWAIT);
> +				if (!*lock_mode) {
> +					error = -EAGAIN;
> +					break;
> +				}
> +			}
> +			error = xfs_dir2_leaf_readbuf(args, ctx, bufsize,
> +					&curoff, &rablk, &bp);

int
xfs_ilock_readdir(
	struct xfs_inode	*ip,
	int			flags)
{
	if (flags & XFS_DABUF_NOWAIT) {
		if (!xfs_ilock_nowait(dp, XFS_ILOCK_SHARED))
			return -EAGAIN;
		return XFS_ILOCK_SHARED;
	}
	return xfs_ilock_data_map_shared(dp);
}

And then this code simply becomes:

			if (*lock_mode == 0)
				*lock_mode = xfs_ilock_readdir(ip, flags);


>  			if (error || !bp)
>  				break;
>  
> @@ -479,6 +496,7 @@ xfs_dir2_leaf_getdents(
>  		 */
>  		offset += length;
>  		curoff += length;
> +		written += length;
>  		/* bufsize may have just been a guess; don't go negative */
>  		bufsize = bufsize > length ? bufsize - length : 0;
>  	}
> @@ -492,6 +510,8 @@ xfs_dir2_leaf_getdents(
>  		ctx->pos = xfs_dir2_byte_to_dataptr(curoff) & 0x7fffffff;
>  	if (bp)
>  		xfs_trans_brelse(args->trans, bp);
> +	if (error == -EAGAIN && written > 0)
> +		error = 0;
>  	return error;
>  }
>  
> @@ -514,6 +534,7 @@ xfs_readdir(
>  	unsigned int		lock_mode;
>  	bool			isblock;
>  	int			error;
> +	bool			nowait;
>  
>  	trace_xfs_readdir(dp);
>  
> @@ -531,7 +552,11 @@ xfs_readdir(
>  	if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
>  		return xfs_dir2_sf_getdents(&args, ctx);
>  
> -	lock_mode = xfs_ilock_data_map_shared(dp);
> +	nowait = ctx->flags & DIR_CONTEXT_F_NOWAIT;
> +	lock_mode = xfs_ilock_data_map_shared_generic(dp, nowait);
> +	if (!lock_mode)
> +		return -EAGAIN;
> +

Given what I said above:

	if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
		/*
		 * If we need to read extents, then we must do IO
		 * and we must use exclusive locking. We don't want
		 * to do either of those things, so just bail if we
		 * have to read extents. Doing this check explicitly
		 * here means we don't have to do it anywhere else
		 * in the XFS_DABUF_NOWAIT path.
		 */
		if (xfs_need_iread_extents(&ip->i_df))
			return -EAGAIN;
		flags |= XFS_DABUF_NOWAIT;
	}
	lock_mode = xfs_ilock_readdir(dp, flags);

And with this change, we probably should be marking the entire
operation as having nowait semantics. i.e. using args->op_flags here
and only use XFS_DABUF_NOWAIT for the actual IO. ie.

		args->op_flags |= XFS_DA_OP_NOWAIT;

This makes it clear that the entire directory op should run under
NOWAIT constraints, and it avoids needing to pass an extra flag
through the stack.  That then makes the readdir locking function
look like this:

/*
 * When we are locking an inode for readdir, we need to ensure that
 * the extents have been read in first. This requires the inode to
 * be locked exclusively across the extent read, but otherwise we
 * want to use shared locking.
 *
 * For XFS_DA_OP_NOWAIT operations, we do an up-front check to see
 * if the extents have been read in, so all we need to do in this
 * case is a shared try-lock as we never need exclusive locking in
 * this path.
 */
static int
xfs_ilock_readdir(
	struct xfs_da_args	*args)
{
	if (args->op_flags & XFS_DA_OP_NOWAIT) {
		if (!xfs_ilock_nowait(args->dp, XFS_ILOCK_SHARED))
			return -EAGAIN;
		return XFS_ILOCK_SHARED;
	}
	return xfs_ilock_data_map_shared(args->dp);
}

> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 9e62cc500140..d088f7d0c23a 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -120,6 +120,33 @@ xfs_ilock_data_map_shared(
>  	return lock_mode;
>  }
>  
> +/*
> + * Similar to xfs_ilock_data_map_shared(), except that it will only try to lock
> + * the inode in shared mode if the extents are already in memory. If it fails to
> + * get the lock or has to do IO to read the extent list, fail the operation by
> + * returning 0 as the lock mode.
> + */
> +uint
> +xfs_ilock_data_map_shared_nowait(
> +	struct xfs_inode	*ip)
> +{
> +	if (xfs_need_iread_extents(&ip->i_df))
> +		return 0;
> +	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
> +		return 0;
> +	return XFS_ILOCK_SHARED;
> +}
> +
> +int
> +xfs_ilock_data_map_shared_generic(
> +	struct xfs_inode	*dp,
> +	bool			nowait)
> +{
> +	if (nowait)
> +		return xfs_ilock_data_map_shared_nowait(dp);
> +	return xfs_ilock_data_map_shared(dp);
> +}

And all this "generic" locking stuff goes away.

FWIW, IMO, "generic" is a poor name for an XFS function as there's
nothing "generic" in XFS.  We tend name the functions after what
they do, not some abstract concept. Leave "generic" as a keyword for
widely used core infrastructure functions, not niche, one-off use
cases like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/11] vfs: trylock inode->i_rwsem in iterate_dir() to support nowait
  2023-08-27 13:28 ` [PATCH 10/11] vfs: trylock inode->i_rwsem in iterate_dir() to support nowait Hao Xu
@ 2023-09-04  9:37   ` Christian Brauner
  0 siblings, 0 replies; 28+ messages in thread
From: Christian Brauner @ 2023-09-04  9:37 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Sun, Aug 27, 2023 at 09:28:34PM +0800, Hao Xu wrote:
> From: Hao Xu <howeyxu@tencent.com>
> 
> Trylock inode->i_rwsem in iterate_dir() to support nowait semantics and
> error out -EAGAIN when there is contention.
> 
> Signed-off-by: Hao Xu <howeyxu@tencent.com>
> ---

Unreviewable until you rebased on -rc1 as far as I'm concerned because
the code in here changed a lot.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] vfs: add nowait parameter for file_accessed()
  2023-08-27 13:28 ` [PATCH 07/11] vfs: add nowait parameter for file_accessed() Hao Xu
  2023-08-27 21:32   ` Matthew Wilcox
@ 2023-09-04  9:51   ` Christian Brauner
  1 sibling, 0 replies; 28+ messages in thread
From: Christian Brauner @ 2023-09-04  9:51 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> From: Hao Xu <howeyxu@tencent.com>
> 
> Add a boolean parameter for file_accessed() to support nowait semantics.
> Currently it is true only with io_uring as its initial caller.
> 
> Signed-off-by: Hao Xu <howeyxu@tencent.com>
> ---
>  arch/s390/hypfs/inode.c | 2 +-
>  block/fops.c            | 2 +-
>  fs/btrfs/file.c         | 2 +-
>  fs/btrfs/inode.c        | 2 +-
>  fs/coda/dir.c           | 4 ++--
>  fs/ext2/file.c          | 4 ++--
>  fs/ext4/file.c          | 6 +++---
>  fs/f2fs/file.c          | 4 ++--
>  fs/fuse/dax.c           | 2 +-
>  fs/fuse/file.c          | 4 ++--
>  fs/gfs2/file.c          | 2 +-
>  fs/hugetlbfs/inode.c    | 2 +-
>  fs/nilfs2/file.c        | 2 +-
>  fs/orangefs/file.c      | 2 +-
>  fs/orangefs/inode.c     | 2 +-
>  fs/pipe.c               | 2 +-
>  fs/ramfs/file-nommu.c   | 2 +-
>  fs/readdir.c            | 2 +-
>  fs/smb/client/cifsfs.c  | 2 +-
>  fs/splice.c             | 2 +-
>  fs/ubifs/file.c         | 2 +-
>  fs/udf/file.c           | 2 +-
>  fs/xfs/xfs_file.c       | 6 +++---
>  fs/zonefs/file.c        | 4 ++--
>  include/linux/fs.h      | 5 +++--
>  mm/filemap.c            | 8 ++++----
>  mm/shmem.c              | 6 +++---
>  27 files changed, 43 insertions(+), 42 deletions(-)
> 
> diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
> index ee919bfc8186..55f562027c4f 100644
> --- a/arch/s390/hypfs/inode.c
> +++ b/arch/s390/hypfs/inode.c
> @@ -157,7 +157,7 @@ static ssize_t hypfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  	if (!count)
>  		return -EFAULT;
>  	iocb->ki_pos = pos + count;
> -	file_accessed(file);
> +	file_accessed(file, false);

Why? If all you do is skip atime update anyway then just add something
like:

bool file_needs_atime(struct file *file)
{
       return !(file->f_flags & O_NOATIME) &&
              atime_needs_update(&file->f_path, d_inode(path->dentry));
}

and then

if (file_needs_atime(file) && IOURING_WANTS_ASYNC)
	return -EAGAIN;

instead of touching all this code.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 00/11] io_uring getdents
  2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
                   ` (10 preceding siblings ...)
  2023-08-27 13:28 ` [PATCH 11/11] io_uring: add support for getdents Hao Xu
@ 2023-09-04  9:57 ` Christian Brauner
  11 siblings, 0 replies; 28+ messages in thread
From: Christian Brauner @ 2023-09-04  9:57 UTC (permalink / raw)
  To: Hao Xu
  Cc: io-uring, Jens Axboe, Dominique Martinet, Pavel Begunkov,
	Alexander Viro, Stefan Roesch, Clay Harris, Dave Chinner,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Sun, Aug 27, 2023 at 09:28:24PM +0800, Hao Xu wrote:

For the future it would be helpful to hold of on sending larger series
that like this until a stable tag is out.

Right now this series is generating a bunch of merge conflicts because
of all the changes to relevant codepaths that got merged. So either we
have to resolve them to see whether things still make sense within the
context of all the changed code or risk that stuff we comment is outdated.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] vfs: add nowait parameter for file_accessed()
  2023-09-03 22:30           ` Dave Chinner
@ 2023-09-08  0:29             ` Pavel Begunkov
  2023-09-10 22:01               ` Dave Chinner
  0 siblings, 1 reply; 28+ messages in thread
From: Pavel Begunkov @ 2023-09-08  0:29 UTC (permalink / raw)
  To: Dave Chinner, Hao Xu
  Cc: Matthew Wilcox, io-uring, Jens Axboe, Dominique Martinet,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On 9/3/23 23:30, Dave Chinner wrote:
> On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote:
>> On 8/29/23 19:53, Matthew Wilcox wrote:
>>> On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
>>>> On 8/28/23 05:32, Matthew Wilcox wrote:
>>>>> On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
>>>>>> From: Hao Xu <howeyxu@tencent.com>
>>>>>>
>>>>>> Add a boolean parameter for file_accessed() to support nowait semantics.
>>>>>> Currently it is true only with io_uring as its initial caller.
>>>>>
>>>>> So why do we need to do this as part of this series?  Apparently it
>>>>> hasn't caused any problems for filemap_read().
>>>>>
>>>>
>>>> We need this parameter to indicate if nowait semantics should be enforced in
>>>> touch_atime(), There are locks and maybe IOs in it.
>>>
>>> That's not my point.  We currently call file_accessed() and
>>> touch_atime() for nowait reads and nowait writes.  You haven't done
>>> anything to fix those.
>>>
>>> I suspect you can trim this patchset down significantly by avoiding
>>> fixing the file_accessed() problem.  And then come back with a later
>>> patchset that fixes it for all nowait i/o.  Or do a separate prep series
>>
>> I'm ok to do that.
>>
>>> first that fixes it for the existing nowait users, and then a second
>>> series to do all the directory stuff.
>>>
>>> I'd do the first thing.  Just ignore the problem.  Directory atime
>>> updates cause I/O so rarely that you can afford to ignore it.  Almost
>>> everyone uses relatime or nodiratime.
>>
>> Hi Matthew,
>> The previous discussion shows this does cause issues in real
>> producations: https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a30510f0@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write
>>
> 
> Then separate it out into it's own patch set so we can have a
> discussion on the merits of requiring using noatime, relatime or
> lazytime for really latency sensitive IO applications. Changing code
> is not always the right solution...

Separation sounds reasonable, but it can hardly be said that only
latency sensitive apps would care about >1s nowait/async submission
delays. Presumably, btrfs can improve on that, but it still looks
like it's perfectly legit for filesystems do heavy stuff in
timestamping like waiting for IO. Right?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/11] vfs: add nowait parameter for file_accessed()
  2023-09-08  0:29             ` Pavel Begunkov
@ 2023-09-10 22:01               ` Dave Chinner
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Chinner @ 2023-09-10 22:01 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Hao Xu, Matthew Wilcox, io-uring, Jens Axboe, Dominique Martinet,
	Christian Brauner, Alexander Viro, Stefan Roesch, Clay Harris,
	Darrick J . Wong, linux-fsdevel, linux-xfs, linux-ext4,
	linux-cachefs, ecryptfs, linux-nfs, linux-unionfs, bpf, netdev,
	linux-s390, linux-kernel, linux-block, linux-btrfs, codalist,
	linux-f2fs-devel, cluster-devel, linux-mm, linux-nilfs, devel,
	linux-cifs, samba-technical, linux-mtd, Wanpeng Li

On Fri, Sep 08, 2023 at 01:29:55AM +0100, Pavel Begunkov wrote:
> On 9/3/23 23:30, Dave Chinner wrote:
> > On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote:
> > > On 8/29/23 19:53, Matthew Wilcox wrote:
> > > > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> > > > > On 8/28/23 05:32, Matthew Wilcox wrote:
> > > > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > > > > > From: Hao Xu <howeyxu@tencent.com>
> > > > > > > 
> > > > > > > Add a boolean parameter for file_accessed() to support nowait semantics.
> > > > > > > Currently it is true only with io_uring as its initial caller.
> > > > > > 
> > > > > > So why do we need to do this as part of this series?  Apparently it
> > > > > > hasn't caused any problems for filemap_read().
> > > > > > 
> > > > > 
> > > > > We need this parameter to indicate if nowait semantics should be enforced in
> > > > > touch_atime(), There are locks and maybe IOs in it.
> > > > 
> > > > That's not my point.  We currently call file_accessed() and
> > > > touch_atime() for nowait reads and nowait writes.  You haven't done
> > > > anything to fix those.
> > > > 
> > > > I suspect you can trim this patchset down significantly by avoiding
> > > > fixing the file_accessed() problem.  And then come back with a later
> > > > patchset that fixes it for all nowait i/o.  Or do a separate prep series
> > > 
> > > I'm ok to do that.
> > > 
> > > > first that fixes it for the existing nowait users, and then a second
> > > > series to do all the directory stuff.
> > > > 
> > > > I'd do the first thing.  Just ignore the problem.  Directory atime
> > > > updates cause I/O so rarely that you can afford to ignore it.  Almost
> > > > everyone uses relatime or nodiratime.
> > > 
> > > Hi Matthew,
> > > The previous discussion shows this does cause issues in real
> > > producations: https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a30510f0@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write
> > > 
> > 
> > Then separate it out into it's own patch set so we can have a
> > discussion on the merits of requiring using noatime, relatime or
> > lazytime for really latency sensitive IO applications. Changing code
> > is not always the right solution...
> 
> Separation sounds reasonable, but it can hardly be said that only
> latency sensitive apps would care about >1s nowait/async submission
> delays. Presumably, btrfs can improve on that, but it still looks
> like it's perfectly legit for filesystems do heavy stuff in
> timestamping like waiting for IO. Right?

Yes, it is, no-one is denying that. And some filesystems are worse
than others, but none of that means it has to be fixed so getdents
can be converted to NOWAIT semantics.

ie. this patchset is about the getdents NOWAIT machinery, and
fiddling around with timestamps has much, much wider scope than just
NOWAIT getdents machinery. We'll have this discussion about NOWAIT
timestamp updates when a RFC is proposed to address the wider
problem of how timestamp updates should behave in NOWAIT context.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-09-10 22:01 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-27 13:28 [PATCH v6 00/11] io_uring getdents Hao Xu
2023-08-27 13:28 ` [PATCH 01/11] fs: split off vfs_getdents function of getdents64 syscall Hao Xu
2023-08-27 13:28 ` [PATCH 02/11] xfs: add NOWAIT semantics for readdir Hao Xu
2023-08-27 20:44   ` Matthew Wilcox
2023-08-29  7:41     ` Hao Xu
2023-08-29 13:05       ` Matthew Wilcox
2023-09-04  1:02   ` Dave Chinner
2023-08-27 13:28 ` [PATCH 03/11] vfs: add nowait flag for struct dir_context Hao Xu
2023-08-27 13:28 ` [PATCH 04/11] vfs: add a vfs helper for io_uring file pos lock Hao Xu
2023-08-27 20:47   ` Matthew Wilcox
2023-08-27 13:28 ` [PATCH 05/11] vfs: add file_pos_unlock() for io_uring usage Hao Xu
2023-08-27 13:28 ` [PATCH 06/11] vfs: add a nowait parameter for touch_atime() Hao Xu
2023-08-27 13:28 ` [PATCH 07/11] vfs: add nowait parameter for file_accessed() Hao Xu
2023-08-27 21:32   ` Matthew Wilcox
2023-08-29  7:46     ` Hao Xu
2023-08-29 11:53       ` Matthew Wilcox
2023-08-30  6:11         ` Hao Xu
2023-09-03 22:30           ` Dave Chinner
2023-09-08  0:29             ` Pavel Begunkov
2023-09-10 22:01               ` Dave Chinner
2023-09-04  9:51   ` Christian Brauner
2023-08-27 13:28 ` [PATCH 08/11] vfs: move file_accessed() to the beginning of iterate_dir() Hao Xu
2023-08-27 13:28 ` [PATCH 09/11] vfs: error out -EAGAIN if atime needs to be updated Hao Xu
2023-08-27 20:51   ` Matthew Wilcox
2023-08-27 13:28 ` [PATCH 10/11] vfs: trylock inode->i_rwsem in iterate_dir() to support nowait Hao Xu
2023-09-04  9:37   ` Christian Brauner
2023-08-27 13:28 ` [PATCH 11/11] io_uring: add support for getdents Hao Xu
2023-09-04  9:57 ` [PATCH v6 00/11] io_uring getdents Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).