All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-25 17:37 ` Abhi Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhi Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, cluster-devel; +Cc: Abhi Das

This system call takes 3 arguments:
fd      - file descriptor of the directory being readahead
*offset - offset in dir from which to resume. This is updated
          as we move along in the directory
count   - The max number of entries to readahead

The syscall is supposed to read upto 'count' entries starting at
'*offset' and cache the inodes corresponding to those entries. It
returns a negative error code or a positive number indicating
the number of inodes it has issued readaheads for. It also
updates the '*offset' value so that repeated calls to dirreadahead
can resume at the right location. Returns 0 when there are no more
entries left.

Abhi Das (2):
  fs: Add dirreadahead syscall and VFS hooks
  gfs2: GFS2's implementation of the dir_readahead file operation

 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/gfs2/Makefile                 |   3 +-
 fs/gfs2/dir.c                    |  49 ++++++---
 fs/gfs2/dir.h                    |  15 +++
 fs/gfs2/dir_readahead.c          | 209 +++++++++++++++++++++++++++++++++++++++
 fs/gfs2/file.c                   |   2 +
 fs/gfs2/main.c                   |  10 +-
 fs/gfs2/super.c                  |   1 +
 fs/readdir.c                     |  49 +++++++++
 include/linux/fs.h               |   3 +
 11 files changed, 328 insertions(+), 15 deletions(-)
 create mode 100644 fs/gfs2/dir_readahead.c

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-25 17:37 ` Abhi Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhi Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, cluster-devel

This system call takes 3 arguments:
fd      - file descriptor of the directory being readahead
*offset - offset in dir from which to resume. This is updated
          as we move along in the directory
count   - The max number of entries to readahead

The syscall is supposed to read upto 'count' entries starting at
'*offset' and cache the inodes corresponding to those entries. It
returns a negative error code or a positive number indicating
the number of inodes it has issued readaheads for. It also
updates the '*offset' value so that repeated calls to dirreadahead
can resume at the right location. Returns 0 when there are no more
entries left.

Abhi Das (2):
  fs: Add dirreadahead syscall and VFS hooks
  gfs2: GFS2's implementation of the dir_readahead file operation

 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/gfs2/Makefile                 |   3 +-
 fs/gfs2/dir.c                    |  49 ++++++---
 fs/gfs2/dir.h                    |  15 +++
 fs/gfs2/dir_readahead.c          | 209 +++++++++++++++++++++++++++++++++++++++
 fs/gfs2/file.c                   |   2 +
 fs/gfs2/main.c                   |  10 +-
 fs/gfs2/super.c                  |   1 +
 fs/readdir.c                     |  49 +++++++++
 include/linux/fs.h               |   3 +
 11 files changed, 328 insertions(+), 15 deletions(-)
 create mode 100644 fs/gfs2/dir_readahead.c

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-25 17:37 ` Abhi Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhi Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This system call takes 3 arguments:
fd      - file descriptor of the directory being readahead
*offset - offset in dir from which to resume. This is updated
          as we move along in the directory
count   - The max number of entries to readahead

The syscall is supposed to read upto 'count' entries starting at
'*offset' and cache the inodes corresponding to those entries. It
returns a negative error code or a positive number indicating
the number of inodes it has issued readaheads for. It also
updates the '*offset' value so that repeated calls to dirreadahead
can resume at the right location. Returns 0 when there are no more
entries left.

Abhi Das (2):
  fs: Add dirreadahead syscall and VFS hooks
  gfs2: GFS2's implementation of the dir_readahead file operation

 arch/x86/syscalls/syscall_32.tbl |   1 +
 arch/x86/syscalls/syscall_64.tbl |   1 +
 fs/gfs2/Makefile                 |   3 +-
 fs/gfs2/dir.c                    |  49 ++++++---
 fs/gfs2/dir.h                    |  15 +++
 fs/gfs2/dir_readahead.c          | 209 +++++++++++++++++++++++++++++++++++++++
 fs/gfs2/file.c                   |   2 +
 fs/gfs2/main.c                   |  10 +-
 fs/gfs2/super.c                  |   1 +
 fs/readdir.c                     |  49 +++++++++
 include/linux/fs.h               |   3 +
 11 files changed, 328 insertions(+), 15 deletions(-)
 create mode 100644 fs/gfs2/dir_readahead.c

-- 
1.8.1.4



^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks
  2014-07-25 17:37 ` Abhi Das
@ 2014-07-25 17:37   ` Abhi Das
  -1 siblings, 0 replies; 49+ messages in thread
From: Abhi Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, cluster-devel; +Cc: Abhi Das

Also adds a void *opaque field to struct dir_context that can be
used by filesystems to temporarily store any context as this
struct gets passed around in the fs.

Signed-off-by: Abhi Das <adas@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 fs/readdir.c                     | 49 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h               |  3 +++
 4 files changed, 54 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..3e0ef85 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
 353	i386	renameat2		sys_renameat2
+354	i386	dirreadahead		sys_dirreadahead
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..2ec0991 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	dirreadahead		sys_dirreadahead
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/readdir.c b/fs/readdir.c
index 33fd922..d216db7 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -198,6 +198,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd,
 	struct linux_dirent __user * lastdirent;
 	struct getdents_callback buf = {
 		.ctx.actor = filldir,
+		.ctx.opaque = NULL,
 		.count = count,
 		.current_dir = dirent
 	};
@@ -278,6 +279,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	struct linux_dirent64 __user * lastdirent;
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
+		.ctx.opaque = NULL,
 		.count = count,
 		.current_dir = dirent
 	};
@@ -304,3 +306,50 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	fdput(f);
 	return error;
 }
+
+SYSCALL_DEFINE3(dirreadahead, unsigned int, fd,
+		loff_t __user *, offset, unsigned int, count)
+{
+	struct fd f;
+	struct inode *inode;
+	int error = -ENOTDIR;
+	loff_t off = 0;
+	struct dir_context ctx = {.actor = NULL, .opaque = NULL};
+
+	if (!count)
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	inode = f.file->f_path.dentry->d_inode;
+
+	error = -ENOTSUPP;
+	if (!f.file->f_op || !f.file->f_op->dir_readahead)
+		goto out;
+
+	error = security_file_permission(f.file, MAY_READ);
+	if (error)
+		goto out;
+
+	error = -EFAULT;
+	if (__get_user(ctx.pos, offset))
+		goto out;
+
+	error = mutex_lock_killable(&inode->i_mutex);
+	if (error)
+		goto out;
+
+	error = -ENOENT;
+	if (!IS_DEADDIR(inode)) {
+		error = f.file->f_op->dir_readahead(f.file, &ctx, count);
+		if (__put_user(ctx.pos, offset))
+			error = -EFAULT;
+		file_accessed(f.file);
+	}
+	mutex_unlock(&inode->i_mutex);
+out:
+	fdput(f);
+	return error;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 338e6f7..fae4a6e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1438,9 +1438,11 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
  * to have different dirent layouts depending on the binary type.
  */
 typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
+
 struct dir_context {
 	const filldir_t actor;
 	loff_t pos;
+	void *opaque;
 };
 
 struct block_device_operations;
@@ -1463,6 +1465,7 @@ struct file_operations {
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
 	int (*iterate) (struct file *, struct dir_context *);
+	int (*dir_readahead) (struct file *, struct dir_context *, unsigned int);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks
@ 2014-07-25 17:37   ` Abhi Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhi Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Also adds a void *opaque field to struct dir_context that can be
used by filesystems to temporarily store any context as this
struct gets passed around in the fs.

Signed-off-by: Abhi Das <adas@redhat.com>
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 fs/readdir.c                     | 49 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h               |  3 +++
 4 files changed, 54 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..3e0ef85 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
 353	i386	renameat2		sys_renameat2
+354	i386	dirreadahead		sys_dirreadahead
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..2ec0991 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	dirreadahead		sys_dirreadahead
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/readdir.c b/fs/readdir.c
index 33fd922..d216db7 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -198,6 +198,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd,
 	struct linux_dirent __user * lastdirent;
 	struct getdents_callback buf = {
 		.ctx.actor = filldir,
+		.ctx.opaque = NULL,
 		.count = count,
 		.current_dir = dirent
 	};
@@ -278,6 +279,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	struct linux_dirent64 __user * lastdirent;
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
+		.ctx.opaque = NULL,
 		.count = count,
 		.current_dir = dirent
 	};
@@ -304,3 +306,50 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
 	fdput(f);
 	return error;
 }
+
+SYSCALL_DEFINE3(dirreadahead, unsigned int, fd,
+		loff_t __user *, offset, unsigned int, count)
+{
+	struct fd f;
+	struct inode *inode;
+	int error = -ENOTDIR;
+	loff_t off = 0;
+	struct dir_context ctx = {.actor = NULL, .opaque = NULL};
+
+	if (!count)
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	inode = f.file->f_path.dentry->d_inode;
+
+	error = -ENOTSUPP;
+	if (!f.file->f_op || !f.file->f_op->dir_readahead)
+		goto out;
+
+	error = security_file_permission(f.file, MAY_READ);
+	if (error)
+		goto out;
+
+	error = -EFAULT;
+	if (__get_user(ctx.pos, offset))
+		goto out;
+
+	error = mutex_lock_killable(&inode->i_mutex);
+	if (error)
+		goto out;
+
+	error = -ENOENT;
+	if (!IS_DEADDIR(inode)) {
+		error = f.file->f_op->dir_readahead(f.file, &ctx, count);
+		if (__put_user(ctx.pos, offset))
+			error = -EFAULT;
+		file_accessed(f.file);
+	}
+	mutex_unlock(&inode->i_mutex);
+out:
+	fdput(f);
+	return error;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 338e6f7..fae4a6e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1438,9 +1438,11 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
  * to have different dirent layouts depending on the binary type.
  */
 typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
+
 struct dir_context {
 	const filldir_t actor;
 	loff_t pos;
+	void *opaque;
 };
 
 struct block_device_operations;
@@ -1463,6 +1465,7 @@ struct file_operations {
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
 	int (*iterate) (struct file *, struct dir_context *);
+	int (*dir_readahead) (struct file *, struct dir_context *, unsigned int);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 2/2] gfs2: GFS2's implementation of the dir_readahead file operation
  2014-07-25 17:37 ` Abhi Das
  (?)
@ 2014-07-25 17:37   ` Abhi Das
  -1 siblings, 0 replies; 49+ messages in thread
From: Abhi Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, cluster-devel; +Cc: Abhi Das

This patch adds support to GFS2 for the dirreadahead system call.

There's a hard limit (128K) on the number of inodes that can be
readahead at one time. There's also a memory cap on the number of
inode numbers we collect. We readahead whatever number of inodes
we have collected until the first of these two caps is hit.

Readahead is done in two stages. In the intial stage, the
directory is read through and all the inode numbers of its entries
are collected. In the second stage, readaheads are performed
asynchronously using workqueues, so the syscall can return to the
user at this stage.
Subsequent getdents calls on the directory and stat calls on the
inodes will have the time-consuming lookups already done for them
and will therefore be quick.

Signed-off-by: Abhi Das <adas@redhat.com>
---
 fs/gfs2/Makefile        |   3 +-
 fs/gfs2/dir.c           |  49 +++++++++---
 fs/gfs2/dir.h           |  15 ++++
 fs/gfs2/dir_readahead.c | 209 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/gfs2/file.c          |   2 +
 fs/gfs2/main.c          |  10 ++-
 fs/gfs2/super.c         |   1 +
 7 files changed, 274 insertions(+), 15 deletions(-)
 create mode 100644 fs/gfs2/dir_readahead.c

diff --git a/fs/gfs2/Makefile b/fs/gfs2/Makefile
index 8612820..2765c83 100644
--- a/fs/gfs2/Makefile
+++ b/fs/gfs2/Makefile
@@ -4,7 +4,8 @@ gfs2-y := acl.o bmap.o dir.o xattr.o glock.o \
 	glops.o log.o lops.o main.o meta_io.o \
 	aops.o dentry.o export.o file.o \
 	ops_fstype.o inode.o quota.o \
-	recovery.o rgrp.o super.o sys.o trans.o util.o
+	recovery.o rgrp.o super.o sys.o trans.o \
+	dir_readahead.o util.o
 
 gfs2-$(CONFIG_GFS2_FS_LOCKING_DLM) += lock_dlm.o
 
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 1a349f9..f068763 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -1217,6 +1217,20 @@ static int compare_dents(const void *a, const void *b)
 	return ret;
 }
 
+static int gfs2_dirent_dot_or_dotdot(const struct gfs2_dirent *dent)
+{
+	const char *name = (char *)(dent + 1);
+
+	if (be16_to_cpu(dent->de_type) == DT_DIR) {
+		if (be16_to_cpu(dent->de_name_len) == 1 && name[0] == '.')
+			return 1;
+		if (be16_to_cpu(dent->de_name_len) == 2 &&
+		    strncmp(name, "..", 2) == 0)
+			return 1;
+	}
+	return 0;   
+}
+
 /**
  * do_filldir_main - read out directory entries
  * @dip: The GFS2 inode
@@ -1262,8 +1276,12 @@ static int do_filldir_main(struct gfs2_inode *dip, struct dir_context *ctx,
 			ctx->pos = off;
 
 			if (off_next == off) {
-				if (*copied && !run)
+				if (*copied && !run) {
+					struct gfs2_dir_ra *d_ra = ctx->opaque;
+					if (d_ra)
+						set_bit(RA_FL_HASHCOLL, &d_ra->flags);
 					return 1;
+				}
 				run = 1;
 			} else
 				run = 0;
@@ -1273,11 +1291,18 @@ static int do_filldir_main(struct gfs2_inode *dip, struct dir_context *ctx,
 			ctx->pos = off;
 		}
 
-		if (!dir_emit(ctx, (const char *)(dent + 1),
-				be16_to_cpu(dent->de_name_len),
-				be64_to_cpu(dent->de_inum.no_addr),
-				be16_to_cpu(dent->de_type)))
-			return 1;
+		if (ctx->actor) {
+			if (!dir_emit(ctx, (const char *)(dent + 1),
+				      be16_to_cpu(dent->de_name_len),
+				      be64_to_cpu(dent->de_inum.no_addr),
+				      be16_to_cpu(dent->de_type)))
+				return 1;
+		} else { /* we were called by dir_readahead */
+			if (gfs2_dirent_dot_or_dotdot(dent))
+				continue;
+			if (collect_inode_blocks(ctx, be64_to_cpu(dent->de_inum.no_addr)))
+				return 1;
+		}
 
 		*copied = 1;
 	}
@@ -1311,8 +1336,7 @@ static void gfs2_free_sort_buffer(void *ptr)
 }
 
 static int gfs2_dir_read_leaf(struct inode *inode, struct dir_context *ctx,
-			      int *copied, unsigned *depth,
-			      u64 leaf_no)
+			      int *copied, unsigned *depth, u64 leaf_no)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
@@ -1399,14 +1423,14 @@ out:
 }
 
 /**
- * gfs2_dir_readahead - Issue read-ahead requests for leaf blocks.
+ * gfs2_dir_leaf_ra - Issue read-ahead requests for leaf blocks.
  *
  * Note: we can't calculate each index like dir_e_read can because we don't
  * have the leaf, and therefore we don't have the depth, and therefore we
  * don't have the length. So we have to just read enough ahead to make up
  * for the loss of information.
  */
-static void gfs2_dir_readahead(struct inode *inode, unsigned hsize, u32 index,
+static void gfs2_dir_leaf_ra(struct inode *inode, unsigned hsize, u32 index,
 			       struct file_ra_state *f_ra)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
@@ -1474,11 +1498,10 @@ static int dir_e_read(struct inode *inode, struct dir_context *ctx,
 	if (IS_ERR(lp))
 		return PTR_ERR(lp);
 
-	gfs2_dir_readahead(inode, hsize, index, f_ra);
+	gfs2_dir_leaf_ra(inode, hsize, index, f_ra);
 
 	while (index < hsize) {
-		error = gfs2_dir_read_leaf(inode, ctx,
-					   &copied, &depth,
+		error = gfs2_dir_read_leaf(inode, ctx, &copied, &depth,
 					   be64_to_cpu(lp[index]));
 		if (error)
 			break;
diff --git a/fs/gfs2/dir.h b/fs/gfs2/dir.h
index 126c65d..429eea9 100644
--- a/fs/gfs2/dir.h
+++ b/fs/gfs2/dir.h
@@ -25,6 +25,21 @@ struct gfs2_diradd {
 	struct buffer_head *bh;
 };
 
+extern struct workqueue_struct *gfs2_dir_ra_wq;
+#define RA_MAX_INOS     131072 /*128K */
+#define RA_FL_HASHCOLL  1
+
+struct gfs2_dir_ra {
+	u64          *inos;
+	size_t        size;
+	size_t        count;
+	unsigned int  req;
+	unsigned long flags;
+};
+
+extern int gfs2_dir_readahead(struct file *file, struct dir_context *ctx,
+				      unsigned int count);
+extern int collect_inode_blocks(struct dir_context *ctx, u64 offset);
 extern struct inode *gfs2_dir_search(struct inode *dir,
 				     const struct qstr *filename,
 				     bool fail_on_exist);
diff --git a/fs/gfs2/dir_readahead.c b/fs/gfs2/dir_readahead.c
new file mode 100644
index 0000000..98888ad
--- /dev/null
+++ b/fs/gfs2/dir_readahead.c
@@ -0,0 +1,209 @@
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/buffer_head.h>
+#include <linux/pagemap.h>
+#include <linux/uio.h>
+#include <linux/blkdev.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/sort.h>
+#include <linux/fs.h>
+#include <linux/gfs2_ondisk.h>
+#include <linux/falloc.h>
+#include <linux/swap.h>
+#include <linux/crc32.h>
+#include <linux/writeback.h>
+#include <asm/uaccess.h>
+#include <linux/dlm.h>
+#include <linux/dlm_plock.h>
+
+#include "gfs2.h"
+#include "incore.h"
+#include "bmap.h"
+#include "dir.h"
+#include "glock.h"
+#include "glops.h"
+#include "inode.h"
+#include "log.h"
+#include "meta_io.h"
+#include "quota.h"
+#include "rgrp.h"
+#include "trans.h"
+#include "util.h"
+
+struct workqueue_struct *gfs2_dir_ra_wq;
+
+static int compare_inos(const void *a, const void *b)
+{
+	u64 ino_a, ino_b;
+
+	ino_a = *(u64 *)a;
+	ino_b = *(u64 *)b;
+
+	if (ino_a > ino_b)
+		return 1;
+	return -1;
+}
+
+static int collect_more(struct gfs2_dir_ra *d_ra)
+{
+	return (d_ra->count < d_ra->req &&
+		(d_ra->count * sizeof(u64)) < d_ra->size);
+}
+
+int collect_inode_blocks(struct dir_context *ctx, u64 ino)
+{
+	struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+	if (!collect_more(d_ra))
+		return 1; /* Collected requested blocks */
+
+	d_ra->inos[d_ra->count++] = ino;
+	return 0;
+}
+
+struct dir_ra_work {
+	struct work_struct work;
+	u64 ino;
+	struct gfs2_sbd *sdp;
+};
+
+static void dir_ra_work_func(struct work_struct *work)
+{
+	struct dir_ra_work *w = container_of(work, struct dir_ra_work, work);
+
+	/* XXX: What to do if sdp disappears by the time we get here? */
+	struct inode *inode = gfs2_lookup_by_inum(w->sdp, w->ino, NULL,
+						  GFS2_BLKST_DINODE);
+	if (IS_ERR(inode)) {
+		fs_err(w->sdp, "can't read in inode at addr:%llu: %ld\n",
+		       w->ino, PTR_ERR(inode));
+	}
+	gfs2_inode_refresh(GFS2_I(inode));
+	iput(inode);
+	kfree(work);
+}
+
+int gfs2_queue_dir_ra(struct dir_context *ctx, struct gfs2_sbd *sdp)
+{
+	int i;
+	struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+	sort(d_ra->inos, d_ra->count, sizeof(u64), compare_inos, NULL);
+
+	for (i=0; i<d_ra->count; i++) {
+		struct dir_ra_work *w;
+
+		w = kmalloc(sizeof(struct dir_ra_work), GFP_NOFS | __GFP_NOWARN);
+		if (!w)
+			break;
+
+		w->ino = d_ra->inos[i];
+		w->sdp = sdp;
+		INIT_WORK(&w->work, dir_ra_work_func);
+		queue_work(gfs2_dir_ra_wq, &w->work);
+	}
+	if (!i)
+		return -ENOMEM;
+	if (i != d_ra->count)
+		ctx->pos = 0; /* Don't know the resume offset for a short RA */
+	return i;
+}
+
+static inline unsigned int compute_ra_bufsize(unsigned int count)
+{
+	unsigned int size = count * (sizeof(u64));
+
+	if (size > KMALLOC_MAX_SIZE)
+		return KMALLOC_MAX_SIZE;
+	if (size < KMALLOC_MIN_SIZE)
+		return KMALLOC_MIN_SIZE;
+
+	return size;
+}
+
+static int init_ra_context(struct gfs2_inode *ip, struct dir_context *ctx,
+			   unsigned int count)
+{
+	unsigned int bufsize;
+	struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+	memset(d_ra, 0, sizeof(struct gfs2_dir_ra));
+	count = count > RA_MAX_INOS ? RA_MAX_INOS : count;
+	count = count > ip->i_entries ? ip->i_entries : count;
+
+	bufsize = compute_ra_bufsize(count);
+	d_ra->inos = kmalloc(bufsize, GFP_NOFS | __GFP_NOWARN);
+	if (!d_ra->inos)
+		return -ENOMEM;
+
+	d_ra->size = bufsize;
+	d_ra->req = count;
+
+	return 0;
+}
+
+static void uninit_ra_context(struct dir_context *ctx)
+{
+	struct gfs2_dir_ra *d_ra;
+
+	if (!ctx || !ctx->opaque)
+		return;
+	d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+	if (d_ra->inos)
+		kfree(d_ra->inos);
+	memset(d_ra, 0, sizeof(struct gfs2_dir_ra));
+}
+/**
+ * gfs2_dir_readahead - GFS2's implementation of readdir readahead
+ * @file  : The directory to be read from
+ * @ctx   : Context contains buffer to collect inode numbers
+ *
+ * Readahead inode disk blocks (and extended attribute blocks if requested)
+ * of every directory entry
+ *
+ * Returns: +ve number: The number of entries for which readahead calls
+ *                      were issued
+ *          -ve values: For error conditions
+ */
+int gfs2_dir_readahead(struct file *file, struct dir_context *ctx, unsigned int count)
+{
+	int error = -EINVAL;
+	struct inode *dir = file->f_mapping->host;
+	struct gfs2_inode *dip = GFS2_I(dir);
+	struct gfs2_holder d_gh;
+	struct gfs2_dir_ra d_ra;
+
+	if (!ctx)
+		goto out;
+
+	ctx->opaque = &d_ra;
+	error = init_ra_context(dip, ctx, count);
+	if (error)
+		goto out;
+
+	gfs2_holder_init(dip->i_gl, LM_ST_SHARED, 0, &d_gh);
+	error = gfs2_glock_nq(&d_gh);
+	if (error) {
+		gfs2_holder_uninit(&d_gh);
+		goto uninit;
+	}
+
+retry:
+	error = gfs2_dir_read(dir, ctx, &file->f_ra);
+	if (test_bit(RA_FL_HASHCOLL, &d_ra.flags)) {
+		clear_bit(RA_FL_HASHCOLL, &d_ra.flags);
+		goto retry;
+	}
+
+	/* Pass the collected inos to the workqueues to be read ahead */ 
+	if (d_ra.count)
+		error = gfs2_queue_dir_ra(ctx, GFS2_SB(dir));
+
+	gfs2_glock_dq_uninit(&d_gh);
+uninit:
+	uninit_ra_context(ctx);
+out:
+	return error;
+}
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 26b3f95..6135bb9 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1075,6 +1075,7 @@ const struct file_operations gfs2_file_fops = {
 
 const struct file_operations gfs2_dir_fops = {
 	.iterate	= gfs2_readdir,
+	.dir_readahead  = gfs2_dir_readahead,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.open		= gfs2_open,
 	.release	= gfs2_release,
@@ -1105,6 +1106,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 
 const struct file_operations gfs2_dir_fops_nolock = {
 	.iterate	= gfs2_readdir,
+	.dir_readahead  = gfs2_dir_readahead,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.open		= gfs2_open,
 	.release	= gfs2_release,
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 82b6ac8..71e8ce5 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -161,9 +161,14 @@ static int __init init_gfs2_fs(void)
 	if (!gfs2_control_wq)
 		goto fail_recovery;
 
+	gfs2_dir_ra_wq = alloc_workqueue("gfs2_dir_ra",
+					 WQ_MEM_RECLAIM | WQ_FREEZABLE, 0);
+	if (!gfs2_dir_ra_wq)
+		goto fail_control;
+
 	gfs2_page_pool = mempool_create_page_pool(64, 0);
 	if (!gfs2_page_pool)
-		goto fail_control;
+		goto fail_ra;
 
 	gfs2_register_debugfs();
 
@@ -171,6 +176,8 @@ static int __init init_gfs2_fs(void)
 
 	return 0;
 
+fail_ra:
+	destroy_workqueue(gfs2_dir_ra_wq);
 fail_control:
 	destroy_workqueue(gfs2_control_wq);
 fail_recovery:
@@ -224,6 +231,7 @@ static void __exit exit_gfs2_fs(void)
 	unregister_filesystem(&gfs2meta_fs_type);
 	destroy_workqueue(gfs_recovery_wq);
 	destroy_workqueue(gfs2_control_wq);
+	destroy_workqueue(gfs2_dir_ra_wq);
 	list_lru_destroy(&gfs2_qd_lru);
 
 	rcu_barrier();
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 669e92e..8654636 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -849,6 +849,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
 	kthread_stop(sdp->sd_quotad_process);
 	kthread_stop(sdp->sd_logd_process);
 
+	flush_workqueue(gfs2_dir_ra_wq);
 	flush_workqueue(gfs2_delete_workqueue);
 	gfs2_quota_sync(sdp->sd_vfs, 0);
 	gfs2_statfs_sync(sdp->sd_vfs, 0);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 2/2] gfs2: GFS2's implementation of the dir_readahead file operation
@ 2014-07-25 17:37   ` Abhi Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhi Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, cluster-devel

This patch adds support to GFS2 for the dirreadahead system call.

There's a hard limit (128K) on the number of inodes that can be
readahead at one time. There's also a memory cap on the number of
inode numbers we collect. We readahead whatever number of inodes
we have collected until the first of these two caps is hit.

Readahead is done in two stages. In the intial stage, the
directory is read through and all the inode numbers of its entries
are collected. In the second stage, readaheads are performed
asynchronously using workqueues, so the syscall can return to the
user at this stage.
Subsequent getdents calls on the directory and stat calls on the
inodes will have the time-consuming lookups already done for them
and will therefore be quick.

Signed-off-by: Abhi Das <adas@redhat.com>
---
 fs/gfs2/Makefile        |   3 +-
 fs/gfs2/dir.c           |  49 +++++++++---
 fs/gfs2/dir.h           |  15 ++++
 fs/gfs2/dir_readahead.c | 209 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/gfs2/file.c          |   2 +
 fs/gfs2/main.c          |  10 ++-
 fs/gfs2/super.c         |   1 +
 7 files changed, 274 insertions(+), 15 deletions(-)
 create mode 100644 fs/gfs2/dir_readahead.c

diff --git a/fs/gfs2/Makefile b/fs/gfs2/Makefile
index 8612820..2765c83 100644
--- a/fs/gfs2/Makefile
+++ b/fs/gfs2/Makefile
@@ -4,7 +4,8 @@ gfs2-y := acl.o bmap.o dir.o xattr.o glock.o \
 	glops.o log.o lops.o main.o meta_io.o \
 	aops.o dentry.o export.o file.o \
 	ops_fstype.o inode.o quota.o \
-	recovery.o rgrp.o super.o sys.o trans.o util.o
+	recovery.o rgrp.o super.o sys.o trans.o \
+	dir_readahead.o util.o
 
 gfs2-$(CONFIG_GFS2_FS_LOCKING_DLM) += lock_dlm.o
 
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 1a349f9..f068763 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -1217,6 +1217,20 @@ static int compare_dents(const void *a, const void *b)
 	return ret;
 }
 
+static int gfs2_dirent_dot_or_dotdot(const struct gfs2_dirent *dent)
+{
+	const char *name = (char *)(dent + 1);
+
+	if (be16_to_cpu(dent->de_type) == DT_DIR) {
+		if (be16_to_cpu(dent->de_name_len) == 1 && name[0] == '.')
+			return 1;
+		if (be16_to_cpu(dent->de_name_len) == 2 &&
+		    strncmp(name, "..", 2) == 0)
+			return 1;
+	}
+	return 0;   
+}
+
 /**
  * do_filldir_main - read out directory entries
  * @dip: The GFS2 inode
@@ -1262,8 +1276,12 @@ static int do_filldir_main(struct gfs2_inode *dip, struct dir_context *ctx,
 			ctx->pos = off;
 
 			if (off_next == off) {
-				if (*copied && !run)
+				if (*copied && !run) {
+					struct gfs2_dir_ra *d_ra = ctx->opaque;
+					if (d_ra)
+						set_bit(RA_FL_HASHCOLL, &d_ra->flags);
 					return 1;
+				}
 				run = 1;
 			} else
 				run = 0;
@@ -1273,11 +1291,18 @@ static int do_filldir_main(struct gfs2_inode *dip, struct dir_context *ctx,
 			ctx->pos = off;
 		}
 
-		if (!dir_emit(ctx, (const char *)(dent + 1),
-				be16_to_cpu(dent->de_name_len),
-				be64_to_cpu(dent->de_inum.no_addr),
-				be16_to_cpu(dent->de_type)))
-			return 1;
+		if (ctx->actor) {
+			if (!dir_emit(ctx, (const char *)(dent + 1),
+				      be16_to_cpu(dent->de_name_len),
+				      be64_to_cpu(dent->de_inum.no_addr),
+				      be16_to_cpu(dent->de_type)))
+				return 1;
+		} else { /* we were called by dir_readahead */
+			if (gfs2_dirent_dot_or_dotdot(dent))
+				continue;
+			if (collect_inode_blocks(ctx, be64_to_cpu(dent->de_inum.no_addr)))
+				return 1;
+		}
 
 		*copied = 1;
 	}
@@ -1311,8 +1336,7 @@ static void gfs2_free_sort_buffer(void *ptr)
 }
 
 static int gfs2_dir_read_leaf(struct inode *inode, struct dir_context *ctx,
-			      int *copied, unsigned *depth,
-			      u64 leaf_no)
+			      int *copied, unsigned *depth, u64 leaf_no)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
@@ -1399,14 +1423,14 @@ out:
 }
 
 /**
- * gfs2_dir_readahead - Issue read-ahead requests for leaf blocks.
+ * gfs2_dir_leaf_ra - Issue read-ahead requests for leaf blocks.
  *
  * Note: we can't calculate each index like dir_e_read can because we don't
  * have the leaf, and therefore we don't have the depth, and therefore we
  * don't have the length. So we have to just read enough ahead to make up
  * for the loss of information.
  */
-static void gfs2_dir_readahead(struct inode *inode, unsigned hsize, u32 index,
+static void gfs2_dir_leaf_ra(struct inode *inode, unsigned hsize, u32 index,
 			       struct file_ra_state *f_ra)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
@@ -1474,11 +1498,10 @@ static int dir_e_read(struct inode *inode, struct dir_context *ctx,
 	if (IS_ERR(lp))
 		return PTR_ERR(lp);
 
-	gfs2_dir_readahead(inode, hsize, index, f_ra);
+	gfs2_dir_leaf_ra(inode, hsize, index, f_ra);
 
 	while (index < hsize) {
-		error = gfs2_dir_read_leaf(inode, ctx,
-					   &copied, &depth,
+		error = gfs2_dir_read_leaf(inode, ctx, &copied, &depth,
 					   be64_to_cpu(lp[index]));
 		if (error)
 			break;
diff --git a/fs/gfs2/dir.h b/fs/gfs2/dir.h
index 126c65d..429eea9 100644
--- a/fs/gfs2/dir.h
+++ b/fs/gfs2/dir.h
@@ -25,6 +25,21 @@ struct gfs2_diradd {
 	struct buffer_head *bh;
 };
 
+extern struct workqueue_struct *gfs2_dir_ra_wq;
+#define RA_MAX_INOS     131072 /*128K */
+#define RA_FL_HASHCOLL  1
+
+struct gfs2_dir_ra {
+	u64          *inos;
+	size_t        size;
+	size_t        count;
+	unsigned int  req;
+	unsigned long flags;
+};
+
+extern int gfs2_dir_readahead(struct file *file, struct dir_context *ctx,
+				      unsigned int count);
+extern int collect_inode_blocks(struct dir_context *ctx, u64 offset);
 extern struct inode *gfs2_dir_search(struct inode *dir,
 				     const struct qstr *filename,
 				     bool fail_on_exist);
diff --git a/fs/gfs2/dir_readahead.c b/fs/gfs2/dir_readahead.c
new file mode 100644
index 0000000..98888ad
--- /dev/null
+++ b/fs/gfs2/dir_readahead.c
@@ -0,0 +1,209 @@
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/buffer_head.h>
+#include <linux/pagemap.h>
+#include <linux/uio.h>
+#include <linux/blkdev.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/sort.h>
+#include <linux/fs.h>
+#include <linux/gfs2_ondisk.h>
+#include <linux/falloc.h>
+#include <linux/swap.h>
+#include <linux/crc32.h>
+#include <linux/writeback.h>
+#include <asm/uaccess.h>
+#include <linux/dlm.h>
+#include <linux/dlm_plock.h>
+
+#include "gfs2.h"
+#include "incore.h"
+#include "bmap.h"
+#include "dir.h"
+#include "glock.h"
+#include "glops.h"
+#include "inode.h"
+#include "log.h"
+#include "meta_io.h"
+#include "quota.h"
+#include "rgrp.h"
+#include "trans.h"
+#include "util.h"
+
+struct workqueue_struct *gfs2_dir_ra_wq;
+
+static int compare_inos(const void *a, const void *b)
+{
+	u64 ino_a, ino_b;
+
+	ino_a = *(u64 *)a;
+	ino_b = *(u64 *)b;
+
+	if (ino_a > ino_b)
+		return 1;
+	return -1;
+}
+
+static int collect_more(struct gfs2_dir_ra *d_ra)
+{
+	return (d_ra->count < d_ra->req &&
+		(d_ra->count * sizeof(u64)) < d_ra->size);
+}
+
+int collect_inode_blocks(struct dir_context *ctx, u64 ino)
+{
+	struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+	if (!collect_more(d_ra))
+		return 1; /* Collected requested blocks */
+
+	d_ra->inos[d_ra->count++] = ino;
+	return 0;
+}
+
+struct dir_ra_work {
+	struct work_struct work;
+	u64 ino;
+	struct gfs2_sbd *sdp;
+};
+
+static void dir_ra_work_func(struct work_struct *work)
+{
+	struct dir_ra_work *w = container_of(work, struct dir_ra_work, work);
+
+	/* XXX: What to do if sdp disappears by the time we get here? */
+	struct inode *inode = gfs2_lookup_by_inum(w->sdp, w->ino, NULL,
+						  GFS2_BLKST_DINODE);
+	if (IS_ERR(inode)) {
+		fs_err(w->sdp, "can't read in inode at addr:%llu: %ld\n",
+		       w->ino, PTR_ERR(inode));
+	}
+	gfs2_inode_refresh(GFS2_I(inode));
+	iput(inode);
+	kfree(work);
+}
+
+int gfs2_queue_dir_ra(struct dir_context *ctx, struct gfs2_sbd *sdp)
+{
+	int i;
+	struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+	sort(d_ra->inos, d_ra->count, sizeof(u64), compare_inos, NULL);
+
+	for (i=0; i<d_ra->count; i++) {
+		struct dir_ra_work *w;
+
+		w = kmalloc(sizeof(struct dir_ra_work), GFP_NOFS | __GFP_NOWARN);
+		if (!w)
+			break;
+
+		w->ino = d_ra->inos[i];
+		w->sdp = sdp;
+		INIT_WORK(&w->work, dir_ra_work_func);
+		queue_work(gfs2_dir_ra_wq, &w->work);
+	}
+	if (!i)
+		return -ENOMEM;
+	if (i != d_ra->count)
+		ctx->pos = 0; /* Don't know the resume offset for a short RA */
+	return i;
+}
+
+static inline unsigned int compute_ra_bufsize(unsigned int count)
+{
+	unsigned int size = count * (sizeof(u64));
+
+	if (size > KMALLOC_MAX_SIZE)
+		return KMALLOC_MAX_SIZE;
+	if (size < KMALLOC_MIN_SIZE)
+		return KMALLOC_MIN_SIZE;
+
+	return size;
+}
+
+static int init_ra_context(struct gfs2_inode *ip, struct dir_context *ctx,
+			   unsigned int count)
+{
+	unsigned int bufsize;
+	struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+	memset(d_ra, 0, sizeof(struct gfs2_dir_ra));
+	count = count > RA_MAX_INOS ? RA_MAX_INOS : count;
+	count = count > ip->i_entries ? ip->i_entries : count;
+
+	bufsize = compute_ra_bufsize(count);
+	d_ra->inos = kmalloc(bufsize, GFP_NOFS | __GFP_NOWARN);
+	if (!d_ra->inos)
+		return -ENOMEM;
+
+	d_ra->size = bufsize;
+	d_ra->req = count;
+
+	return 0;
+}
+
+static void uninit_ra_context(struct dir_context *ctx)
+{
+	struct gfs2_dir_ra *d_ra;
+
+	if (!ctx || !ctx->opaque)
+		return;
+	d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+	if (d_ra->inos)
+		kfree(d_ra->inos);
+	memset(d_ra, 0, sizeof(struct gfs2_dir_ra));
+}
+/**
+ * gfs2_dir_readahead - GFS2's implementation of readdir readahead
+ * @file  : The directory to be read from
+ * @ctx   : Context contains buffer to collect inode numbers
+ *
+ * Readahead inode disk blocks (and extended attribute blocks if requested)
+ * of every directory entry
+ *
+ * Returns: +ve number: The number of entries for which readahead calls
+ *                      were issued
+ *          -ve values: For error conditions
+ */
+int gfs2_dir_readahead(struct file *file, struct dir_context *ctx, unsigned int count)
+{
+	int error = -EINVAL;
+	struct inode *dir = file->f_mapping->host;
+	struct gfs2_inode *dip = GFS2_I(dir);
+	struct gfs2_holder d_gh;
+	struct gfs2_dir_ra d_ra;
+
+	if (!ctx)
+		goto out;
+
+	ctx->opaque = &d_ra;
+	error = init_ra_context(dip, ctx, count);
+	if (error)
+		goto out;
+
+	gfs2_holder_init(dip->i_gl, LM_ST_SHARED, 0, &d_gh);
+	error = gfs2_glock_nq(&d_gh);
+	if (error) {
+		gfs2_holder_uninit(&d_gh);
+		goto uninit;
+	}
+
+retry:
+	error = gfs2_dir_read(dir, ctx, &file->f_ra);
+	if (test_bit(RA_FL_HASHCOLL, &d_ra.flags)) {
+		clear_bit(RA_FL_HASHCOLL, &d_ra.flags);
+		goto retry;
+	}
+
+	/* Pass the collected inos to the workqueues to be read ahead */ 
+	if (d_ra.count)
+		error = gfs2_queue_dir_ra(ctx, GFS2_SB(dir));
+
+	gfs2_glock_dq_uninit(&d_gh);
+uninit:
+	uninit_ra_context(ctx);
+out:
+	return error;
+}
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 26b3f95..6135bb9 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1075,6 +1075,7 @@ const struct file_operations gfs2_file_fops = {
 
 const struct file_operations gfs2_dir_fops = {
 	.iterate	= gfs2_readdir,
+	.dir_readahead  = gfs2_dir_readahead,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.open		= gfs2_open,
 	.release	= gfs2_release,
@@ -1105,6 +1106,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 
 const struct file_operations gfs2_dir_fops_nolock = {
 	.iterate	= gfs2_readdir,
+	.dir_readahead  = gfs2_dir_readahead,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.open		= gfs2_open,
 	.release	= gfs2_release,
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 82b6ac8..71e8ce5 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -161,9 +161,14 @@ static int __init init_gfs2_fs(void)
 	if (!gfs2_control_wq)
 		goto fail_recovery;
 
+	gfs2_dir_ra_wq = alloc_workqueue("gfs2_dir_ra",
+					 WQ_MEM_RECLAIM | WQ_FREEZABLE, 0);
+	if (!gfs2_dir_ra_wq)
+		goto fail_control;
+
 	gfs2_page_pool = mempool_create_page_pool(64, 0);
 	if (!gfs2_page_pool)
-		goto fail_control;
+		goto fail_ra;
 
 	gfs2_register_debugfs();
 
@@ -171,6 +176,8 @@ static int __init init_gfs2_fs(void)
 
 	return 0;
 
+fail_ra:
+	destroy_workqueue(gfs2_dir_ra_wq);
 fail_control:
 	destroy_workqueue(gfs2_control_wq);
 fail_recovery:
@@ -224,6 +231,7 @@ static void __exit exit_gfs2_fs(void)
 	unregister_filesystem(&gfs2meta_fs_type);
 	destroy_workqueue(gfs_recovery_wq);
 	destroy_workqueue(gfs2_control_wq);
+	destroy_workqueue(gfs2_dir_ra_wq);
 	list_lru_destroy(&gfs2_qd_lru);
 
 	rcu_barrier();
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 669e92e..8654636 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -849,6 +849,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
 	kthread_stop(sdp->sd_quotad_process);
 	kthread_stop(sdp->sd_logd_process);
 
+	flush_workqueue(gfs2_dir_ra_wq);
 	flush_workqueue(gfs2_delete_workqueue);
 	gfs2_quota_sync(sdp->sd_vfs, 0);
 	gfs2_statfs_sync(sdp->sd_vfs, 0);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 2/2] gfs2: GFS2's implementation of the dir_readahead file operation
@ 2014-07-25 17:37   ` Abhi Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhi Das @ 2014-07-25 17:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This patch adds support to GFS2 for the dirreadahead system call.

There's a hard limit (128K) on the number of inodes that can be
readahead at one time. There's also a memory cap on the number of
inode numbers we collect. We readahead whatever number of inodes
we have collected until the first of these two caps is hit.

Readahead is done in two stages. In the intial stage, the
directory is read through and all the inode numbers of its entries
are collected. In the second stage, readaheads are performed
asynchronously using workqueues, so the syscall can return to the
user at this stage.
Subsequent getdents calls on the directory and stat calls on the
inodes will have the time-consuming lookups already done for them
and will therefore be quick.

Signed-off-by: Abhi Das <adas@redhat.com>
---
 fs/gfs2/Makefile        |   3 +-
 fs/gfs2/dir.c           |  49 +++++++++---
 fs/gfs2/dir.h           |  15 ++++
 fs/gfs2/dir_readahead.c | 209 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/gfs2/file.c          |   2 +
 fs/gfs2/main.c          |  10 ++-
 fs/gfs2/super.c         |   1 +
 7 files changed, 274 insertions(+), 15 deletions(-)
 create mode 100644 fs/gfs2/dir_readahead.c

diff --git a/fs/gfs2/Makefile b/fs/gfs2/Makefile
index 8612820..2765c83 100644
--- a/fs/gfs2/Makefile
+++ b/fs/gfs2/Makefile
@@ -4,7 +4,8 @@ gfs2-y := acl.o bmap.o dir.o xattr.o glock.o \
 	glops.o log.o lops.o main.o meta_io.o \
 	aops.o dentry.o export.o file.o \
 	ops_fstype.o inode.o quota.o \
-	recovery.o rgrp.o super.o sys.o trans.o util.o
+	recovery.o rgrp.o super.o sys.o trans.o \
+	dir_readahead.o util.o
 
 gfs2-$(CONFIG_GFS2_FS_LOCKING_DLM) += lock_dlm.o
 
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 1a349f9..f068763 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -1217,6 +1217,20 @@ static int compare_dents(const void *a, const void *b)
 	return ret;
 }
 
+static int gfs2_dirent_dot_or_dotdot(const struct gfs2_dirent *dent)
+{
+	const char *name = (char *)(dent + 1);
+
+	if (be16_to_cpu(dent->de_type) == DT_DIR) {
+		if (be16_to_cpu(dent->de_name_len) == 1 && name[0] == '.')
+			return 1;
+		if (be16_to_cpu(dent->de_name_len) == 2 &&
+		    strncmp(name, "..", 2) == 0)
+			return 1;
+	}
+	return 0;   
+}
+
 /**
  * do_filldir_main - read out directory entries
  * @dip: The GFS2 inode
@@ -1262,8 +1276,12 @@ static int do_filldir_main(struct gfs2_inode *dip, struct dir_context *ctx,
 			ctx->pos = off;
 
 			if (off_next == off) {
-				if (*copied && !run)
+				if (*copied && !run) {
+					struct gfs2_dir_ra *d_ra = ctx->opaque;
+					if (d_ra)
+						set_bit(RA_FL_HASHCOLL, &d_ra->flags);
 					return 1;
+				}
 				run = 1;
 			} else
 				run = 0;
@@ -1273,11 +1291,18 @@ static int do_filldir_main(struct gfs2_inode *dip, struct dir_context *ctx,
 			ctx->pos = off;
 		}
 
-		if (!dir_emit(ctx, (const char *)(dent + 1),
-				be16_to_cpu(dent->de_name_len),
-				be64_to_cpu(dent->de_inum.no_addr),
-				be16_to_cpu(dent->de_type)))
-			return 1;
+		if (ctx->actor) {
+			if (!dir_emit(ctx, (const char *)(dent + 1),
+				      be16_to_cpu(dent->de_name_len),
+				      be64_to_cpu(dent->de_inum.no_addr),
+				      be16_to_cpu(dent->de_type)))
+				return 1;
+		} else { /* we were called by dir_readahead */
+			if (gfs2_dirent_dot_or_dotdot(dent))
+				continue;
+			if (collect_inode_blocks(ctx, be64_to_cpu(dent->de_inum.no_addr)))
+				return 1;
+		}
 
 		*copied = 1;
 	}
@@ -1311,8 +1336,7 @@ static void gfs2_free_sort_buffer(void *ptr)
 }
 
 static int gfs2_dir_read_leaf(struct inode *inode, struct dir_context *ctx,
-			      int *copied, unsigned *depth,
-			      u64 leaf_no)
+			      int *copied, unsigned *depth, u64 leaf_no)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
@@ -1399,14 +1423,14 @@ out:
 }
 
 /**
- * gfs2_dir_readahead - Issue read-ahead requests for leaf blocks.
+ * gfs2_dir_leaf_ra - Issue read-ahead requests for leaf blocks.
  *
  * Note: we can't calculate each index like dir_e_read can because we don't
  * have the leaf, and therefore we don't have the depth, and therefore we
  * don't have the length. So we have to just read enough ahead to make up
  * for the loss of information.
  */
-static void gfs2_dir_readahead(struct inode *inode, unsigned hsize, u32 index,
+static void gfs2_dir_leaf_ra(struct inode *inode, unsigned hsize, u32 index,
 			       struct file_ra_state *f_ra)
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
@@ -1474,11 +1498,10 @@ static int dir_e_read(struct inode *inode, struct dir_context *ctx,
 	if (IS_ERR(lp))
 		return PTR_ERR(lp);
 
-	gfs2_dir_readahead(inode, hsize, index, f_ra);
+	gfs2_dir_leaf_ra(inode, hsize, index, f_ra);
 
 	while (index < hsize) {
-		error = gfs2_dir_read_leaf(inode, ctx,
-					   &copied, &depth,
+		error = gfs2_dir_read_leaf(inode, ctx, &copied, &depth,
 					   be64_to_cpu(lp[index]));
 		if (error)
 			break;
diff --git a/fs/gfs2/dir.h b/fs/gfs2/dir.h
index 126c65d..429eea9 100644
--- a/fs/gfs2/dir.h
+++ b/fs/gfs2/dir.h
@@ -25,6 +25,21 @@ struct gfs2_diradd {
 	struct buffer_head *bh;
 };
 
+extern struct workqueue_struct *gfs2_dir_ra_wq;
+#define RA_MAX_INOS     131072 /*128K */
+#define RA_FL_HASHCOLL  1
+
+struct gfs2_dir_ra {
+	u64          *inos;
+	size_t        size;
+	size_t        count;
+	unsigned int  req;
+	unsigned long flags;
+};
+
+extern int gfs2_dir_readahead(struct file *file, struct dir_context *ctx,
+				      unsigned int count);
+extern int collect_inode_blocks(struct dir_context *ctx, u64 offset);
 extern struct inode *gfs2_dir_search(struct inode *dir,
 				     const struct qstr *filename,
 				     bool fail_on_exist);
diff --git a/fs/gfs2/dir_readahead.c b/fs/gfs2/dir_readahead.c
new file mode 100644
index 0000000..98888ad
--- /dev/null
+++ b/fs/gfs2/dir_readahead.c
@@ -0,0 +1,209 @@
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/buffer_head.h>
+#include <linux/pagemap.h>
+#include <linux/uio.h>
+#include <linux/blkdev.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/sort.h>
+#include <linux/fs.h>
+#include <linux/gfs2_ondisk.h>
+#include <linux/falloc.h>
+#include <linux/swap.h>
+#include <linux/crc32.h>
+#include <linux/writeback.h>
+#include <asm/uaccess.h>
+#include <linux/dlm.h>
+#include <linux/dlm_plock.h>
+
+#include "gfs2.h"
+#include "incore.h"
+#include "bmap.h"
+#include "dir.h"
+#include "glock.h"
+#include "glops.h"
+#include "inode.h"
+#include "log.h"
+#include "meta_io.h"
+#include "quota.h"
+#include "rgrp.h"
+#include "trans.h"
+#include "util.h"
+
+struct workqueue_struct *gfs2_dir_ra_wq;
+
+static int compare_inos(const void *a, const void *b)
+{
+	u64 ino_a, ino_b;
+
+	ino_a = *(u64 *)a;
+	ino_b = *(u64 *)b;
+
+	if (ino_a > ino_b)
+		return 1;
+	return -1;
+}
+
+static int collect_more(struct gfs2_dir_ra *d_ra)
+{
+	return (d_ra->count < d_ra->req &&
+		(d_ra->count * sizeof(u64)) < d_ra->size);
+}
+
+int collect_inode_blocks(struct dir_context *ctx, u64 ino)
+{
+	struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+	if (!collect_more(d_ra))
+		return 1; /* Collected requested blocks */
+
+	d_ra->inos[d_ra->count++] = ino;
+	return 0;
+}
+
+struct dir_ra_work {
+	struct work_struct work;
+	u64 ino;
+	struct gfs2_sbd *sdp;
+};
+
+static void dir_ra_work_func(struct work_struct *work)
+{
+	struct dir_ra_work *w = container_of(work, struct dir_ra_work, work);
+
+	/* XXX: What to do if sdp disappears by the time we get here? */
+	struct inode *inode = gfs2_lookup_by_inum(w->sdp, w->ino, NULL,
+						  GFS2_BLKST_DINODE);
+	if (IS_ERR(inode)) {
+		fs_err(w->sdp, "can't read in inode at addr:%llu: %ld\n",
+		       w->ino, PTR_ERR(inode));
+	}
+	gfs2_inode_refresh(GFS2_I(inode));
+	iput(inode);
+	kfree(work);
+}
+
+int gfs2_queue_dir_ra(struct dir_context *ctx, struct gfs2_sbd *sdp)
+{
+	int i;
+	struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+	sort(d_ra->inos, d_ra->count, sizeof(u64), compare_inos, NULL);
+
+	for (i=0; i<d_ra->count; i++) {
+		struct dir_ra_work *w;
+
+		w = kmalloc(sizeof(struct dir_ra_work), GFP_NOFS | __GFP_NOWARN);
+		if (!w)
+			break;
+
+		w->ino = d_ra->inos[i];
+		w->sdp = sdp;
+		INIT_WORK(&w->work, dir_ra_work_func);
+		queue_work(gfs2_dir_ra_wq, &w->work);
+	}
+	if (!i)
+		return -ENOMEM;
+	if (i != d_ra->count)
+		ctx->pos = 0; /* Don't know the resume offset for a short RA */
+	return i;
+}
+
+static inline unsigned int compute_ra_bufsize(unsigned int count)
+{
+	unsigned int size = count * (sizeof(u64));
+
+	if (size > KMALLOC_MAX_SIZE)
+		return KMALLOC_MAX_SIZE;
+	if (size < KMALLOC_MIN_SIZE)
+		return KMALLOC_MIN_SIZE;
+
+	return size;
+}
+
+static int init_ra_context(struct gfs2_inode *ip, struct dir_context *ctx,
+			   unsigned int count)
+{
+	unsigned int bufsize;
+	struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+	memset(d_ra, 0, sizeof(struct gfs2_dir_ra));
+	count = count > RA_MAX_INOS ? RA_MAX_INOS : count;
+	count = count > ip->i_entries ? ip->i_entries : count;
+
+	bufsize = compute_ra_bufsize(count);
+	d_ra->inos = kmalloc(bufsize, GFP_NOFS | __GFP_NOWARN);
+	if (!d_ra->inos)
+		return -ENOMEM;
+
+	d_ra->size = bufsize;
+	d_ra->req = count;
+
+	return 0;
+}
+
+static void uninit_ra_context(struct dir_context *ctx)
+{
+	struct gfs2_dir_ra *d_ra;
+
+	if (!ctx || !ctx->opaque)
+		return;
+	d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+	if (d_ra->inos)
+		kfree(d_ra->inos);
+	memset(d_ra, 0, sizeof(struct gfs2_dir_ra));
+}
+/**
+ * gfs2_dir_readahead - GFS2's implementation of readdir readahead
+ * @file  : The directory to be read from
+ * @ctx   : Context contains buffer to collect inode numbers
+ *
+ * Readahead inode disk blocks (and extended attribute blocks if requested)
+ * of every directory entry
+ *
+ * Returns: +ve number: The number of entries for which readahead calls
+ *                      were issued
+ *          -ve values: For error conditions
+ */
+int gfs2_dir_readahead(struct file *file, struct dir_context *ctx, unsigned int count)
+{
+	int error = -EINVAL;
+	struct inode *dir = file->f_mapping->host;
+	struct gfs2_inode *dip = GFS2_I(dir);
+	struct gfs2_holder d_gh;
+	struct gfs2_dir_ra d_ra;
+
+	if (!ctx)
+		goto out;
+
+	ctx->opaque = &d_ra;
+	error = init_ra_context(dip, ctx, count);
+	if (error)
+		goto out;
+
+	gfs2_holder_init(dip->i_gl, LM_ST_SHARED, 0, &d_gh);
+	error = gfs2_glock_nq(&d_gh);
+	if (error) {
+		gfs2_holder_uninit(&d_gh);
+		goto uninit;
+	}
+
+retry:
+	error = gfs2_dir_read(dir, ctx, &file->f_ra);
+	if (test_bit(RA_FL_HASHCOLL, &d_ra.flags)) {
+		clear_bit(RA_FL_HASHCOLL, &d_ra.flags);
+		goto retry;
+	}
+
+	/* Pass the collected inos to the workqueues to be read ahead */ 
+	if (d_ra.count)
+		error = gfs2_queue_dir_ra(ctx, GFS2_SB(dir));
+
+	gfs2_glock_dq_uninit(&d_gh);
+uninit:
+	uninit_ra_context(ctx);
+out:
+	return error;
+}
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 26b3f95..6135bb9 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1075,6 +1075,7 @@ const struct file_operations gfs2_file_fops = {
 
 const struct file_operations gfs2_dir_fops = {
 	.iterate	= gfs2_readdir,
+	.dir_readahead  = gfs2_dir_readahead,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.open		= gfs2_open,
 	.release	= gfs2_release,
@@ -1105,6 +1106,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 
 const struct file_operations gfs2_dir_fops_nolock = {
 	.iterate	= gfs2_readdir,
+	.dir_readahead  = gfs2_dir_readahead,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.open		= gfs2_open,
 	.release	= gfs2_release,
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 82b6ac8..71e8ce5 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -161,9 +161,14 @@ static int __init init_gfs2_fs(void)
 	if (!gfs2_control_wq)
 		goto fail_recovery;
 
+	gfs2_dir_ra_wq = alloc_workqueue("gfs2_dir_ra",
+					 WQ_MEM_RECLAIM | WQ_FREEZABLE, 0);
+	if (!gfs2_dir_ra_wq)
+		goto fail_control;
+
 	gfs2_page_pool = mempool_create_page_pool(64, 0);
 	if (!gfs2_page_pool)
-		goto fail_control;
+		goto fail_ra;
 
 	gfs2_register_debugfs();
 
@@ -171,6 +176,8 @@ static int __init init_gfs2_fs(void)
 
 	return 0;
 
+fail_ra:
+	destroy_workqueue(gfs2_dir_ra_wq);
 fail_control:
 	destroy_workqueue(gfs2_control_wq);
 fail_recovery:
@@ -224,6 +231,7 @@ static void __exit exit_gfs2_fs(void)
 	unregister_filesystem(&gfs2meta_fs_type);
 	destroy_workqueue(gfs_recovery_wq);
 	destroy_workqueue(gfs2_control_wq);
+	destroy_workqueue(gfs2_dir_ra_wq);
 	list_lru_destroy(&gfs2_qd_lru);
 
 	rcu_barrier();
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 669e92e..8654636 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -849,6 +849,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
 	kthread_stop(sdp->sd_quotad_process);
 	kthread_stop(sdp->sd_logd_process);
 
+	flush_workqueue(gfs2_dir_ra_wq);
 	flush_workqueue(gfs2_delete_workqueue);
 	gfs2_quota_sync(sdp->sd_vfs, 0);
 	gfs2_statfs_sync(sdp->sd_vfs, 0);
-- 
1.8.1.4



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-25 17:37 ` Abhi Das
@ 2014-07-26  5:27   ` Andreas Dilger
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-07-26  5:27 UTC (permalink / raw)
  To: Abhi Das; +Cc: linux-kernel, linux-fsdevel, cluster-devel

Is there a time when this doesn't get called to prefetch entries in
readdir() order?  It isn't clear to me what benefit there is of returning
the entries to userspace instead of just doing the statahead implicitly
in the kernel?

The Lustre client has had what we call "statahead" for a while,
and similar to regular file readahead it detects the sequential access
pattern for readdir() + stat() in readdir() order (taking into account if ".*"
entries are being processed or not) and starts fetching the inode
attributes asynchronously with a worker thread.

This syscall might be more useful if userspace called readdir() to get
the dirents and then passed the kernel the list of inode numbers
to prefetch before starting on the stat() calls. That way, userspace
could generate an arbitrary list of inodes (e.g. names matching a
regexp) and the kernel doesn't need to guess if every inode is needed. 

As it stands, this syscall doesn't help in anything other than readdir
order (or of the directory is small enough to be handled in one
syscall), which could be handled by the kernel internally already,
and it may fetch a considerable number of extra inodes from
disk if not every inode needs to be touched. 

Cheers, Andreas

> On Jul 25, 2014, at 11:37, Abhi Das <adas@redhat.com> wrote:
> 
> This system call takes 3 arguments:
> fd      - file descriptor of the directory being readahead
> *offset - offset in dir from which to resume. This is updated
>          as we move along in the directory
> count   - The max number of entries to readahead
> 
> The syscall is supposed to read upto 'count' entries starting at
> '*offset' and cache the inodes corresponding to those entries. It
> returns a negative error code or a positive number indicating
> the number of inodes it has issued readaheads for. It also
> updates the '*offset' value so that repeated calls to dirreadahead
> can resume at the right location. Returns 0 when there are no more
> entries left.
> 
> Abhi Das (2):
>  fs: Add dirreadahead syscall and VFS hooks
>  gfs2: GFS2's implementation of the dir_readahead file operation
> 
> arch/x86/syscalls/syscall_32.tbl |   1 +
> arch/x86/syscalls/syscall_64.tbl |   1 +
> fs/gfs2/Makefile                 |   3 +-
> fs/gfs2/dir.c                    |  49 ++++++---
> fs/gfs2/dir.h                    |  15 +++
> fs/gfs2/dir_readahead.c          | 209 +++++++++++++++++++++++++++++++++++++++
> fs/gfs2/file.c                   |   2 +
> fs/gfs2/main.c                   |  10 +-
> fs/gfs2/super.c                  |   1 +
> fs/readdir.c                     |  49 +++++++++
> include/linux/fs.h               |   3 +
> 11 files changed, 328 insertions(+), 15 deletions(-)
> create mode 100644 fs/gfs2/dir_readahead.c
> 
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-26  5:27   ` Andreas Dilger
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-07-26  5:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Is there a time when this doesn't get called to prefetch entries in
readdir() order?  It isn't clear to me what benefit there is of returning
the entries to userspace instead of just doing the statahead implicitly
in the kernel?

The Lustre client has had what we call "statahead" for a while,
and similar to regular file readahead it detects the sequential access
pattern for readdir() + stat() in readdir() order (taking into account if ".*"
entries are being processed or not) and starts fetching the inode
attributes asynchronously with a worker thread.

This syscall might be more useful if userspace called readdir() to get
the dirents and then passed the kernel the list of inode numbers
to prefetch before starting on the stat() calls. That way, userspace
could generate an arbitrary list of inodes (e.g. names matching a
regexp) and the kernel doesn't need to guess if every inode is needed. 

As it stands, this syscall doesn't help in anything other than readdir
order (or of the directory is small enough to be handled in one
syscall), which could be handled by the kernel internally already,
and it may fetch a considerable number of extra inodes from
disk if not every inode needs to be touched. 

Cheers, Andreas

> On Jul 25, 2014, at 11:37, Abhi Das <adas@redhat.com> wrote:
> 
> This system call takes 3 arguments:
> fd      - file descriptor of the directory being readahead
> *offset - offset in dir from which to resume. This is updated
>          as we move along in the directory
> count   - The max number of entries to readahead
> 
> The syscall is supposed to read upto 'count' entries starting at
> '*offset' and cache the inodes corresponding to those entries. It
> returns a negative error code or a positive number indicating
> the number of inodes it has issued readaheads for. It also
> updates the '*offset' value so that repeated calls to dirreadahead
> can resume at the right location. Returns 0 when there are no more
> entries left.
> 
> Abhi Das (2):
>  fs: Add dirreadahead syscall and VFS hooks
>  gfs2: GFS2's implementation of the dir_readahead file operation
> 
> arch/x86/syscalls/syscall_32.tbl |   1 +
> arch/x86/syscalls/syscall_64.tbl |   1 +
> fs/gfs2/Makefile                 |   3 +-
> fs/gfs2/dir.c                    |  49 ++++++---
> fs/gfs2/dir.h                    |  15 +++
> fs/gfs2/dir_readahead.c          | 209 +++++++++++++++++++++++++++++++++++++++
> fs/gfs2/file.c                   |   2 +
> fs/gfs2/main.c                   |  10 +-
> fs/gfs2/super.c                  |   1 +
> fs/readdir.c                     |  49 +++++++++
> include/linux/fs.h               |   3 +
> 11 files changed, 328 insertions(+), 15 deletions(-)
> create mode 100644 fs/gfs2/dir_readahead.c
> 
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-26  5:27   ` [Cluster-devel] " Andreas Dilger
@ 2014-07-28 12:52     ` Abhijith Das
  -1 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-07-28 12:52 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-kernel, linux-fsdevel, cluster-devel



----- Original Message -----
> From: "Andreas Dilger" <adilger@dilger.ca>
> To: "Abhi Das" <adas@redhat.com>
> Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, cluster-devel@redhat.com
> Sent: Saturday, July 26, 2014 12:27:19 AM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> Is there a time when this doesn't get called to prefetch entries in
> readdir() order?  It isn't clear to me what benefit there is of returning
> the entries to userspace instead of just doing the statahead implicitly
> in the kernel?
> 
> The Lustre client has had what we call "statahead" for a while,
> and similar to regular file readahead it detects the sequential access
> pattern for readdir() + stat() in readdir() order (taking into account if
> ".*"
> entries are being processed or not) and starts fetching the inode
> attributes asynchronously with a worker thread.

Does this heuristic work well in practice? In the use case we were trying to
address, a Samba server is aware beforehand if it is going to stat all the
inodes in a directory.

> 
> This syscall might be more useful if userspace called readdir() to get
> the dirents and then passed the kernel the list of inode numbers
> to prefetch before starting on the stat() calls. That way, userspace
> could generate an arbitrary list of inodes (e.g. names matching a
> regexp) and the kernel doesn't need to guess if every inode is needed.

Were you thinking arbitrary inodes across the filesystem or just a subset
from a directory? Arbitrary inodes may potentially throw up locking issues.
But yeah, as Steve mentioned in a previous email,  limiting the inodes
readahead in some fashion other than a range in readdir() order is something
that we are thinking of (list of inodes based on regexps, filenames etc). We
just chose to do an offset range of the directory for a quick, early
implementation.

> 
> As it stands, this syscall doesn't help in anything other than readdir
> order (or of the directory is small enough to be handled in one
> syscall), which could be handled by the kernel internally already,
> and it may fetch a considerable number of extra inodes from
> disk if not every inode needs to be touched.

The need for this syscall came up from a specific use case - Samba. I'm told
that Windows clients like to stat every file in a directory as soon as it is
read in and this has been a slow operation.

Cheers!
--Abhi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-28 12:52     ` Abhijith Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-07-28 12:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com



----- Original Message -----
> From: "Andreas Dilger" <adilger@dilger.ca>
> To: "Abhi Das" <adas@redhat.com>
> Cc: linux-kernel at vger.kernel.org, linux-fsdevel at vger.kernel.org, cluster-devel at redhat.com
> Sent: Saturday, July 26, 2014 12:27:19 AM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> Is there a time when this doesn't get called to prefetch entries in
> readdir() order?  It isn't clear to me what benefit there is of returning
> the entries to userspace instead of just doing the statahead implicitly
> in the kernel?
> 
> The Lustre client has had what we call "statahead" for a while,
> and similar to regular file readahead it detects the sequential access
> pattern for readdir() + stat() in readdir() order (taking into account if
> ".*"
> entries are being processed or not) and starts fetching the inode
> attributes asynchronously with a worker thread.

Does this heuristic work well in practice? In the use case we were trying to
address, a Samba server is aware beforehand if it is going to stat all the
inodes in a directory.

> 
> This syscall might be more useful if userspace called readdir() to get
> the dirents and then passed the kernel the list of inode numbers
> to prefetch before starting on the stat() calls. That way, userspace
> could generate an arbitrary list of inodes (e.g. names matching a
> regexp) and the kernel doesn't need to guess if every inode is needed.

Were you thinking arbitrary inodes across the filesystem or just a subset
from a directory? Arbitrary inodes may potentially throw up locking issues.
But yeah, as Steve mentioned in a previous email,  limiting the inodes
readahead in some fashion other than a range in readdir() order is something
that we are thinking of (list of inodes based on regexps, filenames etc). We
just chose to do an offset range of the directory for a quick, early
implementation.

> 
> As it stands, this syscall doesn't help in anything other than readdir
> order (or of the directory is small enough to be handled in one
> syscall), which could be handled by the kernel internally already,
> and it may fetch a considerable number of extra inodes from
> disk if not every inode needs to be touched.

The need for this syscall came up from a specific use case - Samba. I'm told
that Windows clients like to stat every file in a directory as soon as it is
read in and this has been a slow operation.

Cheers!
--Abhi



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-28 12:52     ` [Cluster-devel] " Abhijith Das
@ 2014-07-28 21:19       ` Andreas Dilger
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-07-28 21:19 UTC (permalink / raw)
  To: Abhijith Das; +Cc: LKML, linux-fsdevel, cluster-devel

[-- Attachment #1: Type: text/plain, Size: 4380 bytes --]

On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
>> Is there a time when this doesn't get called to prefetch entries in
>> readdir() order?  It isn't clear to me what benefit there is of returning
>> the entries to userspace instead of just doing the statahead implicitly
>> in the kernel?
>> 
>> The Lustre client has had what we call "statahead" for a while,
>> and similar to regular file readahead it detects the sequential access
>> pattern for readdir() + stat() in readdir() order (taking into account if
>> ".*"
>> entries are being processed or not) and starts fetching the inode
>> attributes asynchronously with a worker thread.
> 
> Does this heuristic work well in practice? In the use case we were trying to
> address, a Samba server is aware beforehand if it is going to stat all the
> inodes in a directory.

Typically this works well for us, because this is done by the Lustre
client, so the statahead is hiding the network latency of the RPCs to
fetch attributes from the server.  I imagine the same could be seen with
GFS2. I don't know if this approach would help very much for local
filesystems because the latency is low.

>> This syscall might be more useful if userspace called readdir() to get
>> the dirents and then passed the kernel the list of inode numbers
>> to prefetch before starting on the stat() calls. That way, userspace
>> could generate an arbitrary list of inodes (e.g. names matching a
>> regexp) and the kernel doesn't need to guess if every inode is needed.
> 
> Were you thinking arbitrary inodes across the filesystem or just a subset
> from a directory? Arbitrary inodes may potentially throw up locking issues.

I was thinking about inodes returned from readdir(), but the syscall
would be much more useful if it could handle arbitrary inodes.  For
example, if directories are small then it may be more efficient to
aggregate inodes from multiple directories for each prefetch syscall.
I can't really think of any locking issues that could exist with
"arbitrary list of inodes" that couldn't be hit by having a directory
with hard links to the same list of inodes, so this is something that
needs to be handled by the filesystem anyway.

Since this would be an advisory syscall (i.e. it doesn't really
return anything and cannot guarantee that all the inodes will be in
memory), then if the filesystem is having trouble prefetching the
inodes (e.g. invalid inode number(s) or lock ordering or contention
issues) it could always bail out and leave it to stat() to actually
fetch the inodes into memory when accessed.

There is no way it would be sane to keep inodes locked in the kernel
after prefetch, in case the "stat" never arrives, so the best it can
do is cache the inodes in memory (on the client for network fs), and
it is possible cache pressure or lock contention drops them from cache.

There are always going to be race conditions even if limited to a
single directory (e.g. another client modifies the inode after calling
dirreadahead(), but before calling stat()) that need to be handled.

I think there are a lot of benefits that could be had by the generic
syscall, possibly similar to what XFS is doing with the "bulkstat"
interfaces that Dave always mentions.  This would be much more so for
cases were you don't want to stat all of the entries in a directory.

> But yeah, as Steve mentioned in a previous email,  limiting the inodes
> readahead in some fashion other than a range in readdir() order is
> something that we are thinking of (list of inodes based on regexps,
> filenames etc). We just chose to do an offset range of the directory
> for a quick, early implementation.



>> As it stands, this syscall doesn't help in anything other than readdir
>> order (or of the directory is small enough to be handled in one
>> syscall), which could be handled by the kernel internally already,
>> and it may fetch a considerable number of extra inodes from
>> disk if not every inode needs to be touched.
> 
> The need for this syscall came up from a specific use case - Samba.
> I'm told that Windows clients like to stat every file in a directory
> as soon as it is read in and this has been a slow operation.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-28 21:19       ` Andreas Dilger
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-07-28 21:19 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
>> Is there a time when this doesn't get called to prefetch entries in
>> readdir() order?  It isn't clear to me what benefit there is of returning
>> the entries to userspace instead of just doing the statahead implicitly
>> in the kernel?
>> 
>> The Lustre client has had what we call "statahead" for a while,
>> and similar to regular file readahead it detects the sequential access
>> pattern for readdir() + stat() in readdir() order (taking into account if
>> ".*"
>> entries are being processed or not) and starts fetching the inode
>> attributes asynchronously with a worker thread.
> 
> Does this heuristic work well in practice? In the use case we were trying to
> address, a Samba server is aware beforehand if it is going to stat all the
> inodes in a directory.

Typically this works well for us, because this is done by the Lustre
client, so the statahead is hiding the network latency of the RPCs to
fetch attributes from the server.  I imagine the same could be seen with
GFS2. I don't know if this approach would help very much for local
filesystems because the latency is low.

>> This syscall might be more useful if userspace called readdir() to get
>> the dirents and then passed the kernel the list of inode numbers
>> to prefetch before starting on the stat() calls. That way, userspace
>> could generate an arbitrary list of inodes (e.g. names matching a
>> regexp) and the kernel doesn't need to guess if every inode is needed.
> 
> Were you thinking arbitrary inodes across the filesystem or just a subset
> from a directory? Arbitrary inodes may potentially throw up locking issues.

I was thinking about inodes returned from readdir(), but the syscall
would be much more useful if it could handle arbitrary inodes.  For
example, if directories are small then it may be more efficient to
aggregate inodes from multiple directories for each prefetch syscall.
I can't really think of any locking issues that could exist with
"arbitrary list of inodes" that couldn't be hit by having a directory
with hard links to the same list of inodes, so this is something that
needs to be handled by the filesystem anyway.

Since this would be an advisory syscall (i.e. it doesn't really
return anything and cannot guarantee that all the inodes will be in
memory), then if the filesystem is having trouble prefetching the
inodes (e.g. invalid inode number(s) or lock ordering or contention
issues) it could always bail out and leave it to stat() to actually
fetch the inodes into memory when accessed.

There is no way it would be sane to keep inodes locked in the kernel
after prefetch, in case the "stat" never arrives, so the best it can
do is cache the inodes in memory (on the client for network fs), and
it is possible cache pressure or lock contention drops them from cache.

There are always going to be race conditions even if limited to a
single directory (e.g. another client modifies the inode after calling
dirreadahead(), but before calling stat()) that need to be handled.

I think there are a lot of benefits that could be had by the generic
syscall, possibly similar to what XFS is doing with the "bulkstat"
interfaces that Dave always mentions.  This would be much more so for
cases were you don't want to stat all of the entries in a directory.

> But yeah, as Steve mentioned in a previous email,  limiting the inodes
> readahead in some fashion other than a range in readdir() order is
> something that we are thinking of (list of inodes based on regexps,
> filenames etc). We just chose to do an offset range of the directory
> for a quick, early implementation.



>> As it stands, this syscall doesn't help in anything other than readdir
>> order (or of the directory is small enough to be handled in one
>> syscall), which could be handled by the kernel internally already,
>> and it may fetch a considerable number of extra inodes from
>> disk if not every inode needs to be touched.
> 
> The need for this syscall came up from a specific use case - Samba.
> I'm told that Windows clients like to stat every file in a directory
> as soon as it is read in and this has been a slow operation.

Cheers, Andreas





-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20140728/ab6112f6/attachment.sig>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-25 17:37 ` Abhi Das
@ 2014-07-29  8:19   ` Michael Kerrisk
  -1 siblings, 0 replies; 49+ messages in thread
From: Michael Kerrisk @ 2014-07-29  8:19 UTC (permalink / raw)
  To: Abhi Das; +Cc: Linux Kernel, Linux-Fsdevel, cluster-devel, Linux API

On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <adas@redhat.com> wrote:
> This system call takes 3 arguments:
> fd      - file descriptor of the directory being readahead
> *offset - offset in dir from which to resume. This is updated
>           as we move along in the directory
> count   - The max number of entries to readahead
>
> The syscall is supposed to read upto 'count' entries starting at
> '*offset' and cache the inodes corresponding to those entries. It
> returns a negative error code or a positive number indicating
> the number of inodes it has issued readaheads for. It also
> updates the '*offset' value so that repeated calls to dirreadahead
> can resume at the right location. Returns 0 when there are no more
> entries left.

Hello Abhi,

As per Documentation/SubmitChecklist, please CC linux-api on patches
that change the kerne-user-space API/ABI. (See
https://www.kernel.org/doc/man-pages/linux-api-ml.html for more
details.)

Cheers,

Michael


> Abhi Das (2):
>   fs: Add dirreadahead syscall and VFS hooks
>   gfs2: GFS2's implementation of the dir_readahead file operation
>
>  arch/x86/syscalls/syscall_32.tbl |   1 +
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  fs/gfs2/Makefile                 |   3 +-
>  fs/gfs2/dir.c                    |  49 ++++++---
>  fs/gfs2/dir.h                    |  15 +++
>  fs/gfs2/dir_readahead.c          | 209 +++++++++++++++++++++++++++++++++++++++
>  fs/gfs2/file.c                   |   2 +
>  fs/gfs2/main.c                   |  10 +-
>  fs/gfs2/super.c                  |   1 +
>  fs/readdir.c                     |  49 +++++++++
>  include/linux/fs.h               |   3 +
>  11 files changed, 328 insertions(+), 15 deletions(-)
>  create mode 100644 fs/gfs2/dir_readahead.c
>
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-29  8:19   ` Michael Kerrisk
  0 siblings, 0 replies; 49+ messages in thread
From: Michael Kerrisk @ 2014-07-29  8:19 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <adas@redhat.com> wrote:
> This system call takes 3 arguments:
> fd      - file descriptor of the directory being readahead
> *offset - offset in dir from which to resume. This is updated
>           as we move along in the directory
> count   - The max number of entries to readahead
>
> The syscall is supposed to read upto 'count' entries starting at
> '*offset' and cache the inodes corresponding to those entries. It
> returns a negative error code or a positive number indicating
> the number of inodes it has issued readaheads for. It also
> updates the '*offset' value so that repeated calls to dirreadahead
> can resume at the right location. Returns 0 when there are no more
> entries left.

Hello Abhi,

As per Documentation/SubmitChecklist, please CC linux-api on patches
that change the kerne-user-space API/ABI. (See
https://www.kernel.org/doc/man-pages/linux-api-ml.html for more
details.)

Cheers,

Michael


> Abhi Das (2):
>   fs: Add dirreadahead syscall and VFS hooks
>   gfs2: GFS2's implementation of the dir_readahead file operation
>
>  arch/x86/syscalls/syscall_32.tbl |   1 +
>  arch/x86/syscalls/syscall_64.tbl |   1 +
>  fs/gfs2/Makefile                 |   3 +-
>  fs/gfs2/dir.c                    |  49 ++++++---
>  fs/gfs2/dir.h                    |  15 +++
>  fs/gfs2/dir_readahead.c          | 209 +++++++++++++++++++++++++++++++++++++++
>  fs/gfs2/file.c                   |   2 +
>  fs/gfs2/main.c                   |  10 +-
>  fs/gfs2/super.c                  |   1 +
>  fs/readdir.c                     |  49 +++++++++
>  include/linux/fs.h               |   3 +
>  11 files changed, 328 insertions(+), 15 deletions(-)
>  create mode 100644 fs/gfs2/dir_readahead.c
>
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks
@ 2014-07-29  8:21     ` Michael Kerrisk
  0 siblings, 0 replies; 49+ messages in thread
From: Michael Kerrisk @ 2014-07-29  8:21 UTC (permalink / raw)
  To: Abhi Das; +Cc: Linux Kernel, Linux-Fsdevel, cluster-devel, Linux API

[CC+=linux-api]

On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <adas@redhat.com> wrote:
> Also adds a void *opaque field to struct dir_context that can be
> used by filesystems to temporarily store any context as this
> struct gets passed around in the fs.
>
> Signed-off-by: Abhi Das <adas@redhat.com>
> ---
>  arch/x86/syscalls/syscall_32.tbl |  1 +
>  arch/x86/syscalls/syscall_64.tbl |  1 +
>  fs/readdir.c                     | 49 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h               |  3 +++
>  4 files changed, 54 insertions(+)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..3e0ef85 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351    i386    sched_setattr           sys_sched_setattr
>  352    i386    sched_getattr           sys_sched_getattr
>  353    i386    renameat2               sys_renameat2
> +354    i386    dirreadahead            sys_dirreadahead
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..2ec0991 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
>  316    common  renameat2               sys_renameat2
> +317    common  dirreadahead            sys_dirreadahead
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 33fd922..d216db7 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -198,6 +198,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd,
>         struct linux_dirent __user * lastdirent;
>         struct getdents_callback buf = {
>                 .ctx.actor = filldir,
> +               .ctx.opaque = NULL,
>                 .count = count,
>                 .current_dir = dirent
>         };
> @@ -278,6 +279,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>         struct linux_dirent64 __user * lastdirent;
>         struct getdents_callback64 buf = {
>                 .ctx.actor = filldir64,
> +               .ctx.opaque = NULL,
>                 .count = count,
>                 .current_dir = dirent
>         };
> @@ -304,3 +306,50 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>         fdput(f);
>         return error;
>  }
> +
> +SYSCALL_DEFINE3(dirreadahead, unsigned int, fd,
> +               loff_t __user *, offset, unsigned int, count)
> +{
> +       struct fd f;
> +       struct inode *inode;
> +       int error = -ENOTDIR;
> +       loff_t off = 0;
> +       struct dir_context ctx = {.actor = NULL, .opaque = NULL};
> +
> +       if (!count)
> +               return -EINVAL;
> +
> +       f = fdget(fd);
> +       if (!f.file)
> +               return -EBADF;
> +
> +       inode = f.file->f_path.dentry->d_inode;
> +
> +       error = -ENOTSUPP;
> +       if (!f.file->f_op || !f.file->f_op->dir_readahead)
> +               goto out;
> +
> +       error = security_file_permission(f.file, MAY_READ);
> +       if (error)
> +               goto out;
> +
> +       error = -EFAULT;
> +       if (__get_user(ctx.pos, offset))
> +               goto out;
> +
> +       error = mutex_lock_killable(&inode->i_mutex);
> +       if (error)
> +               goto out;
> +
> +       error = -ENOENT;
> +       if (!IS_DEADDIR(inode)) {
> +               error = f.file->f_op->dir_readahead(f.file, &ctx, count);
> +               if (__put_user(ctx.pos, offset))
> +                       error = -EFAULT;
> +               file_accessed(f.file);
> +       }
> +       mutex_unlock(&inode->i_mutex);
> +out:
> +       fdput(f);
> +       return error;
> +}
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 338e6f7..fae4a6e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1438,9 +1438,11 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>   * to have different dirent layouts depending on the binary type.
>   */
>  typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
> +
>  struct dir_context {
>         const filldir_t actor;
>         loff_t pos;
> +       void *opaque;
>  };
>
>  struct block_device_operations;
> @@ -1463,6 +1465,7 @@ struct file_operations {
>         ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
>         ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
>         int (*iterate) (struct file *, struct dir_context *);
> +       int (*dir_readahead) (struct file *, struct dir_context *, unsigned int);
>         unsigned int (*poll) (struct file *, struct poll_table_struct *);
>         long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
>         long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks
@ 2014-07-29  8:21     ` Michael Kerrisk
  0 siblings, 0 replies; 49+ messages in thread
From: Michael Kerrisk @ 2014-07-29  8:21 UTC (permalink / raw)
  To: Abhi Das; +Cc: Linux Kernel, Linux-Fsdevel, cluster-devel, Linux API

[CC+=linux-api]

On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <adas-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Also adds a void *opaque field to struct dir_context that can be
> used by filesystems to temporarily store any context as this
> struct gets passed around in the fs.
>
> Signed-off-by: Abhi Das <adas-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  arch/x86/syscalls/syscall_32.tbl |  1 +
>  arch/x86/syscalls/syscall_64.tbl |  1 +
>  fs/readdir.c                     | 49 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h               |  3 +++
>  4 files changed, 54 insertions(+)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..3e0ef85 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351    i386    sched_setattr           sys_sched_setattr
>  352    i386    sched_getattr           sys_sched_getattr
>  353    i386    renameat2               sys_renameat2
> +354    i386    dirreadahead            sys_dirreadahead
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..2ec0991 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
>  316    common  renameat2               sys_renameat2
> +317    common  dirreadahead            sys_dirreadahead
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 33fd922..d216db7 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -198,6 +198,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd,
>         struct linux_dirent __user * lastdirent;
>         struct getdents_callback buf = {
>                 .ctx.actor = filldir,
> +               .ctx.opaque = NULL,
>                 .count = count,
>                 .current_dir = dirent
>         };
> @@ -278,6 +279,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>         struct linux_dirent64 __user * lastdirent;
>         struct getdents_callback64 buf = {
>                 .ctx.actor = filldir64,
> +               .ctx.opaque = NULL,
>                 .count = count,
>                 .current_dir = dirent
>         };
> @@ -304,3 +306,50 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>         fdput(f);
>         return error;
>  }
> +
> +SYSCALL_DEFINE3(dirreadahead, unsigned int, fd,
> +               loff_t __user *, offset, unsigned int, count)
> +{
> +       struct fd f;
> +       struct inode *inode;
> +       int error = -ENOTDIR;
> +       loff_t off = 0;
> +       struct dir_context ctx = {.actor = NULL, .opaque = NULL};
> +
> +       if (!count)
> +               return -EINVAL;
> +
> +       f = fdget(fd);
> +       if (!f.file)
> +               return -EBADF;
> +
> +       inode = f.file->f_path.dentry->d_inode;
> +
> +       error = -ENOTSUPP;
> +       if (!f.file->f_op || !f.file->f_op->dir_readahead)
> +               goto out;
> +
> +       error = security_file_permission(f.file, MAY_READ);
> +       if (error)
> +               goto out;
> +
> +       error = -EFAULT;
> +       if (__get_user(ctx.pos, offset))
> +               goto out;
> +
> +       error = mutex_lock_killable(&inode->i_mutex);
> +       if (error)
> +               goto out;
> +
> +       error = -ENOENT;
> +       if (!IS_DEADDIR(inode)) {
> +               error = f.file->f_op->dir_readahead(f.file, &ctx, count);
> +               if (__put_user(ctx.pos, offset))
> +                       error = -EFAULT;
> +               file_accessed(f.file);
> +       }
> +       mutex_unlock(&inode->i_mutex);
> +out:
> +       fdput(f);
> +       return error;
> +}
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 338e6f7..fae4a6e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1438,9 +1438,11 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>   * to have different dirent layouts depending on the binary type.
>   */
>  typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
> +
>  struct dir_context {
>         const filldir_t actor;
>         loff_t pos;
> +       void *opaque;
>  };
>
>  struct block_device_operations;
> @@ -1463,6 +1465,7 @@ struct file_operations {
>         ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
>         ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
>         int (*iterate) (struct file *, struct dir_context *);
> +       int (*dir_readahead) (struct file *, struct dir_context *, unsigned int);
>         unsigned int (*poll) (struct file *, struct poll_table_struct *);
>         long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
>         long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks
@ 2014-07-29  8:21     ` Michael Kerrisk
  0 siblings, 0 replies; 49+ messages in thread
From: Michael Kerrisk @ 2014-07-29  8:21 UTC (permalink / raw)
  To: cluster-devel.redhat.com

[CC+=linux-api]

On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <adas@redhat.com> wrote:
> Also adds a void *opaque field to struct dir_context that can be
> used by filesystems to temporarily store any context as this
> struct gets passed around in the fs.
>
> Signed-off-by: Abhi Das <adas@redhat.com>
> ---
>  arch/x86/syscalls/syscall_32.tbl |  1 +
>  arch/x86/syscalls/syscall_64.tbl |  1 +
>  fs/readdir.c                     | 49 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/fs.h               |  3 +++
>  4 files changed, 54 insertions(+)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..3e0ef85 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351    i386    sched_setattr           sys_sched_setattr
>  352    i386    sched_getattr           sys_sched_getattr
>  353    i386    renameat2               sys_renameat2
> +354    i386    dirreadahead            sys_dirreadahead
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..2ec0991 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314    common  sched_setattr           sys_sched_setattr
>  315    common  sched_getattr           sys_sched_getattr
>  316    common  renameat2               sys_renameat2
> +317    common  dirreadahead            sys_dirreadahead
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 33fd922..d216db7 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -198,6 +198,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd,
>         struct linux_dirent __user * lastdirent;
>         struct getdents_callback buf = {
>                 .ctx.actor = filldir,
> +               .ctx.opaque = NULL,
>                 .count = count,
>                 .current_dir = dirent
>         };
> @@ -278,6 +279,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>         struct linux_dirent64 __user * lastdirent;
>         struct getdents_callback64 buf = {
>                 .ctx.actor = filldir64,
> +               .ctx.opaque = NULL,
>                 .count = count,
>                 .current_dir = dirent
>         };
> @@ -304,3 +306,50 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
>         fdput(f);
>         return error;
>  }
> +
> +SYSCALL_DEFINE3(dirreadahead, unsigned int, fd,
> +               loff_t __user *, offset, unsigned int, count)
> +{
> +       struct fd f;
> +       struct inode *inode;
> +       int error = -ENOTDIR;
> +       loff_t off = 0;
> +       struct dir_context ctx = {.actor = NULL, .opaque = NULL};
> +
> +       if (!count)
> +               return -EINVAL;
> +
> +       f = fdget(fd);
> +       if (!f.file)
> +               return -EBADF;
> +
> +       inode = f.file->f_path.dentry->d_inode;
> +
> +       error = -ENOTSUPP;
> +       if (!f.file->f_op || !f.file->f_op->dir_readahead)
> +               goto out;
> +
> +       error = security_file_permission(f.file, MAY_READ);
> +       if (error)
> +               goto out;
> +
> +       error = -EFAULT;
> +       if (__get_user(ctx.pos, offset))
> +               goto out;
> +
> +       error = mutex_lock_killable(&inode->i_mutex);
> +       if (error)
> +               goto out;
> +
> +       error = -ENOENT;
> +       if (!IS_DEADDIR(inode)) {
> +               error = f.file->f_op->dir_readahead(f.file, &ctx, count);
> +               if (__put_user(ctx.pos, offset))
> +                       error = -EFAULT;
> +               file_accessed(f.file);
> +       }
> +       mutex_unlock(&inode->i_mutex);
> +out:
> +       fdput(f);
> +       return error;
> +}
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 338e6f7..fae4a6e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1438,9 +1438,11 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
>   * to have different dirent layouts depending on the binary type.
>   */
>  typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
> +
>  struct dir_context {
>         const filldir_t actor;
>         loff_t pos;
> +       void *opaque;
>  };
>
>  struct block_device_operations;
> @@ -1463,6 +1465,7 @@ struct file_operations {
>         ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
>         ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
>         int (*iterate) (struct file *, struct dir_context *);
> +       int (*dir_readahead) (struct file *, struct dir_context *, unsigned int);
>         unsigned int (*poll) (struct file *, struct poll_table_struct *);
>         long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
>         long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
  2014-07-28 21:19       ` [Cluster-devel] " Andreas Dilger
@ 2014-07-29  9:36         ` Steven Whitehouse
  -1 siblings, 0 replies; 49+ messages in thread
From: Steven Whitehouse @ 2014-07-29  9:36 UTC (permalink / raw)
  To: Andreas Dilger, Abhijith Das; +Cc: linux-fsdevel, cluster-devel, LKML

Hi,

On 28/07/14 22:19, Andreas Dilger wrote:
> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
>>> Is there a time when this doesn't get called to prefetch entries in
>>> readdir() order?  It isn't clear to me what benefit there is of returning
>>> the entries to userspace instead of just doing the statahead implicitly
>>> in the kernel?
>>>
>>> The Lustre client has had what we call "statahead" for a while,
>>> and similar to regular file readahead it detects the sequential access
>>> pattern for readdir() + stat() in readdir() order (taking into account if
>>> ".*"
>>> entries are being processed or not) and starts fetching the inode
>>> attributes asynchronously with a worker thread.
>> Does this heuristic work well in practice? In the use case we were trying to
>> address, a Samba server is aware beforehand if it is going to stat all the
>> inodes in a directory.
> Typically this works well for us, because this is done by the Lustre
> client, so the statahead is hiding the network latency of the RPCs to
> fetch attributes from the server.  I imagine the same could be seen with
> GFS2. I don't know if this approach would help very much for local
> filesystems because the latency is low.
Well GFS2 is more or less a local fs in the case of no contention. There 
are issues that arise in the contention case, which is why it is very 
important to ensure that there is as close to 100% chance of using the 
inodes which are pre-fetched as possible. Thats why we've not yet tried 
out algorithms that try to guess when the inodes will be required, since 
one wrong guess can be quite costly in terms of lock bouncing between 
nodes. However it does seem that the suggestions made in this thread 
about using ->lookup in the directory as one way to gather information 
may be a good way to go, since it will only be called for uncached 
inodes and in addition it would be equally useful to any possible 
operation which may follow the lookup.

>>> This syscall might be more useful if userspace called readdir() to get
>>> the dirents and then passed the kernel the list of inode numbers
>>> to prefetch before starting on the stat() calls. That way, userspace
>>> could generate an arbitrary list of inodes (e.g. names matching a
>>> regexp) and the kernel doesn't need to guess if every inode is needed.
>> Were you thinking arbitrary inodes across the filesystem or just a subset
>> from a directory? Arbitrary inodes may potentially throw up locking issues.
> I was thinking about inodes returned from readdir(), but the syscall
> would be much more useful if it could handle arbitrary inodes.  For
> example, if directories are small then it may be more efficient to
> aggregate inodes from multiple directories for each prefetch syscall.
> I can't really think of any locking issues that could exist with
> "arbitrary list of inodes" that couldn't be hit by having a directory
> with hard links to the same list of inodes, so this is something that
> needs to be handled by the filesystem anyway.
>
> Since this would be an advisory syscall (i.e. it doesn't really
> return anything and cannot guarantee that all the inodes will be in
> memory), then if the filesystem is having trouble prefetching the
> inodes (e.g. invalid inode number(s) or lock ordering or contention
> issues) it could always bail out and leave it to stat() to actually
> fetch the inodes into memory when accessed.
>
> There is no way it would be sane to keep inodes locked in the kernel
> after prefetch, in case the "stat" never arrives, so the best it can
> do is cache the inodes in memory (on the client for network fs), and
> it is possible cache pressure or lock contention drops them from cache.
>
> There are always going to be race conditions even if limited to a
> single directory (e.g. another client modifies the inode after calling
> dirreadahead(), but before calling stat()) that need to be handled.
>
> I think there are a lot of benefits that could be had by the generic
> syscall, possibly similar to what XFS is doing with the "bulkstat"
> interfaces that Dave always mentions.  This would be much more so for
> cases were you don't want to stat all of the entries in a directory.
>
Yes, it probably would be possible to do a kind of bulk lookup like 
that, although it would also be more complicated locking-wise if we 
allowed arbitrary inode numbers, rather than doing it on a per directory 
basis. I can see arguments for both approaches though,

Steve.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-29  9:36         ` Steven Whitehouse
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Whitehouse @ 2014-07-29  9:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On 28/07/14 22:19, Andreas Dilger wrote:
> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
>>> Is there a time when this doesn't get called to prefetch entries in
>>> readdir() order?  It isn't clear to me what benefit there is of returning
>>> the entries to userspace instead of just doing the statahead implicitly
>>> in the kernel?
>>>
>>> The Lustre client has had what we call "statahead" for a while,
>>> and similar to regular file readahead it detects the sequential access
>>> pattern for readdir() + stat() in readdir() order (taking into account if
>>> ".*"
>>> entries are being processed or not) and starts fetching the inode
>>> attributes asynchronously with a worker thread.
>> Does this heuristic work well in practice? In the use case we were trying to
>> address, a Samba server is aware beforehand if it is going to stat all the
>> inodes in a directory.
> Typically this works well for us, because this is done by the Lustre
> client, so the statahead is hiding the network latency of the RPCs to
> fetch attributes from the server.  I imagine the same could be seen with
> GFS2. I don't know if this approach would help very much for local
> filesystems because the latency is low.
Well GFS2 is more or less a local fs in the case of no contention. There 
are issues that arise in the contention case, which is why it is very 
important to ensure that there is as close to 100% chance of using the 
inodes which are pre-fetched as possible. Thats why we've not yet tried 
out algorithms that try to guess when the inodes will be required, since 
one wrong guess can be quite costly in terms of lock bouncing between 
nodes. However it does seem that the suggestions made in this thread 
about using ->lookup in the directory as one way to gather information 
may be a good way to go, since it will only be called for uncached 
inodes and in addition it would be equally useful to any possible 
operation which may follow the lookup.

>>> This syscall might be more useful if userspace called readdir() to get
>>> the dirents and then passed the kernel the list of inode numbers
>>> to prefetch before starting on the stat() calls. That way, userspace
>>> could generate an arbitrary list of inodes (e.g. names matching a
>>> regexp) and the kernel doesn't need to guess if every inode is needed.
>> Were you thinking arbitrary inodes across the filesystem or just a subset
>> from a directory? Arbitrary inodes may potentially throw up locking issues.
> I was thinking about inodes returned from readdir(), but the syscall
> would be much more useful if it could handle arbitrary inodes.  For
> example, if directories are small then it may be more efficient to
> aggregate inodes from multiple directories for each prefetch syscall.
> I can't really think of any locking issues that could exist with
> "arbitrary list of inodes" that couldn't be hit by having a directory
> with hard links to the same list of inodes, so this is something that
> needs to be handled by the filesystem anyway.
>
> Since this would be an advisory syscall (i.e. it doesn't really
> return anything and cannot guarantee that all the inodes will be in
> memory), then if the filesystem is having trouble prefetching the
> inodes (e.g. invalid inode number(s) or lock ordering or contention
> issues) it could always bail out and leave it to stat() to actually
> fetch the inodes into memory when accessed.
>
> There is no way it would be sane to keep inodes locked in the kernel
> after prefetch, in case the "stat" never arrives, so the best it can
> do is cache the inodes in memory (on the client for network fs), and
> it is possible cache pressure or lock contention drops them from cache.
>
> There are always going to be race conditions even if limited to a
> single directory (e.g. another client modifies the inode after calling
> dirreadahead(), but before calling stat()) that need to be handled.
>
> I think there are a lot of benefits that could be had by the generic
> syscall, possibly similar to what XFS is doing with the "bulkstat"
> interfaces that Dave always mentions.  This would be much more so for
> cases were you don't want to stat all of the entries in a directory.
>
Yes, it probably would be possible to do a kind of bulk lookup like 
that, although it would also be more complicated locking-wise if we 
allowed arbitrary inode numbers, rather than doing it on a per directory 
basis. I can see arguments for both approaches though,

Steve.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-25 17:37 ` Abhi Das
@ 2014-07-31  3:18   ` NeilBrown
  -1 siblings, 0 replies; 49+ messages in thread
From: NeilBrown @ 2014-07-31  3:18 UTC (permalink / raw)
  To: Abhi Das; +Cc: linux-kernel, linux-fsdevel, cluster-devel

[-- Attachment #1: Type: text/plain, Size: 3280 bytes --]

On Fri, 25 Jul 2014 12:37:29 -0500 Abhi Das <adas@redhat.com> wrote:

> This system call takes 3 arguments:
> fd      - file descriptor of the directory being readahead
> *offset - offset in dir from which to resume. This is updated
>           as we move along in the directory
> count   - The max number of entries to readahead
> 
> The syscall is supposed to read upto 'count' entries starting at
> '*offset' and cache the inodes corresponding to those entries. It
> returns a negative error code or a positive number indicating
> the number of inodes it has issued readaheads for. It also
> updates the '*offset' value so that repeated calls to dirreadahead
> can resume at the right location. Returns 0 when there are no more
> entries left.

Hi Abhi,

 I like the idea of enhanced read-ahead on a directory.
 It isn't clear to me why you have included these particular fields in the
 interface though.

 - why have an 'offset'?  Why not just use the current offset of the
   directory 'fd'?
 - Why have a count?  How would a program choose what count to give?

 Maybe you imagine using 'getdents' first to get a list of names, then
 selectively calling 'dirreadahead'  on the offsets of the names you are
 interested it?  That would be racy as names can be added and removed which
 might change offsets.  So maybe you have another reason?

 I would like to suggest an alternate interface (I love playing the API
 game....).

 1/ Add a flag to 'fstatat'  AT_EXPECT_MORE.
    If the pathname does not contain a '/', then the 'dirfd' is marked
    to indicate that stat information for all names returned by getdents will
    be wanted.  The filesystem can choose to optimise that however it sees
    fit.

 2/ Add a flag to 'fstatat'  AT_NONBLOCK.
    This tells the filesystem that you want this information, so if it can
    return it immediately it should, and if not it should start pulling it
    into cache.  Possibly this should be two flags: AT_NONBLOCK just avoids
    any IO, and AT_ASYNC instigates IO even if NONBLOCK is set.

 Then an "ls -l" could use AT_EXPECT_MORE and then just stat each name.
 An "ls -l *.c", might avoid AT_EXPECT_MORE, but would use AT_NONBLOCK
 against all names, then try again with all the names that returned
 EWOULDBLOCK the first time.


 I would really like to see the 'xstat' syscall too, but there is no point
 having both "xstat" and "fxstat".  Follow the model of "fstatat" and provide
 just "fxstatat" which can do both.
 With fxstatat, AT_EXPECT_MORE would tell the dirfd exactly which attributes
 would be wanted so it can fetch only that which is desired.

 I'm not very keen on the xgetdents idea of including name information and
 stat information into the one syscall - I would prefer getdents and xstat be
 kept separate.  Of course if a genuine performance cost of the separate can
 be demonstrated, I could well change my mind.

 It does, however, have the advantage that the kernel doesn't need to worry
 about how long read-ahead data needs to be kept, and the application doesn't
 need to worry about how soon to retry an fstatat which failed with
 EWOULDBLOCK.

Thanks for raising this issue again.  I hope it gets fixed one day...

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-31  3:18   ` NeilBrown
  0 siblings, 0 replies; 49+ messages in thread
From: NeilBrown @ 2014-07-31  3:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, 25 Jul 2014 12:37:29 -0500 Abhi Das <adas@redhat.com> wrote:

> This system call takes 3 arguments:
> fd      - file descriptor of the directory being readahead
> *offset - offset in dir from which to resume. This is updated
>           as we move along in the directory
> count   - The max number of entries to readahead
> 
> The syscall is supposed to read upto 'count' entries starting at
> '*offset' and cache the inodes corresponding to those entries. It
> returns a negative error code or a positive number indicating
> the number of inodes it has issued readaheads for. It also
> updates the '*offset' value so that repeated calls to dirreadahead
> can resume at the right location. Returns 0 when there are no more
> entries left.

Hi Abhi,

 I like the idea of enhanced read-ahead on a directory.
 It isn't clear to me why you have included these particular fields in the
 interface though.

 - why have an 'offset'?  Why not just use the current offset of the
   directory 'fd'?
 - Why have a count?  How would a program choose what count to give?

 Maybe you imagine using 'getdents' first to get a list of names, then
 selectively calling 'dirreadahead'  on the offsets of the names you are
 interested it?  That would be racy as names can be added and removed which
 might change offsets.  So maybe you have another reason?

 I would like to suggest an alternate interface (I love playing the API
 game....).

 1/ Add a flag to 'fstatat'  AT_EXPECT_MORE.
    If the pathname does not contain a '/', then the 'dirfd' is marked
    to indicate that stat information for all names returned by getdents will
    be wanted.  The filesystem can choose to optimise that however it sees
    fit.

 2/ Add a flag to 'fstatat'  AT_NONBLOCK.
    This tells the filesystem that you want this information, so if it can
    return it immediately it should, and if not it should start pulling it
    into cache.  Possibly this should be two flags: AT_NONBLOCK just avoids
    any IO, and AT_ASYNC instigates IO even if NONBLOCK is set.

 Then an "ls -l" could use AT_EXPECT_MORE and then just stat each name.
 An "ls -l *.c", might avoid AT_EXPECT_MORE, but would use AT_NONBLOCK
 against all names, then try again with all the names that returned
 EWOULDBLOCK the first time.


 I would really like to see the 'xstat' syscall too, but there is no point
 having both "xstat" and "fxstat".  Follow the model of "fstatat" and provide
 just "fxstatat" which can do both.
 With fxstatat, AT_EXPECT_MORE would tell the dirfd exactly which attributes
 would be wanted so it can fetch only that which is desired.

 I'm not very keen on the xgetdents idea of including name information and
 stat information into the one syscall - I would prefer getdents and xstat be
 kept separate.  Of course if a genuine performance cost of the separate can
 be demonstrated, I could well change my mind.

 It does, however, have the advantage that the kernel doesn't need to worry
 about how long read-ahead data needs to be kept, and the application doesn't
 need to worry about how soon to retry an fstatat which failed with
 EWOULDBLOCK.

Thanks for raising this issue again.  I hope it gets fixed one day...

NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 828 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20140731/39775e3e/attachment.sig>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks
@ 2014-07-31  3:31       ` Dave Chinner
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2014-07-31  3:31 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Abhi Das, Linux Kernel, Linux-Fsdevel, cluster-devel, Linux API

On Tue, Jul 29, 2014 at 10:21:50AM +0200, Michael Kerrisk wrote:
> [CC+=linux-api]
> 
> On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <adas@redhat.com> wrote:
> > Also adds a void *opaque field to struct dir_context that can be
> > used by filesystems to temporarily store any context as this
> > struct gets passed around in the fs.

So the prototype is:

int dir_readahead(int fd, off64_t offset, unsigned int count);

Why do we need a new syscall for this?

$ man 2 readahead
....
ssize_t readahead(int fd, off64_t offset, size_t count);
....
	EINVAL fd does not refer to a file type to which readahead() can be applied.


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks
@ 2014-07-31  3:31       ` Dave Chinner
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2014-07-31  3:31 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Abhi Das, Linux Kernel, Linux-Fsdevel, cluster-devel, Linux API

On Tue, Jul 29, 2014 at 10:21:50AM +0200, Michael Kerrisk wrote:
> [CC+=linux-api]
> 
> On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <adas-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Also adds a void *opaque field to struct dir_context that can be
> > used by filesystems to temporarily store any context as this
> > struct gets passed around in the fs.

So the prototype is:

int dir_readahead(int fd, off64_t offset, unsigned int count);

Why do we need a new syscall for this?

$ man 2 readahead
....
ssize_t readahead(int fd, off64_t offset, size_t count);
....
	EINVAL fd does not refer to a file type to which readahead() can be applied.


Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks
@ 2014-07-31  3:31       ` Dave Chinner
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2014-07-31  3:31 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jul 29, 2014 at 10:21:50AM +0200, Michael Kerrisk wrote:
> [CC+=linux-api]
> 
> On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <adas@redhat.com> wrote:
> > Also adds a void *opaque field to struct dir_context that can be
> > used by filesystems to temporarily store any context as this
> > struct gets passed around in the fs.

So the prototype is:

int dir_readahead(int fd, off64_t offset, unsigned int count);

Why do we need a new syscall for this?

$ man 2 readahead
....
ssize_t readahead(int fd, off64_t offset, size_t count);
....
	EINVAL fd does not refer to a file type to which readahead() can be applied.


Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-28 21:19       ` [Cluster-devel] " Andreas Dilger
@ 2014-07-31  4:49         ` Dave Chinner
  -1 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2014-07-31  4:49 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Abhijith Das, LKML, linux-fsdevel, cluster-devel

On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> > OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
> >> Is there a time when this doesn't get called to prefetch entries in
> >> readdir() order?  It isn't clear to me what benefit there is of returning
> >> the entries to userspace instead of just doing the statahead implicitly
> >> in the kernel?
> >> 
> >> The Lustre client has had what we call "statahead" for a while,
> >> and similar to regular file readahead it detects the sequential access
> >> pattern for readdir() + stat() in readdir() order (taking into account if
> >> ".*"
> >> entries are being processed or not) and starts fetching the inode
> >> attributes asynchronously with a worker thread.
> > 
> > Does this heuristic work well in practice? In the use case we were trying to
> > address, a Samba server is aware beforehand if it is going to stat all the
> > inodes in a directory.
> 
> Typically this works well for us, because this is done by the Lustre
> client, so the statahead is hiding the network latency of the RPCs to
> fetch attributes from the server.  I imagine the same could be seen with
> GFS2. I don't know if this approach would help very much for local
> filesystems because the latency is low.
> 
> >> This syscall might be more useful if userspace called readdir() to get
> >> the dirents and then passed the kernel the list of inode numbers
> >> to prefetch before starting on the stat() calls. That way, userspace
> >> could generate an arbitrary list of inodes (e.g. names matching a
> >> regexp) and the kernel doesn't need to guess if every inode is needed.
> > 
> > Were you thinking arbitrary inodes across the filesystem or just a subset
> > from a directory? Arbitrary inodes may potentially throw up locking issues.
> 
> I was thinking about inodes returned from readdir(), but the syscall
> would be much more useful if it could handle arbitrary inodes.

I'mnot sure we can do that. The only way to safely identify a
specific inode in the filesystem from userspace is via a filehandle.
Plain inode numbers are susceptible to TOCTOU race conditions that
the kernel cannot resolve. Also, lookup by inode number bypasses
directory access permissions, so is not something we would expose
to arbitrary unprivileged users.

> There are always going to be race conditions even if limited to a
> single directory (e.g. another client modifies the inode after calling
> dirreadahead(), but before calling stat()) that need to be handled.

Unlink/realloc to a different directory with different access
permissions is the big issue.

> I think there are a lot of benefits that could be had by the generic
> syscall, possibly similar to what XFS is doing with the "bulkstat"
> interfaces that Dave always mentions.  This would be much more so for
> cases were you don't want to stat all of the entries in a directory.

Bulkstat is not really suited to this - it gets it's speed
specifically by avoiding directory traversal to find inodes. Hence
it is a root-only operation because it bypasses directory based
access restrictions and hence is really only useful for
full-filesystem traversal operations like backups, defragmentation,
etc.

Bulkstat output also contains enough information to construct a
valid file handle in userspace and so access to inodes found via
bulkstat can be gained via the XFS open-by-handle interfaces. Again,
this bypasses permissions checking and hence is a root-only
operation. It does, however, avoid TOCTOU races because the open-by-handle
will fail if the inode is unlinked and reallocated between the
bulkstat call and the open-by-handle as the generation number in the
handle will no longer match that of the inode.

Cheers,

Dave.



-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-31  4:49         ` Dave Chinner
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2014-07-31  4:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> > OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
> >> Is there a time when this doesn't get called to prefetch entries in
> >> readdir() order?  It isn't clear to me what benefit there is of returning
> >> the entries to userspace instead of just doing the statahead implicitly
> >> in the kernel?
> >> 
> >> The Lustre client has had what we call "statahead" for a while,
> >> and similar to regular file readahead it detects the sequential access
> >> pattern for readdir() + stat() in readdir() order (taking into account if
> >> ".*"
> >> entries are being processed or not) and starts fetching the inode
> >> attributes asynchronously with a worker thread.
> > 
> > Does this heuristic work well in practice? In the use case we were trying to
> > address, a Samba server is aware beforehand if it is going to stat all the
> > inodes in a directory.
> 
> Typically this works well for us, because this is done by the Lustre
> client, so the statahead is hiding the network latency of the RPCs to
> fetch attributes from the server.  I imagine the same could be seen with
> GFS2. I don't know if this approach would help very much for local
> filesystems because the latency is low.
> 
> >> This syscall might be more useful if userspace called readdir() to get
> >> the dirents and then passed the kernel the list of inode numbers
> >> to prefetch before starting on the stat() calls. That way, userspace
> >> could generate an arbitrary list of inodes (e.g. names matching a
> >> regexp) and the kernel doesn't need to guess if every inode is needed.
> > 
> > Were you thinking arbitrary inodes across the filesystem or just a subset
> > from a directory? Arbitrary inodes may potentially throw up locking issues.
> 
> I was thinking about inodes returned from readdir(), but the syscall
> would be much more useful if it could handle arbitrary inodes.

I'mnot sure we can do that. The only way to safely identify a
specific inode in the filesystem from userspace is via a filehandle.
Plain inode numbers are susceptible to TOCTOU race conditions that
the kernel cannot resolve. Also, lookup by inode number bypasses
directory access permissions, so is not something we would expose
to arbitrary unprivileged users.

> There are always going to be race conditions even if limited to a
> single directory (e.g. another client modifies the inode after calling
> dirreadahead(), but before calling stat()) that need to be handled.

Unlink/realloc to a different directory with different access
permissions is the big issue.

> I think there are a lot of benefits that could be had by the generic
> syscall, possibly similar to what XFS is doing with the "bulkstat"
> interfaces that Dave always mentions.  This would be much more so for
> cases were you don't want to stat all of the entries in a directory.

Bulkstat is not really suited to this - it gets it's speed
specifically by avoiding directory traversal to find inodes. Hence
it is a root-only operation because it bypasses directory based
access restrictions and hence is really only useful for
full-filesystem traversal operations like backups, defragmentation,
etc.

Bulkstat output also contains enough information to construct a
valid file handle in userspace and so access to inodes found via
bulkstat can be gained via the XFS open-by-handle interfaces. Again,
this bypasses permissions checking and hence is a root-only
operation. It does, however, avoid TOCTOU races because the open-by-handle
will fail if the inode is unlinked and reallocated between the
bulkstat call and the open-by-handle as the generation number in the
handle will no longer match that of the inode.

Cheers,

Dave.



-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-31  4:49         ` [Cluster-devel] " Dave Chinner
@ 2014-07-31 11:19           ` Andreas Dilger
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-07-31 11:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Abhijith Das, LKML, linux-fsdevel, cluster-devel

On Jul 31, 2014, at 6:49, Dave Chinner <david@fromorbit.com> wrote:
> 
>> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
>>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
>>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
>>>> Is there a time when this doesn't get called to prefetch entries in
>>>> readdir() order?  It isn't clear to me what benefit there is of returning
>>>> the entries to userspace instead of just doing the statahead implicitly
>>>> in the kernel?
>>>> 
>>>> The Lustre client has had what we call "statahead" for a while,
>>>> and similar to regular file readahead it detects the sequential access
>>>> pattern for readdir() + stat() in readdir() order (taking into account if
>>>> ".*"
>>>> entries are being processed or not) and starts fetching the inode
>>>> attributes asynchronously with a worker thread.
>>> 
>>> Does this heuristic work well in practice? In the use case we were trying to
>>> address, a Samba server is aware beforehand if it is going to stat all the
>>> inodes in a directory.
>> 
>> Typically this works well for us, because this is done by the Lustre
>> client, so the statahead is hiding the network latency of the RPCs to
>> fetch attributes from the server.  I imagine the same could be seen with
>> GFS2. I don't know if this approach would help very much for local
>> filesystems because the latency is low.
>> 
>>>> This syscall might be more useful if userspace called readdir() to get
>>>> the dirents and then passed the kernel the list of inode numbers
>>>> to prefetch before starting on the stat() calls. That way, userspace
>>>> could generate an arbitrary list of inodes (e.g. names matching a
>>>> regexp) and the kernel doesn't need to guess if every inode is needed.
>>> 
>>> Were you thinking arbitrary inodes across the filesystem or just a subset
>>> from a directory? Arbitrary inodes may potentially throw up locking issues.
>> 
>> I was thinking about inodes returned from readdir(), but the syscall
>> would be much more useful if it could handle arbitrary inodes.
> 
> I'm not sure we can do that. The only way to safely identify a
> specific inode in the filesystem from userspace is via a filehandle.
> Plain inode numbers are susceptible to TOCTOU race conditions that
> the kernel cannot resolve. Also, lookup by inode number bypasses
> directory access permissions, so is not something we would expose
> to arbitrary unprivileged users.

None of these issues are relevant in the API that I'm thinking about. 
The syscall just passes the list of inode numbers to be prefetched
into kernel memory, and then stat() is used to actually get the data into
userspace (or whatever other operation is to be done on them),
so there is no danger if the wrong inode is prefetched.  If the inode
number is bad the filesystem can just ignore it. 

>> There are always going to be race conditions even if limited to a
>> single directory (e.g. another client modifies the inode after calling
>> dirreadahead(), but before calling stat()) that need to be handled.
> 
> Unlink/realloc to a different directory with different access
> permissions is the big issue.
> 
>> I think there are a lot of benefits that could be had by the generic
>> syscall, possibly similar to what XFS is doing with the "bulkstat"
>> interfaces that Dave always mentions.  This would be much more so for
>> cases were you don't want to stat all of the entries in a directory.
> 
> Bulkstat is not really suited to this - it gets it's speed
> specifically by avoiding directory traversal to find inodes. Hence
> it is a root-only operation because it bypasses directory based
> access restrictions and hence is really only useful for
> full-filesystem traversal operations like backups, defragmentation,
> etc.

Maybe it was a mistake to mention bulkstat in this discussion. The
stat() information would not actually be returned to memory in my
proposal, just the inodes would be loaded into memory in bulk so
that the later stat() (or unlink() or whatever) can go quickly.

Cheers, Andreas

> Bulkstat output also contains enough information to construct a
> valid file handle in userspace and so access to inodes found via
> bulkstat can be gained via the XFS open-by-handle interfaces. Again,
> this bypasses permissions checking and hence is a root-only
> operation. It does, however, avoid TOCTOU races because the open-by-handle
> will fail if the inode is unlinked and reallocated between the
> bulkstat call and the open-by-handle as the generation number in the
> handle will no longer match that of the inode.
> 
> Cheers,
> 
> Dave.
> 
> 
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-31 11:19           ` Andreas Dilger
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-07-31 11:19 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Jul 31, 2014, at 6:49, Dave Chinner <david@fromorbit.com> wrote:
> 
>> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
>>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
>>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
>>>> Is there a time when this doesn't get called to prefetch entries in
>>>> readdir() order?  It isn't clear to me what benefit there is of returning
>>>> the entries to userspace instead of just doing the statahead implicitly
>>>> in the kernel?
>>>> 
>>>> The Lustre client has had what we call "statahead" for a while,
>>>> and similar to regular file readahead it detects the sequential access
>>>> pattern for readdir() + stat() in readdir() order (taking into account if
>>>> ".*"
>>>> entries are being processed or not) and starts fetching the inode
>>>> attributes asynchronously with a worker thread.
>>> 
>>> Does this heuristic work well in practice? In the use case we were trying to
>>> address, a Samba server is aware beforehand if it is going to stat all the
>>> inodes in a directory.
>> 
>> Typically this works well for us, because this is done by the Lustre
>> client, so the statahead is hiding the network latency of the RPCs to
>> fetch attributes from the server.  I imagine the same could be seen with
>> GFS2. I don't know if this approach would help very much for local
>> filesystems because the latency is low.
>> 
>>>> This syscall might be more useful if userspace called readdir() to get
>>>> the dirents and then passed the kernel the list of inode numbers
>>>> to prefetch before starting on the stat() calls. That way, userspace
>>>> could generate an arbitrary list of inodes (e.g. names matching a
>>>> regexp) and the kernel doesn't need to guess if every inode is needed.
>>> 
>>> Were you thinking arbitrary inodes across the filesystem or just a subset
>>> from a directory? Arbitrary inodes may potentially throw up locking issues.
>> 
>> I was thinking about inodes returned from readdir(), but the syscall
>> would be much more useful if it could handle arbitrary inodes.
> 
> I'm not sure we can do that. The only way to safely identify a
> specific inode in the filesystem from userspace is via a filehandle.
> Plain inode numbers are susceptible to TOCTOU race conditions that
> the kernel cannot resolve. Also, lookup by inode number bypasses
> directory access permissions, so is not something we would expose
> to arbitrary unprivileged users.

None of these issues are relevant in the API that I'm thinking about. 
The syscall just passes the list of inode numbers to be prefetched
into kernel memory, and then stat() is used to actually get the data into
userspace (or whatever other operation is to be done on them),
so there is no danger if the wrong inode is prefetched.  If the inode
number is bad the filesystem can just ignore it. 

>> There are always going to be race conditions even if limited to a
>> single directory (e.g. another client modifies the inode after calling
>> dirreadahead(), but before calling stat()) that need to be handled.
> 
> Unlink/realloc to a different directory with different access
> permissions is the big issue.
> 
>> I think there are a lot of benefits that could be had by the generic
>> syscall, possibly similar to what XFS is doing with the "bulkstat"
>> interfaces that Dave always mentions.  This would be much more so for
>> cases were you don't want to stat all of the entries in a directory.
> 
> Bulkstat is not really suited to this - it gets it's speed
> specifically by avoiding directory traversal to find inodes. Hence
> it is a root-only operation because it bypasses directory based
> access restrictions and hence is really only useful for
> full-filesystem traversal operations like backups, defragmentation,
> etc.

Maybe it was a mistake to mention bulkstat in this discussion. The
stat() information would not actually be returned to memory in my
proposal, just the inodes would be loaded into memory in bulk so
that the later stat() (or unlink() or whatever) can go quickly.

Cheers, Andreas

> Bulkstat output also contains enough information to construct a
> valid file handle in userspace and so access to inodes found via
> bulkstat can be gained via the XFS open-by-handle interfaces. Again,
> this bypasses permissions checking and hence is a root-only
> operation. It does, however, avoid TOCTOU races because the open-by-handle
> will fail if the inode is unlinked and reallocated between the
> bulkstat call and the open-by-handle as the generation number in the
> handle will no longer match that of the inode.
> 
> Cheers,
> 
> Dave.
> 
> 
> 
> -- 
> Dave Chinner
> david at fromorbit.com



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-31 11:19           ` [Cluster-devel] " Andreas Dilger
@ 2014-07-31 23:53             ` Dave Chinner
  -1 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2014-07-31 23:53 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Abhijith Das, LKML, linux-fsdevel, cluster-devel

On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> On Jul 31, 2014, at 6:49, Dave Chinner <david@fromorbit.com> wrote:
> > 
> >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
> >>>> Is there a time when this doesn't get called to prefetch entries in
> >>>> readdir() order?  It isn't clear to me what benefit there is of returning
> >>>> the entries to userspace instead of just doing the statahead implicitly
> >>>> in the kernel?
> >>>> 
> >>>> The Lustre client has had what we call "statahead" for a while,
> >>>> and similar to regular file readahead it detects the sequential access
> >>>> pattern for readdir() + stat() in readdir() order (taking into account if
> >>>> ".*"
> >>>> entries are being processed or not) and starts fetching the inode
> >>>> attributes asynchronously with a worker thread.
> >>> 
> >>> Does this heuristic work well in practice? In the use case we were trying to
> >>> address, a Samba server is aware beforehand if it is going to stat all the
> >>> inodes in a directory.
> >> 
> >> Typically this works well for us, because this is done by the Lustre
> >> client, so the statahead is hiding the network latency of the RPCs to
> >> fetch attributes from the server.  I imagine the same could be seen with
> >> GFS2. I don't know if this approach would help very much for local
> >> filesystems because the latency is low.
> >> 
> >>>> This syscall might be more useful if userspace called readdir() to get
> >>>> the dirents and then passed the kernel the list of inode numbers
> >>>> to prefetch before starting on the stat() calls. That way, userspace
> >>>> could generate an arbitrary list of inodes (e.g. names matching a
> >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> >>> 
> >>> Were you thinking arbitrary inodes across the filesystem or just a subset
> >>> from a directory? Arbitrary inodes may potentially throw up locking issues.
> >> 
> >> I was thinking about inodes returned from readdir(), but the syscall
> >> would be much more useful if it could handle arbitrary inodes.
> > 
> > I'm not sure we can do that. The only way to safely identify a
> > specific inode in the filesystem from userspace is via a filehandle.
> > Plain inode numbers are susceptible to TOCTOU race conditions that
> > the kernel cannot resolve. Also, lookup by inode number bypasses
> > directory access permissions, so is not something we would expose
> > to arbitrary unprivileged users.
> 
> None of these issues are relevant in the API that I'm thinking about. 
> The syscall just passes the list of inode numbers to be prefetched
> into kernel memory, and then stat() is used to actually get the data into
> userspace (or whatever other operation is to be done on them),
> so there is no danger if the wrong inode is prefetched.  If the inode
> number is bad the filesystem can just ignore it. 

Which means the filesystem has to treat the inode number as
potentially hostile. i.e. it can not be trusted to be correct and so
must take slow paths to validate the inode numbers. This adds
*significant* overhead to the readahead path for some filesystems:
readahead is only a win if it is low cost.

For example, on XFS every untrusted inode number lookup requires an
inode btree lookup to validate the inode is actually valid on disk
and that is it allocated and has references. That lookup serialises
against inode allocation/freeing as well as other lookups. In
comparison, when using a trusted inode number from a directory
lookup within the kernel, we only need to do a couple of shift and
mask operations to convert it to a disk address and we are good to
go.

i.e. the difference is at least 5 orders of magnitude higher CPU usage
for an "inode number readahead" syscall versus a "directory
readahead" syscall, it has significant serialisation issues and it
can stall other modification/lookups going on at the same time.
That's *horrible behaviour* for a speculative readahead operation,
but because the inodenumbers are untrusted, we can't avoid it.

So, again, it's way more overhead than userspace just calling
stat() asycnhronously on many files at once as readdir/gentdents
returns dirents from the kernel to speed up cache population.

That's my main issue with this patchset - it's implementing
something in kernelspace that can *easily* be done generically in
userspace without introducing all sorts of nasty corner cases that
we have to handle in the kernel. We only add functionality to the kernel if there's a
compelling reason to do it in kernelspace, and right now I just
don't see any numbers that justify adding readdir+stat() readahead
or inode number based cache population in kernelspace.

Before we add *any* syscall for directory readahead, we need
comparison numbers against doing the dumb multithreaded
userspace readahead of stat() calls. If userspace can do this as
fast as the kernel can....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-07-31 23:53             ` Dave Chinner
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2014-07-31 23:53 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> On Jul 31, 2014, at 6:49, Dave Chinner <david@fromorbit.com> wrote:
> > 
> >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
> >>>> Is there a time when this doesn't get called to prefetch entries in
> >>>> readdir() order?  It isn't clear to me what benefit there is of returning
> >>>> the entries to userspace instead of just doing the statahead implicitly
> >>>> in the kernel?
> >>>> 
> >>>> The Lustre client has had what we call "statahead" for a while,
> >>>> and similar to regular file readahead it detects the sequential access
> >>>> pattern for readdir() + stat() in readdir() order (taking into account if
> >>>> ".*"
> >>>> entries are being processed or not) and starts fetching the inode
> >>>> attributes asynchronously with a worker thread.
> >>> 
> >>> Does this heuristic work well in practice? In the use case we were trying to
> >>> address, a Samba server is aware beforehand if it is going to stat all the
> >>> inodes in a directory.
> >> 
> >> Typically this works well for us, because this is done by the Lustre
> >> client, so the statahead is hiding the network latency of the RPCs to
> >> fetch attributes from the server.  I imagine the same could be seen with
> >> GFS2. I don't know if this approach would help very much for local
> >> filesystems because the latency is low.
> >> 
> >>>> This syscall might be more useful if userspace called readdir() to get
> >>>> the dirents and then passed the kernel the list of inode numbers
> >>>> to prefetch before starting on the stat() calls. That way, userspace
> >>>> could generate an arbitrary list of inodes (e.g. names matching a
> >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> >>> 
> >>> Were you thinking arbitrary inodes across the filesystem or just a subset
> >>> from a directory? Arbitrary inodes may potentially throw up locking issues.
> >> 
> >> I was thinking about inodes returned from readdir(), but the syscall
> >> would be much more useful if it could handle arbitrary inodes.
> > 
> > I'm not sure we can do that. The only way to safely identify a
> > specific inode in the filesystem from userspace is via a filehandle.
> > Plain inode numbers are susceptible to TOCTOU race conditions that
> > the kernel cannot resolve. Also, lookup by inode number bypasses
> > directory access permissions, so is not something we would expose
> > to arbitrary unprivileged users.
> 
> None of these issues are relevant in the API that I'm thinking about. 
> The syscall just passes the list of inode numbers to be prefetched
> into kernel memory, and then stat() is used to actually get the data into
> userspace (or whatever other operation is to be done on them),
> so there is no danger if the wrong inode is prefetched.  If the inode
> number is bad the filesystem can just ignore it. 

Which means the filesystem has to treat the inode number as
potentially hostile. i.e. it can not be trusted to be correct and so
must take slow paths to validate the inode numbers. This adds
*significant* overhead to the readahead path for some filesystems:
readahead is only a win if it is low cost.

For example, on XFS every untrusted inode number lookup requires an
inode btree lookup to validate the inode is actually valid on disk
and that is it allocated and has references. That lookup serialises
against inode allocation/freeing as well as other lookups. In
comparison, when using a trusted inode number from a directory
lookup within the kernel, we only need to do a couple of shift and
mask operations to convert it to a disk address and we are good to
go.

i.e. the difference is at least 5 orders of magnitude higher CPU usage
for an "inode number readahead" syscall versus a "directory
readahead" syscall, it has significant serialisation issues and it
can stall other modification/lookups going on at the same time.
That's *horrible behaviour* for a speculative readahead operation,
but because the inodenumbers are untrusted, we can't avoid it.

So, again, it's way more overhead than userspace just calling
stat() asycnhronously on many files at once as readdir/gentdents
returns dirents from the kernel to speed up cache population.

That's my main issue with this patchset - it's implementing
something in kernelspace that can *easily* be done generically in
userspace without introducing all sorts of nasty corner cases that
we have to handle in the kernel. We only add functionality to the kernel if there's a
compelling reason to do it in kernelspace, and right now I just
don't see any numbers that justify adding readdir+stat() readahead
or inode number based cache population in kernelspace.

Before we add *any* syscall for directory readahead, we need
comparison numbers against doing the dumb multithreaded
userspace readahead of stat() calls. If userspace can do this as
fast as the kernel can....

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-31 23:53             ` [Cluster-devel] " Dave Chinner
@ 2014-08-01  2:11               ` Abhijith Das
  -1 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-08-01  2:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Andreas Dilger, LKML, linux-fsdevel, cluster-devel



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "Andreas Dilger" <adilger@dilger.ca>
> Cc: "Abhijith Das" <adas@redhat.com>, "LKML" <linux-kernel@vger.kernel.org>, "linux-fsdevel"
> <linux-fsdevel@vger.kernel.org>, cluster-devel@redhat.com
> Sent: Thursday, July 31, 2014 6:53:06 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> > On Jul 31, 2014, at 6:49, Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> > >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> > >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
> > >>>> Is there a time when this doesn't get called to prefetch entries in
> > >>>> readdir() order?  It isn't clear to me what benefit there is of
> > >>>> returning
> > >>>> the entries to userspace instead of just doing the statahead
> > >>>> implicitly
> > >>>> in the kernel?
> > >>>> 
> > >>>> The Lustre client has had what we call "statahead" for a while,
> > >>>> and similar to regular file readahead it detects the sequential access
> > >>>> pattern for readdir() + stat() in readdir() order (taking into account
> > >>>> if
> > >>>> ".*"
> > >>>> entries are being processed or not) and starts fetching the inode
> > >>>> attributes asynchronously with a worker thread.
> > >>> 
> > >>> Does this heuristic work well in practice? In the use case we were
> > >>> trying to
> > >>> address, a Samba server is aware beforehand if it is going to stat all
> > >>> the
> > >>> inodes in a directory.
> > >> 
> > >> Typically this works well for us, because this is done by the Lustre
> > >> client, so the statahead is hiding the network latency of the RPCs to
> > >> fetch attributes from the server.  I imagine the same could be seen with
> > >> GFS2. I don't know if this approach would help very much for local
> > >> filesystems because the latency is low.
> > >> 
> > >>>> This syscall might be more useful if userspace called readdir() to get
> > >>>> the dirents and then passed the kernel the list of inode numbers
> > >>>> to prefetch before starting on the stat() calls. That way, userspace
> > >>>> could generate an arbitrary list of inodes (e.g. names matching a
> > >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> > >>> 
> > >>> Were you thinking arbitrary inodes across the filesystem or just a
> > >>> subset
> > >>> from a directory? Arbitrary inodes may potentially throw up locking
> > >>> issues.
> > >> 
> > >> I was thinking about inodes returned from readdir(), but the syscall
> > >> would be much more useful if it could handle arbitrary inodes.
> > > 
> > > I'm not sure we can do that. The only way to safely identify a
> > > specific inode in the filesystem from userspace is via a filehandle.
> > > Plain inode numbers are susceptible to TOCTOU race conditions that
> > > the kernel cannot resolve. Also, lookup by inode number bypasses
> > > directory access permissions, so is not something we would expose
> > > to arbitrary unprivileged users.
> > 
> > None of these issues are relevant in the API that I'm thinking about.
> > The syscall just passes the list of inode numbers to be prefetched
> > into kernel memory, and then stat() is used to actually get the data into
> > userspace (or whatever other operation is to be done on them),
> > so there is no danger if the wrong inode is prefetched.  If the inode
> > number is bad the filesystem can just ignore it.
> 
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
> 
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
> 
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.
> 
> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.
> 
> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if
> there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.
> 
> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....
> 

This was next on my to-do list to see how an all-userspace solution compares
with doing this in the kernel. I'll post some results as soon as I can find
some time to play with this stuff again.

Cheers!
--Abhi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-08-01  2:11               ` Abhijith Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-08-01  2:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "Andreas Dilger" <adilger@dilger.ca>
> Cc: "Abhijith Das" <adas@redhat.com>, "LKML" <linux-kernel@vger.kernel.org>, "linux-fsdevel"
> <linux-fsdevel@vger.kernel.org>, cluster-devel at redhat.com
> Sent: Thursday, July 31, 2014 6:53:06 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> > On Jul 31, 2014, at 6:49, Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> > >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> > >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
> > >>>> Is there a time when this doesn't get called to prefetch entries in
> > >>>> readdir() order?  It isn't clear to me what benefit there is of
> > >>>> returning
> > >>>> the entries to userspace instead of just doing the statahead
> > >>>> implicitly
> > >>>> in the kernel?
> > >>>> 
> > >>>> The Lustre client has had what we call "statahead" for a while,
> > >>>> and similar to regular file readahead it detects the sequential access
> > >>>> pattern for readdir() + stat() in readdir() order (taking into account
> > >>>> if
> > >>>> ".*"
> > >>>> entries are being processed or not) and starts fetching the inode
> > >>>> attributes asynchronously with a worker thread.
> > >>> 
> > >>> Does this heuristic work well in practice? In the use case we were
> > >>> trying to
> > >>> address, a Samba server is aware beforehand if it is going to stat all
> > >>> the
> > >>> inodes in a directory.
> > >> 
> > >> Typically this works well for us, because this is done by the Lustre
> > >> client, so the statahead is hiding the network latency of the RPCs to
> > >> fetch attributes from the server.  I imagine the same could be seen with
> > >> GFS2. I don't know if this approach would help very much for local
> > >> filesystems because the latency is low.
> > >> 
> > >>>> This syscall might be more useful if userspace called readdir() to get
> > >>>> the dirents and then passed the kernel the list of inode numbers
> > >>>> to prefetch before starting on the stat() calls. That way, userspace
> > >>>> could generate an arbitrary list of inodes (e.g. names matching a
> > >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> > >>> 
> > >>> Were you thinking arbitrary inodes across the filesystem or just a
> > >>> subset
> > >>> from a directory? Arbitrary inodes may potentially throw up locking
> > >>> issues.
> > >> 
> > >> I was thinking about inodes returned from readdir(), but the syscall
> > >> would be much more useful if it could handle arbitrary inodes.
> > > 
> > > I'm not sure we can do that. The only way to safely identify a
> > > specific inode in the filesystem from userspace is via a filehandle.
> > > Plain inode numbers are susceptible to TOCTOU race conditions that
> > > the kernel cannot resolve. Also, lookup by inode number bypasses
> > > directory access permissions, so is not something we would expose
> > > to arbitrary unprivileged users.
> > 
> > None of these issues are relevant in the API that I'm thinking about.
> > The syscall just passes the list of inode numbers to be prefetched
> > into kernel memory, and then stat() is used to actually get the data into
> > userspace (or whatever other operation is to be done on them),
> > so there is no danger if the wrong inode is prefetched.  If the inode
> > number is bad the filesystem can just ignore it.
> 
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
> 
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
> 
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.
> 
> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.
> 
> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if
> there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.
> 
> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....
> 

This was next on my to-do list to see how an all-userspace solution compares
with doing this in the kernel. I'll post some results as soon as I can find
some time to play with this stuff again.

Cheers!
--Abhi



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-31  3:18   ` [Cluster-devel] " NeilBrown
@ 2014-08-01  2:21     ` Abhijith Das
  -1 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-08-01  2:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-kernel, linux-fsdevel, cluster-devel



----- Original Message -----
> From: "NeilBrown" <neilb@suse.de>
> To: "Abhi Das" <adas@redhat.com>
> Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, cluster-devel@redhat.com
> Sent: Wednesday, July 30, 2014 10:18:05 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Fri, 25 Jul 2014 12:37:29 -0500 Abhi Das <adas@redhat.com> wrote:
> 
> > This system call takes 3 arguments:
> > fd      - file descriptor of the directory being readahead
> > *offset - offset in dir from which to resume. This is updated
> >           as we move along in the directory
> > count   - The max number of entries to readahead
> > 
> > The syscall is supposed to read upto 'count' entries starting at
> > '*offset' and cache the inodes corresponding to those entries. It
> > returns a negative error code or a positive number indicating
> > the number of inodes it has issued readaheads for. It also
> > updates the '*offset' value so that repeated calls to dirreadahead
> > can resume at the right location. Returns 0 when there are no more
> > entries left.
> 
> Hi Abhi,
> 
>  I like the idea of enhanced read-ahead on a directory.
>  It isn't clear to me why you have included these particular fields in the
>  interface though.
> 
>  - why have an 'offset'?  Why not just use the current offset of the
>    directory 'fd'?

The idea was that we didn't want a syscall like readahead mucking with the
file pointer as the same fd might be used to do getdents().

>  - Why have a count?  How would a program choose what count to give?

If a program knows that it's only going to work on a subset of files at a time,
it can use the count value to only readahead a small number of inodes at once
instead of reading ahead the entire directory.

That said, this interface is not set in stone and we are exploring ways to inform
the kernel of the inodes we are interested in reading ahead.

> 
>  Maybe you imagine using 'getdents' first to get a list of names, then
>  selectively calling 'dirreadahead'  on the offsets of the names you are
>  interested it?  That would be racy as names can be added and removed which
>  might change offsets.  So maybe you have another reason?
> 
>  I would like to suggest an alternate interface (I love playing the API
>  game....).
> 
>  1/ Add a flag to 'fstatat'  AT_EXPECT_MORE.
>     If the pathname does not contain a '/', then the 'dirfd' is marked
>     to indicate that stat information for all names returned by getdents will
>     be wanted.  The filesystem can choose to optimise that however it sees
>     fit.
> 
>  2/ Add a flag to 'fstatat'  AT_NONBLOCK.
>     This tells the filesystem that you want this information, so if it can
>     return it immediately it should, and if not it should start pulling it
>     into cache.  Possibly this should be two flags: AT_NONBLOCK just avoids
>     any IO, and AT_ASYNC instigates IO even if NONBLOCK is set.
> 
>  Then an "ls -l" could use AT_EXPECT_MORE and then just stat each name.
>  An "ls -l *.c", might avoid AT_EXPECT_MORE, but would use AT_NONBLOCK
>  against all names, then try again with all the names that returned
>  EWOULDBLOCK the first time.
> 
> 
>  I would really like to see the 'xstat' syscall too, but there is no point
>  having both "xstat" and "fxstat".  Follow the model of "fstatat" and provide
>  just "fxstatat" which can do both.
>  With fxstatat, AT_EXPECT_MORE would tell the dirfd exactly which attributes
>  would be wanted so it can fetch only that which is desired.
> 
>  I'm not very keen on the xgetdents idea of including name information and
>  stat information into the one syscall - I would prefer getdents and xstat be
>  kept separate.  Of course if a genuine performance cost of the separate can
>  be demonstrated, I could well change my mind.
> 
>  It does, however, have the advantage that the kernel doesn't need to worry
>  about how long read-ahead data needs to be kept, and the application doesn't
>  need to worry about how soon to retry an fstatat which failed with
>  EWOULDBLOCK.
> 
> Thanks for raising this issue again.  I hope it gets fixed one day...
> 
> NeilBrown
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-08-01  2:21     ` Abhijith Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-08-01  2:21 UTC (permalink / raw)
  To: cluster-devel.redhat.com



----- Original Message -----
> From: "NeilBrown" <neilb@suse.de>
> To: "Abhi Das" <adas@redhat.com>
> Cc: linux-kernel at vger.kernel.org, linux-fsdevel at vger.kernel.org, cluster-devel at redhat.com
> Sent: Wednesday, July 30, 2014 10:18:05 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Fri, 25 Jul 2014 12:37:29 -0500 Abhi Das <adas@redhat.com> wrote:
> 
> > This system call takes 3 arguments:
> > fd      - file descriptor of the directory being readahead
> > *offset - offset in dir from which to resume. This is updated
> >           as we move along in the directory
> > count   - The max number of entries to readahead
> > 
> > The syscall is supposed to read upto 'count' entries starting at
> > '*offset' and cache the inodes corresponding to those entries. It
> > returns a negative error code or a positive number indicating
> > the number of inodes it has issued readaheads for. It also
> > updates the '*offset' value so that repeated calls to dirreadahead
> > can resume at the right location. Returns 0 when there are no more
> > entries left.
> 
> Hi Abhi,
> 
>  I like the idea of enhanced read-ahead on a directory.
>  It isn't clear to me why you have included these particular fields in the
>  interface though.
> 
>  - why have an 'offset'?  Why not just use the current offset of the
>    directory 'fd'?

The idea was that we didn't want a syscall like readahead mucking with the
file pointer as the same fd might be used to do getdents().

>  - Why have a count?  How would a program choose what count to give?

If a program knows that it's only going to work on a subset of files at a time,
it can use the count value to only readahead a small number of inodes at once
instead of reading ahead the entire directory.

That said, this interface is not set in stone and we are exploring ways to inform
the kernel of the inodes we are interested in reading ahead.

> 
>  Maybe you imagine using 'getdents' first to get a list of names, then
>  selectively calling 'dirreadahead'  on the offsets of the names you are
>  interested it?  That would be racy as names can be added and removed which
>  might change offsets.  So maybe you have another reason?
> 
>  I would like to suggest an alternate interface (I love playing the API
>  game....).
> 
>  1/ Add a flag to 'fstatat'  AT_EXPECT_MORE.
>     If the pathname does not contain a '/', then the 'dirfd' is marked
>     to indicate that stat information for all names returned by getdents will
>     be wanted.  The filesystem can choose to optimise that however it sees
>     fit.
> 
>  2/ Add a flag to 'fstatat'  AT_NONBLOCK.
>     This tells the filesystem that you want this information, so if it can
>     return it immediately it should, and if not it should start pulling it
>     into cache.  Possibly this should be two flags: AT_NONBLOCK just avoids
>     any IO, and AT_ASYNC instigates IO even if NONBLOCK is set.
> 
>  Then an "ls -l" could use AT_EXPECT_MORE and then just stat each name.
>  An "ls -l *.c", might avoid AT_EXPECT_MORE, but would use AT_NONBLOCK
>  against all names, then try again with all the names that returned
>  EWOULDBLOCK the first time.
> 
> 
>  I would really like to see the 'xstat' syscall too, but there is no point
>  having both "xstat" and "fxstat".  Follow the model of "fstatat" and provide
>  just "fxstatat" which can do both.
>  With fxstatat, AT_EXPECT_MORE would tell the dirfd exactly which attributes
>  would be wanted so it can fetch only that which is desired.
> 
>  I'm not very keen on the xgetdents idea of including name information and
>  stat information into the one syscall - I would prefer getdents and xstat be
>  kept separate.  Of course if a genuine performance cost of the separate can
>  be demonstrated, I could well change my mind.
> 
>  It does, however, have the advantage that the kernel doesn't need to worry
>  about how long read-ahead data needs to be kept, and the application doesn't
>  need to worry about how soon to retry an fstatat which failed with
>  EWOULDBLOCK.
> 
> Thanks for raising this issue again.  I hope it gets fixed one day...
> 
> NeilBrown
> 



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-31 23:53             ` [Cluster-devel] " Dave Chinner
@ 2014-08-01  5:54               ` Andreas Dilger
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-08-01  5:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Abhijith Das, LKML, linux-fsdevel, cluster-devel

On Aug 1, 2014, at 1:53, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
>> None of these issues are relevant in the API that I'm thinking about. 
>> The syscall just passes the list of inode numbers to be prefetched
>> into kernel memory, and then stat() is used to actually get the data into
>> userspace (or whatever other operation is to be done on them),
>> so there is no danger if the wrong inode is prefetched.  If the inode
>> number is bad the filesystem can just ignore it.
> 
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
> 
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
> 
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.

For ext4 this is virtually free. The same needs to happen for any
inode number from NFS so it can't be that bad. 

Also, since this API would be prefetching inodes in bulk it
could presumably optimize this to some extent. 

> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.

To me this seems like saying it is just as fast to submit hundreds of
256-byte random reads for a file as it is for large linear reads with
readahead.  Yes, it is possible for the kernel to optimize the random
read workload to some extent, but not as easily as getting reads
in order in the first place. 

> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.

The original patch showed there was definitely a significant win with
the prefetch case over the single threaded readdir+stat. There is
also something to be said for keeping complexity out of applications.
Sure it is possible for apps to get good performance from AIO,
but very few do so because of complexity. 

> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....

I'd be interested to see this also, but my prediction is that this will not
deliver the kind of improvements you are expecting. 

Cheers, Andreas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-08-01  5:54               ` Andreas Dilger
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-08-01  5:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Aug 1, 2014, at 1:53, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
>> None of these issues are relevant in the API that I'm thinking about. 
>> The syscall just passes the list of inode numbers to be prefetched
>> into kernel memory, and then stat() is used to actually get the data into
>> userspace (or whatever other operation is to be done on them),
>> so there is no danger if the wrong inode is prefetched.  If the inode
>> number is bad the filesystem can just ignore it.
> 
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
> 
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
> 
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.

For ext4 this is virtually free. The same needs to happen for any
inode number from NFS so it can't be that bad. 

Also, since this API would be prefetching inodes in bulk it
could presumably optimize this to some extent. 

> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.

To me this seems like saying it is just as fast to submit hundreds of
256-byte random reads for a file as it is for large linear reads with
readahead.  Yes, it is possible for the kernel to optimize the random
read workload to some extent, but not as easily as getting reads
in order in the first place. 

> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.

The original patch showed there was definitely a significant win with
the prefetch case over the single threaded readdir+stat. There is
also something to be said for keeping complexity out of applications.
Sure it is possible for apps to get good performance from AIO,
but very few do so because of complexity. 

> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....

I'd be interested to see this also, but my prediction is that this will not
deliver the kind of improvements you are expecting. 

Cheers, Andreas



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-28 21:19       ` [Cluster-devel] " Andreas Dilger
                         ` (2 preceding siblings ...)
  (?)
@ 2014-08-01  9:14       ` Thomas Martitz
  -1 siblings, 0 replies; 49+ messages in thread
From: Thomas Martitz @ 2014-08-01  9:14 UTC (permalink / raw)
  To: linux-fsdevel

Andreas Dilger <adilger <at> dilger.ca> writes:

> Since this would be an advisory syscall (i.e. it doesn't really
> return anything and cannot guarantee that all the inodes will be in
> memory), then if the filesystem is having trouble prefetching the
> inodes (e.g. invalid inode number(s) or lock ordering or contention
> issues) it could always bail out and leave it to stat() to actually
> fetch the inodes into memory when accessed.
> 

The first sentence made me think whether the existing fadvice64() syscall
can be reused (with the dir fd) in place if the new dirreadahead()?

If not, dirreadahead should have a flags param
(https://lwn.net/Articles/585415/)

Best regards.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-08-01  5:54               ` [Cluster-devel] " Andreas Dilger
@ 2014-08-06  2:01                 ` Dave Chinner
  -1 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2014-08-06  2:01 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Abhijith Das, LKML, linux-fsdevel, cluster-devel

On Fri, Aug 01, 2014 at 07:54:56AM +0200, Andreas Dilger wrote:
> On Aug 1, 2014, at 1:53, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> >> None of these issues are relevant in the API that I'm thinking about. 
> >> The syscall just passes the list of inode numbers to be prefetched
> >> into kernel memory, and then stat() is used to actually get the data into
> >> userspace (or whatever other operation is to be done on them),
> >> so there is no danger if the wrong inode is prefetched.  If the inode
> >> number is bad the filesystem can just ignore it.
> > 
> > Which means the filesystem has to treat the inode number as
> > potentially hostile. i.e. it can not be trusted to be correct and so
> > must take slow paths to validate the inode numbers. This adds
> > *significant* overhead to the readahead path for some filesystems:
> > readahead is only a win if it is low cost.
> > 
> > For example, on XFS every untrusted inode number lookup requires an
> > inode btree lookup to validate the inode is actually valid on disk
> > and that is it allocated and has references. That lookup serialises
> > against inode allocation/freeing as well as other lookups. In
> > comparison, when using a trusted inode number from a directory
> > lookup within the kernel, we only need to do a couple of shift and
> > mask operations to convert it to a disk address and we are good to
> > go.
> > 
> > i.e. the difference is at least 5 orders of magnitude higher CPU usage
> > for an "inode number readahead" syscall versus a "directory
> > readahead" syscall, it has significant serialisation issues and it
> > can stall other modification/lookups going on at the same time.
> > That's *horrible behaviour* for a speculative readahead operation,
> > but because the inodenumbers are untrusted, we can't avoid it.
> 
> For ext4 this is virtually free.

Yes, due to it's old-school fixed inode table design. But we aren't
designing functionality exclusely for ext4 - many filesystems do
dynamic inode allocation - and so we need to take the overhead of
inode number validation into account when discussing the
effectiveness of a generic readahead mechanism.

> The same needs to happen for any
> inode number from NFS so it can't be that bad. 

NFS is RPC rate limited, not inode number validation rate limited -
the inode number lookup during file handle translation is not a
limiting factor. e.g. we might be able to do 50,000 NFS getattr RPCs
per second on a modern CPU when we hit the inode cache and 10% of
that CPU load will be file handle translation.

However, local filesystems might be able to do 500,000 stat() calls
with the same CPU from a local filesystem on cached inodes. Let's
now add readahead that uses 10% of a CPU over per 50,000 inode
numbers.  Ignoring the overhead of the everything else but
filehandle translation, we run out of CPU at 500,000 inode number
lookups per second. So we burn up the entire CPU just doing inode
number validationand cache lookups, without actually doing any real
work. So adding inode number based readahead just slowed this
workload down by 50%...

Readahead is not always a win....

> Also, since this API would be prefetching inodes in bulk it
> could presumably optimize this to some extent. 

So can userspace.

> > So, again, it's way more overhead than userspace just calling
> > stat() asycnhronously on many files at once as readdir/gentdents
> > returns dirents from the kernel to speed up cache population.
> 
> To me this seems like saying it is just as fast to submit hundreds of
> 256-byte random reads for a file as it is for large linear reads with
> readahead.  Yes, it is possible for the kernel to optimize the random
> read workload to some extent, but not as easily as getting reads
> in order in the first place. 

Strawman.

For ext4, optimised inode read ordering is as simple as dispatching
the IO in ascending inode number order. Same as for many other
filesystems that use direct indexed, fixed location tables for
inodes. And it's the same for XFS, where the inode number is a
direct encoding of it's location in the address space of the
underlying block device.

Seriously, Chris Mason proved years ago how much we can gain
from simple userspace optimisations years ago with his ACP tool.

https://oss.oracle.com/~mason/acp/

This doesn't even use threading - just orders all the stat() calls
and readahead() based on the inode number or a FIBMAP call - and it
shows how substantial the improvements can be from simple userspace
changes.

> > That's my main issue with this patchset - it's implementing
> > something in kernelspace that can *easily* be done generically in
> > userspace without introducing all sorts of nasty corner cases that
> > we have to handle in the kernel. We only add functionality to the kernel if there's a
> > compelling reason to do it in kernelspace, and right now I just
> > don't see any numbers that justify adding readdir+stat() readahead
> > or inode number based cache population in kernelspace.
> 
> The original patch showed there was definitely a significant win with
> the prefetch case over the single threaded readdir+stat.

Sure, but so is acp, hence showing what userspace optimisation gains
us before we even consider implementations using threading and/or
async stat() calls.

> There is
> also something to be said for keeping complexity out of applications.

Yes, that's why I've been saying it should be in glibc hidden behind
ftw() and friends so no application has to be changed to benefit
from the readahead on traversal workloads.

> Sure it is possible for apps to get good performance from AIO,
> but very few do so because of complexity. 

True, but it's also true that few use AIO because they don't need
the few extra percent of performance that AIO can extract from the
system.

> > Before we add *any* syscall for directory readahead, we need
> > comparison numbers against doing the dumb multithreaded
> > userspace readahead of stat() calls. If userspace can do this as
> > fast as the kernel can....
> 
> I'd be interested to see this also, but my prediction is that this will not
> deliver the kind of improvements you are expecting. 

Do you really think I'd be saying we need numbers if I didn't have
a good idea of what the numbers are going to say? ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-08-06  2:01                 ` Dave Chinner
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Chinner @ 2014-08-06  2:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Aug 01, 2014 at 07:54:56AM +0200, Andreas Dilger wrote:
> On Aug 1, 2014, at 1:53, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> >> None of these issues are relevant in the API that I'm thinking about. 
> >> The syscall just passes the list of inode numbers to be prefetched
> >> into kernel memory, and then stat() is used to actually get the data into
> >> userspace (or whatever other operation is to be done on them),
> >> so there is no danger if the wrong inode is prefetched.  If the inode
> >> number is bad the filesystem can just ignore it.
> > 
> > Which means the filesystem has to treat the inode number as
> > potentially hostile. i.e. it can not be trusted to be correct and so
> > must take slow paths to validate the inode numbers. This adds
> > *significant* overhead to the readahead path for some filesystems:
> > readahead is only a win if it is low cost.
> > 
> > For example, on XFS every untrusted inode number lookup requires an
> > inode btree lookup to validate the inode is actually valid on disk
> > and that is it allocated and has references. That lookup serialises
> > against inode allocation/freeing as well as other lookups. In
> > comparison, when using a trusted inode number from a directory
> > lookup within the kernel, we only need to do a couple of shift and
> > mask operations to convert it to a disk address and we are good to
> > go.
> > 
> > i.e. the difference is at least 5 orders of magnitude higher CPU usage
> > for an "inode number readahead" syscall versus a "directory
> > readahead" syscall, it has significant serialisation issues and it
> > can stall other modification/lookups going on at the same time.
> > That's *horrible behaviour* for a speculative readahead operation,
> > but because the inodenumbers are untrusted, we can't avoid it.
> 
> For ext4 this is virtually free.

Yes, due to it's old-school fixed inode table design. But we aren't
designing functionality exclusely for ext4 - many filesystems do
dynamic inode allocation - and so we need to take the overhead of
inode number validation into account when discussing the
effectiveness of a generic readahead mechanism.

> The same needs to happen for any
> inode number from NFS so it can't be that bad. 

NFS is RPC rate limited, not inode number validation rate limited -
the inode number lookup during file handle translation is not a
limiting factor. e.g. we might be able to do 50,000 NFS getattr RPCs
per second on a modern CPU when we hit the inode cache and 10% of
that CPU load will be file handle translation.

However, local filesystems might be able to do 500,000 stat() calls
with the same CPU from a local filesystem on cached inodes. Let's
now add readahead that uses 10% of a CPU over per 50,000 inode
numbers.  Ignoring the overhead of the everything else but
filehandle translation, we run out of CPU at 500,000 inode number
lookups per second. So we burn up the entire CPU just doing inode
number validationand cache lookups, without actually doing any real
work. So adding inode number based readahead just slowed this
workload down by 50%...

Readahead is not always a win....

> Also, since this API would be prefetching inodes in bulk it
> could presumably optimize this to some extent. 

So can userspace.

> > So, again, it's way more overhead than userspace just calling
> > stat() asycnhronously on many files at once as readdir/gentdents
> > returns dirents from the kernel to speed up cache population.
> 
> To me this seems like saying it is just as fast to submit hundreds of
> 256-byte random reads for a file as it is for large linear reads with
> readahead.  Yes, it is possible for the kernel to optimize the random
> read workload to some extent, but not as easily as getting reads
> in order in the first place. 

Strawman.

For ext4, optimised inode read ordering is as simple as dispatching
the IO in ascending inode number order. Same as for many other
filesystems that use direct indexed, fixed location tables for
inodes. And it's the same for XFS, where the inode number is a
direct encoding of it's location in the address space of the
underlying block device.

Seriously, Chris Mason proved years ago how much we can gain
from simple userspace optimisations years ago with his ACP tool.

https://oss.oracle.com/~mason/acp/

This doesn't even use threading - just orders all the stat() calls
and readahead() based on the inode number or a FIBMAP call - and it
shows how substantial the improvements can be from simple userspace
changes.

> > That's my main issue with this patchset - it's implementing
> > something in kernelspace that can *easily* be done generically in
> > userspace without introducing all sorts of nasty corner cases that
> > we have to handle in the kernel. We only add functionality to the kernel if there's a
> > compelling reason to do it in kernelspace, and right now I just
> > don't see any numbers that justify adding readdir+stat() readahead
> > or inode number based cache population in kernelspace.
> 
> The original patch showed there was definitely a significant win with
> the prefetch case over the single threaded readdir+stat.

Sure, but so is acp, hence showing what userspace optimisation gains
us before we even consider implementations using threading and/or
async stat() calls.

> There is
> also something to be said for keeping complexity out of applications.

Yes, that's why I've been saying it should be in glibc hidden behind
ftw() and friends so no application has to be changed to benefit
from the readahead on traversal workloads.

> Sure it is possible for apps to get good performance from AIO,
> but very few do so because of complexity. 

True, but it's also true that few use AIO because they don't need
the few extra percent of performance that AIO can extract from the
system.

> > Before we add *any* syscall for directory readahead, we need
> > comparison numbers against doing the dumb multithreaded
> > userspace readahead of stat() calls. If userspace can do this as
> > fast as the kernel can....
> 
> I'd be interested to see this also, but my prediction is that this will not
> deliver the kind of improvements you are expecting. 

Do you really think I'd be saying we need numbers if I didn't have
a good idea of what the numbers are going to say? ;)

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-07-31 23:53             ` [Cluster-devel] " Dave Chinner
@ 2014-10-21  5:21               ` Abhijith Das
  -1 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-10-21  5:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, LKML, linux-fsdevel, cluster-devel, Steven Whitehouse



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "Andreas Dilger" <adilger@dilger.ca>
> Cc: "Abhijith Das" <adas@redhat.com>, "LKML" <linux-kernel@vger.kernel.org>, "linux-fsdevel"
> <linux-fsdevel@vger.kernel.org>, cluster-devel@redhat.com
> Sent: Thursday, July 31, 2014 6:53:06 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> > On Jul 31, 2014, at 6:49, Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> > >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> > >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
> > >>>> Is there a time when this doesn't get called to prefetch entries in
> > >>>> readdir() order?  It isn't clear to me what benefit there is of
> > >>>> returning
> > >>>> the entries to userspace instead of just doing the statahead
> > >>>> implicitly
> > >>>> in the kernel?
> > >>>> 
> > >>>> The Lustre client has had what we call "statahead" for a while,
> > >>>> and similar to regular file readahead it detects the sequential access
> > >>>> pattern for readdir() + stat() in readdir() order (taking into account
> > >>>> if
> > >>>> ".*"
> > >>>> entries are being processed or not) and starts fetching the inode
> > >>>> attributes asynchronously with a worker thread.
> > >>> 
> > >>> Does this heuristic work well in practice? In the use case we were
> > >>> trying to
> > >>> address, a Samba server is aware beforehand if it is going to stat all
> > >>> the
> > >>> inodes in a directory.
> > >> 
> > >> Typically this works well for us, because this is done by the Lustre
> > >> client, so the statahead is hiding the network latency of the RPCs to
> > >> fetch attributes from the server.  I imagine the same could be seen with
> > >> GFS2. I don't know if this approach would help very much for local
> > >> filesystems because the latency is low.
> > >> 
> > >>>> This syscall might be more useful if userspace called readdir() to get
> > >>>> the dirents and then passed the kernel the list of inode numbers
> > >>>> to prefetch before starting on the stat() calls. That way, userspace
> > >>>> could generate an arbitrary list of inodes (e.g. names matching a
> > >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> > >>> 
> > >>> Were you thinking arbitrary inodes across the filesystem or just a
> > >>> subset
> > >>> from a directory? Arbitrary inodes may potentially throw up locking
> > >>> issues.
> > >> 
> > >> I was thinking about inodes returned from readdir(), but the syscall
> > >> would be much more useful if it could handle arbitrary inodes.
> > > 
> > > I'm not sure we can do that. The only way to safely identify a
> > > specific inode in the filesystem from userspace is via a filehandle.
> > > Plain inode numbers are susceptible to TOCTOU race conditions that
> > > the kernel cannot resolve. Also, lookup by inode number bypasses
> > > directory access permissions, so is not something we would expose
> > > to arbitrary unprivileged users.
> > 
> > None of these issues are relevant in the API that I'm thinking about.
> > The syscall just passes the list of inode numbers to be prefetched
> > into kernel memory, and then stat() is used to actually get the data into
> > userspace (or whatever other operation is to be done on them),
> > so there is no danger if the wrong inode is prefetched.  If the inode
> > number is bad the filesystem can just ignore it.
> 
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
> 
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
> 
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.
> 
> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.
> 
> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if
> there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.
> 
> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....
> 

Hi Dave/all,

I finally got around to playing with the multithreaded userspace readahead
idea and the results are quite promising. I tried to mimic what my kernel
readahead patch did with this userspace program (userspace_ra.c)
Source code here: https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0

Each thread has an associated buffer into which a chunk of directory
entries are read in using getdents(). Each thread then sorts the entries in
inode number order (for GFS2, this is also their disk block order) and proceeds
to cache in the inodes in that order by issuing open(2) syscalls against them.
In my tests, I backgrounded this program and issued an 'ls -l' on the dir
in question. I did the same following the kernel dirreadahead syscall as well.

I did not manage to test out too many parameter combinations for both
userspace_ra and SYS_dirreadahead because the test matrix got pretty big and
time consuming. However, I did notice that without sorting, userspace_ra did
not perform as well in some of my tests. I haven't investigated that yet,
so the numbers shown here are all with sorting enabled.

For a directory with 100000 files,
a) simple 'ls -l' took 14m11s
b) SYS_dirreadahead + 'ls -l' took 3m9s, and
c) userspace_ra (1M buffer/thread, 32 threads) took 1m42s

https://www.dropbox.com/s/85na3hmo3qrtib1/ra_vs_u_ra_vs_ls.jpg?dl=0 is a graph
that contains a few more data points. In the graph, along with data for 'ls -l'
and SYS_dirreadahead, there are six data series for userspace_ra for each
directory size (10K, 100K and 200K files). i.e. u_ra:XXX,YYY, where XXX is one
of (64K, 1M) buffer size and YYY is one of (4, 16, 32) threads.

Cheers!
--Abhi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-10-21  5:21               ` Abhijith Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-10-21  5:21 UTC (permalink / raw)
  To: cluster-devel.redhat.com



----- Original Message -----
> From: "Dave Chinner" <david@fromorbit.com>
> To: "Andreas Dilger" <adilger@dilger.ca>
> Cc: "Abhijith Das" <adas@redhat.com>, "LKML" <linux-kernel@vger.kernel.org>, "linux-fsdevel"
> <linux-fsdevel@vger.kernel.org>, cluster-devel at redhat.com
> Sent: Thursday, July 31, 2014 6:53:06 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> > On Jul 31, 2014, at 6:49, Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> > >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <adas@redhat.com> wrote:
> > >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <adilger@dilger.ca> wrote:
> > >>>> Is there a time when this doesn't get called to prefetch entries in
> > >>>> readdir() order?  It isn't clear to me what benefit there is of
> > >>>> returning
> > >>>> the entries to userspace instead of just doing the statahead
> > >>>> implicitly
> > >>>> in the kernel?
> > >>>> 
> > >>>> The Lustre client has had what we call "statahead" for a while,
> > >>>> and similar to regular file readahead it detects the sequential access
> > >>>> pattern for readdir() + stat() in readdir() order (taking into account
> > >>>> if
> > >>>> ".*"
> > >>>> entries are being processed or not) and starts fetching the inode
> > >>>> attributes asynchronously with a worker thread.
> > >>> 
> > >>> Does this heuristic work well in practice? In the use case we were
> > >>> trying to
> > >>> address, a Samba server is aware beforehand if it is going to stat all
> > >>> the
> > >>> inodes in a directory.
> > >> 
> > >> Typically this works well for us, because this is done by the Lustre
> > >> client, so the statahead is hiding the network latency of the RPCs to
> > >> fetch attributes from the server.  I imagine the same could be seen with
> > >> GFS2. I don't know if this approach would help very much for local
> > >> filesystems because the latency is low.
> > >> 
> > >>>> This syscall might be more useful if userspace called readdir() to get
> > >>>> the dirents and then passed the kernel the list of inode numbers
> > >>>> to prefetch before starting on the stat() calls. That way, userspace
> > >>>> could generate an arbitrary list of inodes (e.g. names matching a
> > >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> > >>> 
> > >>> Were you thinking arbitrary inodes across the filesystem or just a
> > >>> subset
> > >>> from a directory? Arbitrary inodes may potentially throw up locking
> > >>> issues.
> > >> 
> > >> I was thinking about inodes returned from readdir(), but the syscall
> > >> would be much more useful if it could handle arbitrary inodes.
> > > 
> > > I'm not sure we can do that. The only way to safely identify a
> > > specific inode in the filesystem from userspace is via a filehandle.
> > > Plain inode numbers are susceptible to TOCTOU race conditions that
> > > the kernel cannot resolve. Also, lookup by inode number bypasses
> > > directory access permissions, so is not something we would expose
> > > to arbitrary unprivileged users.
> > 
> > None of these issues are relevant in the API that I'm thinking about.
> > The syscall just passes the list of inode numbers to be prefetched
> > into kernel memory, and then stat() is used to actually get the data into
> > userspace (or whatever other operation is to be done on them),
> > so there is no danger if the wrong inode is prefetched.  If the inode
> > number is bad the filesystem can just ignore it.
> 
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
> 
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
> 
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.
> 
> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.
> 
> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if
> there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.
> 
> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....
> 

Hi Dave/all,

I finally got around to playing with the multithreaded userspace readahead
idea and the results are quite promising. I tried to mimic what my kernel
readahead patch did with this userspace program (userspace_ra.c)
Source code here: https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0

Each thread has an associated buffer into which a chunk of directory
entries are read in using getdents(). Each thread then sorts the entries in
inode number order (for GFS2, this is also their disk block order) and proceeds
to cache in the inodes in that order by issuing open(2) syscalls against them.
In my tests, I backgrounded this program and issued an 'ls -l' on the dir
in question. I did the same following the kernel dirreadahead syscall as well.

I did not manage to test out too many parameter combinations for both
userspace_ra and SYS_dirreadahead because the test matrix got pretty big and
time consuming. However, I did notice that without sorting, userspace_ra did
not perform as well in some of my tests. I haven't investigated that yet,
so the numbers shown here are all with sorting enabled.

For a directory with 100000 files,
a) simple 'ls -l' took 14m11s
b) SYS_dirreadahead + 'ls -l' took 3m9s, and
c) userspace_ra (1M buffer/thread, 32 threads) took 1m42s

https://www.dropbox.com/s/85na3hmo3qrtib1/ra_vs_u_ra_vs_ls.jpg?dl=0 is a graph
that contains a few more data points. In the graph, along with data for 'ls -l'
and SYS_dirreadahead, there are six data series for userspace_ra for each
directory size (10K, 100K and 200K files). i.e. u_ra:XXX,YYY, where XXX is one
of (64K, 1M) buffer size and YYY is one of (4, 16, 32) threads.

Cheers!
--Abhi



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-10-21  5:21               ` [Cluster-devel] " Abhijith Das
@ 2014-11-10  3:41                 ` Abhijith Das
  -1 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-11-10  3:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, LKML, linux-fsdevel, cluster-devel, Steven Whitehouse


> > 
> 
> Hi Dave/all,
> 
> I finally got around to playing with the multithreaded userspace readahead
> idea and the results are quite promising. I tried to mimic what my kernel
> readahead patch did with this userspace program (userspace_ra.c)
> Source code here:
> https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0
> 
> Each thread has an associated buffer into which a chunk of directory
> entries are read in using getdents(). Each thread then sorts the entries in
> inode number order (for GFS2, this is also their disk block order) and
> proceeds
> to cache in the inodes in that order by issuing open(2) syscalls against
> them.
> In my tests, I backgrounded this program and issued an 'ls -l' on the dir
> in question. I did the same following the kernel dirreadahead syscall as
> well.
> 
> I did not manage to test out too many parameter combinations for both
> userspace_ra and SYS_dirreadahead because the test matrix got pretty big and
> time consuming. However, I did notice that without sorting, userspace_ra did
> not perform as well in some of my tests. I haven't investigated that yet,
> so the numbers shown here are all with sorting enabled.
> 
> For a directory with 100000 files,
> a) simple 'ls -l' took 14m11s
> b) SYS_dirreadahead + 'ls -l' took 3m9s, and
> c) userspace_ra (1M buffer/thread, 32 threads) took 1m42s
> 
> https://www.dropbox.com/s/85na3hmo3qrtib1/ra_vs_u_ra_vs_ls.jpg?dl=0 is a
> graph
> that contains a few more data points. In the graph, along with data for 'ls
> -l'
> and SYS_dirreadahead, there are six data series for userspace_ra for each
> directory size (10K, 100K and 200K files). i.e. u_ra:XXX,YYY, where XXX is
> one
> of (64K, 1M) buffer size and YYY is one of (4, 16, 32) threads.
> 

Hi,

Here are some more numbers for larger directories and it seems like userspace
readahead scales well and is still a good option.

I've chosen the best-performing runs for kernel readahead and userspace readahead. I
have data for runs with different parameters (buffer size, number of threads, etc)
that I can provide, if anybody's interested.

The numbers here are total elapsed times for the readahead plus 'ls -l' operations
to complete.

							#files in testdir
						50k	100k	200k	500k	1m
------------------------------------------------------------------------------------
Readdir 'ls -l'					11	849	1873	5024	10365
Kernel readahead + 'ls -l' (best case)		7	214	814	2330	4900
Userspace MT readahead + 'ls -l' (best case)	12	99	239	1351	4761

Cheers!
--Abhi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-11-10  3:41                 ` Abhijith Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-11-10  3:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com


> > 
> 
> Hi Dave/all,
> 
> I finally got around to playing with the multithreaded userspace readahead
> idea and the results are quite promising. I tried to mimic what my kernel
> readahead patch did with this userspace program (userspace_ra.c)
> Source code here:
> https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0
> 
> Each thread has an associated buffer into which a chunk of directory
> entries are read in using getdents(). Each thread then sorts the entries in
> inode number order (for GFS2, this is also their disk block order) and
> proceeds
> to cache in the inodes in that order by issuing open(2) syscalls against
> them.
> In my tests, I backgrounded this program and issued an 'ls -l' on the dir
> in question. I did the same following the kernel dirreadahead syscall as
> well.
> 
> I did not manage to test out too many parameter combinations for both
> userspace_ra and SYS_dirreadahead because the test matrix got pretty big and
> time consuming. However, I did notice that without sorting, userspace_ra did
> not perform as well in some of my tests. I haven't investigated that yet,
> so the numbers shown here are all with sorting enabled.
> 
> For a directory with 100000 files,
> a) simple 'ls -l' took 14m11s
> b) SYS_dirreadahead + 'ls -l' took 3m9s, and
> c) userspace_ra (1M buffer/thread, 32 threads) took 1m42s
> 
> https://www.dropbox.com/s/85na3hmo3qrtib1/ra_vs_u_ra_vs_ls.jpg?dl=0 is a
> graph
> that contains a few more data points. In the graph, along with data for 'ls
> -l'
> and SYS_dirreadahead, there are six data series for userspace_ra for each
> directory size (10K, 100K and 200K files). i.e. u_ra:XXX,YYY, where XXX is
> one
> of (64K, 1M) buffer size and YYY is one of (4, 16, 32) threads.
> 

Hi,

Here are some more numbers for larger directories and it seems like userspace
readahead scales well and is still a good option.

I've chosen the best-performing runs for kernel readahead and userspace readahead. I
have data for runs with different parameters (buffer size, number of threads, etc)
that I can provide, if anybody's interested.

The numbers here are total elapsed times for the readahead plus 'ls -l' operations
to complete.

							#files in testdir
						50k	100k	200k	500k	1m
------------------------------------------------------------------------------------
Readdir 'ls -l'					11	849	1873	5024	10365
Kernel readahead + 'ls -l' (best case)		7	214	814	2330	4900
Userspace MT readahead + 'ls -l' (best case)	12	99	239	1351	4761

Cheers!
--Abhi



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-11-10  3:41                 ` [Cluster-devel] " Abhijith Das
@ 2014-11-10 22:23                   ` Andreas Dilger
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-11-10 22:23 UTC (permalink / raw)
  To: Abhijith Das
  Cc: Dave Chinner, LKML, linux-fsdevel, cluster-devel, Steven Whitehouse

[-- Attachment #1: Type: text/plain, Size: 3167 bytes --]

On Nov 9, 2014, at 8:41 PM, Abhijith Das <adas@redhat.com> wrote:
>> Hi Dave/all,
>> 
>> I finally got around to playing with the multithreaded userspace readahead
>> idea and the results are quite promising. I tried to mimic what my kernel
>> readahead patch did with this userspace program (userspace_ra.c)
>> Source code here:
>> https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0
>> 
>> Each thread has an associated buffer into which a chunk of directory
>> entries are read in using getdents(). Each thread then sorts the
>> entries in inode number order (for GFS2, this is also their disk
>> block order) and proceeds to cache in the inodes in that order by
>> issuing open(2) syscalls against them.  In my tests, I backgrounded
>> this program and issued an 'ls -l' on the dir in question. I did the
>> same following the kernel dirreadahead syscall as well.
>> 
>> I did not manage to test out too many parameter combinations for both
>> userspace_ra and SYS_dirreadahead because the test matrix got pretty
>> big and time consuming. However, I did notice that without sorting,
>> userspace_ra did not perform as well in some of my tests. I haven't
>> investigated that, so numbers shown here are all with sorting enabled.

One concern is for filesystems where inode order does not necessarily
match the on-disk order.  I believe that filesystems like ext4 and XFS
have matching inode/disk order, but tree-based COW filesystems like
Btrfs do not necessarily preserve this order, so sorting in userspace
will not help and may in fact hurt readahead compared to readdir order.

What filesystem(s) have you tested this besides GFS?

Cheers, Andreas

>> For a directory with 100000 files,
>> a) simple 'ls -l' took 14m11s
>> b) SYS_dirreadahead + 'ls -l' took 3m9s, and
>> c) userspace_ra (1M buffer/thread, 32 threads) took 1m42s
>> 
>> https://www.dropbox.com/s/85na3hmo3qrtib1/ra_vs_u_ra_vs_ls.jpg?dl=0 is a
>> graph
>> that contains a few more data points. In the graph, along with data for 'ls
>> -l'
>> and SYS_dirreadahead, there are six data series for userspace_ra for each
>> directory size (10K, 100K and 200K files). i.e. u_ra:XXX,YYY, where XXX is
>> one
>> of (64K, 1M) buffer size and YYY is one of (4, 16, 32) threads.
>> 
> 
> Hi,
> 
> Here are some more numbers for larger directories and it seems like
> userspace readahead scales well and is still a good option.
> 
> I've chosen the best-performing runs for kernel readahead and userspace
> readahead. I have data for runs with different parameters (buffer size,
> number of threads, etc) that I can provide, if anybody's interested.
> 
> The numbers here are total elapsed times for the readahead plus 'ls -l'
> operations to complete.
> 
> 							#files in testdir
> 						50k	100k	200k	500k	1m
> ------------------------------------------------------------------------------------
> Readdir 'ls -l'					11	849	1873	5024	10365
> Kernel readahead + 'ls -l' (best case)		7	214	814	2330	4900
> Userspace MT readahead + 'ls -l' (best case)	12	99	239	1351	4761
> 
> Cheers!
> --Abhi


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-11-10 22:23                   ` Andreas Dilger
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Dilger @ 2014-11-10 22:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Nov 9, 2014, at 8:41 PM, Abhijith Das <adas@redhat.com> wrote:
>> Hi Dave/all,
>> 
>> I finally got around to playing with the multithreaded userspace readahead
>> idea and the results are quite promising. I tried to mimic what my kernel
>> readahead patch did with this userspace program (userspace_ra.c)
>> Source code here:
>> https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0
>> 
>> Each thread has an associated buffer into which a chunk of directory
>> entries are read in using getdents(). Each thread then sorts the
>> entries in inode number order (for GFS2, this is also their disk
>> block order) and proceeds to cache in the inodes in that order by
>> issuing open(2) syscalls against them.  In my tests, I backgrounded
>> this program and issued an 'ls -l' on the dir in question. I did the
>> same following the kernel dirreadahead syscall as well.
>> 
>> I did not manage to test out too many parameter combinations for both
>> userspace_ra and SYS_dirreadahead because the test matrix got pretty
>> big and time consuming. However, I did notice that without sorting,
>> userspace_ra did not perform as well in some of my tests. I haven't
>> investigated that, so numbers shown here are all with sorting enabled.

One concern is for filesystems where inode order does not necessarily
match the on-disk order.  I believe that filesystems like ext4 and XFS
have matching inode/disk order, but tree-based COW filesystems like
Btrfs do not necessarily preserve this order, so sorting in userspace
will not help and may in fact hurt readahead compared to readdir order.

What filesystem(s) have you tested this besides GFS?

Cheers, Andreas

>> For a directory with 100000 files,
>> a) simple 'ls -l' took 14m11s
>> b) SYS_dirreadahead + 'ls -l' took 3m9s, and
>> c) userspace_ra (1M buffer/thread, 32 threads) took 1m42s
>> 
>> https://www.dropbox.com/s/85na3hmo3qrtib1/ra_vs_u_ra_vs_ls.jpg?dl=0 is a
>> graph
>> that contains a few more data points. In the graph, along with data for 'ls
>> -l'
>> and SYS_dirreadahead, there are six data series for userspace_ra for each
>> directory size (10K, 100K and 200K files). i.e. u_ra:XXX,YYY, where XXX is
>> one
>> of (64K, 1M) buffer size and YYY is one of (4, 16, 32) threads.
>> 
> 
> Hi,
> 
> Here are some more numbers for larger directories and it seems like
> userspace readahead scales well and is still a good option.
> 
> I've chosen the best-performing runs for kernel readahead and userspace
> readahead. I have data for runs with different parameters (buffer size,
> number of threads, etc) that I can provide, if anybody's interested.
> 
> The numbers here are total elapsed times for the readahead plus 'ls -l'
> operations to complete.
> 
> 							#files in testdir
> 						50k	100k	200k	500k	1m
> ------------------------------------------------------------------------------------
> Readdir 'ls -l'					11	849	1873	5024	10365
> Kernel readahead + 'ls -l' (best case)		7	214	814	2330	4900
> Userspace MT readahead + 'ls -l' (best case)	12	99	239	1351	4761
> 
> Cheers!
> --Abhi


Cheers, Andreas





-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20141110/166285af/attachment.sig>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 0/2] dirreadahead system call
  2014-11-10 22:23                   ` [Cluster-devel] " Andreas Dilger
@ 2014-11-10 22:47                     ` Abhijith Das
  -1 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-11-10 22:47 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Dave Chinner, LKML, linux-fsdevel, cluster-devel, Steven Whitehouse



----- Original Message -----
> From: "Andreas Dilger" <adilger@dilger.ca>
> To: "Abhijith Das" <adas@redhat.com>
> Cc: "Dave Chinner" <david@fromorbit.com>, "LKML" <linux-kernel@vger.kernel.org>, "linux-fsdevel"
> <linux-fsdevel@vger.kernel.org>, cluster-devel@redhat.com, "Steven Whitehouse" <swhiteho@redhat.com>
> Sent: Monday, November 10, 2014 4:23:25 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Nov 9, 2014, at 8:41 PM, Abhijith Das <adas@redhat.com> wrote:
> >> Hi Dave/all,
> >> 
> >> I finally got around to playing with the multithreaded userspace readahead
> >> idea and the results are quite promising. I tried to mimic what my kernel
> >> readahead patch did with this userspace program (userspace_ra.c)
> >> Source code here:
> >> https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0
> >> 
> >> Each thread has an associated buffer into which a chunk of directory
> >> entries are read in using getdents(). Each thread then sorts the
> >> entries in inode number order (for GFS2, this is also their disk
> >> block order) and proceeds to cache in the inodes in that order by
> >> issuing open(2) syscalls against them.  In my tests, I backgrounded
> >> this program and issued an 'ls -l' on the dir in question. I did the
> >> same following the kernel dirreadahead syscall as well.
> >> 
> >> I did not manage to test out too many parameter combinations for both
> >> userspace_ra and SYS_dirreadahead because the test matrix got pretty
> >> big and time consuming. However, I did notice that without sorting,
> >> userspace_ra did not perform as well in some of my tests. I haven't
> >> investigated that, so numbers shown here are all with sorting enabled.
> 
> One concern is for filesystems where inode order does not necessarily
> match the on-disk order.  I believe that filesystems like ext4 and XFS
> have matching inode/disk order, but tree-based COW filesystems like
> Btrfs do not necessarily preserve this order, so sorting in userspace
> will not help and may in fact hurt readahead compared to readdir order.

Yes. I have seen in my tests that the multi-threaded user program performs
better than plain readdir + 'ls -l' with the sorting turned off, but not
as well as the kernel readahead. I'd imagine that if this ends up in glibc
or something, the sorting can be turned on and off based on whether the
filesystem's inode numbers and disk block order match.

> 
> What filesystem(s) have you tested this besides GFS?

I did some brief runs on ext4, in addition to GFS2 with similar results,
but I had to repurpose my machines, so I haven't had a chance to try out
other fses. Hopefully, I'll get around to it soon.

> 
> Cheers, Andreas
> 

Cheers!
--Abhi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call
@ 2014-11-10 22:47                     ` Abhijith Das
  0 siblings, 0 replies; 49+ messages in thread
From: Abhijith Das @ 2014-11-10 22:47 UTC (permalink / raw)
  To: cluster-devel.redhat.com



----- Original Message -----
> From: "Andreas Dilger" <adilger@dilger.ca>
> To: "Abhijith Das" <adas@redhat.com>
> Cc: "Dave Chinner" <david@fromorbit.com>, "LKML" <linux-kernel@vger.kernel.org>, "linux-fsdevel"
> <linux-fsdevel@vger.kernel.org>, cluster-devel at redhat.com, "Steven Whitehouse" <swhiteho@redhat.com>
> Sent: Monday, November 10, 2014 4:23:25 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
> 
> On Nov 9, 2014, at 8:41 PM, Abhijith Das <adas@redhat.com> wrote:
> >> Hi Dave/all,
> >> 
> >> I finally got around to playing with the multithreaded userspace readahead
> >> idea and the results are quite promising. I tried to mimic what my kernel
> >> readahead patch did with this userspace program (userspace_ra.c)
> >> Source code here:
> >> https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0
> >> 
> >> Each thread has an associated buffer into which a chunk of directory
> >> entries are read in using getdents(). Each thread then sorts the
> >> entries in inode number order (for GFS2, this is also their disk
> >> block order) and proceeds to cache in the inodes in that order by
> >> issuing open(2) syscalls against them.  In my tests, I backgrounded
> >> this program and issued an 'ls -l' on the dir in question. I did the
> >> same following the kernel dirreadahead syscall as well.
> >> 
> >> I did not manage to test out too many parameter combinations for both
> >> userspace_ra and SYS_dirreadahead because the test matrix got pretty
> >> big and time consuming. However, I did notice that without sorting,
> >> userspace_ra did not perform as well in some of my tests. I haven't
> >> investigated that, so numbers shown here are all with sorting enabled.
> 
> One concern is for filesystems where inode order does not necessarily
> match the on-disk order.  I believe that filesystems like ext4 and XFS
> have matching inode/disk order, but tree-based COW filesystems like
> Btrfs do not necessarily preserve this order, so sorting in userspace
> will not help and may in fact hurt readahead compared to readdir order.

Yes. I have seen in my tests that the multi-threaded user program performs
better than plain readdir + 'ls -l' with the sorting turned off, but not
as well as the kernel readahead. I'd imagine that if this ends up in glibc
or something, the sorting can be turned on and off based on whether the
filesystem's inode numbers and disk block order match.

> 
> What filesystem(s) have you tested this besides GFS?

I did some brief runs on ext4, in addition to GFS2 with similar results,
but I had to repurpose my machines, so I haven't had a chance to try out
other fses. Hopefully, I'll get around to it soon.

> 
> Cheers, Andreas
> 

Cheers!
--Abhi



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2014-11-10 22:47 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-25 17:37 [RFC PATCH 0/2] dirreadahead system call Abhi Das
2014-07-25 17:37 ` [Cluster-devel] " Abhi Das
2014-07-25 17:37 ` Abhi Das
2014-07-25 17:37 ` [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks Abhi Das
2014-07-25 17:37   ` [Cluster-devel] " Abhi Das
2014-07-29  8:21   ` Michael Kerrisk
2014-07-29  8:21     ` [Cluster-devel] " Michael Kerrisk
2014-07-29  8:21     ` Michael Kerrisk
2014-07-31  3:31     ` Dave Chinner
2014-07-31  3:31       ` [Cluster-devel] " Dave Chinner
2014-07-31  3:31       ` Dave Chinner
2014-07-25 17:37 ` [RFC PATCH 2/2] gfs2: GFS2's implementation of the dir_readahead file operation Abhi Das
2014-07-25 17:37   ` [Cluster-devel] " Abhi Das
2014-07-25 17:37   ` Abhi Das
2014-07-26  5:27 ` [RFC PATCH 0/2] dirreadahead system call Andreas Dilger
2014-07-26  5:27   ` [Cluster-devel] " Andreas Dilger
2014-07-28 12:52   ` Abhijith Das
2014-07-28 12:52     ` [Cluster-devel] " Abhijith Das
2014-07-28 21:19     ` Andreas Dilger
2014-07-28 21:19       ` [Cluster-devel] " Andreas Dilger
2014-07-29  9:36       ` Steven Whitehouse
2014-07-29  9:36         ` Steven Whitehouse
2014-07-31  4:49       ` Dave Chinner
2014-07-31  4:49         ` [Cluster-devel] " Dave Chinner
2014-07-31 11:19         ` Andreas Dilger
2014-07-31 11:19           ` [Cluster-devel] " Andreas Dilger
2014-07-31 23:53           ` Dave Chinner
2014-07-31 23:53             ` [Cluster-devel] " Dave Chinner
2014-08-01  2:11             ` Abhijith Das
2014-08-01  2:11               ` [Cluster-devel] " Abhijith Das
2014-08-01  5:54             ` Andreas Dilger
2014-08-01  5:54               ` [Cluster-devel] " Andreas Dilger
2014-08-06  2:01               ` Dave Chinner
2014-08-06  2:01                 ` [Cluster-devel] " Dave Chinner
2014-10-21  5:21             ` Abhijith Das
2014-10-21  5:21               ` [Cluster-devel] " Abhijith Das
2014-11-10  3:41               ` Abhijith Das
2014-11-10  3:41                 ` [Cluster-devel] " Abhijith Das
2014-11-10 22:23                 ` Andreas Dilger
2014-11-10 22:23                   ` [Cluster-devel] " Andreas Dilger
2014-11-10 22:47                   ` Abhijith Das
2014-11-10 22:47                     ` [Cluster-devel] " Abhijith Das
2014-08-01  9:14       ` Thomas Martitz
2014-07-29  8:19 ` Michael Kerrisk
2014-07-29  8:19   ` [Cluster-devel] " Michael Kerrisk
2014-07-31  3:18 ` NeilBrown
2014-07-31  3:18   ` [Cluster-devel] " NeilBrown
2014-08-01  2:21   ` Abhijith Das
2014-08-01  2:21     ` [Cluster-devel] " Abhijith Das

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.