[RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-15 20:20 ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

This patcheset introduces an ability to perform a non-blocking read from 
regular files in buffered IO mode. This works by only for those filesystems
that have data in the page cache.

It does this by introducing new syscalls new syscalls readv2/writev2 and
preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
syscalls that accept an extra flag argument (O_NONBLOCK).

It's a very common patern today (samba, libuv, etc..) use a large threadpool to 
perform buffered IO operations. They submit the work form another thread
that performs network IO and epoll or other threads that perform CPU work. This
leads to increased latency for processing, esp. in the case of data that's
already cached in the page cache.

With the new interface the applications will now be able to fetch the data in
their network / cpu bound thread(s) and only defer to a threadpool if it's not
there. In our own application (VLDB) we've observed a decrease in latency for
"fast" request by avoiding unnecessary queuing and having to swap out current
tasks in IO bound work threads.

I have co-developed these changes with Christoph Hellwig, a whole lot of his
fixes went into the first patch in the series (were squashed with his
approval).

I am going to post the perf report in a reply-to to this RFC.

Christoph Hellwig (3):
  documentation updates
  move flags enforcement to vfs_preadv/vfs_pwritev
  check for O_NONBLOCK in all read_iter instances

Milosz Tanski (4):
  Prepare for adding a new readv/writev with user flags.
  Define new syscalls readv2,preadv2,writev2,pwritev2
  Export new vector IO (with flags) to userland
  O_NONBLOCK flag for readv2/preadv2

 Documentation/filesystems/Locking |    4 +-
 Documentation/filesystems/vfs.txt |    4 +-
 arch/x86/syscalls/syscall_32.tbl  |    4 +
 arch/x86/syscalls/syscall_64.tbl  |    4 +
 drivers/target/target_core_file.c |    6 +-
 fs/afs/internal.h                 |    2 +-
 fs/afs/write.c                    |    4 +-
 fs/aio.c                          |    4 +-
 fs/block_dev.c                    |    9 ++-
 fs/btrfs/file.c                   |    2 +-
 fs/ceph/file.c                    |   10 ++-
 fs/cifs/cifsfs.c                  |    9 ++-
 fs/cifs/cifsfs.h                  |   12 ++-
 fs/cifs/file.c                    |   30 +++++---
 fs/ecryptfs/file.c                |    4 +-
 fs/ext4/file.c                    |    4 +-
 fs/fuse/file.c                    |   10 ++-
 fs/gfs2/file.c                    |    5 +-
 fs/nfs/file.c                     |   13 ++--
 fs/nfs/internal.h                 |    4 +-
 fs/nfsd/vfs.c                     |    4 +-
 fs/ocfs2/file.c                   |   13 +++-
 fs/pipe.c                         |    7 +-
 fs/read_write.c                   |  146 +++++++++++++++++++++++++++++++------
 fs/splice.c                       |    4 +-
 fs/ubifs/file.c                   |    5 +-
 fs/udf/file.c                     |    5 +-
 fs/xfs/xfs_file.c                 |   12 ++-
 include/linux/fs.h                |   16 ++--
 include/linux/syscalls.h          |   12 +++
 include/uapi/asm-generic/unistd.h |   10 ++-
 mm/filemap.c                      |   34 +++++++--
 mm/shmem.c                        |    6 +-
 33 files changed, 306 insertions(+), 112 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-15 20:20 ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

This patcheset introduces an ability to perform a non-blocking read from 
regular files in buffered IO mode. This works by only for those filesystems
that have data in the page cache.

It does this by introducing new syscalls new syscalls readv2/writev2 and
preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
syscalls that accept an extra flag argument (O_NONBLOCK).

It's a very common patern today (samba, libuv, etc..) use a large threadpool to 
perform buffered IO operations. They submit the work form another thread
that performs network IO and epoll or other threads that perform CPU work. This
leads to increased latency for processing, esp. in the case of data that's
already cached in the page cache.

With the new interface the applications will now be able to fetch the data in
their network / cpu bound thread(s) and only defer to a threadpool if it's not
there. In our own application (VLDB) we've observed a decrease in latency for
"fast" request by avoiding unnecessary queuing and having to swap out current
tasks in IO bound work threads.

I have co-developed these changes with Christoph Hellwig, a whole lot of his
fixes went into the first patch in the series (were squashed with his
approval).

I am going to post the perf report in a reply-to to this RFC.

Christoph Hellwig (3):
  documentation updates
  move flags enforcement to vfs_preadv/vfs_pwritev
  check for O_NONBLOCK in all read_iter instances

Milosz Tanski (4):
  Prepare for adding a new readv/writev with user flags.
  Define new syscalls readv2,preadv2,writev2,pwritev2
  Export new vector IO (with flags) to userland
  O_NONBLOCK flag for readv2/preadv2

 Documentation/filesystems/Locking |    4 +-
 Documentation/filesystems/vfs.txt |    4 +-
 arch/x86/syscalls/syscall_32.tbl  |    4 +
 arch/x86/syscalls/syscall_64.tbl  |    4 +
 drivers/target/target_core_file.c |    6 +-
 fs/afs/internal.h                 |    2 +-
 fs/afs/write.c                    |    4 +-
 fs/aio.c                          |    4 +-
 fs/block_dev.c                    |    9 ++-
 fs/btrfs/file.c                   |    2 +-
 fs/ceph/file.c                    |   10 ++-
 fs/cifs/cifsfs.c                  |    9 ++-
 fs/cifs/cifsfs.h                  |   12 ++-
 fs/cifs/file.c                    |   30 +++++---
 fs/ecryptfs/file.c                |    4 +-
 fs/ext4/file.c                    |    4 +-
 fs/fuse/file.c                    |   10 ++-
 fs/gfs2/file.c                    |    5 +-
 fs/nfs/file.c                     |   13 ++--
 fs/nfs/internal.h                 |    4 +-
 fs/nfsd/vfs.c                     |    4 +-
 fs/ocfs2/file.c                   |   13 +++-
 fs/pipe.c                         |    7 +-
 fs/read_write.c                   |  146 +++++++++++++++++++++++++++++++------
 fs/splice.c                       |    4 +-
 fs/ubifs/file.c                   |    5 +-
 fs/udf/file.c                     |    5 +-
 fs/xfs/xfs_file.c                 |   12 ++-
 include/linux/fs.h                |   16 ++--
 include/linux/syscalls.h          |   12 +++
 include/uapi/asm-generic/unistd.h |   10 ++-
 mm/filemap.c                      |   34 +++++++--
 mm/shmem.c                        |    6 +-
 33 files changed, 306 insertions(+), 112 deletions(-)

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [PATCH 1/7] Prepare for adding a new readv/writev with user flags.
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-15 20:20   ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 drivers/target/target_core_file.c |    6 +++---
 fs/afs/internal.h                 |    2 +-
 fs/afs/write.c                    |    4 ++--
 fs/aio.c                          |    4 ++--
 fs/block_dev.c                    |    9 +++++----
 fs/btrfs/file.c                   |    2 +-
 fs/ceph/file.c                    |    8 +++++---
 fs/cifs/cifsfs.c                  |    9 +++++----
 fs/cifs/cifsfs.h                  |   12 ++++++++----
 fs/cifs/file.c                    |   24 ++++++++++++------------
 fs/ecryptfs/file.c                |    4 ++--
 fs/ext4/file.c                    |    4 ++--
 fs/fuse/file.c                    |   10 ++++++----
 fs/gfs2/file.c                    |    5 +++--
 fs/nfs/file.c                     |    8 ++++----
 fs/nfs/internal.h                 |    4 ++--
 fs/nfsd/vfs.c                     |    4 ++--
 fs/ocfs2/file.c                   |    7 ++++---
 fs/pipe.c                         |    4 ++--
 fs/read_write.c                   |   35 +++++++++++++++++++----------------
 fs/splice.c                       |    4 ++--
 fs/ubifs/file.c                   |    5 +++--
 fs/udf/file.c                     |    5 +++--
 fs/xfs/xfs_file.c                 |    8 +++++---
 include/linux/fs.h                |   16 ++++++++--------
 mm/filemap.c                      |   13 +++++++------
 mm/shmem.c                        |    2 +-
 27 files changed, 119 insertions(+), 99 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7d6cdda..58d9a6d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -350,9 +350,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
 	set_fs(get_ds());
 
 	if (is_write)
-		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
 	else
-		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);
 
 	set_fs(old_fs);
 
@@ -528,7 +528,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 
 	old_fs = get_fs();
 	set_fs(get_ds());
-	rc = vfs_writev(f, &iov[0], iov_num, &pos);
+	rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
 	set_fs(old_fs);
 
 	vfree(iov);
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 71d5982..50bf5cd 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -747,7 +747,7 @@ extern int afs_write_end(struct file *file, struct address_space *mapping,
 extern int afs_writepage(struct page *, struct writeback_control *);
 extern int afs_writepages(struct address_space *, struct writeback_control *);
 extern void afs_pages_written_back(struct afs_vnode *, struct afs_call *);
-extern ssize_t afs_file_write(struct kiocb *, struct iov_iter *);
+extern ssize_t afs_file_write(struct kiocb *, struct iov_iter *, int flags);
 extern int afs_writeback_all(struct afs_vnode *);
 extern int afs_fsync(struct file *, loff_t, loff_t, int);
 
diff --git a/fs/afs/write.c b/fs/afs/write.c
index ab6adfd..7330860 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -625,7 +625,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call)
 /*
  * write to an AFS file
  */
-ssize_t afs_file_write(struct kiocb *iocb, struct iov_iter *from)
+ssize_t afs_file_write(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct afs_vnode *vnode = AFS_FS_I(file_inode(iocb->ki_filp));
 	ssize_t result;
@@ -643,7 +643,7 @@ ssize_t afs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	if (!count)
 		return 0;
 
-	result = generic_file_write_iter(iocb, from);
+	result = generic_file_write_iter(iocb, from, flags);
 	if (IS_ERR_VALUE(result)) {
 		_leave(" = %zd", result);
 		return result;
diff --git a/fs/aio.c b/fs/aio.c
index 7337500..c17227f 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1309,7 +1309,7 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
 
 typedef ssize_t (aio_rw_op)(struct kiocb *, const struct iovec *,
 			    unsigned long, loff_t);
-typedef ssize_t (rw_iter_op)(struct kiocb *, struct iov_iter *);
+typedef ssize_t (rw_iter_op)(struct kiocb *, struct iov_iter *, int flags);
 
 static ssize_t aio_setup_vectored_rw(struct kiocb *kiocb,
 				     int rw, char __user *buf,
@@ -1421,7 +1421,7 @@ rw_common:
 
 		if (iter_op) {
 			iov_iter_init(&iter, rw, iovec, nr_segs, req->ki_nbytes);
-			ret = iter_op(req, &iter);
+			ret = iter_op(req, &iter, 0);
 		} else {
 			ret = rw_op(req, iovec, nr_segs, req->ki_pos);
 		}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6d72746..04b203b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1572,14 +1572,14 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
  * Does not take i_mutex for the write and thus is not for general purpose
  * use.
  */
-ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct blk_plug plug;
 	ssize_t ret;
 
 	blk_start_plug(&plug);
-	ret = __generic_file_write_iter(iocb, from);
+	ret = __generic_file_write_iter(iocb, from, flags);
 	if (ret > 0) {
 		ssize_t err;
 		err = generic_write_sync(file, iocb->ki_pos - ret, ret);
@@ -1591,7 +1591,8 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 }
 EXPORT_SYMBOL_GPL(blkdev_write_iter);
 
-static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
+static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to,
+			        int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *bd_inode = file->f_mapping->host;
@@ -1603,7 +1604,7 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	size -= pos;
 	iov_iter_truncate(to, size);
-	return generic_file_read_iter(iocb, to);
+	return generic_file_read_iter(iocb, to, flags);
 }
 
 /*
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ff1cc03..ad72a21 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1726,7 +1726,7 @@ static void update_time_for_write(struct inode *inode)
 }
 
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
-				    struct iov_iter *from)
+				    struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 2eb02f8..4776257 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -781,7 +781,8 @@ out:
  *
  * Hmm, the sync read case isn't actually async... should it be?
  */
-static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
+static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to,
+			      int flags)
 {
 	struct file *filp = iocb->ki_filp;
 	struct ceph_file_info *fi = filp->private_data;
@@ -819,7 +820,7 @@ again:
 		     inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
 		     ceph_cap_string(got));
 
-		ret = generic_file_read_iter(iocb, to);
+		ret = generic_file_read_iter(iocb, to, flags);
 	}
 	dout("aio_read %p %llx.%llx dropping cap refs on %s = %d\n",
 	     inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret);
@@ -860,7 +861,8 @@ again:
  *
  * If we are near ENOSPC, write synchronously.
  */
-static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from,
+			       int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct ceph_file_info *fi = file->private_data;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 889b984..079ad6c 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -739,7 +739,7 @@ out_nls:
 }
 
 static ssize_t
-cifs_loose_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+cifs_loose_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
 {
 	ssize_t rc;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -748,10 +748,11 @@ cifs_loose_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	if (rc)
 		return rc;
 
-	return generic_file_read_iter(iocb, iter);
+	return generic_file_read_iter(iocb, iter, flags);
 }
 
-static ssize_t cifs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t cifs_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+		int flags)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct cifsInodeInfo *cinode = CIFS_I(inode);
@@ -762,7 +763,7 @@ static ssize_t cifs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (written)
 		return written;
 
-	written = generic_file_write_iter(iocb, from);
+	written = generic_file_write_iter(iocb, from, flags);
 
 	if (CIFS_CACHE_WRITE(CIFS_I(inode)))
 		goto out;
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index b0fafa4..6c44582 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -95,10 +95,14 @@ extern const struct file_operations cifs_file_strict_nobrl_ops;
 extern int cifs_open(struct inode *inode, struct file *file);
 extern int cifs_close(struct inode *inode, struct file *file);
 extern int cifs_closedir(struct inode *inode, struct file *file);
-extern ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to);
-extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
-extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
-extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
+extern ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to,
+		int flags);
+extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to,
+		int flags);
+extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from,
+		int flags);
+extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from,
+		int flags);
 extern int cifs_lock(struct file *, int, struct file_lock *);
 extern int cifs_fsync(struct file *, loff_t, loff_t, int);
 extern int cifs_strict_fsync(struct file *, loff_t, loff_t, int);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 7c018a1..58aecd7 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2658,7 +2658,7 @@ restart_loop:
 	return total_written ? total_written : (ssize_t)rc;
 }
 
-ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from)
+ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	ssize_t written;
 	struct inode *inode;
@@ -2682,7 +2682,7 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from)
 }
 
 static ssize_t
-cifs_writev(struct kiocb *iocb, struct iov_iter *from)
+cifs_writev(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct cifsFileInfo *cfile = (struct cifsFileInfo *)file->private_data;
@@ -2703,7 +2703,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
 	if (!cifs_find_lock_conflict(cfile, lock_pos, iov_iter_count(from),
 				     server->vals->exclusive_lock_type, NULL,
 				     CIFS_WRITE_OP)) {
-		rc = __generic_file_write_iter(iocb, from);
+		rc = __generic_file_write_iter(iocb, from, flags);
 		mutex_unlock(&inode->i_mutex);
 
 		if (rc > 0) {
@@ -2721,7 +2721,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
 }
 
 ssize_t
-cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from)
+cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct cifsInodeInfo *cinode = CIFS_I(inode);
@@ -2739,10 +2739,10 @@ cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from)
 		if (cap_unix(tcon->ses) &&
 		(CIFS_UNIX_FCNTL_CAP & le64_to_cpu(tcon->fsUnixInfo.Capability))
 		  && ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0)) {
-			written = generic_file_write_iter(iocb, from);
+			written = generic_file_write_iter(iocb, from, flags);
 			goto out;
 		}
-		written = cifs_writev(iocb, from);
+		written = cifs_writev(iocb, from, flags);
 		goto out;
 	}
 	/*
@@ -2751,7 +2751,7 @@ cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from)
 	 * affected pages because it may cause a error with mandatory locks on
 	 * these pages but not on the region from pos to ppos+len-1.
 	 */
-	written = cifs_user_writev(iocb, from);
+	written = cifs_user_writev(iocb, from, flags);
 	if (written > 0 && CIFS_CACHE_READ(cinode)) {
 		/*
 		 * Windows 7 server can delay breaking level2 oplock if a write
@@ -2992,7 +2992,7 @@ error:
 	return rc;
 }
 
-ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
+ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	ssize_t rc;
@@ -3097,7 +3097,7 @@ again:
 }
 
 ssize_t
-cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
+cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct cifsInodeInfo *cinode = CIFS_I(inode);
@@ -3116,12 +3116,12 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	 * pos+len-1.
 	 */
 	if (!CIFS_CACHE_READ(cinode))
-		return cifs_user_readv(iocb, to);
+		return cifs_user_readv(iocb, to, flags);
 
 	if (cap_unix(tcon->ses) &&
 	    (CIFS_UNIX_FCNTL_CAP & le64_to_cpu(tcon->fsUnixInfo.Capability)) &&
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
-		return generic_file_read_iter(iocb, to);
+		return generic_file_read_iter(iocb, to, flags);
 
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
@@ -3131,7 +3131,7 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	if (!cifs_find_lock_conflict(cfile, iocb->ki_pos, iov_iter_count(to),
 				     tcon->ses->server->vals->shared_lock_type,
 				     NULL, CIFS_READ_OP))
-		rc = generic_file_read_iter(iocb, to);
+		rc = generic_file_read_iter(iocb, to, flags);
 	up_read(&cinode->lock_sem);
 	return rc;
 }
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index db0fad3..bec0a0e 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -45,13 +45,13 @@
  * The function to be used for directory reads is ecryptfs_read.
  */
 static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb,
-				struct iov_iter *to)
+				struct iov_iter *to, int flags)
 {
 	ssize_t rc;
 	struct path *path;
 	struct file *file = iocb->ki_filp;
 
-	rc = generic_file_read_iter(iocb, to);
+	rc = generic_file_read_iter(iocb, to, flags);
 	/*
 	 * Even though this is a async interface, we need to wait
 	 * for IO to finish to update atime
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..565de78 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -89,7 +89,7 @@ ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
 }
 
 static ssize_t
-ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -172,7 +172,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		}
 	}
 
-	ret = __generic_file_write_iter(iocb, from);
+	ret = __generic_file_write_iter(iocb, from, flags);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 912061a..fdf1711 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -933,7 +933,8 @@ out:
 	return err;
 }
 
-static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to,
+				   int flags)
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
@@ -951,7 +952,7 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			return err;
 	}
 
-	return generic_file_read_iter(iocb, to);
+	return generic_file_read_iter(iocb, to, flags);
 }
 
 static void fuse_write_fill(struct fuse_req *req, struct fuse_file *ff,
@@ -1180,7 +1181,8 @@ static ssize_t fuse_perform_write(struct file *file,
 	return res > 0 ? res : err;
 }
 
-static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+				    int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
@@ -1198,7 +1200,7 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (err)
 			return err;
 
-		return generic_file_write_iter(iocb, from);
+		return generic_file_write_iter(iocb, from, flags);
 	}
 
 	mutex_lock(&inode->i_mutex);
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 26b3f95..6146ffe 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -697,7 +697,8 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
  *
  */
 
-static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+		int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct gfs2_inode *ip = GFS2_I(file_inode(file));
@@ -718,7 +719,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		gfs2_glock_dq_uninit(&gh);
 	}
 
-	return generic_file_write_iter(iocb, from);
+	return generic_file_write_iter(iocb, from, flags);
 }
 
 static int fallocate_chunk(struct inode *inode, loff_t offset, loff_t len,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 524dd80..4072f3a 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -165,7 +165,7 @@ nfs_file_flush(struct file *file, fl_owner_t id)
 EXPORT_SYMBOL_GPL(nfs_file_flush);
 
 ssize_t
-nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
+nfs_file_read(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t result;
@@ -179,7 +179,7 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 
 	result = nfs_revalidate_mapping(inode, iocb->ki_filp->f_mapping);
 	if (!result) {
-		result = generic_file_read_iter(iocb, to);
+		result = generic_file_read_iter(iocb, to, flags);
 		if (result > 0)
 			nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, result);
 	}
@@ -634,7 +634,7 @@ static int nfs_need_sync_write(struct file *filp, struct inode *inode)
 	return 0;
 }
 
-ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
+ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
@@ -669,7 +669,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	if (!count)
 		goto out;
 
-	result = generic_file_write_iter(iocb, from);
+	result = generic_file_write_iter(iocb, from, flags);
 	if (result > 0)
 		written = result;
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 9056622..646ad13c 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -337,11 +337,11 @@ int nfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *)
 int nfs_file_fsync_commit(struct file *, loff_t, loff_t, int);
 loff_t nfs_file_llseek(struct file *, loff_t, int);
 int nfs_file_flush(struct file *, fl_owner_t);
-ssize_t nfs_file_read(struct kiocb *, struct iov_iter *);
+ssize_t nfs_file_read(struct kiocb *, struct iov_iter *, int);
 ssize_t nfs_file_splice_read(struct file *, loff_t *, struct pipe_inode_info *,
 			     size_t, unsigned int);
 int nfs_file_mmap(struct file *, struct vm_area_struct *);
-ssize_t nfs_file_write(struct kiocb *, struct iov_iter *);
+ssize_t nfs_file_write(struct kiocb *, struct iov_iter *, int);
 int nfs_file_release(struct inode *, struct file *);
 int nfs_lock(struct file *, int, struct file_lock *);
 int nfs_flock(struct file *, int, struct file_lock *);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f501a9b..db7a31d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -855,7 +855,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
 
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
 	set_fs(oldfs);
 	return nfsd_finish_read(file, count, host_err);
 }
@@ -943,7 +943,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(oldfs);
 	if (host_err < 0)
 		goto out_nfserr;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 2930e23..418c8a3 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2234,7 +2234,7 @@ out:
 }
 
 static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
-				    struct iov_iter *from)
+				    struct iov_iter *from, int flags)
 {
 	int ret, direct_io, appending, rw_level, have_alloc_sem  = 0;
 	int can_do_direct, has_refcount = 0;
@@ -2461,7 +2461,8 @@ bail:
 }
 
 static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
-				   struct iov_iter *to)
+				   struct iov_iter *to,
+				   int flags)
 {
 	int ret = 0, rw_level = -1, have_alloc_sem = 0, lock_level = 0;
 	struct file *filp = iocb->ki_filp;
@@ -2516,7 +2517,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 	}
 	ocfs2_inode_unlock(inode, lock_level);
 
-	ret = generic_file_read_iter(iocb, to);
+	ret = generic_file_read_iter(iocb, to, flags);
 	trace_generic_file_aio_read_ret(ret);
 
 	/* buffered aio wouldn't have proper lock coverage today */
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..d2510ab 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -227,7 +227,7 @@ static const struct pipe_buf_operations packet_pipe_buf_ops = {
 };
 
 static ssize_t
-pipe_read(struct kiocb *iocb, struct iov_iter *to)
+pipe_read(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	size_t total_len = iov_iter_count(to);
 	struct file *filp = iocb->ki_filp;
@@ -336,7 +336,7 @@ static inline int is_packetized(struct file *file)
 }
 
 static ssize_t
-pipe_write(struct kiocb *iocb, struct iov_iter *from)
+pipe_write(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
diff --git a/fs/read_write.c b/fs/read_write.c
index 009d854..4747247 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -25,7 +25,7 @@
 typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *);
 typedef ssize_t (*iov_fn_t)(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t);
-typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *);
+typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *, int);
 
 const struct file_operations generic_ro_fops = {
 	.llseek		= generic_file_llseek,
@@ -403,7 +403,7 @@ ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *p
 	kiocb.ki_nbytes = len;
 	iov_iter_init(&iter, READ, &iov, 1, len);
 
-	ret = filp->f_op->read_iter(&kiocb, &iter);
+	ret = filp->f_op->read_iter(&kiocb, &iter, 0);
 	if (-EIOCBQUEUED == ret)
 		ret = wait_on_sync_kiocb(&kiocb);
 	*ppos = kiocb.ki_pos;
@@ -475,7 +475,7 @@ ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, lo
 	kiocb.ki_nbytes = len;
 	iov_iter_init(&iter, WRITE, &iov, 1, len);
 
-	ret = filp->f_op->write_iter(&kiocb, &iter);
+	ret = filp->f_op->write_iter(&kiocb, &iter, 0);
 	if (-EIOCBQUEUED == ret)
 		ret = wait_on_sync_kiocb(&kiocb);
 	*ppos = kiocb.ki_pos;
@@ -651,7 +651,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
 EXPORT_SYMBOL(iov_shorten);
 
 static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
-		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+		int flags)
 {
 	struct kiocb kiocb;
 	struct iov_iter iter;
@@ -662,7 +663,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
 	kiocb.ki_nbytes = len;
 
 	iov_iter_init(&iter, rw, iov, nr_segs, len);
-	ret = fn(&kiocb, &iter);
+	ret = fn(&kiocb, &iter, flags);
 	if (ret == -EIOCBQUEUED)
 		ret = wait_on_sync_kiocb(&kiocb);
 	*ppos = kiocb.ki_pos;
@@ -798,7 +799,8 @@ out:
 
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec __user * uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos,
+			       int flags)
 {
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -832,7 +834,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -855,27 +857,27 @@ out:
 }
 
 ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
-		  unsigned long vlen, loff_t *pos)
+		  unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_READ))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
 
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_readv);
 
 ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
-		   unsigned long vlen, loff_t *pos)
+		   unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
 
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
@@ -888,7 +890,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos);
+		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -908,7 +910,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos);
+		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -940,7 +942,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos);
+			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -964,7 +966,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos);
+			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1012,7 +1014,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, 0);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -1111,6 +1113,7 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 	return __compat_sys_preadv64(fd, vec, vlen, pos);
 }
 
+
 static size_t compat_writev(struct file *file,
 			    const struct compat_iovec __user *vec,
 			    unsigned long vlen, loff_t *pos)
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..a466a86 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(old_fs);
 
 	return res;
@@ -1018,7 +1018,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		kiocb.ki_nbytes = sd.total_len - left;
 
 		/* now, send it */
-		ret = out->f_op->write_iter(&kiocb, &from);
+		ret = out->f_op->write_iter(&kiocb, &from, 0);
 		if (-EIOCBQUEUED == ret)
 			ret = wait_on_sync_kiocb(&kiocb);
 
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index b5b593c..51100c4 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1397,13 +1397,14 @@ static int update_mctime(struct inode *inode)
 	return 0;
 }
 
-static ssize_t ubifs_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t ubifs_write_iter(struct kiocb *iocb, struct iov_iter *from,
+		int flags)
 {
 	int err = update_mctime(file_inode(iocb->ki_filp));
 	if (err)
 		return err;
 
-	return generic_file_write_iter(iocb, from);
+	return generic_file_write_iter(iocb, from, flags);
 }
 
 static int ubifs_set_page_dirty(struct page *page)
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 86c6743..e903b96 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -116,7 +116,8 @@ const struct address_space_operations udf_adinicb_aops = {
 	.direct_IO	= udf_adinicb_direct_IO,
 };
 
-static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+		int flags)
 {
 	ssize_t retval;
 	struct file *file = iocb->ki_filp;
@@ -152,7 +153,7 @@ static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	} else
 		up_write(&iinfo->i_data_sem);
 
-	retval = __generic_file_write_iter(iocb, from);
+	retval = __generic_file_write_iter(iocb, from, flags);
 	mutex_unlock(&inode->i_mutex);
 
 	if (retval > 0) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de5368c..68847008 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -232,7 +232,8 @@ xfs_file_fsync(
 STATIC ssize_t
 xfs_file_read_iter(
 	struct kiocb		*iocb,
-	struct iov_iter		*to)
+	struct iov_iter		*to,
+	int			flags)
 {
 	struct file		*file = iocb->ki_filp;
 	struct inode		*inode = file->f_mapping->host;
@@ -313,7 +314,7 @@ xfs_file_read_iter(
 
 	trace_xfs_file_read(ip, size, pos, ioflags);
 
-	ret = generic_file_read_iter(iocb, to);
+	ret = generic_file_read_iter(iocb, to, flags);
 	if (ret > 0)
 		XFS_STATS_ADD(xs_read_bytes, ret);
 
@@ -743,7 +744,8 @@ out:
 STATIC ssize_t
 xfs_file_write_iter(
 	struct kiocb		*iocb,
-	struct iov_iter		*from)
+	struct iov_iter		*from,
+	int			flags)
 {
 	struct file		*file = iocb->ki_filp;
 	struct address_space	*mapping = file->f_mapping;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..62ea9f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,8 +1486,8 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
-	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
-	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, int);
+	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, int);
 	int (*iterate) (struct file *, struct dir_context *);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -1556,9 +1556,9 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
@@ -2444,9 +2444,9 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_remap_pages(struct vm_area_struct *, unsigned long addr,
 		unsigned long size, pgoff_t pgoff);
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
-extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
-extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
-extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
+extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *, int flags);
+extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *, int flags);
+extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *, int flags);
 extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *, loff_t);
 extern ssize_t generic_perform_write(struct file *, struct iov_iter *, loff_t);
 extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
@@ -2455,7 +2455,7 @@ extern ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
 extern ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
 
 /* fs/block_dev.c */
-extern ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from);
+extern ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from, int flags);
 extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
 			int datasync);
 extern void block_sync_page(struct page *page);
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..c95edbf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1456,7 +1456,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
-		struct iov_iter *iter, ssize_t written)
+		struct iov_iter *iter, ssize_t written, int flags)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1683,7 +1683,7 @@ out:
  * that can use the page cache directly.
  */
 ssize_t
-generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	ssize_t retval = 0;
@@ -1726,7 +1726,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		}
 	}
 
-	retval = do_generic_file_read(file, ppos, iter, retval);
+	retval = do_generic_file_read(file, ppos, iter, retval, flags);
 out:
 	return retval;
 }
@@ -2549,7 +2549,8 @@ EXPORT_SYMBOL(generic_perform_write);
  * A caller has to handle it. This is mainly due to the fact that we want to
  * avoid syncing under i_mutex.
  */
-ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+				  int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space * mapping = file->f_mapping;
@@ -2645,14 +2646,14 @@ EXPORT_SYMBOL(__generic_file_write_iter);
  * filesystems. It takes care of syncing the file in case of O_SYNC file
  * and acquires i_mutex as needed.
  */
-ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
 	mutex_lock(&inode->i_mutex);
-	ret = __generic_file_write_iter(iocb, from);
+	ret = __generic_file_write_iter(iocb, from, flags);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 0e5fb22..24c73bce 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1519,7 +1519,7 @@ shmem_write_end(struct file *file, struct address_space *mapping,
 	return copied;
 }
 
-static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 1/7] Prepare for adding a new readv/writev with user flags.
@ 2014-09-15 20:20   ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 drivers/target/target_core_file.c |    6 +++---
 fs/afs/internal.h                 |    2 +-
 fs/afs/write.c                    |    4 ++--
 fs/aio.c                          |    4 ++--
 fs/block_dev.c                    |    9 +++++----
 fs/btrfs/file.c                   |    2 +-
 fs/ceph/file.c                    |    8 +++++---
 fs/cifs/cifsfs.c                  |    9 +++++----
 fs/cifs/cifsfs.h                  |   12 ++++++++----
 fs/cifs/file.c                    |   24 ++++++++++++------------
 fs/ecryptfs/file.c                |    4 ++--
 fs/ext4/file.c                    |    4 ++--
 fs/fuse/file.c                    |   10 ++++++----
 fs/gfs2/file.c                    |    5 +++--
 fs/nfs/file.c                     |    8 ++++----
 fs/nfs/internal.h                 |    4 ++--
 fs/nfsd/vfs.c                     |    4 ++--
 fs/ocfs2/file.c                   |    7 ++++---
 fs/pipe.c                         |    4 ++--
 fs/read_write.c                   |   35 +++++++++++++++++++----------------
 fs/splice.c                       |    4 ++--
 fs/ubifs/file.c                   |    5 +++--
 fs/udf/file.c                     |    5 +++--
 fs/xfs/xfs_file.c                 |    8 +++++---
 include/linux/fs.h                |   16 ++++++++--------
 mm/filemap.c                      |   13 +++++++------
 mm/shmem.c                        |    2 +-
 27 files changed, 119 insertions(+), 99 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7d6cdda..58d9a6d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -350,9 +350,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
 	set_fs(get_ds());
 
 	if (is_write)
-		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
 	else
-		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);
 
 	set_fs(old_fs);
 
@@ -528,7 +528,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 
 	old_fs = get_fs();
 	set_fs(get_ds());
-	rc = vfs_writev(f, &iov[0], iov_num, &pos);
+	rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
 	set_fs(old_fs);
 
 	vfree(iov);
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 71d5982..50bf5cd 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -747,7 +747,7 @@ extern int afs_write_end(struct file *file, struct address_space *mapping,
 extern int afs_writepage(struct page *, struct writeback_control *);
 extern int afs_writepages(struct address_space *, struct writeback_control *);
 extern void afs_pages_written_back(struct afs_vnode *, struct afs_call *);
-extern ssize_t afs_file_write(struct kiocb *, struct iov_iter *);
+extern ssize_t afs_file_write(struct kiocb *, struct iov_iter *, int flags);
 extern int afs_writeback_all(struct afs_vnode *);
 extern int afs_fsync(struct file *, loff_t, loff_t, int);
 
diff --git a/fs/afs/write.c b/fs/afs/write.c
index ab6adfd..7330860 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -625,7 +625,7 @@ void afs_pages_written_back(struct afs_vnode *vnode, struct afs_call *call)
 /*
  * write to an AFS file
  */
-ssize_t afs_file_write(struct kiocb *iocb, struct iov_iter *from)
+ssize_t afs_file_write(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct afs_vnode *vnode = AFS_FS_I(file_inode(iocb->ki_filp));
 	ssize_t result;
@@ -643,7 +643,7 @@ ssize_t afs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	if (!count)
 		return 0;
 
-	result = generic_file_write_iter(iocb, from);
+	result = generic_file_write_iter(iocb, from, flags);
 	if (IS_ERR_VALUE(result)) {
 		_leave(" = %zd", result);
 		return result;
diff --git a/fs/aio.c b/fs/aio.c
index 7337500..c17227f 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1309,7 +1309,7 @@ SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
 
 typedef ssize_t (aio_rw_op)(struct kiocb *, const struct iovec *,
 			    unsigned long, loff_t);
-typedef ssize_t (rw_iter_op)(struct kiocb *, struct iov_iter *);
+typedef ssize_t (rw_iter_op)(struct kiocb *, struct iov_iter *, int flags);
 
 static ssize_t aio_setup_vectored_rw(struct kiocb *kiocb,
 				     int rw, char __user *buf,
@@ -1421,7 +1421,7 @@ rw_common:
 
 		if (iter_op) {
 			iov_iter_init(&iter, rw, iovec, nr_segs, req->ki_nbytes);
-			ret = iter_op(req, &iter);
+			ret = iter_op(req, &iter, 0);
 		} else {
 			ret = rw_op(req, iovec, nr_segs, req->ki_pos);
 		}
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6d72746..04b203b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1572,14 +1572,14 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
  * Does not take i_mutex for the write and thus is not for general purpose
  * use.
  */
-ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct blk_plug plug;
 	ssize_t ret;
 
 	blk_start_plug(&plug);
-	ret = __generic_file_write_iter(iocb, from);
+	ret = __generic_file_write_iter(iocb, from, flags);
 	if (ret > 0) {
 		ssize_t err;
 		err = generic_write_sync(file, iocb->ki_pos - ret, ret);
@@ -1591,7 +1591,8 @@ ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 }
 EXPORT_SYMBOL_GPL(blkdev_write_iter);
 
-static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
+static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to,
+			        int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *bd_inode = file->f_mapping->host;
@@ -1603,7 +1604,7 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	size -= pos;
 	iov_iter_truncate(to, size);
-	return generic_file_read_iter(iocb, to);
+	return generic_file_read_iter(iocb, to, flags);
 }
 
 /*
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ff1cc03..ad72a21 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1726,7 +1726,7 @@ static void update_time_for_write(struct inode *inode)
 }
 
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
-				    struct iov_iter *from)
+				    struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 2eb02f8..4776257 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -781,7 +781,8 @@ out:
  *
  * Hmm, the sync read case isn't actually async... should it be?
  */
-static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
+static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to,
+			      int flags)
 {
 	struct file *filp = iocb->ki_filp;
 	struct ceph_file_info *fi = filp->private_data;
@@ -819,7 +820,7 @@ again:
 		     inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
 		     ceph_cap_string(got));
 
-		ret = generic_file_read_iter(iocb, to);
+		ret = generic_file_read_iter(iocb, to, flags);
 	}
 	dout("aio_read %p %llx.%llx dropping cap refs on %s = %d\n",
 	     inode, ceph_vinop(inode), ceph_cap_string(got), (int)ret);
@@ -860,7 +861,8 @@ again:
  *
  * If we are near ENOSPC, write synchronously.
  */
-static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from,
+			       int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct ceph_file_info *fi = file->private_data;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 889b984..079ad6c 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -739,7 +739,7 @@ out_nls:
 }
 
 static ssize_t
-cifs_loose_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+cifs_loose_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
 {
 	ssize_t rc;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -748,10 +748,11 @@ cifs_loose_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	if (rc)
 		return rc;
 
-	return generic_file_read_iter(iocb, iter);
+	return generic_file_read_iter(iocb, iter, flags);
 }
 
-static ssize_t cifs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t cifs_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+		int flags)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct cifsInodeInfo *cinode = CIFS_I(inode);
@@ -762,7 +763,7 @@ static ssize_t cifs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (written)
 		return written;
 
-	written = generic_file_write_iter(iocb, from);
+	written = generic_file_write_iter(iocb, from, flags);
 
 	if (CIFS_CACHE_WRITE(CIFS_I(inode)))
 		goto out;
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index b0fafa4..6c44582 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -95,10 +95,14 @@ extern const struct file_operations cifs_file_strict_nobrl_ops;
 extern int cifs_open(struct inode *inode, struct file *file);
 extern int cifs_close(struct inode *inode, struct file *file);
 extern int cifs_closedir(struct inode *inode, struct file *file);
-extern ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to);
-extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
-extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
-extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
+extern ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to,
+		int flags);
+extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to,
+		int flags);
+extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from,
+		int flags);
+extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from,
+		int flags);
 extern int cifs_lock(struct file *, int, struct file_lock *);
 extern int cifs_fsync(struct file *, loff_t, loff_t, int);
 extern int cifs_strict_fsync(struct file *, loff_t, loff_t, int);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 7c018a1..58aecd7 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2658,7 +2658,7 @@ restart_loop:
 	return total_written ? total_written : (ssize_t)rc;
 }
 
-ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from)
+ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	ssize_t written;
 	struct inode *inode;
@@ -2682,7 +2682,7 @@ ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from)
 }
 
 static ssize_t
-cifs_writev(struct kiocb *iocb, struct iov_iter *from)
+cifs_writev(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct cifsFileInfo *cfile = (struct cifsFileInfo *)file->private_data;
@@ -2703,7 +2703,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
 	if (!cifs_find_lock_conflict(cfile, lock_pos, iov_iter_count(from),
 				     server->vals->exclusive_lock_type, NULL,
 				     CIFS_WRITE_OP)) {
-		rc = __generic_file_write_iter(iocb, from);
+		rc = __generic_file_write_iter(iocb, from, flags);
 		mutex_unlock(&inode->i_mutex);
 
 		if (rc > 0) {
@@ -2721,7 +2721,7 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
 }
 
 ssize_t
-cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from)
+cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct cifsInodeInfo *cinode = CIFS_I(inode);
@@ -2739,10 +2739,10 @@ cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from)
 		if (cap_unix(tcon->ses) &&
 		(CIFS_UNIX_FCNTL_CAP & le64_to_cpu(tcon->fsUnixInfo.Capability))
 		  && ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0)) {
-			written = generic_file_write_iter(iocb, from);
+			written = generic_file_write_iter(iocb, from, flags);
 			goto out;
 		}
-		written = cifs_writev(iocb, from);
+		written = cifs_writev(iocb, from, flags);
 		goto out;
 	}
 	/*
@@ -2751,7 +2751,7 @@ cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from)
 	 * affected pages because it may cause a error with mandatory locks on
 	 * these pages but not on the region from pos to ppos+len-1.
 	 */
-	written = cifs_user_writev(iocb, from);
+	written = cifs_user_writev(iocb, from, flags);
 	if (written > 0 && CIFS_CACHE_READ(cinode)) {
 		/*
 		 * Windows 7 server can delay breaking level2 oplock if a write
@@ -2992,7 +2992,7 @@ error:
 	return rc;
 }
 
-ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
+ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	ssize_t rc;
@@ -3097,7 +3097,7 @@ again:
 }
 
 ssize_t
-cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
+cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct cifsInodeInfo *cinode = CIFS_I(inode);
@@ -3116,12 +3116,12 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	 * pos+len-1.
 	 */
 	if (!CIFS_CACHE_READ(cinode))
-		return cifs_user_readv(iocb, to);
+		return cifs_user_readv(iocb, to, flags);
 
 	if (cap_unix(tcon->ses) &&
 	    (CIFS_UNIX_FCNTL_CAP & le64_to_cpu(tcon->fsUnixInfo.Capability)) &&
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
-		return generic_file_read_iter(iocb, to);
+		return generic_file_read_iter(iocb, to, flags);
 
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
@@ -3131,7 +3131,7 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	if (!cifs_find_lock_conflict(cfile, iocb->ki_pos, iov_iter_count(to),
 				     tcon->ses->server->vals->shared_lock_type,
 				     NULL, CIFS_READ_OP))
-		rc = generic_file_read_iter(iocb, to);
+		rc = generic_file_read_iter(iocb, to, flags);
 	up_read(&cinode->lock_sem);
 	return rc;
 }
diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index db0fad3..bec0a0e 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -45,13 +45,13 @@
  * The function to be used for directory reads is ecryptfs_read.
  */
 static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb,
-				struct iov_iter *to)
+				struct iov_iter *to, int flags)
 {
 	ssize_t rc;
 	struct path *path;
 	struct file *file = iocb->ki_filp;
 
-	rc = generic_file_read_iter(iocb, to);
+	rc = generic_file_read_iter(iocb, to, flags);
 	/*
 	 * Even though this is a async interface, we need to wait
 	 * for IO to finish to update atime
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..565de78 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -89,7 +89,7 @@ ext4_unaligned_aio(struct inode *inode, struct iov_iter *from, loff_t pos)
 }
 
 static ssize_t
-ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(iocb->ki_filp);
@@ -172,7 +172,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		}
 	}
 
-	ret = __generic_file_write_iter(iocb, from);
+	ret = __generic_file_write_iter(iocb, from, flags);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 912061a..fdf1711 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -933,7 +933,8 @@ out:
 	return err;
 }
 
-static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to,
+				   int flags)
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	struct fuse_conn *fc = get_fuse_conn(inode);
@@ -951,7 +952,7 @@ static ssize_t fuse_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			return err;
 	}
 
-	return generic_file_read_iter(iocb, to);
+	return generic_file_read_iter(iocb, to, flags);
 }
 
 static void fuse_write_fill(struct fuse_req *req, struct fuse_file *ff,
@@ -1180,7 +1181,8 @@ static ssize_t fuse_perform_write(struct file *file,
 	return res > 0 ? res : err;
 }
 
-static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+				    int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space *mapping = file->f_mapping;
@@ -1198,7 +1200,7 @@ static ssize_t fuse_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (err)
 			return err;
 
-		return generic_file_write_iter(iocb, from);
+		return generic_file_write_iter(iocb, from, flags);
 	}
 
 	mutex_lock(&inode->i_mutex);
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 26b3f95..6146ffe 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -697,7 +697,8 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
  *
  */
 
-static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+		int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct gfs2_inode *ip = GFS2_I(file_inode(file));
@@ -718,7 +719,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		gfs2_glock_dq_uninit(&gh);
 	}
 
-	return generic_file_write_iter(iocb, from);
+	return generic_file_write_iter(iocb, from, flags);
 }
 
 static int fallocate_chunk(struct inode *inode, loff_t offset, loff_t len,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 524dd80..4072f3a 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -165,7 +165,7 @@ nfs_file_flush(struct file *file, fl_owner_t id)
 EXPORT_SYMBOL_GPL(nfs_file_flush);
 
 ssize_t
-nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
+nfs_file_read(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t result;
@@ -179,7 +179,7 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 
 	result = nfs_revalidate_mapping(inode, iocb->ki_filp->f_mapping);
 	if (!result) {
-		result = generic_file_read_iter(iocb, to);
+		result = generic_file_read_iter(iocb, to, flags);
 		if (result > 0)
 			nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, result);
 	}
@@ -634,7 +634,7 @@ static int nfs_need_sync_write(struct file *filp, struct inode *inode)
 	return 0;
 }
 
-ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
+ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
@@ -669,7 +669,7 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
 	if (!count)
 		goto out;
 
-	result = generic_file_write_iter(iocb, from);
+	result = generic_file_write_iter(iocb, from, flags);
 	if (result > 0)
 		written = result;
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 9056622..646ad13c 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -337,11 +337,11 @@ int nfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *)
 int nfs_file_fsync_commit(struct file *, loff_t, loff_t, int);
 loff_t nfs_file_llseek(struct file *, loff_t, int);
 int nfs_file_flush(struct file *, fl_owner_t);
-ssize_t nfs_file_read(struct kiocb *, struct iov_iter *);
+ssize_t nfs_file_read(struct kiocb *, struct iov_iter *, int);
 ssize_t nfs_file_splice_read(struct file *, loff_t *, struct pipe_inode_info *,
 			     size_t, unsigned int);
 int nfs_file_mmap(struct file *, struct vm_area_struct *);
-ssize_t nfs_file_write(struct kiocb *, struct iov_iter *);
+ssize_t nfs_file_write(struct kiocb *, struct iov_iter *, int);
 int nfs_file_release(struct inode *, struct file *);
 int nfs_lock(struct file *, int, struct file_lock *);
 int nfs_flock(struct file *, int, struct file_lock *);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f501a9b..db7a31d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -855,7 +855,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
 
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
 	set_fs(oldfs);
 	return nfsd_finish_read(file, count, host_err);
 }
@@ -943,7 +943,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(oldfs);
 	if (host_err < 0)
 		goto out_nfserr;
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 2930e23..418c8a3 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2234,7 +2234,7 @@ out:
 }
 
 static ssize_t ocfs2_file_write_iter(struct kiocb *iocb,
-				    struct iov_iter *from)
+				    struct iov_iter *from, int flags)
 {
 	int ret, direct_io, appending, rw_level, have_alloc_sem  = 0;
 	int can_do_direct, has_refcount = 0;
@@ -2461,7 +2461,8 @@ bail:
 }
 
 static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
-				   struct iov_iter *to)
+				   struct iov_iter *to,
+				   int flags)
 {
 	int ret = 0, rw_level = -1, have_alloc_sem = 0, lock_level = 0;
 	struct file *filp = iocb->ki_filp;
@@ -2516,7 +2517,7 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 	}
 	ocfs2_inode_unlock(inode, lock_level);
 
-	ret = generic_file_read_iter(iocb, to);
+	ret = generic_file_read_iter(iocb, to, flags);
 	trace_generic_file_aio_read_ret(ret);
 
 	/* buffered aio wouldn't have proper lock coverage today */
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..d2510ab 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -227,7 +227,7 @@ static const struct pipe_buf_operations packet_pipe_buf_ops = {
 };
 
 static ssize_t
-pipe_read(struct kiocb *iocb, struct iov_iter *to)
+pipe_read(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	size_t total_len = iov_iter_count(to);
 	struct file *filp = iocb->ki_filp;
@@ -336,7 +336,7 @@ static inline int is_packetized(struct file *file)
 }
 
 static ssize_t
-pipe_write(struct kiocb *iocb, struct iov_iter *from)
+pipe_write(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *filp = iocb->ki_filp;
 	struct pipe_inode_info *pipe = filp->private_data;
diff --git a/fs/read_write.c b/fs/read_write.c
index 009d854..4747247 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -25,7 +25,7 @@
 typedef ssize_t (*io_fn_t)(struct file *, char __user *, size_t, loff_t *);
 typedef ssize_t (*iov_fn_t)(struct kiocb *, const struct iovec *,
 		unsigned long, loff_t);
-typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *);
+typedef ssize_t (*iter_fn_t)(struct kiocb *, struct iov_iter *, int);
 
 const struct file_operations generic_ro_fops = {
 	.llseek		= generic_file_llseek,
@@ -403,7 +403,7 @@ ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *p
 	kiocb.ki_nbytes = len;
 	iov_iter_init(&iter, READ, &iov, 1, len);
 
-	ret = filp->f_op->read_iter(&kiocb, &iter);
+	ret = filp->f_op->read_iter(&kiocb, &iter, 0);
 	if (-EIOCBQUEUED == ret)
 		ret = wait_on_sync_kiocb(&kiocb);
 	*ppos = kiocb.ki_pos;
@@ -475,7 +475,7 @@ ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, lo
 	kiocb.ki_nbytes = len;
 	iov_iter_init(&iter, WRITE, &iov, 1, len);
 
-	ret = filp->f_op->write_iter(&kiocb, &iter);
+	ret = filp->f_op->write_iter(&kiocb, &iter, 0);
 	if (-EIOCBQUEUED == ret)
 		ret = wait_on_sync_kiocb(&kiocb);
 	*ppos = kiocb.ki_pos;
@@ -651,7 +651,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
 EXPORT_SYMBOL(iov_shorten);
 
 static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
-		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+		int flags)
 {
 	struct kiocb kiocb;
 	struct iov_iter iter;
@@ -662,7 +663,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
 	kiocb.ki_nbytes = len;
 
 	iov_iter_init(&iter, rw, iov, nr_segs, len);
-	ret = fn(&kiocb, &iter);
+	ret = fn(&kiocb, &iter, flags);
 	if (ret == -EIOCBQUEUED)
 		ret = wait_on_sync_kiocb(&kiocb);
 	*ppos = kiocb.ki_pos;
@@ -798,7 +799,8 @@ out:
 
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec __user * uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos,
+			       int flags)
 {
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -832,7 +834,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -855,27 +857,27 @@ out:
 }
 
 ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
-		  unsigned long vlen, loff_t *pos)
+		  unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_READ))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
 
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_readv);
 
 ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
-		   unsigned long vlen, loff_t *pos)
+		   unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
 
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
@@ -888,7 +890,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos);
+		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -908,7 +910,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos);
+		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -940,7 +942,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos);
+			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -964,7 +966,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos);
+			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1012,7 +1014,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, 0);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -1111,6 +1113,7 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 	return __compat_sys_preadv64(fd, vec, vlen, pos);
 }
 
+
 static size_t compat_writev(struct file *file,
 			    const struct compat_iovec __user *vec,
 			    unsigned long vlen, loff_t *pos)
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..a466a86 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(old_fs);
 
 	return res;
@@ -1018,7 +1018,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		kiocb.ki_nbytes = sd.total_len - left;
 
 		/* now, send it */
-		ret = out->f_op->write_iter(&kiocb, &from);
+		ret = out->f_op->write_iter(&kiocb, &from, 0);
 		if (-EIOCBQUEUED == ret)
 			ret = wait_on_sync_kiocb(&kiocb);
 
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index b5b593c..51100c4 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1397,13 +1397,14 @@ static int update_mctime(struct inode *inode)
 	return 0;
 }
 
-static ssize_t ubifs_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t ubifs_write_iter(struct kiocb *iocb, struct iov_iter *from,
+		int flags)
 {
 	int err = update_mctime(file_inode(iocb->ki_filp));
 	if (err)
 		return err;
 
-	return generic_file_write_iter(iocb, from);
+	return generic_file_write_iter(iocb, from, flags);
 }
 
 static int ubifs_set_page_dirty(struct page *page)
diff --git a/fs/udf/file.c b/fs/udf/file.c
index 86c6743..e903b96 100644
--- a/fs/udf/file.c
+++ b/fs/udf/file.c
@@ -116,7 +116,8 @@ const struct address_space_operations udf_adinicb_aops = {
 	.direct_IO	= udf_adinicb_direct_IO,
 };
 
-static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+		int flags)
 {
 	ssize_t retval;
 	struct file *file = iocb->ki_filp;
@@ -152,7 +153,7 @@ static ssize_t udf_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	} else
 		up_write(&iinfo->i_data_sem);
 
-	retval = __generic_file_write_iter(iocb, from);
+	retval = __generic_file_write_iter(iocb, from, flags);
 	mutex_unlock(&inode->i_mutex);
 
 	if (retval > 0) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de5368c..68847008 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -232,7 +232,8 @@ xfs_file_fsync(
 STATIC ssize_t
 xfs_file_read_iter(
 	struct kiocb		*iocb,
-	struct iov_iter		*to)
+	struct iov_iter		*to,
+	int			flags)
 {
 	struct file		*file = iocb->ki_filp;
 	struct inode		*inode = file->f_mapping->host;
@@ -313,7 +314,7 @@ xfs_file_read_iter(
 
 	trace_xfs_file_read(ip, size, pos, ioflags);
 
-	ret = generic_file_read_iter(iocb, to);
+	ret = generic_file_read_iter(iocb, to, flags);
 	if (ret > 0)
 		XFS_STATS_ADD(xs_read_bytes, ret);
 
@@ -743,7 +744,8 @@ out:
 STATIC ssize_t
 xfs_file_write_iter(
 	struct kiocb		*iocb,
-	struct iov_iter		*from)
+	struct iov_iter		*from,
+	int			flags)
 {
 	struct file		*file = iocb->ki_filp;
 	struct address_space	*mapping = file->f_mapping;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..62ea9f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1486,8 +1486,8 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
-	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
-	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, int);
+	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, int);
 	int (*iterate) (struct file *, struct dir_context *);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
@@ -1556,9 +1556,9 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
@@ -2444,9 +2444,9 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_remap_pages(struct vm_area_struct *, unsigned long addr,
 		unsigned long size, pgoff_t pgoff);
 int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk);
-extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
-extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
-extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
+extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *, int flags);
+extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *, int flags);
+extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *, int flags);
 extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *, loff_t);
 extern ssize_t generic_perform_write(struct file *, struct iov_iter *, loff_t);
 extern ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos);
@@ -2455,7 +2455,7 @@ extern ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
 extern ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos);
 
 /* fs/block_dev.c */
-extern ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from);
+extern ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from, int flags);
 extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
 			int datasync);
 extern void block_sync_page(struct page *page);
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..c95edbf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1456,7 +1456,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
-		struct iov_iter *iter, ssize_t written)
+		struct iov_iter *iter, ssize_t written, int flags)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1683,7 +1683,7 @@ out:
  * that can use the page cache directly.
  */
 ssize_t
-generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	ssize_t retval = 0;
@@ -1726,7 +1726,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		}
 	}
 
-	retval = do_generic_file_read(file, ppos, iter, retval);
+	retval = do_generic_file_read(file, ppos, iter, retval, flags);
 out:
 	return retval;
 }
@@ -2549,7 +2549,8 @@ EXPORT_SYMBOL(generic_perform_write);
  * A caller has to handle it. This is mainly due to the fact that we want to
  * avoid syncing under i_mutex.
  */
-ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from,
+				  int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct address_space * mapping = file->f_mapping;
@@ -2645,14 +2646,14 @@ EXPORT_SYMBOL(__generic_file_write_iter);
  * filesystems. It takes care of syncing the file in case of O_SYNC file
  * and acquires i_mutex as needed.
  */
-ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ssize_t generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
 	ssize_t ret;
 
 	mutex_lock(&inode->i_mutex);
-	ret = __generic_file_write_iter(iocb, from);
+	ret = __generic_file_write_iter(iocb, from, flags);
 	mutex_unlock(&inode->i_mutex);
 
 	if (ret > 0) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 0e5fb22..24c73bce 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1519,7 +1519,7 @@ shmem_write_end(struct file *file, struct address_space *mapping,
 	return copied;
 }
 
-static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
+static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to, int flags)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-15 20:20   ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

New syscalls with an extra flag argument. For now all flags except for 0 are
not supported.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c                   |  100 +++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h          |   12 +++++
 include/uapi/asm-generic/unistd.h |   10 +++-
 3 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 4747247..6c5030a 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -902,6 +902,29 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE4(readv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, int, flags)
+{
+	struct fd f = fdget_pos(fd);
+	ssize_t ret = -EBADF;
+
+	if (flags & ~0)
+		return -EINVAL;
+
+	if (f.file) {
+		loff_t pos = file_pos_read(f.file);
+		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
+		if (ret >= 0)
+			file_pos_write(f.file, pos);
+		fdput_pos(f);
+	}
+
+	if (ret > 0)
+		add_rchar(current, ret);
+	inc_syscr(current);
+	return ret;
+}
+
 SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
@@ -922,6 +945,29 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE4(writev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, int, flags)
+{
+	struct fd f = fdget_pos(fd);
+	ssize_t ret = -EBADF;
+
+	if (flags & ~0)
+		return -EINVAL;
+
+	if (f.file) {
+		loff_t pos = file_pos_read(f.file);
+		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
+		if (ret >= 0)
+			file_pos_write(f.file, pos);
+		fdput_pos(f);
+	}
+
+	if (ret > 0)
+		add_wchar(current, ret);
+	inc_syscw(current);
+	return ret;
+}
+
 static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
 {
 #define HALF_LONG_BITS (BITS_PER_LONG / 2)
@@ -952,6 +998,33 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+	struct fd f;
+	ssize_t ret = -EBADF;
+
+	if (flags & ~0)
+		return -EINVAL;
+	if (pos < 0)
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (f.file) {
+		ret = -ESPIPE;
+		if (f.file->f_mode & FMODE_PREAD)
+			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
+		fdput(f);
+	}
+
+	if (ret > 0)
+		add_rchar(current, ret);
+	inc_syscr(current);
+	return ret;
+}
+
 SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
 {
@@ -976,6 +1049,33 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+	struct fd f;
+	ssize_t ret = -EBADF;
+
+	if (flags & ~0)
+		return -EINVAL;
+	if (pos < 0)
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (f.file) {
+		ret = -ESPIPE;
+		if (f.file->f_mode & FMODE_PWRITE)
+			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
+		fdput(f);
+	}
+
+	if (ret > 0)
+		add_wchar(current, ret);
+	inc_syscw(current);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t compat_do_readv_writev(int type, struct file *file,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85..0c49ae4 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -559,19 +559,31 @@ asmlinkage long sys_readahead(int fd, loff_t offset, size_t count);
 asmlinkage long sys_readv(unsigned long fd,
 			  const struct iovec __user *vec,
 			  unsigned long vlen);
+asmlinkage long sys_readv2(unsigned long fd,
+			  const struct iovec __user *vec,
+			  unsigned long vlen, int flags);
 asmlinkage long sys_write(unsigned int fd, const char __user *buf,
 			  size_t count);
 asmlinkage long sys_writev(unsigned long fd,
 			   const struct iovec __user *vec,
 			   unsigned long vlen);
+asmlinkage long sys_writev2(unsigned long fd,
+			    const struct iovec __user *vec,
+			    unsigned long vlen, int flags);
 asmlinkage long sys_pread64(unsigned int fd, char __user *buf,
 			    size_t count, loff_t pos);
 asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
 			     size_t count, loff_t pos);
 asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
 			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
 			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
 asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
 asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 11d11bc..75ad687 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,14 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
 __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
 #define __NR_pwritev 70
 __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_readv2 280
+__SC_COMP(__NR_readv2, sys_readv2)
+#define __NR_writev2 281
+__SC_COMP(__NR_writev2, sys_writev2)
+#define __NR_preadv2 282
+__SC_COMP(__NR_preadv2, sys_preadv2)
+#define __NR_pwritev2 283
+__SC_COMP(__NR_pwritev2, sys_pwritev2)
 
 /* fs/sendfile.c */
 #define __NR3264_sendfile 71
@@ -707,7 +715,7 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
 
 #undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 284
 
 /*
  * All syscalls below here should go away really,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-15 20:20   ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

New syscalls with an extra flag argument. For now all flags except for 0 are
not supported.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c                   |  100 +++++++++++++++++++++++++++++++++++++
 include/linux/syscalls.h          |   12 +++++
 include/uapi/asm-generic/unistd.h |   10 +++-
 3 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 4747247..6c5030a 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -902,6 +902,29 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE4(readv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, int, flags)
+{
+	struct fd f = fdget_pos(fd);
+	ssize_t ret = -EBADF;
+
+	if (flags & ~0)
+		return -EINVAL;
+
+	if (f.file) {
+		loff_t pos = file_pos_read(f.file);
+		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
+		if (ret >= 0)
+			file_pos_write(f.file, pos);
+		fdput_pos(f);
+	}
+
+	if (ret > 0)
+		add_rchar(current, ret);
+	inc_syscr(current);
+	return ret;
+}
+
 SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen)
 {
@@ -922,6 +945,29 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE4(writev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, int, flags)
+{
+	struct fd f = fdget_pos(fd);
+	ssize_t ret = -EBADF;
+
+	if (flags & ~0)
+		return -EINVAL;
+
+	if (f.file) {
+		loff_t pos = file_pos_read(f.file);
+		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
+		if (ret >= 0)
+			file_pos_write(f.file, pos);
+		fdput_pos(f);
+	}
+
+	if (ret > 0)
+		add_wchar(current, ret);
+	inc_syscw(current);
+	return ret;
+}
+
 static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
 {
 #define HALF_LONG_BITS (BITS_PER_LONG / 2)
@@ -952,6 +998,33 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+	struct fd f;
+	ssize_t ret = -EBADF;
+
+	if (flags & ~0)
+		return -EINVAL;
+	if (pos < 0)
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (f.file) {
+		ret = -ESPIPE;
+		if (f.file->f_mode & FMODE_PREAD)
+			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
+		fdput(f);
+	}
+
+	if (ret > 0)
+		add_rchar(current, ret);
+	inc_syscr(current);
+	return ret;
+}
+
 SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
 {
@@ -976,6 +1049,33 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+	struct fd f;
+	ssize_t ret = -EBADF;
+
+	if (flags & ~0)
+		return -EINVAL;
+	if (pos < 0)
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (f.file) {
+		ret = -ESPIPE;
+		if (f.file->f_mode & FMODE_PWRITE)
+			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
+		fdput(f);
+	}
+
+	if (ret > 0)
+		add_wchar(current, ret);
+	inc_syscw(current);
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t compat_do_readv_writev(int type, struct file *file,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85..0c49ae4 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -559,19 +559,31 @@ asmlinkage long sys_readahead(int fd, loff_t offset, size_t count);
 asmlinkage long sys_readv(unsigned long fd,
 			  const struct iovec __user *vec,
 			  unsigned long vlen);
+asmlinkage long sys_readv2(unsigned long fd,
+			  const struct iovec __user *vec,
+			  unsigned long vlen, int flags);
 asmlinkage long sys_write(unsigned int fd, const char __user *buf,
 			  size_t count);
 asmlinkage long sys_writev(unsigned long fd,
 			   const struct iovec __user *vec,
 			   unsigned long vlen);
+asmlinkage long sys_writev2(unsigned long fd,
+			    const struct iovec __user *vec,
+			    unsigned long vlen, int flags);
 asmlinkage long sys_pread64(unsigned int fd, char __user *buf,
 			    size_t count, loff_t pos);
 asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
 			     size_t count, loff_t pos);
 asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
 			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
 			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
 asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
 asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 11d11bc..75ad687 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,14 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
 __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
 #define __NR_pwritev 70
 __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_readv2 280
+__SC_COMP(__NR_readv2, sys_readv2)
+#define __NR_writev2 281
+__SC_COMP(__NR_writev2, sys_writev2)
+#define __NR_preadv2 282
+__SC_COMP(__NR_preadv2, sys_preadv2)
+#define __NR_pwritev2 283
+__SC_COMP(__NR_pwritev2, sys_pwritev2)
 
 /* fs/sendfile.c */
 #define __NR3264_sendfile 71
@@ -707,7 +715,7 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
 
 #undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 284
 
 /*
  * All syscalls below here should go away really,
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 3/7] Export new vector IO (with flags) to userland
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-15 20:20   ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

This is only for x86_64 and x86. Will add other arch later.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 arch/x86/syscalls/syscall_32.tbl |    4 ++++
 arch/x86/syscalls/syscall_64.tbl |    4 ++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b781..ed85dca 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,7 @@
 354	i386	seccomp			sys_seccomp
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
+357	i386	readv2			sys_readv2
+358	i386	writev2			sys_writev2
+359	i386	preadv2			sys_preadv2
+360	i386	pwritev2		sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922..76d9f60 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,10 @@
 318	common	getrandom		sys_getrandom
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
+321	64	readv2			sys_readv2
+322	64	writev2			sys_writev2
+323	64	preadv2			sys_preadv2
+324	64	pwritev2		sys_pwritev2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 3/7] Export new vector IO (with flags) to userland
@ 2014-09-15 20:20   ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

This is only for x86_64 and x86. Will add other arch later.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 arch/x86/syscalls/syscall_32.tbl |    4 ++++
 arch/x86/syscalls/syscall_64.tbl |    4 ++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b781..ed85dca 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,7 @@
 354	i386	seccomp			sys_seccomp
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
+357	i386	readv2			sys_readv2
+358	i386	writev2			sys_writev2
+359	i386	preadv2			sys_preadv2
+360	i386	pwritev2		sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922..76d9f60 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,10 @@
 318	common	getrandom		sys_getrandom
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
+321	64	readv2			sys_readv2
+322	64	writev2			sys_writev2
+323	64	preadv2			sys_preadv2
+324	64	pwritev2		sys_pwritev2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-15 20:21   ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Filesystems that generic_file_read_iter will not be allowed to perform
non-blocking reads. This only will read data if it's in the page cache and if
there is no page error (causing a re-read).

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c |    4 ++--
 mm/filemap.c    |   21 +++++++++++++++++++++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6c5030a..87a6034 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -908,7 +908,7 @@ SYSCALL_DEFINE4(readv2, unsigned long, fd, const struct iovec __user *, vec,
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
-	if (flags & ~0)
+	if (flags & ~O_NONBLOCK)
 		return -EINVAL;
 
 	if (f.file) {
@@ -1006,7 +1006,7 @@ SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
 	struct fd f;
 	ssize_t ret = -EBADF;
 
-	if (flags & ~0)
+	if (flags & ~O_NONBLOCK)
 		return -EINVAL;
 	if (pos < 0)
 		return -EINVAL;
diff --git a/mm/filemap.c b/mm/filemap.c
index c95edbf..5b72572 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1483,7 +1483,10 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 		cond_resched();
 find_page:
 		page = find_get_page(mapping, index);
+
 		if (!page) {
+			if (flags & O_NONBLOCK)
+				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1575,6 +1578,11 @@ page_ok:
 		continue;
 
 page_not_up_to_date:
+		if (flags & O_NONBLOCK) {
+			page_cache_release(page);
+			goto would_block;
+		}
+
 		/* Get exclusive access to the page ... */
 		error = lock_page_killable(page);
 		if (unlikely(error))
@@ -1594,6 +1602,12 @@ page_not_up_to_date_locked:
 			goto page_ok;
 		}
 
+		if (flags & O_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto would_block;
+		}
+
 readpage:
 		/*
 		 * A previous I/O error may have been due to temporary
@@ -1662,6 +1676,10 @@ no_cached_page:
 			goto out;
 		}
 		goto readpage;
+
+would_block:
+		error = -EAGAIN;
+		goto out;
 	}
 
 out:
@@ -1697,6 +1715,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
 		size_t count = iov_iter_count(iter);
 		loff_t size;
 
+		if (flags & O_NONBLOCK)
+			return -EAGAIN;
+
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2
@ 2014-09-15 20:21   ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Filesystems that generic_file_read_iter will not be allowed to perform
non-blocking reads. This only will read data if it's in the page cache and if
there is no page error (causing a re-read).

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c |    4 ++--
 mm/filemap.c    |   21 +++++++++++++++++++++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 6c5030a..87a6034 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -908,7 +908,7 @@ SYSCALL_DEFINE4(readv2, unsigned long, fd, const struct iovec __user *, vec,
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
-	if (flags & ~0)
+	if (flags & ~O_NONBLOCK)
 		return -EINVAL;
 
 	if (f.file) {
@@ -1006,7 +1006,7 @@ SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
 	struct fd f;
 	ssize_t ret = -EBADF;
 
-	if (flags & ~0)
+	if (flags & ~O_NONBLOCK)
 		return -EINVAL;
 	if (pos < 0)
 		return -EINVAL;
diff --git a/mm/filemap.c b/mm/filemap.c
index c95edbf..5b72572 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1483,7 +1483,10 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 		cond_resched();
 find_page:
 		page = find_get_page(mapping, index);
+
 		if (!page) {
+			if (flags & O_NONBLOCK)
+				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1575,6 +1578,11 @@ page_ok:
 		continue;
 
 page_not_up_to_date:
+		if (flags & O_NONBLOCK) {
+			page_cache_release(page);
+			goto would_block;
+		}
+
 		/* Get exclusive access to the page ... */
 		error = lock_page_killable(page);
 		if (unlikely(error))
@@ -1594,6 +1602,12 @@ page_not_up_to_date_locked:
 			goto page_ok;
 		}
 
+		if (flags & O_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto would_block;
+		}
+
 readpage:
 		/*
 		 * A previous I/O error may have been due to temporary
@@ -1662,6 +1676,10 @@ no_cached_page:
 			goto out;
 		}
 		goto readpage;
+
+would_block:
+		error = -EAGAIN;
+		goto out;
 	}
 
 out:
@@ -1697,6 +1715,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
 		size_t count = iov_iter_count(iter);
 		loff_t size;
 
+		if (flags & O_NONBLOCK)
+			return -EAGAIN;
+
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 5/7] documentation updates
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-15 20:21   ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-15 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Acked-by: Milosz Tanski <milosz@adfin.com>
---
 Documentation/filesystems/Locking |    4 ++--
 Documentation/filesystems/vfs.txt |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f1997e9..08f8fdd 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -434,8 +434,8 @@ prototypes:
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
-	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
-	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, int);
+	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, int);
 	int (*iterate) (struct file *, struct dir_context *);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 61d65cc..c86f5f8 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -806,8 +806,8 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
-	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
-	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, int);
+	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, int);
 	int (*iterate) (struct file *, struct dir_context *);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 5/7] documentation updates
@ 2014-09-15 20:21   ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-15 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Acked-by: Milosz Tanski <milosz@adfin.com>
---
 Documentation/filesystems/Locking |    4 ++--
 Documentation/filesystems/vfs.txt |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index f1997e9..08f8fdd 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -434,8 +434,8 @@ prototypes:
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
-	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
-	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, int);
+	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, int);
 	int (*iterate) (struct file *, struct dir_context *);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 61d65cc..c86f5f8 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -806,8 +806,8 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
 	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
-	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
-	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *, int);
+	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *, int);
 	int (*iterate) (struct file *, struct dir_context *);
 	unsigned int (*poll) (struct file *, struct poll_table_struct *);
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 6/7] move flags enforcement to vfs_preadv/vfs_pwritev
  2014-09-15 20:20 ` Milosz Tanski
                   ` (5 preceding siblings ...)
  (?)
@ 2014-09-15 20:21 ` Christoph Hellwig
  2014-09-15 21:15     ` Christoph Hellwig
  -1 siblings, 1 reply; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-15 20:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Acked-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c |   14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 87a6034..233ea98 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -863,6 +863,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
+	if (flags & ~O_NONBLOCK)
+		return -EINVAL;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -876,6 +878,8 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
@@ -908,9 +912,6 @@ SYSCALL_DEFINE4(readv2, unsigned long, fd, const struct iovec __user *, vec,
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
-	if (flags & ~O_NONBLOCK)
-		return -EINVAL;
-
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
 		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
@@ -951,9 +952,6 @@ SYSCALL_DEFINE4(writev2, unsigned long, fd, const struct iovec __user *, vec,
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
-	if (flags & ~0)
-		return -EINVAL;
-
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
 		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
@@ -1006,8 +1004,6 @@ SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
 	struct fd f;
 	ssize_t ret = -EBADF;
 
-	if (flags & ~O_NONBLOCK)
-		return -EINVAL;
 	if (pos < 0)
 		return -EINVAL;
 
@@ -1057,8 +1053,6 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 	struct fd f;
 	ssize_t ret = -EBADF;
 
-	if (flags & ~0)
-		return -EINVAL;
 	if (pos < 0)
 		return -EINVAL;
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-15 20:22   ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-15 20:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Acked-by: Milosz Tanski <milosz@adfin.com>
---
 fs/ceph/file.c    |    2 ++
 fs/cifs/file.c    |    6 ++++++
 fs/nfs/file.c     |    5 ++++-
 fs/ocfs2/file.c   |    6 ++++++
 fs/pipe.c         |    3 ++-
 fs/read_write.c   |   17 +++++++++++------
 fs/xfs/xfs_file.c |    4 ++++
 mm/shmem.c        |    4 ++++
 8 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 4776257..b62e3a5 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -808,6 +808,8 @@ again:
 	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
 	    (iocb->ki_filp->f_flags & O_DIRECT) ||
 	    (fi->flags & CEPH_F_SYNC)) {
+		if (flags & O_NONBLOCK)
+			return -EAGAIN;
 
 		dout("aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n",
 		     inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 58aecd7..cda6b9e 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to, int flags)
 	struct cifs_readdata *rdata, *tmp;
 	struct list_head rdata_list;
 
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
+
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;
@@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to, int flags)
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
 		return generic_file_read_iter(iocb, to, flags);
 
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
 	 * with a brlock that prevents reading.
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 4072f3a..116bed2 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -170,8 +170,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to, int flags)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t result;
 
-	if (iocb->ki_filp->f_flags & O_DIRECT)
+	if (iocb->ki_filp->f_flags & O_DIRECT) {
+		if (flags & O_NONBLOCK)
+			return -EAGAIN;
 		return nfs_file_direct_read(iocb, to, iocb->ki_pos, true);
+	}
 
 	dprintk("NFS: read(%pD2, %zu@%lu)\n",
 		iocb->ki_filp,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 418c8a3..a8f7def 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2474,6 +2474,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 			filp->f_path.dentry->d_name.name,
 			to->nr_segs);	/* GRRRRR */
 
+	/*
+	 * No non-blocking reads for ocfs2 for now.  Might be doable with
+	 * non-blocking cluster lock helpers.
+	 */
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
 
 	if (!inode) {
 		ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index d2510ab..74d6749 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to, int flags)
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if ((filp->f_flags & O_NONBLOCK) ||
+			    (flags & O_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
diff --git a/fs/read_write.c b/fs/read_write.c
index 233ea98..4a5f82c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -832,14 +832,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & O_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 68847008..0252a31 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -247,6 +247,10 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
+	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
diff --git a/mm/shmem.c b/mm/shmem.c
index 24c73bce..f5ad85f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to, int
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/* XXX: should be easily supportable */
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-15 20:22   ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-15 20:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Acked-by: Milosz Tanski <milosz@adfin.com>
---
 fs/ceph/file.c    |    2 ++
 fs/cifs/file.c    |    6 ++++++
 fs/nfs/file.c     |    5 ++++-
 fs/ocfs2/file.c   |    6 ++++++
 fs/pipe.c         |    3 ++-
 fs/read_write.c   |   17 +++++++++++------
 fs/xfs/xfs_file.c |    4 ++++
 mm/shmem.c        |    4 ++++
 8 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 4776257..b62e3a5 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -808,6 +808,8 @@ again:
 	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
 	    (iocb->ki_filp->f_flags & O_DIRECT) ||
 	    (fi->flags & CEPH_F_SYNC)) {
+		if (flags & O_NONBLOCK)
+			return -EAGAIN;
 
 		dout("aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n",
 		     inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 58aecd7..cda6b9e 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to, int flags)
 	struct cifs_readdata *rdata, *tmp;
 	struct list_head rdata_list;
 
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
+
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;
@@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to, int flags)
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
 		return generic_file_read_iter(iocb, to, flags);
 
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
 	 * with a brlock that prevents reading.
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 4072f3a..116bed2 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -170,8 +170,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to, int flags)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t result;
 
-	if (iocb->ki_filp->f_flags & O_DIRECT)
+	if (iocb->ki_filp->f_flags & O_DIRECT) {
+		if (flags & O_NONBLOCK)
+			return -EAGAIN;
 		return nfs_file_direct_read(iocb, to, iocb->ki_pos, true);
+	}
 
 	dprintk("NFS: read(%pD2, %zu@%lu)\n",
 		iocb->ki_filp,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 418c8a3..a8f7def 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2474,6 +2474,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 			filp->f_path.dentry->d_name.name,
 			to->nr_segs);	/* GRRRRR */
 
+	/*
+	 * No non-blocking reads for ocfs2 for now.  Might be doable with
+	 * non-blocking cluster lock helpers.
+	 */
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
 
 	if (!inode) {
 		ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index d2510ab..74d6749 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to, int flags)
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if ((filp->f_flags & O_NONBLOCK) ||
+			    (flags & O_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
diff --git a/fs/read_write.c b/fs/read_write.c
index 233ea98..4a5f82c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -832,14 +832,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & O_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 68847008..0252a31 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -247,6 +247,10 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
+	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
diff --git a/mm/shmem.c b/mm/shmem.c
index 24c73bce..f5ad85f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to, int
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/* XXX: should be easily supportable */
+	if (flags & O_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-15 20:27   ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:27 UTC (permalink / raw)
  To: LKML
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

As promised here is some performance data. I ended up having up
copying the posix AIO engine and hacking it up to support the preadv2
syscall to perform a "fast read" in the submit thread. Bellow my
observations, followed by test data on a local filesystem (ext4) for
two different test cases (the second one being more of a realistic
case). I also tried this with a remote filesystem (Ceph) where I was
able to get a much better latency improvement.

- I tested two workloads. One is a primarily would be cached work-load
and the other a simulating a more complex workload that tries to mimic
what we would see in our db nodes.
- In the mostly cached case. The bandwidth doesn't increase, but the
request latency is much better. Here the bottleneck on total bandwidth
is probably a single submission thread.
- In the second case we see the same thing we generally. Bandwidth is
more or less the same, request latency is much better in the case of
random read (cached data), and sequential read (due to kernel's
readahead detection). Request latency of random uncached data is worse
(since we do two syscalls).
- Posix AIO probably suffers due to synchronization it could be
improved by a lockless mpmc queue and a aggressive spin before
sleeping wait.
- I can probably improve the uncached latency to be margin of error if
I add miss detection to the submission code (don't try fast read for a
while if a low percentage of those fail).

A lot of possible improvement, but even in its crude state it helps
similar apps (threaded IO worker pool).

Simple in-memory workload (mostly cached), 16kb blocks:

posix_aio:

bw (KB  /s): min=    5, max=29125, per=100.00%, avg=17662.31, stdev=4735.36
lat (usec) : 100=0.17%, 250=0.02%, 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.08%, 10=0.54%, 20=2.97%, 50=40.26%
lat (msec) : 100=49.41%, 250=6.31%, 500=0.21%
READ: io=5171.4MB, aggrb=17649KB/s, minb=17649KB/s, maxb=17649KB/s,
mint=300030msec, maxt=300030msec

posix_aio w/ fast_read:

bw (KB  /s): min=   15, max=38624, per=100.00%, avg=17977.23, stdev=6043.56
lat (usec) : 2=84.33%, 4=0.01%, 10=0.01%, 20=0.01%
lat (msec) : 50=0.01%, 100=0.01%, 250=0.48%, 500=14.45%, 750=0.67%
lat (msec) : 1000=0.05%
READ: io=5235.4MB, aggrb=17849KB/s, minb=17849KB/s, maxb=17849KB/s,
mint=300341msec, maxt=300341msec

Complex workload (simulate our DB access patern), 16kb blocks

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

posix_aio:

f1:
    bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
    lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
    lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
    bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
    lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
    lat (msec) : >=2000=4.33%
f3:
    bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
stdev=34526.89
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
    lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
    lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
   READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
mint=600001msec, maxt=600113msec

posix_aio w/ fast_read:

f1:
    bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
    lat (usec) : 2=70.63%, 4=0.01%
    lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
    bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
    lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
    lat (msec) : >=2000=9.99%
f3:
    bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
stdev=35995.60
    lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
    lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
    lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%
total:
   READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
mint=600020msec, maxt=600178msec

On Mon, Sep 15, 2014 at 4:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls readv2/writev2 and
> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.
>
> I have co-developed these changes with Christoph Hellwig, a whole lot of his
> fixes went into the first patch in the series (were squashed with his
> approval).
>
> I am going to post the perf report in a reply-to to this RFC.
>
> Christoph Hellwig (3):
>   documentation updates
>   move flags enforcement to vfs_preadv/vfs_pwritev
>   check for O_NONBLOCK in all read_iter instances
>
> Milosz Tanski (4):
>   Prepare for adding a new readv/writev with user flags.
>   Define new syscalls readv2,preadv2,writev2,pwritev2
>   Export new vector IO (with flags) to userland
>   O_NONBLOCK flag for readv2/preadv2
>
>  Documentation/filesystems/Locking |    4 +-
>  Documentation/filesystems/vfs.txt |    4 +-
>  arch/x86/syscalls/syscall_32.tbl  |    4 +
>  arch/x86/syscalls/syscall_64.tbl  |    4 +
>  drivers/target/target_core_file.c |    6 +-
>  fs/afs/internal.h                 |    2 +-
>  fs/afs/write.c                    |    4 +-
>  fs/aio.c                          |    4 +-
>  fs/block_dev.c                    |    9 ++-
>  fs/btrfs/file.c                   |    2 +-
>  fs/ceph/file.c                    |   10 ++-
>  fs/cifs/cifsfs.c                  |    9 ++-
>  fs/cifs/cifsfs.h                  |   12 ++-
>  fs/cifs/file.c                    |   30 +++++---
>  fs/ecryptfs/file.c                |    4 +-
>  fs/ext4/file.c                    |    4 +-
>  fs/fuse/file.c                    |   10 ++-
>  fs/gfs2/file.c                    |    5 +-
>  fs/nfs/file.c                     |   13 ++--
>  fs/nfs/internal.h                 |    4 +-
>  fs/nfsd/vfs.c                     |    4 +-
>  fs/ocfs2/file.c                   |   13 +++-
>  fs/pipe.c                         |    7 +-
>  fs/read_write.c                   |  146 +++++++++++++++++++++++++++++++------
>  fs/splice.c                       |    4 +-
>  fs/ubifs/file.c                   |    5 +-
>  fs/udf/file.c                     |    5 +-
>  fs/xfs/xfs_file.c                 |   12 ++-
>  include/linux/fs.h                |   16 ++--
>  include/linux/syscalls.h          |   12 +++
>  include/uapi/asm-generic/unistd.h |   10 ++-
>  mm/filemap.c                      |   34 +++++++--
>  mm/shmem.c                        |    6 +-
>  33 files changed, 306 insertions(+), 112 deletions(-)
>
> --
> 1.7.9.5
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-15 20:27   ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 20:27 UTC (permalink / raw)
  To: LKML
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

As promised here is some performance data. I ended up having up
copying the posix AIO engine and hacking it up to support the preadv2
syscall to perform a "fast read" in the submit thread. Bellow my
observations, followed by test data on a local filesystem (ext4) for
two different test cases (the second one being more of a realistic
case). I also tried this with a remote filesystem (Ceph) where I was
able to get a much better latency improvement.

- I tested two workloads. One is a primarily would be cached work-load
and the other a simulating a more complex workload that tries to mimic
what we would see in our db nodes.
- In the mostly cached case. The bandwidth doesn't increase, but the
request latency is much better. Here the bottleneck on total bandwidth
is probably a single submission thread.
- In the second case we see the same thing we generally. Bandwidth is
more or less the same, request latency is much better in the case of
random read (cached data), and sequential read (due to kernel's
readahead detection). Request latency of random uncached data is worse
(since we do two syscalls).
- Posix AIO probably suffers due to synchronization it could be
improved by a lockless mpmc queue and a aggressive spin before
sleeping wait.
- I can probably improve the uncached latency to be margin of error if
I add miss detection to the submission code (don't try fast read for a
while if a low percentage of those fail).

A lot of possible improvement, but even in its crude state it helps
similar apps (threaded IO worker pool).

Simple in-memory workload (mostly cached), 16kb blocks:

posix_aio:

bw (KB  /s): min=    5, max=29125, per=100.00%, avg=17662.31, stdev=4735.36
lat (usec) : 100=0.17%, 250=0.02%, 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.08%, 10=0.54%, 20=2.97%, 50=40.26%
lat (msec) : 100=49.41%, 250=6.31%, 500=0.21%
READ: io=5171.4MB, aggrb=17649KB/s, minb=17649KB/s, maxb=17649KB/s,
mint=300030msec, maxt=300030msec

posix_aio w/ fast_read:

bw (KB  /s): min=   15, max=38624, per=100.00%, avg=17977.23, stdev=6043.56
lat (usec) : 2=84.33%, 4=0.01%, 10=0.01%, 20=0.01%
lat (msec) : 50=0.01%, 100=0.01%, 250=0.48%, 500=14.45%, 750=0.67%
lat (msec) : 1000=0.05%
READ: io=5235.4MB, aggrb=17849KB/s, minb=17849KB/s, maxb=17849KB/s,
mint=300341msec, maxt=300341msec

Complex workload (simulate our DB access patern), 16kb blocks

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

posix_aio:

f1:
    bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
    lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
    lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
    bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
    lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
    lat (msec) : >=2000=4.33%
f3:
    bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
stdev=34526.89
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
    lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
    lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
   READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
mint=600001msec, maxt=600113msec

posix_aio w/ fast_read:

f1:
    bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
    lat (usec) : 2=70.63%, 4=0.01%
    lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
    bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
    lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
    lat (msec) : >=2000=9.99%
f3:
    bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
stdev=35995.60
    lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
    lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
    lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%
total:
   READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
mint=600020msec, maxt=600178msec

On Mon, Sep 15, 2014 at 4:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls readv2/writev2 and
> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.
>
> I have co-developed these changes with Christoph Hellwig, a whole lot of his
> fixes went into the first patch in the series (were squashed with his
> approval).
>
> I am going to post the perf report in a reply-to to this RFC.
>
> Christoph Hellwig (3):
>   documentation updates
>   move flags enforcement to vfs_preadv/vfs_pwritev
>   check for O_NONBLOCK in all read_iter instances
>
> Milosz Tanski (4):
>   Prepare for adding a new readv/writev with user flags.
>   Define new syscalls readv2,preadv2,writev2,pwritev2
>   Export new vector IO (with flags) to userland
>   O_NONBLOCK flag for readv2/preadv2
>
>  Documentation/filesystems/Locking |    4 +-
>  Documentation/filesystems/vfs.txt |    4 +-
>  arch/x86/syscalls/syscall_32.tbl  |    4 +
>  arch/x86/syscalls/syscall_64.tbl  |    4 +
>  drivers/target/target_core_file.c |    6 +-
>  fs/afs/internal.h                 |    2 +-
>  fs/afs/write.c                    |    4 +-
>  fs/aio.c                          |    4 +-
>  fs/block_dev.c                    |    9 ++-
>  fs/btrfs/file.c                   |    2 +-
>  fs/ceph/file.c                    |   10 ++-
>  fs/cifs/cifsfs.c                  |    9 ++-
>  fs/cifs/cifsfs.h                  |   12 ++-
>  fs/cifs/file.c                    |   30 +++++---
>  fs/ecryptfs/file.c                |    4 +-
>  fs/ext4/file.c                    |    4 +-
>  fs/fuse/file.c                    |   10 ++-
>  fs/gfs2/file.c                    |    5 +-
>  fs/nfs/file.c                     |   13 ++--
>  fs/nfs/internal.h                 |    4 +-
>  fs/nfsd/vfs.c                     |    4 +-
>  fs/ocfs2/file.c                   |   13 +++-
>  fs/pipe.c                         |    7 +-
>  fs/read_write.c                   |  146 +++++++++++++++++++++++++++++++------
>  fs/splice.c                       |    4 +-
>  fs/ubifs/file.c                   |    5 +-
>  fs/udf/file.c                     |    5 +-
>  fs/xfs/xfs_file.c                 |   12 ++-
>  include/linux/fs.h                |   16 ++--
>  include/linux/syscalls.h          |   12 +++
>  include/uapi/asm-generic/unistd.h |   10 ++-
>  mm/filemap.c                      |   34 +++++++--
>  mm/shmem.c                        |    6 +-
>  33 files changed, 306 insertions(+), 112 deletions(-)
>
> --
> 1.7.9.5
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 1/7] Prepare for adding a new readv/writev with user flags.
  2014-09-15 20:20   ` Milosz Tanski
  (?)
@ 2014-09-15 20:28   ` Al Viro
  2014-09-15 21:15       ` Christoph Hellwig
  -1 siblings, 1 reply; 167+ messages in thread
From: Al Viro @ 2014-09-15 20:28 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer

On Mon, Sep 15, 2014 at 04:20:17PM -0400, Milosz Tanski wrote:
> Plumbing the flags argument through the vfs code so they can be passed down to
> __generic_file_(read/write)_iter function that do the acctual work.

NAK.  Put these flags into iocb, it'll be less noisy that way.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 1/7] Prepare for adding a new readv/writev with user flags.
  2014-09-15 20:28   ` Al Viro
@ 2014-09-15 21:15       ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-15 21:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer

On Mon, Sep 15, 2014 at 09:28:28PM +0100, Al Viro wrote:
> On Mon, Sep 15, 2014 at 04:20:17PM -0400, Milosz Tanski wrote:
> > Plumbing the flags argument through the vfs code so they can be passed down to
> > __generic_file_(read/write)_iter function that do the acctual work.
> 
> NAK.  Put these flags into iocb, it'll be less noisy that way.

Fine with me.  My initial prototype had it in the iov_iter type field
which is another possibility.

But if we get rid of the explicit flags field and make it more invisible
I'd really like to add a features field struct file_operations where
instances can advertize that they support it (and other things like
actual AIO support in the future) so that we won't have a situation
like with AIO where we can submit I/O but it might actually still block
anyway because lots of operations don't support it.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 1/7] Prepare for adding a new readv/writev with user flags.
@ 2014-09-15 21:15       ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-15 21:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer

On Mon, Sep 15, 2014 at 09:28:28PM +0100, Al Viro wrote:
> On Mon, Sep 15, 2014 at 04:20:17PM -0400, Milosz Tanski wrote:
> > Plumbing the flags argument through the vfs code so they can be passed down to
> > __generic_file_(read/write)_iter function that do the acctual work.
> 
> NAK.  Put these flags into iocb, it'll be less noisy that way.

Fine with me.  My initial prototype had it in the iov_iter type field
which is another possibility.

But if we get rid of the explicit flags field and make it more invisible
I'd really like to add a features field struct file_operations where
instances can advertize that they support it (and other things like
actual AIO support in the future) so that we won't have a situation
like with AIO where we can submit I/O but it might actually still block
anyway because lots of operations don't support it.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 6/7] move flags enforcement to vfs_preadv/vfs_pwritev
  2014-09-15 20:21 ` [PATCH 6/7] move flags enforcement to vfs_preadv/vfs_pwritev Christoph Hellwig
@ 2014-09-15 21:15     ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-15 21:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer

This should simply be folded into patch 2.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 6/7] move flags enforcement to vfs_preadv/vfs_pwritev
@ 2014-09-15 21:15     ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-15 21:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer

This should simply be folded into patch 2.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 20:20 ` Milosz Tanski
                   ` (8 preceding siblings ...)
  (?)
@ 2014-09-15 21:33 ` Andreas Dilger
  2014-09-15 22:13     ` Milosz Tanski
  2014-09-15 22:36     ` Elliott, Robert (Server Storage)
  -1 siblings, 2 replies; 167+ messages in thread
From: Andreas Dilger @ 2014-09-15 21:33 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer

[-- Attachment #1: Type: text/plain, Size: 4432 bytes --]

On Sep 15, 2014, at 2:20 PM, Milosz Tanski <milosz@adfin.com> wrote:

> This patcheset introduces an ability to perform a non-blocking read
> from regular files in buffered IO mode. This works by only for those
> filesystems that have data in the page cache.
> 
> It does this by introducing new syscalls new syscalls readv2/writev2
> and preadv2/pwritev2. These new syscalls behave like the network sendmsg,
> recvmsg syscalls that accept an extra flag argument (O_NONBLOCK).

It's too bad that we are introducing yet another new read/write
syscall pair that only allow IO into discontiguous memory regions,
but do not allow a single call to access discontiguous file regions
(i.e. specify a separate file offset for each iov).

Adding syscalls similar to preadv/pwritev() that could take a iovec
that specified the file offset+length in addition to the memory address
would allow efficient scatter-gather IO in a single syscall.  While
that is less critical for local filesystems with small syscall latency,
it is more important for network filesystems, or in the case of
NVRAM-backed filesystems.

Cheers, Andreas

> It's a very common patern today (samba, libuv, etc..) use a large
> threadpool to perform buffered IO operations. They submit the work
> form another thread that performs network IO and epoll or other threads
> that perform CPU work. This leads to increased latency for processing,
> esp. in the case of data that's already cached in the page cache.
> 
> With the new interface the applications will now be able to fetch the
> data in their network / cpu bound thread(s) and only defer to a
> threadpool if it's not there. In our own application (VLDB) we've
> observed a decrease in latency for "fast" request by avoiding unnecessary
> queuing and having to swap out current tasks in IO bound work threads.
> 
> I have co-developed these changes with Christoph Hellwig, a whole lot
> of his fixes went into the first patch in the series (were squashed
> with his approval).
> 
> I am going to post the perf report in a reply-to to this RFC.
> 
> Christoph Hellwig (3):
>  documentation updates
>  move flags enforcement to vfs_preadv/vfs_pwritev
>  check for O_NONBLOCK in all read_iter instances
> 
> Milosz Tanski (4):
>  Prepare for adding a new readv/writev with user flags.
>  Define new syscalls readv2,preadv2,writev2,pwritev2
>  Export new vector IO (with flags) to userland
>  O_NONBLOCK flag for readv2/preadv2
> 
> Documentation/filesystems/Locking |    4 +-
> Documentation/filesystems/vfs.txt |    4 +-
> arch/x86/syscalls/syscall_32.tbl  |    4 +
> arch/x86/syscalls/syscall_64.tbl  |    4 +
> drivers/target/target_core_file.c |    6 +-
> fs/afs/internal.h                 |    2 +-
> fs/afs/write.c                    |    4 +-
> fs/aio.c                          |    4 +-
> fs/block_dev.c                    |    9 ++-
> fs/btrfs/file.c                   |    2 +-
> fs/ceph/file.c                    |   10 ++-
> fs/cifs/cifsfs.c                  |    9 ++-
> fs/cifs/cifsfs.h                  |   12 ++-
> fs/cifs/file.c                    |   30 +++++---
> fs/ecryptfs/file.c                |    4 +-
> fs/ext4/file.c                    |    4 +-
> fs/fuse/file.c                    |   10 ++-
> fs/gfs2/file.c                    |    5 +-
> fs/nfs/file.c                     |   13 ++--
> fs/nfs/internal.h                 |    4 +-
> fs/nfsd/vfs.c                     |    4 +-
> fs/ocfs2/file.c                   |   13 +++-
> fs/pipe.c                         |    7 +-
> fs/read_write.c                   |  146 +++++++++++++++++++++++++++++++------
> fs/splice.c                       |    4 +-
> fs/ubifs/file.c                   |    5 +-
> fs/udf/file.c                     |    5 +-
> fs/xfs/xfs_file.c                 |   12 ++-
> include/linux/fs.h                |   16 ++--
> include/linux/syscalls.h          |   12 +++
> include/uapi/asm-generic/unistd.h |   10 ++-
> mm/filemap.c                      |   34 +++++++--
> mm/shmem.c                        |    6 +-
> 33 files changed, 306 insertions(+), 112 deletions(-)
> 
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 1/7] Prepare for adding a new readv/writev with user flags.
  2014-09-15 21:15       ` Christoph Hellwig
@ 2014-09-15 21:44         ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 21:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, LKML, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

I'll redo with the flag inside of kiocb for the next submission.

On Mon, Sep 15, 2014 at 5:15 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Sep 15, 2014 at 09:28:28PM +0100, Al Viro wrote:
>> On Mon, Sep 15, 2014 at 04:20:17PM -0400, Milosz Tanski wrote:
>> > Plumbing the flags argument through the vfs code so they can be passed down to
>> > __generic_file_(read/write)_iter function that do the acctual work.
>>
>> NAK.  Put these flags into iocb, it'll be less noisy that way.
>
> Fine with me.  My initial prototype had it in the iov_iter type field
> which is another possibility.
>
> But if we get rid of the explicit flags field and make it more invisible
> I'd really like to add a features field struct file_operations where
> instances can advertize that they support it (and other things like
> actual AIO support in the future) so that we won't have a situation
> like with AIO where we can submit I/O but it might actually still block
> anyway because lots of operations don't support it.



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 1/7] Prepare for adding a new readv/writev with user flags.
@ 2014-09-15 21:44         ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 21:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, LKML, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

I'll redo with the flag inside of kiocb for the next submission.

On Mon, Sep 15, 2014 at 5:15 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Sep 15, 2014 at 09:28:28PM +0100, Al Viro wrote:
>> On Mon, Sep 15, 2014 at 04:20:17PM -0400, Milosz Tanski wrote:
>> > Plumbing the flags argument through the vfs code so they can be passed down to
>> > __generic_file_(read/write)_iter function that do the acctual work.
>>
>> NAK.  Put these flags into iocb, it'll be less noisy that way.
>
> Fine with me.  My initial prototype had it in the iov_iter type field
> which is another possibility.
>
> But if we get rid of the explicit flags field and make it more invisible
> I'd really like to add a features field struct file_operations where
> instances can advertize that they support it (and other things like
> actual AIO support in the future) so that we won't have a situation
> like with AIO where we can submit I/O but it might actually still block
> anyway because lots of operations don't support it.



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 6/7] move flags enforcement to vfs_preadv/vfs_pwritev
  2014-09-15 21:15     ` Christoph Hellwig
@ 2014-09-15 21:45       ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 21:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer

Will fix this for the next submission.

On Mon, Sep 15, 2014 at 5:15 PM, Christoph Hellwig <hch@infradead.org> wrote:
> This should simply be folded into patch 2.



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 6/7] move flags enforcement to vfs_preadv/vfs_pwritev
@ 2014-09-15 21:45       ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 21:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer

Will fix this for the next submission.

On Mon, Sep 15, 2014 at 5:15 PM, Christoph Hellwig <hch@infradead.org> wrote:
> This should simply be folded into patch 2.



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-15 21:58   ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-15 21:58 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, michael.kerrisk

Hi, Milosz,

I CC'd Michael Kerrisk, in case he has any opinions on the matter.

Milosz Tanski <milosz@adfin.com> writes:

> This patcheset introduces an ability to perform a non-blocking read from 
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls readv2/writev2 and
> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
> syscalls that accept an extra flag argument (O_NONBLOCK).

I thought you were going to introduce a new flag instead of using
O_NONBLOCK for this.  I dug up an old email that suggested that enabling
O_NONBLOCK for regular files (well, a device node in this case) broke a
cd ripping or burning application.  I also found this old bugzilla,
which states that squid would fail to start, and that gqview was also
broken:
  https://bugzilla.redhat.com/show_bug.cgi?id=136057

More generally, do you expect the open(2) of a regular file with
O_NONBLOCK to perform the same way as a pipe, fifo, or device (namely,
that the open itself won't block)?  Should O_NONBLOCK affect writes to
regular files?  What do you think the return value from poll and friends
should be when a file is opened in this manner (probably not important,
as poll always returns data ready on regular files)?  Also consider
whether you want the O_NONBLOCK behaviour for mandatory file locks in
your use case (or any other, for that matter).  If you issue a read and
it returns -EAGAIN, should it be up to the application to kick off I/O
to ensure it makes progress?

I don't think O_NONBLOCK is the right flag.  What you're really
specifying is a flag that prevents I/O in the read path, and nowhere
else.  As such, I'd feel much better about this if we defined a new flag
(O_NONBLOCK_READ maybe?  No, that's too verbose.).

In summary, I like the idea, but I worry about overloading O_NONBLOCK.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-15 21:58   ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-15 21:58 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, michael.kerrisk

Hi, Milosz,

I CC'd Michael Kerrisk, in case he has any opinions on the matter.

Milosz Tanski <milosz@adfin.com> writes:

> This patcheset introduces an ability to perform a non-blocking read from 
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls readv2/writev2 and
> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
> syscalls that accept an extra flag argument (O_NONBLOCK).

I thought you were going to introduce a new flag instead of using
O_NONBLOCK for this.  I dug up an old email that suggested that enabling
O_NONBLOCK for regular files (well, a device node in this case) broke a
cd ripping or burning application.  I also found this old bugzilla,
which states that squid would fail to start, and that gqview was also
broken:
  https://bugzilla.redhat.com/show_bug.cgi?id=136057

More generally, do you expect the open(2) of a regular file with
O_NONBLOCK to perform the same way as a pipe, fifo, or device (namely,
that the open itself won't block)?  Should O_NONBLOCK affect writes to
regular files?  What do you think the return value from poll and friends
should be when a file is opened in this manner (probably not important,
as poll always returns data ready on regular files)?  Also consider
whether you want the O_NONBLOCK behaviour for mandatory file locks in
your use case (or any other, for that matter).  If you issue a read and
it returns -EAGAIN, should it be up to the application to kick off I/O
to ensure it makes progress?

I don't think O_NONBLOCK is the right flag.  What you're really
specifying is a flag that prevents I/O in the read path, and nowhere
else.  As such, I'd feel much better about this if we defined a new flag
(O_NONBLOCK_READ maybe?  No, that's too verbose.).

In summary, I like the idea, but I worry about overloading O_NONBLOCK.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 21:33 ` Andreas Dilger
@ 2014-09-15 22:13     ` Milosz Tanski
  2014-09-15 22:36     ` Elliott, Robert (Server Storage)
  1 sibling, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 22:13 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Like you Andreas I would like to see a syscall that let you take
vectored positions (along with buffers and lengths). However, that's
not the problem I'm trying to solve with this patchset which is
non-blocking read for filesystem fds. The vectored position read
call(s) deserve another submission for a number of the usual reasons.

Best,
- Milosz

On Mon, Sep 15, 2014 at 5:33 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Sep 15, 2014, at 2:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
>
>> This patcheset introduces an ability to perform a non-blocking read
>> from regular files in buffered IO mode. This works by only for those
>> filesystems that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2
>> and preadv2/pwritev2. These new syscalls behave like the network sendmsg,
>> recvmsg syscalls that accept an extra flag argument (O_NONBLOCK).
>
> It's too bad that we are introducing yet another new read/write
> syscall pair that only allow IO into discontiguous memory regions,
> but do not allow a single call to access discontiguous file regions
> (i.e. specify a separate file offset for each iov).
>
> Adding syscalls similar to preadv/pwritev() that could take a iovec
> that specified the file offset+length in addition to the memory address
> would allow efficient scatter-gather IO in a single syscall.  While
> that is less critical for local filesystems with small syscall latency,
> it is more important for network filesystems, or in the case of
> NVRAM-backed filesystems.
>
> Cheers, Andreas
>
>> It's a very common patern today (samba, libuv, etc..) use a large
>> threadpool to perform buffered IO operations. They submit the work
>> form another thread that performs network IO and epoll or other threads
>> that perform CPU work. This leads to increased latency for processing,
>> esp. in the case of data that's already cached in the page cache.
>>
>> With the new interface the applications will now be able to fetch the
>> data in their network / cpu bound thread(s) and only defer to a
>> threadpool if it's not there. In our own application (VLDB) we've
>> observed a decrease in latency for "fast" request by avoiding unnecessary
>> queuing and having to swap out current tasks in IO bound work threads.
>>
>> I have co-developed these changes with Christoph Hellwig, a whole lot
>> of his fixes went into the first patch in the series (were squashed
>> with his approval).
>>
>> I am going to post the perf report in a reply-to to this RFC.
>>
>> Christoph Hellwig (3):
>>  documentation updates
>>  move flags enforcement to vfs_preadv/vfs_pwritev
>>  check for O_NONBLOCK in all read_iter instances
>>
>> Milosz Tanski (4):
>>  Prepare for adding a new readv/writev with user flags.
>>  Define new syscalls readv2,preadv2,writev2,pwritev2
>>  Export new vector IO (with flags) to userland
>>  O_NONBLOCK flag for readv2/preadv2
>>
>> Documentation/filesystems/Locking |    4 +-
>> Documentation/filesystems/vfs.txt |    4 +-
>> arch/x86/syscalls/syscall_32.tbl  |    4 +
>> arch/x86/syscalls/syscall_64.tbl  |    4 +
>> drivers/target/target_core_file.c |    6 +-
>> fs/afs/internal.h                 |    2 +-
>> fs/afs/write.c                    |    4 +-
>> fs/aio.c                          |    4 +-
>> fs/block_dev.c                    |    9 ++-
>> fs/btrfs/file.c                   |    2 +-
>> fs/ceph/file.c                    |   10 ++-
>> fs/cifs/cifsfs.c                  |    9 ++-
>> fs/cifs/cifsfs.h                  |   12 ++-
>> fs/cifs/file.c                    |   30 +++++---
>> fs/ecryptfs/file.c                |    4 +-
>> fs/ext4/file.c                    |    4 +-
>> fs/fuse/file.c                    |   10 ++-
>> fs/gfs2/file.c                    |    5 +-
>> fs/nfs/file.c                     |   13 ++--
>> fs/nfs/internal.h                 |    4 +-
>> fs/nfsd/vfs.c                     |    4 +-
>> fs/ocfs2/file.c                   |   13 +++-
>> fs/pipe.c                         |    7 +-
>> fs/read_write.c                   |  146 +++++++++++++++++++++++++++++++------
>> fs/splice.c                       |    4 +-
>> fs/ubifs/file.c                   |    5 +-
>> fs/udf/file.c                     |    5 +-
>> fs/xfs/xfs_file.c                 |   12 ++-
>> include/linux/fs.h                |   16 ++--
>> include/linux/syscalls.h          |   12 +++
>> include/uapi/asm-generic/unistd.h |   10 ++-
>> mm/filemap.c                      |   34 +++++++--
>> mm/shmem.c                        |    6 +-
>> 33 files changed, 306 insertions(+), 112 deletions(-)
>>
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> Cheers, Andreas
>
>
>
>
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-15 22:13     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 22:13 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

Like you Andreas I would like to see a syscall that let you take
vectored positions (along with buffers and lengths). However, that's
not the problem I'm trying to solve with this patchset which is
non-blocking read for filesystem fds. The vectored position read
call(s) deserve another submission for a number of the usual reasons.

Best,
- Milosz

On Mon, Sep 15, 2014 at 5:33 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Sep 15, 2014, at 2:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
>
>> This patcheset introduces an ability to perform a non-blocking read
>> from regular files in buffered IO mode. This works by only for those
>> filesystems that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2
>> and preadv2/pwritev2. These new syscalls behave like the network sendmsg,
>> recvmsg syscalls that accept an extra flag argument (O_NONBLOCK).
>
> It's too bad that we are introducing yet another new read/write
> syscall pair that only allow IO into discontiguous memory regions,
> but do not allow a single call to access discontiguous file regions
> (i.e. specify a separate file offset for each iov).
>
> Adding syscalls similar to preadv/pwritev() that could take a iovec
> that specified the file offset+length in addition to the memory address
> would allow efficient scatter-gather IO in a single syscall.  While
> that is less critical for local filesystems with small syscall latency,
> it is more important for network filesystems, or in the case of
> NVRAM-backed filesystems.
>
> Cheers, Andreas
>
>> It's a very common patern today (samba, libuv, etc..) use a large
>> threadpool to perform buffered IO operations. They submit the work
>> form another thread that performs network IO and epoll or other threads
>> that perform CPU work. This leads to increased latency for processing,
>> esp. in the case of data that's already cached in the page cache.
>>
>> With the new interface the applications will now be able to fetch the
>> data in their network / cpu bound thread(s) and only defer to a
>> threadpool if it's not there. In our own application (VLDB) we've
>> observed a decrease in latency for "fast" request by avoiding unnecessary
>> queuing and having to swap out current tasks in IO bound work threads.
>>
>> I have co-developed these changes with Christoph Hellwig, a whole lot
>> of his fixes went into the first patch in the series (were squashed
>> with his approval).
>>
>> I am going to post the perf report in a reply-to to this RFC.
>>
>> Christoph Hellwig (3):
>>  documentation updates
>>  move flags enforcement to vfs_preadv/vfs_pwritev
>>  check for O_NONBLOCK in all read_iter instances
>>
>> Milosz Tanski (4):
>>  Prepare for adding a new readv/writev with user flags.
>>  Define new syscalls readv2,preadv2,writev2,pwritev2
>>  Export new vector IO (with flags) to userland
>>  O_NONBLOCK flag for readv2/preadv2
>>
>> Documentation/filesystems/Locking |    4 +-
>> Documentation/filesystems/vfs.txt |    4 +-
>> arch/x86/syscalls/syscall_32.tbl  |    4 +
>> arch/x86/syscalls/syscall_64.tbl  |    4 +
>> drivers/target/target_core_file.c |    6 +-
>> fs/afs/internal.h                 |    2 +-
>> fs/afs/write.c                    |    4 +-
>> fs/aio.c                          |    4 +-
>> fs/block_dev.c                    |    9 ++-
>> fs/btrfs/file.c                   |    2 +-
>> fs/ceph/file.c                    |   10 ++-
>> fs/cifs/cifsfs.c                  |    9 ++-
>> fs/cifs/cifsfs.h                  |   12 ++-
>> fs/cifs/file.c                    |   30 +++++---
>> fs/ecryptfs/file.c                |    4 +-
>> fs/ext4/file.c                    |    4 +-
>> fs/fuse/file.c                    |   10 ++-
>> fs/gfs2/file.c                    |    5 +-
>> fs/nfs/file.c                     |   13 ++--
>> fs/nfs/internal.h                 |    4 +-
>> fs/nfsd/vfs.c                     |    4 +-
>> fs/ocfs2/file.c                   |   13 +++-
>> fs/pipe.c                         |    7 +-
>> fs/read_write.c                   |  146 +++++++++++++++++++++++++++++++------
>> fs/splice.c                       |    4 +-
>> fs/ubifs/file.c                   |    5 +-
>> fs/udf/file.c                     |    5 +-
>> fs/xfs/xfs_file.c                 |   12 ++-
>> include/linux/fs.h                |   16 ++--
>> include/linux/syscalls.h          |   12 +++
>> include/uapi/asm-generic/unistd.h |   10 ++-
>> mm/filemap.c                      |   34 +++++++--
>> mm/shmem.c                        |    6 +-
>> 33 files changed, 306 insertions(+), 112 deletions(-)
>>
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> Cheers, Andreas
>
>
>
>
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 21:58   ` Jeff Moyer
@ 2014-09-15 22:27     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 22:27 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, michael.kerrisk

Jeff,

This patchset creates a new read (readv2/preadv2) syscall(s) that take
a extra flag argument (kind of like recvmsg). What it doesn't do is
change the current behavior of of the O_NONBLOCK, if the file is
open() with O_NONBLOCK flag. It shouldn't break any existing
applications since you have to opt into using this by using the new
syscall.

I don't have a preference either way if we should create a new flag or
re-use O_NONBLOCK the flag. Instead, I'm hoping to get some consensus
here from senior kernel developers like yourself. Maybe a RWF_NONBLOCK
(I'm stealing from eventfd, EFD_NONBLOCK).

As a side note, I noticed that EFD_NONBLOCK, SFD_NONBLOCK, etc... all
alias to the value of O_NONBLOCK and there's a bunch of bug checks in
the code like this:
BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK);

Thanks,
- Milosz

On Mon, Sep 15, 2014 at 5:58 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Hi, Milosz,
>
> I CC'd Michael Kerrisk, in case he has any opinions on the matter.
>
> Milosz Tanski <milosz@adfin.com> writes:
>
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2 and
>> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
>> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> I thought you were going to introduce a new flag instead of using
> O_NONBLOCK for this.  I dug up an old email that suggested that enabling
> O_NONBLOCK for regular files (well, a device node in this case) broke a
> cd ripping or burning application.  I also found this old bugzilla,
> which states that squid would fail to start, and that gqview was also
> broken:
>   https://bugzilla.redhat.com/show_bug.cgi?id=136057
>
> More generally, do you expect the open(2) of a regular file with
> O_NONBLOCK to perform the same way as a pipe, fifo, or device (namely,
> that the open itself won't block)?  Should O_NONBLOCK affect writes to
> regular files?  What do you think the return value from poll and friends
> should be when a file is opened in this manner (probably not important,
> as poll always returns data ready on regular files)?  Also consider
> whether you want the O_NONBLOCK behaviour for mandatory file locks in
> your use case (or any other, for that matter).  If you issue a read and
> it returns -EAGAIN, should it be up to the application to kick off I/O
> to ensure it makes progress?
>
> I don't think O_NONBLOCK is the right flag.  What you're really
> specifying is a flag that prevents I/O in the read path, and nowhere
> else.  As such, I'd feel much better about this if we defined a new flag
> (O_NONBLOCK_READ maybe?  No, that's too verbose.).
>
> In summary, I like the idea, but I worry about overloading O_NONBLOCK.
>
> Cheers,
> Jeff



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-15 22:27     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-15 22:27 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, michael.kerrisk

Jeff,

This patchset creates a new read (readv2/preadv2) syscall(s) that take
a extra flag argument (kind of like recvmsg). What it doesn't do is
change the current behavior of of the O_NONBLOCK, if the file is
open() with O_NONBLOCK flag. It shouldn't break any existing
applications since you have to opt into using this by using the new
syscall.

I don't have a preference either way if we should create a new flag or
re-use O_NONBLOCK the flag. Instead, I'm hoping to get some consensus
here from senior kernel developers like yourself. Maybe a RWF_NONBLOCK
(I'm stealing from eventfd, EFD_NONBLOCK).

As a side note, I noticed that EFD_NONBLOCK, SFD_NONBLOCK, etc... all
alias to the value of O_NONBLOCK and there's a bunch of bug checks in
the code like this:
BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK);

Thanks,
- Milosz

On Mon, Sep 15, 2014 at 5:58 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Hi, Milosz,
>
> I CC'd Michael Kerrisk, in case he has any opinions on the matter.
>
> Milosz Tanski <milosz@adfin.com> writes:
>
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2 and
>> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
>> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> I thought you were going to introduce a new flag instead of using
> O_NONBLOCK for this.  I dug up an old email that suggested that enabling
> O_NONBLOCK for regular files (well, a device node in this case) broke a
> cd ripping or burning application.  I also found this old bugzilla,
> which states that squid would fail to start, and that gqview was also
> broken:
>   https://bugzilla.redhat.com/show_bug.cgi?id=136057
>
> More generally, do you expect the open(2) of a regular file with
> O_NONBLOCK to perform the same way as a pipe, fifo, or device (namely,
> that the open itself won't block)?  Should O_NONBLOCK affect writes to
> regular files?  What do you think the return value from poll and friends
> should be when a file is opened in this manner (probably not important,
> as poll always returns data ready on regular files)?  Also consider
> whether you want the O_NONBLOCK behaviour for mandatory file locks in
> your use case (or any other, for that matter).  If you issue a read and
> it returns -EAGAIN, should it be up to the application to kick off I/O
> to ensure it makes progress?
>
> I don't think O_NONBLOCK is the right flag.  What you're really
> specifying is a flag that prevents I/O in the read path, and nowhere
> else.  As such, I'd feel much better about this if we defined a new flag
> (O_NONBLOCK_READ maybe?  No, that's too verbose.).
>
> In summary, I like the idea, but I worry about overloading O_NONBLOCK.
>
> Cheers,
> Jeff



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* RE: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 21:33 ` Andreas Dilger
@ 2014-09-15 22:36     ` Elliott, Robert (Server Storage)
  2014-09-15 22:36     ` Elliott, Robert (Server Storage)
  1 sibling, 0 replies; 167+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-09-15 22:36 UTC (permalink / raw)
  To: Andreas Dilger, Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Andreas Dilger
> Sent: Monday, 15 September, 2014 4:34 PM
> To: Milosz Tanski
> Cc: linux-kernel@vger.kernel.org; Christoph Hellwig; linux-
> fsdevel@vger.kernel.org; linux-aio@kvack.org; Mel Gorman; Volker Lendecke;
> Tejun Heo; Jeff Moyer
> Subject: Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
> 
> On Sep 15, 2014, at 2:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
> 
> > This patcheset introduces an ability to perform a non-blocking read
> > from regular files in buffered IO mode. This works by only for those
> > filesystems that have data in the page cache.
> >
> > It does this by introducing new syscalls new syscalls readv2/writev2
> > and preadv2/pwritev2. These new syscalls behave like the network sendmsg,
> > recvmsg syscalls that accept an extra flag argument (O_NONBLOCK).
> 
> It's too bad that we are introducing yet another new read/write
> syscall pair that only allow IO into discontiguous memory regions,
> but do not allow a single call to access discontiguous file regions
> (i.e. specify a separate file offset for each iov).
> 
> Adding syscalls similar to preadv/pwritev() that could take a iovec
> that specified the file offset+length in addition to the memory address
> would allow efficient scatter-gather IO in a single syscall.  While
> that is less critical for local filesystems with small syscall latency,
> it is more important for network filesystems, or in the case of
> NVRAM-backed filesystems.
> 
> Cheers, Andreas

That sounds like the proposed WRITE SCATTERED/READ GATHERED 
commands for SCSI (where are related to, but not necessarily
tied to, atomic writes).  We discussed them a bit at 
LSF-MM 2013 - see http://lwn.net/Articles/548116/.


---
Rob Elliott    HP Server Storage






^ permalink raw reply	[flat|nested] 167+ messages in thread

* RE: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-15 22:36     ` Elliott, Robert (Server Storage)
  0 siblings, 0 replies; 167+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-09-15 22:36 UTC (permalink / raw)
  To: Andreas Dilger, Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer



> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Andreas Dilger
> Sent: Monday, 15 September, 2014 4:34 PM
> To: Milosz Tanski
> Cc: linux-kernel@vger.kernel.org; Christoph Hellwig; linux-
> fsdevel@vger.kernel.org; linux-aio@kvack.org; Mel Gorman; Volker Lendecke;
> Tejun Heo; Jeff Moyer
> Subject: Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
> 
> On Sep 15, 2014, at 2:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
> 
> > This patcheset introduces an ability to perform a non-blocking read
> > from regular files in buffered IO mode. This works by only for those
> > filesystems that have data in the page cache.
> >
> > It does this by introducing new syscalls new syscalls readv2/writev2
> > and preadv2/pwritev2. These new syscalls behave like the network sendmsg,
> > recvmsg syscalls that accept an extra flag argument (O_NONBLOCK).
> 
> It's too bad that we are introducing yet another new read/write
> syscall pair that only allow IO into discontiguous memory regions,
> but do not allow a single call to access discontiguous file regions
> (i.e. specify a separate file offset for each iov).
> 
> Adding syscalls similar to preadv/pwritev() that could take a iovec
> that specified the file offset+length in addition to the memory address
> would allow efficient scatter-gather IO in a single syscall.  While
> that is less critical for local filesystems with small syscall latency,
> it is more important for network filesystems, or in the case of
> NVRAM-backed filesystems.
> 
> Cheers, Andreas

That sounds like the proposed WRITE SCATTERED/READ GATHERED 
commands for SCSI (where are related to, but not necessarily
tied to, atomic writes).  We discussed them a bit at 
LSF-MM 2013 - see http://lwn.net/Articles/548116/.


---
Rob Elliott    HP Server Storage





--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 22:27     ` Milosz Tanski
@ 2014-09-16 13:44       ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 13:44 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, michael.kerrisk

Milosz Tanski <milosz@adfin.com> writes:

> Jeff,
>
> This patchset creates a new read (readv2/preadv2) syscall(s) that take
> a extra flag argument (kind of like recvmsg). What it doesn't do is
> change the current behavior of of the O_NONBLOCK, if the file is
> open() with O_NONBLOCK flag. It shouldn't break any existing
> applications since you have to opt into using this by using the new
> syscall.

Hi, Milosz,

Ah, I misread one of the patches.  Now that I've applied the series, I
see that you're testing the flag argument, not the file open flags.

> I don't have a preference either way if we should create a new flag or
> re-use O_NONBLOCK the flag. Instead, I'm hoping to get some consensus
> here from senior kernel developers like yourself. Maybe a RWF_NONBLOCK
> (I'm stealing from eventfd, EFD_NONBLOCK).

I think I'd rather name the flag something other than O_NONBLOCK, if for
no other reason that to avoid confusion.

> As a side note, I noticed that EFD_NONBLOCK, SFD_NONBLOCK, etc... all
> alias to the value of O_NONBLOCK and there's a bunch of bug checks in
> the code like this:
> BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK);

That's because the flag is passed on to anon_inode_getfile.  See also
this define:

#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)

A general note on your subjects -- you should make them more specific to
the subsystem you're updating.  Commit titles like "documentation
update" are a bit too broad.  ;-)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-16 13:44       ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 13:44 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, michael.kerrisk

Milosz Tanski <milosz@adfin.com> writes:

> Jeff,
>
> This patchset creates a new read (readv2/preadv2) syscall(s) that take
> a extra flag argument (kind of like recvmsg). What it doesn't do is
> change the current behavior of of the O_NONBLOCK, if the file is
> open() with O_NONBLOCK flag. It shouldn't break any existing
> applications since you have to opt into using this by using the new
> syscall.

Hi, Milosz,

Ah, I misread one of the patches.  Now that I've applied the series, I
see that you're testing the flag argument, not the file open flags.

> I don't have a preference either way if we should create a new flag or
> re-use O_NONBLOCK the flag. Instead, I'm hoping to get some consensus
> here from senior kernel developers like yourself. Maybe a RWF_NONBLOCK
> (I'm stealing from eventfd, EFD_NONBLOCK).

I think I'd rather name the flag something other than O_NONBLOCK, if for
no other reason that to avoid confusion.

> As a side note, I noticed that EFD_NONBLOCK, SFD_NONBLOCK, etc... all
> alias to the value of O_NONBLOCK and there's a bunch of bug checks in
> the code like this:
> BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK);

That's because the flag is passed on to anon_inode_getfile.  See also
this define:

#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)

A general note on your subjects -- you should make them more specific to
the subsystem you're updating.  Commit titles like "documentation
update" are a bit too broad.  ;-)

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 22:36     ` Elliott, Robert (Server Storage)
@ 2014-09-16 18:24       ` Zach Brown
  -1 siblings, 0 replies; 167+ messages in thread
From: Zach Brown @ 2014-09-16 18:24 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Andreas Dilger, Milosz Tanski, linux-kernel, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer

On Mon, Sep 15, 2014 at 10:36:46PM +0000, Elliott, Robert (Server Storage) wrote:
> 
> 
> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > owner@vger.kernel.org] On Behalf Of Andreas Dilger
> > Sent: Monday, 15 September, 2014 4:34 PM
> > To: Milosz Tanski
> > Cc: linux-kernel@vger.kernel.org; Christoph Hellwig; linux-
> > fsdevel@vger.kernel.org; linux-aio@kvack.org; Mel Gorman; Volker Lendecke;
> > Tejun Heo; Jeff Moyer
> > Subject: Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
> > 
> > On Sep 15, 2014, at 2:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
> > 
> > > This patcheset introduces an ability to perform a non-blocking read
> > > from regular files in buffered IO mode. This works by only for those
> > > filesystems that have data in the page cache.
> > >
> > > It does this by introducing new syscalls new syscalls readv2/writev2
> > > and preadv2/pwritev2. These new syscalls behave like the network sendmsg,
> > > recvmsg syscalls that accept an extra flag argument (O_NONBLOCK).
> > 
> > It's too bad that we are introducing yet another new read/write
> > syscall pair that only allow IO into discontiguous memory regions,
> > but do not allow a single call to access discontiguous file regions
> > (i.e. specify a separate file offset for each iov).
> > 
> > Adding syscalls similar to preadv/pwritev() that could take a iovec
> > that specified the file offset+length in addition to the memory address
> > would allow efficient scatter-gather IO in a single syscall.  While
> > that is less critical for local filesystems with small syscall latency,
> > it is more important for network filesystems, or in the case of
> > NVRAM-backed filesystems.
> > 
> > Cheers, Andreas
> 
> That sounds like the proposed WRITE SCATTERED/READ GATHERED 
> commands for SCSI (where are related to, but not necessarily
> tied to, atomic writes).  We discussed them a bit at 
> LSF-MM 2013 - see http://lwn.net/Articles/548116/.

It's the old {read,write}x proposals:

http://www.mcs.anl.gov/uploads/cels/papers/TM-302-FINAL.pdf

- z

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-16 18:24       ` Zach Brown
  0 siblings, 0 replies; 167+ messages in thread
From: Zach Brown @ 2014-09-16 18:24 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Andreas Dilger, Milosz Tanski, linux-kernel, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer

On Mon, Sep 15, 2014 at 10:36:46PM +0000, Elliott, Robert (Server Storage) wrote:
> 
> 
> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> > owner@vger.kernel.org] On Behalf Of Andreas Dilger
> > Sent: Monday, 15 September, 2014 4:34 PM
> > To: Milosz Tanski
> > Cc: linux-kernel@vger.kernel.org; Christoph Hellwig; linux-
> > fsdevel@vger.kernel.org; linux-aio@kvack.org; Mel Gorman; Volker Lendecke;
> > Tejun Heo; Jeff Moyer
> > Subject: Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
> > 
> > On Sep 15, 2014, at 2:20 PM, Milosz Tanski <milosz@adfin.com> wrote:
> > 
> > > This patcheset introduces an ability to perform a non-blocking read
> > > from regular files in buffered IO mode. This works by only for those
> > > filesystems that have data in the page cache.
> > >
> > > It does this by introducing new syscalls new syscalls readv2/writev2
> > > and preadv2/pwritev2. These new syscalls behave like the network sendmsg,
> > > recvmsg syscalls that accept an extra flag argument (O_NONBLOCK).
> > 
> > It's too bad that we are introducing yet another new read/write
> > syscall pair that only allow IO into discontiguous memory regions,
> > but do not allow a single call to access discontiguous file regions
> > (i.e. specify a separate file offset for each iov).
> > 
> > Adding syscalls similar to preadv/pwritev() that could take a iovec
> > that specified the file offset+length in addition to the memory address
> > would allow efficient scatter-gather IO in a single syscall.  While
> > that is less critical for local filesystems with small syscall latency,
> > it is more important for network filesystems, or in the case of
> > NVRAM-backed filesystems.
> > 
> > Cheers, Andreas
> 
> That sounds like the proposed WRITE SCATTERED/READ GATHERED 
> commands for SCSI (where are related to, but not necessarily
> tied to, atomic writes).  We discussed them a bit at 
> LSF-MM 2013 - see http://lwn.net/Articles/548116/.

It's the old {read,write}x proposals:

http://www.mcs.anl.gov/uploads/cels/papers/TM-302-FINAL.pdf

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2
  2014-09-15 20:21   ` Milosz Tanski
@ 2014-09-16 19:19     ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:19 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

> Filesystems that generic_file_read_iter will not be allowed to perform
> non-blocking reads. This only will read data if it's in the page cache and if
> there is no page error (causing a re-read).
>
> Signed-off-by: Milosz Tanski <milosz@adfin.com>

> @@ -1662,6 +1676,10 @@ no_cached_page:
>  			goto out;
>  		}
>  		goto readpage;
> +
> +would_block:
> +		error = -EAGAIN;
> +		goto out;
>  	}

Why did you put the wouldblock label inside the loop?  That should be
pushed down to just above out, and then you can get rid of the goto.

Other than that, it looks like you put the check in all the right places
in that function.

>  out:
> @@ -1697,6 +1715,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
>  		size_t count = iov_iter_count(iter);
>  		loff_t size;
>  
> +		if (flags & O_NONBLOCK)
> +			return -EAGAIN;
> +

If a program is attempting non-blocking reads on a file opened with
O_DIRECT, I think returning -EAGAIN is very misleading.  Better to
return -EINVAL in this case, and maybe check that earlier in the stack?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2
@ 2014-09-16 19:19     ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:19 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

> Filesystems that generic_file_read_iter will not be allowed to perform
> non-blocking reads. This only will read data if it's in the page cache and if
> there is no page error (causing a re-read).
>
> Signed-off-by: Milosz Tanski <milosz@adfin.com>

> @@ -1662,6 +1676,10 @@ no_cached_page:
>  			goto out;
>  		}
>  		goto readpage;
> +
> +would_block:
> +		error = -EAGAIN;
> +		goto out;
>  	}

Why did you put the wouldblock label inside the loop?  That should be
pushed down to just above out, and then you can get rid of the goto.

Other than that, it looks like you put the check in all the right places
in that function.

>  out:
> @@ -1697,6 +1715,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
>  		size_t count = iov_iter_count(iter);
>  		loff_t size;
>  
> +		if (flags & O_NONBLOCK)
> +			return -EAGAIN;
> +

If a program is attempting non-blocking reads on a file opened with
O_DIRECT, I think returning -EAGAIN is very misleading.  Better to
return -EINVAL in this case, and maybe check that earlier in the stack?

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-15 20:20   ` Milosz Tanski
@ 2014-09-16 19:20     ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:20 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

> New syscalls with an extra flag argument. For now all flags except for 0 are
> not supported.

The blatant copy-n-paste of the vectored functions bothers me.  I'll
withold judgement until I've seen your next version of the patch series,
though.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-16 19:20     ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:20 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

> New syscalls with an extra flag argument. For now all flags except for 0 are
> not supported.

The blatant copy-n-paste of the vectored functions bothers me.  I'll
withold judgement until I've seen your next version of the patch series,
though.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-15 20:22   ` Christoph Hellwig
@ 2014-09-16 19:27     ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:27 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Christoph Hellwig <milosz@adfin.com> writes:

Hrm, you're not Christoph...

> Acked-by: Milosz Tanski <milosz@adfin.com>
> ---
>  fs/ceph/file.c    |    2 ++
>  fs/cifs/file.c    |    6 ++++++
>  fs/nfs/file.c     |    5 ++++-
>  fs/ocfs2/file.c   |    6 ++++++
>  fs/pipe.c         |    3 ++-
>  fs/read_write.c   |   17 +++++++++++------
>  fs/xfs/xfs_file.c |    4 ++++
>  mm/shmem.c        |    4 ++++
>  8 files changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 4776257..b62e3a5 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -808,6 +808,8 @@ again:
>  	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
>  	    (iocb->ki_filp->f_flags & O_DIRECT) ||
>  	    (fi->flags & CEPH_F_SYNC)) {
> +		if (flags & O_NONBLOCK)
> +			return -EAGAIN;

Again, the right return value for the O_DIRECT case is EINVAL.

> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 4072f3a..116bed2 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -170,8 +170,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to, int flags)
>  	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t result;
>  
> -	if (iocb->ki_filp->f_flags & O_DIRECT)
> +	if (iocb->ki_filp->f_flags & O_DIRECT) {
> +		if (flags & O_NONBLOCK)
> +			return -EAGAIN;
>  		return nfs_file_direct_read(iocb, to, iocb->ki_pos, true);
> +	}

And here.  I stopped looking for that case at this point.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-16 19:27     ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:27 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Christoph Hellwig <milosz@adfin.com> writes:

Hrm, you're not Christoph...

> Acked-by: Milosz Tanski <milosz@adfin.com>
> ---
>  fs/ceph/file.c    |    2 ++
>  fs/cifs/file.c    |    6 ++++++
>  fs/nfs/file.c     |    5 ++++-
>  fs/ocfs2/file.c   |    6 ++++++
>  fs/pipe.c         |    3 ++-
>  fs/read_write.c   |   17 +++++++++++------
>  fs/xfs/xfs_file.c |    4 ++++
>  mm/shmem.c        |    4 ++++
>  8 files changed, 39 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 4776257..b62e3a5 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -808,6 +808,8 @@ again:
>  	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
>  	    (iocb->ki_filp->f_flags & O_DIRECT) ||
>  	    (fi->flags & CEPH_F_SYNC)) {
> +		if (flags & O_NONBLOCK)
> +			return -EAGAIN;

Again, the right return value for the O_DIRECT case is EINVAL.

> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 4072f3a..116bed2 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -170,8 +170,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to, int flags)
>  	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t result;
>  
> -	if (iocb->ki_filp->f_flags & O_DIRECT)
> +	if (iocb->ki_filp->f_flags & O_DIRECT) {
> +		if (flags & O_NONBLOCK)
> +			return -EAGAIN;
>  		return nfs_file_direct_read(iocb, to, iocb->ki_pos, true);
> +	}

And here.  I stopped looking for that case at this point.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-16 19:30   ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:30 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

> This patcheset introduces an ability to perform a non-blocking read from 
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls readv2/writev2 and
> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to 
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.
>
> I have co-developed these changes with Christoph Hellwig, a whole lot of his
> fixes went into the first patch in the series (were squashed with his
> approval).
>
> I am going to post the perf report in a reply-to to this RFC.

You can send the performance data along with the patch series, no need to
separate it off in a reply.

One additional patch I'd like to see is a man page update.  That would
help clarify exactly what you're trying to accomplish.

I look forward to v2!

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-16 19:30   ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:30 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

> This patcheset introduces an ability to perform a non-blocking read from 
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls readv2/writev2 and
> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to 
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.
>
> I have co-developed these changes with Christoph Hellwig, a whole lot of his
> fixes went into the first patch in the series (were squashed with his
> approval).
>
> I am going to post the perf report in a reply-to to this RFC.

You can send the performance data along with the patch series, no need to
separate it off in a reply.

One additional patch I'd like to see is a man page update.  That would
help clarify exactly what you're trying to accomplish.

I look forward to v2!

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2
  2014-09-16 19:19     ` Jeff Moyer
@ 2014-09-16 19:44       ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-16 19:44 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Tue, Sep 16, 2014 at 3:19 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Milosz Tanski <milosz@adfin.com> writes:
>
>> Filesystems that generic_file_read_iter will not be allowed to perform
>> non-blocking reads. This only will read data if it's in the page cache and if
>> there is no page error (causing a re-read).
>>
>> Signed-off-by: Milosz Tanski <milosz@adfin.com>
>
>> @@ -1662,6 +1676,10 @@ no_cached_page:
>>                       goto out;
>>               }
>>               goto readpage;
>> +
>> +would_block:
>> +             error = -EAGAIN;
>> +             goto out;
>>       }
>
> Why did you put the wouldblock label inside the loop?  That should be
> pushed down to just above out, and then you can get rid of the goto.

When I put the code outside the loop it actually looked worse (imo):

}

goto out;

would_block:
error = -EAGAIN;

out:
...

>
> Other than that, it looks like you put the check in all the right places
> in that function.
>
>>  out:
>> @@ -1697,6 +1715,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
>>               size_t count = iov_iter_count(iter);
>>               loff_t size;
>>
>> +             if (flags & O_NONBLOCK)
>> +                     return -EAGAIN;
>> +
>
> If a program is attempting non-blocking reads on a file opened with
> O_DIRECT, I think returning -EAGAIN is very misleading.  Better to
> return -EINVAL in this case, and maybe check that earlier in the stack?

Point taken and I can fix this for the next version further up the
stack. A longer term question is how the flags the file is open with
interact with the read/write flags ... since I imagine folks will want
to add other flags (like force a SYNC write).

>
> Cheers,
> Jeff

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2
@ 2014-09-16 19:44       ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-16 19:44 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Tue, Sep 16, 2014 at 3:19 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Milosz Tanski <milosz@adfin.com> writes:
>
>> Filesystems that generic_file_read_iter will not be allowed to perform
>> non-blocking reads. This only will read data if it's in the page cache and if
>> there is no page error (causing a re-read).
>>
>> Signed-off-by: Milosz Tanski <milosz@adfin.com>
>
>> @@ -1662,6 +1676,10 @@ no_cached_page:
>>                       goto out;
>>               }
>>               goto readpage;
>> +
>> +would_block:
>> +             error = -EAGAIN;
>> +             goto out;
>>       }
>
> Why did you put the wouldblock label inside the loop?  That should be
> pushed down to just above out, and then you can get rid of the goto.

When I put the code outside the loop it actually looked worse (imo):

}

goto out;

would_block:
error = -EAGAIN;

out:
...

>
> Other than that, it looks like you put the check in all the right places
> in that function.
>
>>  out:
>> @@ -1697,6 +1715,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter, int flags)
>>               size_t count = iov_iter_count(iter);
>>               loff_t size;
>>
>> +             if (flags & O_NONBLOCK)
>> +                     return -EAGAIN;
>> +
>
> If a program is attempting non-blocking reads on a file opened with
> O_DIRECT, I think returning -EAGAIN is very misleading.  Better to
> return -EINVAL in this case, and maybe check that earlier in the stack?

Point taken and I can fix this for the next version further up the
stack. A longer term question is how the flags the file is open with
interact with the read/write flags ... since I imagine folks will want
to add other flags (like force a SYNC write).

>
> Cheers,
> Jeff

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-16 19:27     ` Jeff Moyer
@ 2014-09-16 19:45       ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-16 19:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

I am not Christoph, we collaborated and he sent me this patch.

On Tue, Sep 16, 2014 at 3:27 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <milosz@adfin.com> writes:
>
> Hrm, you're not Christoph...
>
>> Acked-by: Milosz Tanski <milosz@adfin.com>
>> ---
>>  fs/ceph/file.c    |    2 ++
>>  fs/cifs/file.c    |    6 ++++++
>>  fs/nfs/file.c     |    5 ++++-
>>  fs/ocfs2/file.c   |    6 ++++++
>>  fs/pipe.c         |    3 ++-
>>  fs/read_write.c   |   17 +++++++++++------
>>  fs/xfs/xfs_file.c |    4 ++++
>>  mm/shmem.c        |    4 ++++
>>  8 files changed, 39 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 4776257..b62e3a5 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -808,6 +808,8 @@ again:
>>       if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
>>           (iocb->ki_filp->f_flags & O_DIRECT) ||
>>           (fi->flags & CEPH_F_SYNC)) {
>> +             if (flags & O_NONBLOCK)
>> +                     return -EAGAIN;
>
> Again, the right return value for the O_DIRECT case is EINVAL.
>
>> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> index 4072f3a..116bed2 100644
>> --- a/fs/nfs/file.c
>> +++ b/fs/nfs/file.c
>> @@ -170,8 +170,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to, int flags)
>>       struct inode *inode = file_inode(iocb->ki_filp);
>>       ssize_t result;
>>
>> -     if (iocb->ki_filp->f_flags & O_DIRECT)
>> +     if (iocb->ki_filp->f_flags & O_DIRECT) {
>> +             if (flags & O_NONBLOCK)
>> +                     return -EAGAIN;
>>               return nfs_file_direct_read(iocb, to, iocb->ki_pos, true);
>> +     }
>
> And here.  I stopped looking for that case at this point.
>
> Cheers,
> Jeff



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-16 19:45       ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-16 19:45 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

I am not Christoph, we collaborated and he sent me this patch.

On Tue, Sep 16, 2014 at 3:27 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <milosz@adfin.com> writes:
>
> Hrm, you're not Christoph...
>
>> Acked-by: Milosz Tanski <milosz@adfin.com>
>> ---
>>  fs/ceph/file.c    |    2 ++
>>  fs/cifs/file.c    |    6 ++++++
>>  fs/nfs/file.c     |    5 ++++-
>>  fs/ocfs2/file.c   |    6 ++++++
>>  fs/pipe.c         |    3 ++-
>>  fs/read_write.c   |   17 +++++++++++------
>>  fs/xfs/xfs_file.c |    4 ++++
>>  mm/shmem.c        |    4 ++++
>>  8 files changed, 39 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 4776257..b62e3a5 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -808,6 +808,8 @@ again:
>>       if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
>>           (iocb->ki_filp->f_flags & O_DIRECT) ||
>>           (fi->flags & CEPH_F_SYNC)) {
>> +             if (flags & O_NONBLOCK)
>> +                     return -EAGAIN;
>
> Again, the right return value for the O_DIRECT case is EINVAL.
>
>> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
>> index 4072f3a..116bed2 100644
>> --- a/fs/nfs/file.c
>> +++ b/fs/nfs/file.c
>> @@ -170,8 +170,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to, int flags)
>>       struct inode *inode = file_inode(iocb->ki_filp);
>>       ssize_t result;
>>
>> -     if (iocb->ki_filp->f_flags & O_DIRECT)
>> +     if (iocb->ki_filp->f_flags & O_DIRECT) {
>> +             if (flags & O_NONBLOCK)
>> +                     return -EAGAIN;
>>               return nfs_file_direct_read(iocb, to, iocb->ki_pos, true);
>> +     }
>
> And here.  I stopped looking for that case at this point.
>
> Cheers,
> Jeff



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2
  2014-09-16 19:44       ` Milosz Tanski
@ 2014-09-16 19:53         ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:53 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

>> Why did you put the wouldblock label inside the loop?  That should be
>> pushed down to just above out, and then you can get rid of the goto.
>
> When I put the code outside the loop it actually looked worse (imo):
>
> }
>
> goto out;
>
> would_block:
> error = -EAGAIN;
>
> out:
> ...
>

We don't exit the loop without a return or a goto, so you wouldn't need
that 'goto out' just below the end of the loop.  It would look like:

	}

would_block:
	error = -EAGAIN;
out:
...

> Point taken and I can fix this for the next version further up the
> stack. A longer term question is how the flags the file is open with
> interact with the read/write flags ... since I imagine folks will want
> to add other flags (like force a SYNC write).

I think we'll have to address those one at a time.  I do like the idea
of the SYNC flag for a write, though you'll probably have several
variants of that (equivalents of SYNC and DSYNC at least).  Another fun
write flag to consider is O_ATOMIC.  :)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2
@ 2014-09-16 19:53         ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 19:53 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

>> Why did you put the wouldblock label inside the loop?  That should be
>> pushed down to just above out, and then you can get rid of the goto.
>
> When I put the code outside the loop it actually looked worse (imo):
>
> }
>
> goto out;
>
> would_block:
> error = -EAGAIN;
>
> out:
> ...
>

We don't exit the loop without a return or a goto, so you wouldn't need
that 'goto out' just below the end of the loop.  It would look like:

	}

would_block:
	error = -EAGAIN;
out:
...

> Point taken and I can fix this for the next version further up the
> stack. A longer term question is how the flags the file is open with
> interact with the read/write flags ... since I imagine folks will want
> to add other flags (like force a SYNC write).

I think we'll have to address those one at a time.  I do like the idea
of the SYNC flag for a write, though you'll probably have several
variants of that (equivalents of SYNC and DSYNC at least).  Another fun
write flag to consider is O_ATOMIC.  :)

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-16 19:20     ` Jeff Moyer
@ 2014-09-16 19:54       ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-16 19:54 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

Good point. I will put the shared functionality in a static function.

On Tue, Sep 16, 2014 at 3:20 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Milosz Tanski <milosz@adfin.com> writes:
>
>> New syscalls with an extra flag argument. For now all flags except for 0 are
>> not supported.
>
> The blatant copy-n-paste of the vectored functions bothers me.  I'll
> withold judgement until I've seen your next version of the patch series,
> though.
>
> Cheers,
> Jeff



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-16 19:54       ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-16 19:54 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

Good point. I will put the shared functionality in a static function.

On Tue, Sep 16, 2014 at 3:20 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Milosz Tanski <milosz@adfin.com> writes:
>
>> New syscalls with an extra flag argument. For now all flags except for 0 are
>> not supported.
>
> The blatant copy-n-paste of the vectored functions bothers me.  I'll
> withold judgement until I've seen your next version of the patch series,
> though.
>
> Cheers,
> Jeff



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-16 19:30   ` Jeff Moyer
@ 2014-09-16 20:34     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-16 20:34 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

Jeff,

Which git repository do the man pages live in?

- Milosz

On Tue, Sep 16, 2014 at 3:30 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Milosz Tanski <milosz@adfin.com> writes:
>
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2 and
>> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
>> syscalls that accept an extra flag argument (O_NONBLOCK).
>>
>> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
>> perform buffered IO operations. They submit the work form another thread
>> that performs network IO and epoll or other threads that perform CPU work. This
>> leads to increased latency for processing, esp. in the case of data that's
>> already cached in the page cache.
>>
>> With the new interface the applications will now be able to fetch the data in
>> their network / cpu bound thread(s) and only defer to a threadpool if it's not
>> there. In our own application (VLDB) we've observed a decrease in latency for
>> "fast" request by avoiding unnecessary queuing and having to swap out current
>> tasks in IO bound work threads.
>>
>> I have co-developed these changes with Christoph Hellwig, a whole lot of his
>> fixes went into the first patch in the series (were squashed with his
>> approval).
>>
>> I am going to post the perf report in a reply-to to this RFC.
>
> You can send the performance data along with the patch series, no need to
> separate it off in a reply.
>
> One additional patch I'd like to see is a man page update.  That would
> help clarify exactly what you're trying to accomplish.
>
> I look forward to v2!
>
> Cheers,
> Jeff



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-16 20:34     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-16 20:34 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

Jeff,

Which git repository do the man pages live in?

- Milosz

On Tue, Sep 16, 2014 at 3:30 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Milosz Tanski <milosz@adfin.com> writes:
>
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2 and
>> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
>> syscalls that accept an extra flag argument (O_NONBLOCK).
>>
>> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
>> perform buffered IO operations. They submit the work form another thread
>> that performs network IO and epoll or other threads that perform CPU work. This
>> leads to increased latency for processing, esp. in the case of data that's
>> already cached in the page cache.
>>
>> With the new interface the applications will now be able to fetch the data in
>> their network / cpu bound thread(s) and only defer to a threadpool if it's not
>> there. In our own application (VLDB) we've observed a decrease in latency for
>> "fast" request by avoiding unnecessary queuing and having to swap out current
>> tasks in IO bound work threads.
>>
>> I have co-developed these changes with Christoph Hellwig, a whole lot of his
>> fixes went into the first patch in the series (were squashed with his
>> approval).
>>
>> I am going to post the perf report in a reply-to to this RFC.
>
> You can send the performance data along with the patch series, no need to
> separate it off in a reply.
>
> One additional patch I'd like to see is a man page update.  That would
> help clarify exactly what you're trying to accomplish.
>
> I look forward to v2!
>
> Cheers,
> Jeff



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-16 20:34     ` Milosz Tanski
@ 2014-09-16 20:49       ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 20:49 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

> Jeff,
>
> Which git repository do the man pages live in?

Hi, Milosz,

The download page is here:
https://www.kernel.org/doc/man-pages/download.html

And the git clone command is:
$ git clone http://git.kernel.org/pub/scm/docs/man-pages/man-pages

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-16 20:49       ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 20:49 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

Milosz Tanski <milosz@adfin.com> writes:

> Jeff,
>
> Which git repository do the man pages live in?

Hi, Milosz,

The download page is here:
https://www.kernel.org/doc/man-pages/download.html

And the git clone command is:
$ git clone http://git.kernel.org/pub/scm/docs/man-pages/man-pages

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-16 19:20     ` Jeff Moyer
@ 2014-09-16 21:03       ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-16 21:03 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

On Tue, Sep 16, 2014 at 03:20:32PM -0400, Jeff Moyer wrote:
> Milosz Tanski <milosz@adfin.com> writes:
> 
> > New syscalls with an extra flag argument. For now all flags except for 0 are
> > not supported.
> 
> The blatant copy-n-paste of the vectored functions bothers me.  I'll
> withold judgement until I've seen your next version of the patch series,
> though.

We could actually implement the non-flags ones based on the flagged ones
at the syscall level, similar to how recv is implemented on top of
recvfrom.


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-16 21:03       ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-16 21:03 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

On Tue, Sep 16, 2014 at 03:20:32PM -0400, Jeff Moyer wrote:
> Milosz Tanski <milosz@adfin.com> writes:
> 
> > New syscalls with an extra flag argument. For now all flags except for 0 are
> > not supported.
> 
> The blatant copy-n-paste of the vectored functions bothers me.  I'll
> withold judgement until I've seen your next version of the patch series,
> though.

We could actually implement the non-flags ones based on the flagged ones
at the syscall level, similar to how recv is implemented on top of
recvfrom.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-16 19:27     ` Jeff Moyer
@ 2014-09-16 21:04       ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-16 21:04 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

On Tue, Sep 16, 2014 at 03:27:41PM -0400, Jeff Moyer wrote:
> Christoph Hellwig <milosz@adfin.com> writes:
> 
> Hrm, you're not Christoph...
> 
> > Acked-by: Milosz Tanski <milosz@adfin.com>
> > ---
> >  fs/ceph/file.c    |    2 ++
> >  fs/cifs/file.c    |    6 ++++++
> >  fs/nfs/file.c     |    5 ++++-
> >  fs/ocfs2/file.c   |    6 ++++++
> >  fs/pipe.c         |    3 ++-
> >  fs/read_write.c   |   17 +++++++++++------
> >  fs/xfs/xfs_file.c |    4 ++++
> >  mm/shmem.c        |    4 ++++
> >  8 files changed, 39 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 4776257..b62e3a5 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -808,6 +808,8 @@ again:
> >  	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
> >  	    (iocb->ki_filp->f_flags & O_DIRECT) ||
> >  	    (fi->flags & CEPH_F_SYNC)) {
> > +		if (flags & O_NONBLOCK)
> > +			return -EAGAIN;
> 
> Again, the right return value for the O_DIRECT case is EINVAL.

Is it?  We define -EAGAIN as it would block, which is defintively true
for O_DIRECT reads.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-16 21:04       ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-16 21:04 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

On Tue, Sep 16, 2014 at 03:27:41PM -0400, Jeff Moyer wrote:
> Christoph Hellwig <milosz@adfin.com> writes:
> 
> Hrm, you're not Christoph...
> 
> > Acked-by: Milosz Tanski <milosz@adfin.com>
> > ---
> >  fs/ceph/file.c    |    2 ++
> >  fs/cifs/file.c    |    6 ++++++
> >  fs/nfs/file.c     |    5 ++++-
> >  fs/ocfs2/file.c   |    6 ++++++
> >  fs/pipe.c         |    3 ++-
> >  fs/read_write.c   |   17 +++++++++++------
> >  fs/xfs/xfs_file.c |    4 ++++
> >  mm/shmem.c        |    4 ++++
> >  8 files changed, 39 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 4776257..b62e3a5 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -808,6 +808,8 @@ again:
> >  	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
> >  	    (iocb->ki_filp->f_flags & O_DIRECT) ||
> >  	    (fi->flags & CEPH_F_SYNC)) {
> > +		if (flags & O_NONBLOCK)
> > +			return -EAGAIN;
> 
> Again, the right return value for the O_DIRECT case is EINVAL.

Is it?  We define -EAGAIN as it would block, which is defintively true
for O_DIRECT reads.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-16 21:04       ` Christoph Hellwig
@ 2014-09-16 21:24         ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 21:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Christoph Hellwig <hch@infradead.org> writes:

>> Again, the right return value for the O_DIRECT case is EINVAL.
>
> Is it?  We define -EAGAIN as it would block, which is defintively true
> for O_DIRECT reads.

It will *always* block.  So I don't think it's valid to ask for a
non-blocking read on a file opened with O_DIRECT.  What am I missing?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-16 21:24         ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-16 21:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

Christoph Hellwig <hch@infradead.org> writes:

>> Again, the right return value for the O_DIRECT case is EINVAL.
>
> Is it?  We define -EAGAIN as it would block, which is defintively true
> for O_DIRECT reads.

It will *always* block.  So I don't think it's valid to ask for a
non-blocking read on a file opened with O_DIRECT.  What am I missing?

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-16 19:45       ` Milosz Tanski
@ 2014-09-16 21:42         ` Dave Chinner
  -1 siblings, 0 replies; 167+ messages in thread
From: Dave Chinner @ 2014-09-16 21:42 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Jeff Moyer, LKML, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

[Please don't top post!]

On Tue, Sep 16, 2014 at 03:45:52PM -0400, Milosz Tanski wrote:
> On Tue, Sep 16, 2014 at 3:27 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Christoph Hellwig <milosz@adfin.com> writes:
> >
> > Hrm, you're not Christoph...
>
> I am not Christoph, we collaborated and he sent me this patch.

You're missing Jeff's point - have a look at the name and email
adress the mail appears to be from. It's completely mangled - forged
if you will and Linus had a major rant about doing exactly this to
patch sereies recently.  There is a perfectly acceptible way of
crediting who the patch is from correctly without resorting to games
like this.

Also, this patch doesn't have a description or a valid SOB on it....

Please read Documentation/SubmittingPatches so you get the format of
the patches correct for V2. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-16 21:42         ` Dave Chinner
  0 siblings, 0 replies; 167+ messages in thread
From: Dave Chinner @ 2014-09-16 21:42 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Jeff Moyer, LKML, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

[Please don't top post!]

On Tue, Sep 16, 2014 at 03:45:52PM -0400, Milosz Tanski wrote:
> On Tue, Sep 16, 2014 at 3:27 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Christoph Hellwig <milosz@adfin.com> writes:
> >
> > Hrm, you're not Christoph...
>
> I am not Christoph, we collaborated and he sent me this patch.

You're missing Jeff's point - have a look at the name and email
adress the mail appears to be from. It's completely mangled - forged
if you will and Linus had a major rant about doing exactly this to
patch sereies recently.  There is a perfectly acceptible way of
crediting who the patch is from correctly without resorting to games
like this.

Also, this patch doesn't have a description or a valid SOB on it....

Please read Documentation/SubmittingPatches so you get the format of
the patches correct for V2. ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-16 21:42         ` Dave Chinner
@ 2014-09-17 12:24           ` Benjamin LaHaise
  -1 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-17 12:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Milosz Tanski, Jeff Moyer, LKML, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 07:42:54AM +1000, Dave Chinner wrote:
> [Please don't top post!]
> 
> On Tue, Sep 16, 2014 at 03:45:52PM -0400, Milosz Tanski wrote:
> > On Tue, Sep 16, 2014 at 3:27 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > > Christoph Hellwig <milosz@adfin.com> writes:
> > >
> > > Hrm, you're not Christoph...
> >
> > I am not Christoph, we collaborated and he sent me this patch.
> 
> You're missing Jeff's point - have a look at the name and email
> adress the mail appears to be from. It's completely mangled - forged
> if you will and Linus had a major rant about doing exactly this to
> patch sereies recently.  There is a perfectly acceptible way of
> crediting who the patch is from correctly without resorting to games
> like this.

Linus flamed me for that a few weeks ago.  The problem is that if one 
uses "git format-patch" to prepare a series of emails to post, that it 
users the patch's author for the From: entry.  I think that is a bug 
in git since multiple people have encountered this issue.  It's not 
like git doesn't know what one's email address is....

		-ben

> Also, this patch doesn't have a description or a valid SOB on it....
> 
> Please read Documentation/SubmittingPatches so you get the format of
> the patches correct for V2. ;)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-17 12:24           ` Benjamin LaHaise
  0 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-17 12:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Milosz Tanski, Jeff Moyer, LKML, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 07:42:54AM +1000, Dave Chinner wrote:
> [Please don't top post!]
> 
> On Tue, Sep 16, 2014 at 03:45:52PM -0400, Milosz Tanski wrote:
> > On Tue, Sep 16, 2014 at 3:27 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > > Christoph Hellwig <milosz@adfin.com> writes:
> > >
> > > Hrm, you're not Christoph...
> >
> > I am not Christoph, we collaborated and he sent me this patch.
> 
> You're missing Jeff's point - have a look at the name and email
> adress the mail appears to be from. It's completely mangled - forged
> if you will and Linus had a major rant about doing exactly this to
> patch sereies recently.  There is a perfectly acceptible way of
> crediting who the patch is from correctly without resorting to games
> like this.

Linus flamed me for that a few weeks ago.  The problem is that if one 
uses "git format-patch" to prepare a series of emails to post, that it 
users the patch's author for the From: entry.  I think that is a bug 
in git since multiple people have encountered this issue.  It's not 
like git doesn't know what one's email address is....

		-ben

> Also, this patch doesn't have a description or a valid SOB on it....
> 
> Please read Documentation/SubmittingPatches so you get the format of
> the patches correct for V2. ;)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-17 12:24           ` Benjamin LaHaise
@ 2014-09-17 13:47             ` Theodore Ts'o
  -1 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2014-09-17 13:47 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Dave Chinner, Milosz Tanski, Jeff Moyer, LKML, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 08:24:48AM -0400, Benjamin LaHaise wrote:
> 
> Linus flamed me for that a few weeks ago.  The problem is that if one 
> uses "git format-patch" to prepare a series of emails to post, that it 
> users the patch's author for the From: entry.  I think that is a bug 
> in git since multiple people have encountered this issue.  It's not 
> like git doesn't know what one's email address is....

That's true, but "git send-email" takes care of doing the rewrite so
that your e-mail address is on the from line and the original author's
from: line is moved into the body of the e-mail:

% git log -1
commit 1419b5ae0b3bf093bd694dd592ebfdb58ac92d10
Author: Michael Forney <forney@google.com>
Date:   Mon Sep 15 14:30:00 2014 -0400

    Don't clear BUILD_CFLAGS and BUILD_LDFLAGS when cross-compiling
    
    Signed-off-by: Michael Forney <forney@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
% git format-patch -o /tmp/p -1
/tmp/p/0001-Don-t-clear-BUILD_CFLAGS-and-BUILD_LDFLAGS-when-cros.patch
% head -5 /tmp/p/0001-Don-t-clear-BUILD_CFLAGS-and-BUILD_LDFLAGS-when-cros.patch 
>From 1419b5ae0b3bf093bd694dd592ebfdb58ac92d10 Mon Sep 17 00:00:00 2001
From: Michael Forney <forney@google.com>
Date: Mon, 15 Sep 2014 14:30:00 -0400
Subject: [PATCH] Don't clear BUILD_CFLAGS and BUILD_LDFLAGS when
 cross-compiling
% git send-email /tmp/p --to tytso@mit.edu
/tmp/p/0001-Don-t-clear-BUILD_CFLAGS-and-BUILD_LDFLAGS-when-cros.patch
(mbox) Adding cc: Michael Forney <forney@google.com> from line 'From: Michael Forney <forney@google.com>'
(body) Adding cc: Michael Forney <forney@google.com> from line 'Signed-off-by: Michael Forney <forney@google.com>'
(body) Adding cc: Theodore Ts'o <tytso@mit.edu> from line 'Signed-off-by: Theodore Ts'o <tytso@mit.edu>'

From: Theodore Ts'o <tytso@mit.edu>
To: tytso@mit.edu
Cc: Michael Forney <forney@google.com>
Subject: [PATCH] Don't clear BUILD_CFLAGS and BUILD_LDFLAGS when cross-compiling
Date: Wed, 17 Sep 2014 09:45:09 -0400
Message-Id: <1410961509-10620-1-git-send-email-tytso@mit.edu>
X-Mailer: git-send-email 2.1.0

Send this email? ([y]es|[n]o|[q]uit|[a]ll): ^C

% git version
git version 2.1.0

Perhaps you and other people are using your own scripts, and not using
git send-email?


							- Ted

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-17 13:47             ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2014-09-17 13:47 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Dave Chinner, Milosz Tanski, Jeff Moyer, LKML, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 08:24:48AM -0400, Benjamin LaHaise wrote:
> 
> Linus flamed me for that a few weeks ago.  The problem is that if one 
> uses "git format-patch" to prepare a series of emails to post, that it 
> users the patch's author for the From: entry.  I think that is a bug 
> in git since multiple people have encountered this issue.  It's not 
> like git doesn't know what one's email address is....

That's true, but "git send-email" takes care of doing the rewrite so
that your e-mail address is on the from line and the original author's
from: line is moved into the body of the e-mail:

% git log -1
commit 1419b5ae0b3bf093bd694dd592ebfdb58ac92d10
Author: Michael Forney <forney@google.com>
Date:   Mon Sep 15 14:30:00 2014 -0400

    Don't clear BUILD_CFLAGS and BUILD_LDFLAGS when cross-compiling
    
    Signed-off-by: Michael Forney <forney@google.com>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
% git format-patch -o /tmp/p -1
/tmp/p/0001-Don-t-clear-BUILD_CFLAGS-and-BUILD_LDFLAGS-when-cros.patch
% head -5 /tmp/p/0001-Don-t-clear-BUILD_CFLAGS-and-BUILD_LDFLAGS-when-cros.patch 
>From 1419b5ae0b3bf093bd694dd592ebfdb58ac92d10 Mon Sep 17 00:00:00 2001
From: Michael Forney <forney@google.com>
Date: Mon, 15 Sep 2014 14:30:00 -0400
Subject: [PATCH] Don't clear BUILD_CFLAGS and BUILD_LDFLAGS when
 cross-compiling
% git send-email /tmp/p --to tytso@mit.edu
/tmp/p/0001-Don-t-clear-BUILD_CFLAGS-and-BUILD_LDFLAGS-when-cros.patch
(mbox) Adding cc: Michael Forney <forney@google.com> from line 'From: Michael Forney <forney@google.com>'
(body) Adding cc: Michael Forney <forney@google.com> from line 'Signed-off-by: Michael Forney <forney@google.com>'
(body) Adding cc: Theodore Ts'o <tytso@mit.edu> from line 'Signed-off-by: Theodore Ts'o <tytso@mit.edu>'

From: Theodore Ts'o <tytso@mit.edu>
To: tytso@mit.edu
Cc: Michael Forney <forney@google.com>
Subject: [PATCH] Don't clear BUILD_CFLAGS and BUILD_LDFLAGS when cross-compiling
Date: Wed, 17 Sep 2014 09:45:09 -0400
Message-Id: <1410961509-10620-1-git-send-email-tytso@mit.edu>
X-Mailer: git-send-email 2.1.0

Send this email? ([y]es|[n]o|[q]uit|[a]ll): ^C

% git version
git version 2.1.0

Perhaps you and other people are using your own scripts, and not using
git send-email?


							- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-17 13:47             ` Theodore Ts'o
@ 2014-09-17 13:56               ` Benjamin LaHaise
  -1 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-17 13:56 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Chinner, Milosz Tanski, Jeff Moyer, LKML,
	Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 09:47:02AM -0400, Theodore Ts'o wrote:
...
> % git version
> git version 2.1.0
> 
> Perhaps you and other people are using your own scripts, and not using
> git send-email?

That would be because none of my systems have git 2.1.0 on them.  Fedora 
(and EPEL) appear to still be back on git 1.9.3 which does not have git 
send-email.  Until that command is more widely propagated, I expect we'll 
see people making this mistake every once in a while.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-17 13:56               ` Benjamin LaHaise
  0 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-17 13:56 UTC (permalink / raw)
  To: Theodore Ts'o, Dave Chinner, Milosz Tanski, Jeff Moyer, LKML,
	Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 09:47:02AM -0400, Theodore Ts'o wrote:
...
> % git version
> git version 2.1.0
> 
> Perhaps you and other people are using your own scripts, and not using
> git send-email?

That would be because none of my systems have git 2.1.0 on them.  Fedora 
(and EPEL) appear to still be back on git 1.9.3 which does not have git 
send-email.  Until that command is more widely propagated, I expect we'll 
see people making this mistake every once in a while.

		-ben
-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [RFC 1/2] aio: async readahead
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-17 14:49   ` Benjamin LaHaise
  -1 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-17 14:49 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Andreas Dilger, Jeff Moyer

Hi Milosz et al,

This code is probably relevant to the non-blocking read thread.  A 
non-blocking read is pretty useless without some way to trigger and 
become aware of data being read into the page cache, and the attached 
patch is one way to do so.

The changes below introduce an async readahead operation that is based 
on readpage (sorry, I haven't done an mpage version of this code yet).  
Please note that this code was written against an older kernel (3.4) 
and hasn't been extensively tested against recent kernels, so there may 
be a few bugs lingering.  That said, the code has been enabled in our 
internal kernel at Solace Systems for a few months now with no reported 
issues.

There is a companion patch to make ext3's readpage operation use async 
metadata reads that will follow.  A test program that uses the new readhead 
operation can be found at http://www.kvack.org/~bcrl/aio-readahead.c .

		-ben
-- 
"Thought is the essence of where you are now."

 fs/aio.c                     |  220 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pagemap.h      |    2 
 include/uapi/linux/aio_abi.h |    2 
 mm/filemap.c                 |    2 
 4 files changed, 225 insertions(+), 1 deletion(-)
diff --git a/fs/aio.c b/fs/aio.c
index 7337500..f1c0f74 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -46,6 +46,8 @@
 
 #include "internal.h"
 
+static long aio_readahead(struct kiocb *iocb);
+
 #define AIO_RING_MAGIC			0xa10a10a1
 #define AIO_RING_COMPAT_FEATURES	1
 #define AIO_RING_INCOMPAT_FEATURES	0
@@ -1379,6 +1381,12 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
 		iter_op	= file->f_op->read_iter;
 		goto rw_common;
 
+	case IOCB_CMD_READAHEAD:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_READ)))
+			break;
+		return aio_readahead(req);
+
 	case IOCB_CMD_PWRITE:
 	case IOCB_CMD_PWRITEV:
 		mode	= FMODE_WRITE;
@@ -1710,3 +1718,215 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
 	}
 	return ret;
 }
+
+/* for readahead */
+struct readahead_state;
+struct readahead_pginfo {
+	struct wait_bit_queue		wait_bit;
+	struct page			*page;
+};
+
+struct readahead_state {
+	struct kiocb			*iocb;
+	unsigned			nr_pages;
+	atomic_t			nr_pages_reading;
+
+	struct readahead_pginfo	pginfo[];
+};
+
+static void aio_readahead_complete(struct readahead_state *state)
+{
+	unsigned i, nr_uptodate = 0;
+	struct kiocb *iocb;
+	long res;
+	if (!atomic_dec_and_test(&state->nr_pages_reading))
+		return;
+	for (i = 0; i < state->nr_pages; i++) {
+		struct page *page = state->pginfo[i].page;
+
+		if (PageUptodate(page))
+			nr_uptodate++;
+		page_cache_release(page);
+	}
+	iocb = state->iocb;
+	if (nr_uptodate == state->nr_pages) {
+		res = iocb->ki_nbytes;
+	} else
+		res = -EIO;
+	kfree(state);
+	aio_complete(iocb, res, 0);
+}
+
+static int pginfo_wait_func(wait_queue_t *wait, unsigned mode, int flags,
+			    void *arg)
+{
+	struct readahead_state *state = wait->private;
+	struct readahead_pginfo *pginfo;
+	struct wait_bit_key *key = arg;
+	unsigned idx;
+
+	pginfo = container_of(wait, struct readahead_pginfo, wait_bit.wait);
+	idx = pginfo - state->pginfo;
+	BUG_ON(idx >= state->nr_pages);
+
+	if (pginfo->wait_bit.key.flags != key->flags ||
+	    pginfo->wait_bit.key.bit_nr != key->bit_nr ||
+	    test_bit(key->bit_nr, key->flags))
+		return 0;
+	list_del_init(&wait->task_list);
+	aio_readahead_complete(state);
+	return 1;
+}
+
+static void pginfo_wait_on_page(struct readahead_state *state,
+				struct readahead_pginfo *pginfo)
+{
+	struct page *page = pginfo->page;
+	wait_queue_head_t *wq;
+	unsigned long flags;
+	int ret;
+
+	pginfo->wait_bit.key.flags = &page->flags;
+	pginfo->wait_bit.key.bit_nr = PG_locked;
+	pginfo->wait_bit.wait.private = state;
+	pginfo->wait_bit.wait.func = pginfo_wait_func;
+	
+	page = pginfo->page;
+	wq = page_waitqueue(page);
+	atomic_inc(&state->nr_pages_reading);
+
+	spin_lock_irqsave(&wq->lock, flags);
+	__add_wait_queue(wq, &pginfo->wait_bit.wait);
+	if (!PageLocked(page))
+		ret = pginfo_wait_func(&pginfo->wait_bit.wait, 0, 0,
+				       &pginfo->wait_bit.key);
+	spin_unlock_irqrestore(&wq->lock, flags);
+}
+
+
+/*
+ * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates all
+ * the pages first, then submits them all for I/O. This avoids the very bad
+ * behaviour which would occur if page allocations are causing VM writeback.
+ * We really don't want to intermingle reads and writes like that.
+ *
+ * Returns the number of pages requested, or the maximum amount of I/O allowed.
+ */
+static int
+__do_page_cache_readahead(struct address_space *mapping, struct file *filp,
+			pgoff_t offset, unsigned long nr_to_read,
+			unsigned long lookahead_size,
+			struct readahead_state *state)
+{
+	struct inode *inode = mapping->host;
+	struct page *page;
+	unsigned long end_index;	/* The last page we want to read */
+	LIST_HEAD(page_pool);
+	int page_idx;
+	int ret = 0;
+	loff_t isize = i_size_read(inode);
+
+	if (isize == 0)
+		goto out;
+
+	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+
+	/*
+	 * Preallocate as many pages as we will need.
+	 */
+	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
+		pgoff_t page_offset = offset + page_idx;
+		struct readahead_pginfo *pginfo = &state->pginfo[page_idx];
+		int locked = 0;
+
+		if (page_offset > end_index)
+			break;
+
+		init_waitqueue_func_entry(&pginfo->wait_bit.wait,
+					  pginfo_wait_func);
+find_page:
+		page = find_get_page(mapping, page_offset);
+		if (!page) {
+			int err;
+			page = page_cache_alloc_cold(mapping);
+			err = add_to_page_cache_lru(page, mapping,
+						    page_offset,
+						    GFP_KERNEL);
+			if (err)
+				page_cache_release(page);
+			if (err == -EEXIST)
+				goto find_page;
+			if (err)
+				break;
+			locked = 1;
+		}
+		if (!page)
+			break;
+
+		ret++;
+		state->nr_pages++;
+		pginfo->page = page;
+		if (!locked && PageUptodate(page))
+			continue;
+		if (locked || trylock_page(page)) {
+			if (PageUptodate(page)) {
+				unlock_page(page);
+				continue;
+			}
+			pginfo_wait_on_page(state, pginfo);
+
+			/* Ignoring the return code from readpage here is
+			 * safe, as the readpage() operation will unlock
+			 * the page and thus kick our state machine.
+			 */
+			mapping->a_ops->readpage(filp, page);
+			continue;
+		}
+		pginfo_wait_on_page(state, pginfo);
+	}
+
+out:
+	return ret;
+}
+
+static long aio_readahead(struct kiocb *iocb)
+{
+	struct file *filp = iocb->ki_filp;
+	struct readahead_state *state;
+	pgoff_t start, end;
+	unsigned nr_pages;
+	int ret;
+
+	if (!filp->f_mapping || !filp->f_mapping->a_ops ||
+	    !filp->f_mapping->a_ops->readpage)
+		return -EINVAL;
+
+	if (iocb->ki_nbytes == 0) {
+		aio_complete(iocb, 0, 0);
+		return 0;
+	}
+
+	start = iocb->ki_pos >> PAGE_CACHE_SHIFT;
+	end = (iocb->ki_pos + iocb->ki_nbytes - 1) >> PAGE_CACHE_SHIFT;
+	nr_pages = 1 + end - start;
+
+	state = kzalloc(sizeof(*state) +
+			nr_pages * sizeof(struct readahead_pginfo),
+			GFP_KERNEL);
+	if (!state)
+		return -ENOMEM;
+
+	state->iocb = iocb;
+	atomic_set(&state->nr_pages_reading, 1);
+
+	ret = __do_page_cache_readahead(filp->f_mapping, filp, start, nr_pages,
+					0, state);
+	if (ret <= 0) {
+		kfree(state);
+		aio_complete(iocb, 0, 0);
+		return 0;
+	}
+
+	aio_readahead_complete(state);	// Drops ref of 1 from nr_pages_reading
+	return 0;
+}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3df8c7d..afd1f20 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -495,6 +495,8 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
 	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
 }
 
+wait_queue_head_t *page_waitqueue(struct page *page);
+
 /*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..11723c53 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -44,6 +44,8 @@ enum {
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
+
+	IOCB_CMD_READAHEAD = 12,
 };
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..3368b73 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -670,7 +670,7 @@ EXPORT_SYMBOL(__page_cache_alloc);
  * at a cost of "thundering herd" phenomena during rare hash
  * collisions.
  */
-static wait_queue_head_t *page_waitqueue(struct page *page)
+wait_queue_head_t *page_waitqueue(struct page *page)
 {
 	const struct zone *zone = page_zone(page);
 

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC 1/2] aio: async readahead
@ 2014-09-17 14:49   ` Benjamin LaHaise
  0 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-17 14:49 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Andreas Dilger, Jeff Moyer

Hi Milosz et al,

This code is probably relevant to the non-blocking read thread.  A 
non-blocking read is pretty useless without some way to trigger and 
become aware of data being read into the page cache, and the attached 
patch is one way to do so.

The changes below introduce an async readahead operation that is based 
on readpage (sorry, I haven't done an mpage version of this code yet).  
Please note that this code was written against an older kernel (3.4) 
and hasn't been extensively tested against recent kernels, so there may 
be a few bugs lingering.  That said, the code has been enabled in our 
internal kernel at Solace Systems for a few months now with no reported 
issues.

There is a companion patch to make ext3's readpage operation use async 
metadata reads that will follow.  A test program that uses the new readhead 
operation can be found at http://www.kvack.org/~bcrl/aio-readahead.c .

		-ben
-- 
"Thought is the essence of where you are now."

 fs/aio.c                     |  220 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pagemap.h      |    2 
 include/uapi/linux/aio_abi.h |    2 
 mm/filemap.c                 |    2 
 4 files changed, 225 insertions(+), 1 deletion(-)
diff --git a/fs/aio.c b/fs/aio.c
index 7337500..f1c0f74 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -46,6 +46,8 @@
 
 #include "internal.h"
 
+static long aio_readahead(struct kiocb *iocb);
+
 #define AIO_RING_MAGIC			0xa10a10a1
 #define AIO_RING_COMPAT_FEATURES	1
 #define AIO_RING_INCOMPAT_FEATURES	0
@@ -1379,6 +1381,12 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
 		iter_op	= file->f_op->read_iter;
 		goto rw_common;
 
+	case IOCB_CMD_READAHEAD:
+		ret = -EBADF;
+		if (unlikely(!(file->f_mode & FMODE_READ)))
+			break;
+		return aio_readahead(req);
+
 	case IOCB_CMD_PWRITE:
 	case IOCB_CMD_PWRITEV:
 		mode	= FMODE_WRITE;
@@ -1710,3 +1718,215 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
 	}
 	return ret;
 }
+
+/* for readahead */
+struct readahead_state;
+struct readahead_pginfo {
+	struct wait_bit_queue		wait_bit;
+	struct page			*page;
+};
+
+struct readahead_state {
+	struct kiocb			*iocb;
+	unsigned			nr_pages;
+	atomic_t			nr_pages_reading;
+
+	struct readahead_pginfo	pginfo[];
+};
+
+static void aio_readahead_complete(struct readahead_state *state)
+{
+	unsigned i, nr_uptodate = 0;
+	struct kiocb *iocb;
+	long res;
+	if (!atomic_dec_and_test(&state->nr_pages_reading))
+		return;
+	for (i = 0; i < state->nr_pages; i++) {
+		struct page *page = state->pginfo[i].page;
+
+		if (PageUptodate(page))
+			nr_uptodate++;
+		page_cache_release(page);
+	}
+	iocb = state->iocb;
+	if (nr_uptodate == state->nr_pages) {
+		res = iocb->ki_nbytes;
+	} else
+		res = -EIO;
+	kfree(state);
+	aio_complete(iocb, res, 0);
+}
+
+static int pginfo_wait_func(wait_queue_t *wait, unsigned mode, int flags,
+			    void *arg)
+{
+	struct readahead_state *state = wait->private;
+	struct readahead_pginfo *pginfo;
+	struct wait_bit_key *key = arg;
+	unsigned idx;
+
+	pginfo = container_of(wait, struct readahead_pginfo, wait_bit.wait);
+	idx = pginfo - state->pginfo;
+	BUG_ON(idx >= state->nr_pages);
+
+	if (pginfo->wait_bit.key.flags != key->flags ||
+	    pginfo->wait_bit.key.bit_nr != key->bit_nr ||
+	    test_bit(key->bit_nr, key->flags))
+		return 0;
+	list_del_init(&wait->task_list);
+	aio_readahead_complete(state);
+	return 1;
+}
+
+static void pginfo_wait_on_page(struct readahead_state *state,
+				struct readahead_pginfo *pginfo)
+{
+	struct page *page = pginfo->page;
+	wait_queue_head_t *wq;
+	unsigned long flags;
+	int ret;
+
+	pginfo->wait_bit.key.flags = &page->flags;
+	pginfo->wait_bit.key.bit_nr = PG_locked;
+	pginfo->wait_bit.wait.private = state;
+	pginfo->wait_bit.wait.func = pginfo_wait_func;
+	
+	page = pginfo->page;
+	wq = page_waitqueue(page);
+	atomic_inc(&state->nr_pages_reading);
+
+	spin_lock_irqsave(&wq->lock, flags);
+	__add_wait_queue(wq, &pginfo->wait_bit.wait);
+	if (!PageLocked(page))
+		ret = pginfo_wait_func(&pginfo->wait_bit.wait, 0, 0,
+				       &pginfo->wait_bit.key);
+	spin_unlock_irqrestore(&wq->lock, flags);
+}
+
+
+/*
+ * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates all
+ * the pages first, then submits them all for I/O. This avoids the very bad
+ * behaviour which would occur if page allocations are causing VM writeback.
+ * We really don't want to intermingle reads and writes like that.
+ *
+ * Returns the number of pages requested, or the maximum amount of I/O allowed.
+ */
+static int
+__do_page_cache_readahead(struct address_space *mapping, struct file *filp,
+			pgoff_t offset, unsigned long nr_to_read,
+			unsigned long lookahead_size,
+			struct readahead_state *state)
+{
+	struct inode *inode = mapping->host;
+	struct page *page;
+	unsigned long end_index;	/* The last page we want to read */
+	LIST_HEAD(page_pool);
+	int page_idx;
+	int ret = 0;
+	loff_t isize = i_size_read(inode);
+
+	if (isize == 0)
+		goto out;
+
+	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+
+	/*
+	 * Preallocate as many pages as we will need.
+	 */
+	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
+		pgoff_t page_offset = offset + page_idx;
+		struct readahead_pginfo *pginfo = &state->pginfo[page_idx];
+		int locked = 0;
+
+		if (page_offset > end_index)
+			break;
+
+		init_waitqueue_func_entry(&pginfo->wait_bit.wait,
+					  pginfo_wait_func);
+find_page:
+		page = find_get_page(mapping, page_offset);
+		if (!page) {
+			int err;
+			page = page_cache_alloc_cold(mapping);
+			err = add_to_page_cache_lru(page, mapping,
+						    page_offset,
+						    GFP_KERNEL);
+			if (err)
+				page_cache_release(page);
+			if (err == -EEXIST)
+				goto find_page;
+			if (err)
+				break;
+			locked = 1;
+		}
+		if (!page)
+			break;
+
+		ret++;
+		state->nr_pages++;
+		pginfo->page = page;
+		if (!locked && PageUptodate(page))
+			continue;
+		if (locked || trylock_page(page)) {
+			if (PageUptodate(page)) {
+				unlock_page(page);
+				continue;
+			}
+			pginfo_wait_on_page(state, pginfo);
+
+			/* Ignoring the return code from readpage here is
+			 * safe, as the readpage() operation will unlock
+			 * the page and thus kick our state machine.
+			 */
+			mapping->a_ops->readpage(filp, page);
+			continue;
+		}
+		pginfo_wait_on_page(state, pginfo);
+	}
+
+out:
+	return ret;
+}
+
+static long aio_readahead(struct kiocb *iocb)
+{
+	struct file *filp = iocb->ki_filp;
+	struct readahead_state *state;
+	pgoff_t start, end;
+	unsigned nr_pages;
+	int ret;
+
+	if (!filp->f_mapping || !filp->f_mapping->a_ops ||
+	    !filp->f_mapping->a_ops->readpage)
+		return -EINVAL;
+
+	if (iocb->ki_nbytes == 0) {
+		aio_complete(iocb, 0, 0);
+		return 0;
+	}
+
+	start = iocb->ki_pos >> PAGE_CACHE_SHIFT;
+	end = (iocb->ki_pos + iocb->ki_nbytes - 1) >> PAGE_CACHE_SHIFT;
+	nr_pages = 1 + end - start;
+
+	state = kzalloc(sizeof(*state) +
+			nr_pages * sizeof(struct readahead_pginfo),
+			GFP_KERNEL);
+	if (!state)
+		return -ENOMEM;
+
+	state->iocb = iocb;
+	atomic_set(&state->nr_pages_reading, 1);
+
+	ret = __do_page_cache_readahead(filp->f_mapping, filp, start, nr_pages,
+					0, state);
+	if (ret <= 0) {
+		kfree(state);
+		aio_complete(iocb, 0, 0);
+		return 0;
+	}
+
+	aio_readahead_complete(state);	// Drops ref of 1 from nr_pages_reading
+	return 0;
+}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3df8c7d..afd1f20 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -495,6 +495,8 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
 	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
 }
 
+wait_queue_head_t *page_waitqueue(struct page *page);
+
 /*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f..11723c53 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -44,6 +44,8 @@ enum {
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
+
+	IOCB_CMD_READAHEAD = 12,
 };
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..3368b73 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -670,7 +670,7 @@ EXPORT_SYMBOL(__page_cache_alloc);
  * at a cost of "thundering herd" phenomena during rare hash
  * collisions.
  */
-static wait_queue_head_t *page_waitqueue(struct page *page)
+wait_queue_head_t *page_waitqueue(struct page *page)
 {
 	const struct zone *zone = page_zone(page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC 2/2] ext4: async readpage for indirect style inodes
  2014-09-17 14:49   ` Benjamin LaHaise
@ 2014-09-17 15:26     ` Benjamin LaHaise
  -1 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-17 15:26 UTC (permalink / raw)
  To: Milosz Tanski, Theodore Ts'o
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Andreas Dilger

Hi all,

And here is the version of readpage for ext3/ext4 that performs async 
metadata reads for old-style indirect block based ext3/ext4 filesystems.  
This version only includes the changes against ext4 -- the changes to 
ext3 are pretty much identical.  This is only an RFC and has at least 
one known issue, that being that it only works on ext3 filesystems with 
block size equal to page size.

		-ben
-- 
"Thought is the essence of where you are now."

 fs/ext4/ext4.h          |    3 
 fs/ext4/indirect.c      |    6 
 fs/ext4/inode.c         |  294 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/mpage.c              |    4 
 include/linux/mpage.h   |    3 
 include/linux/pagemap.h |    2 
 6 files changed, 307 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b0c225c..8136284 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2835,6 +2835,9 @@ extern struct mutex ext4__aio_mutex[EXT4_WQ_HASH_SZ];
 extern int ext4_resize_begin(struct super_block *sb);
 extern void ext4_resize_end(struct super_block *sb);
 
+extern int ext4_block_to_path(struct inode *inode,
+			      ext4_lblk_t i_block,
+			      ext4_lblk_t offsets[4], int *boundary);
 #endif	/* __KERNEL__ */
 
 #endif	/* _EXT4_H */
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index e75f840..689267a 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -69,9 +69,9 @@ static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
  * get there at all.
  */
 
-static int ext4_block_to_path(struct inode *inode,
-			      ext4_lblk_t i_block,
-			      ext4_lblk_t offsets[4], int *boundary)
+int ext4_block_to_path(struct inode *inode,
+		       ext4_lblk_t i_block,
+		       ext4_lblk_t offsets[4], int *boundary)
 {
 	int ptrs = EXT4_ADDR_PER_BLOCK(inode->i_sb);
 	int ptrs_bits = EXT4_ADDR_PER_BLOCK_BITS(inode->i_sb);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9..4b36000 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2820,6 +2820,297 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping, block, ext4_get_block);
 }
 
+#include <linux/bio.h>
+
+struct ext4_readpage_state {
+	struct page		*page;
+	struct inode		*inode;
+	int			offsets[4];
+	int			depth;
+	int			blocks_to_boundary;
+	int			cur_depth;
+	int			cur_block;
+	int			waiting_on_lock;
+	struct buffer_head	*cur_bh;
+	struct bio		*bio;
+	struct wait_bit_queue	wait_bit;
+	struct work_struct	work;
+};
+
+#define dprintk(x...) do { ; } while (0)
+
+static void ext4_readpage_statemachine(struct ext4_readpage_state *state);
+static void ext4_readpage_work_func(struct work_struct *work)
+{
+	struct ext4_readpage_state *state;
+
+	state = container_of(work, struct ext4_readpage_state, work);
+	dprintk("ext4_readpage_work_func(%p): state=%p\n", work, state);
+	ext4_readpage_statemachine(state);
+}
+
+static int ext4_readpage_wait_func(wait_queue_t *wait, unsigned mode, int flags,
+				   void *arg)
+{
+	struct ext4_readpage_state *state = wait->private;
+	struct wait_bit_key *key = arg;
+
+	dprintk("ext4_readpage_wait_func: state=%p\n", state);
+	dprintk("key->flags=%p key->bit_nr=%d, page->flags=%p\n",
+		key->flags, key->bit_nr, &pginfo->page->flags);
+	if (state->wait_bit.key.flags != key->flags ||
+	    state->wait_bit.key.bit_nr != key->bit_nr ||
+	    test_bit(key->bit_nr, key->flags))
+		 return 0;
+	dprintk("ext4_readpage_wait_func: page is unlocked/uptodate\n");
+	list_del_init(&wait->task_list);
+	INIT_WORK(&state->work, ext4_readpage_work_func);
+	schedule_work(&state->work);
+	return 1;
+}
+
+static void ext4_readpage_wait_on_bh(struct ext4_readpage_state *state)
+{
+	wait_queue_head_t *wq;
+	unsigned long flags;
+	int ret = 0;
+
+	state->wait_bit.key.flags = &state->cur_bh->b_state;
+	state->wait_bit.key.bit_nr = BH_Lock;
+	state->wait_bit.wait.private = state;
+	state->wait_bit.wait.func = ext4_readpage_wait_func;
+
+	wq = bit_waitqueue(state->wait_bit.key.flags,
+			   state->wait_bit.key.bit_nr);
+	spin_lock_irqsave(&wq->lock, flags);
+	__add_wait_queue(wq, &state->wait_bit.wait);
+	if (!buffer_locked(state->cur_bh)) {
+		dprintk("ext4_readpage_wait_on_bh(%p): buffer not locked\n", state);
+		list_del_init(&state->wait_bit.wait.task_list);
+		ret = 1;
+	}
+	spin_unlock_irqrestore(&wq->lock, flags);
+
+	dprintk("ext4_readpage_wait_on_bh(%p): ret=%d\n", state, ret);
+	if (ret)
+		ext4_readpage_statemachine(state);
+}
+
+static void ext4_readpage_statemachine(struct ext4_readpage_state *state)
+{
+	struct ext4_inode_info *ei = EXT4_I(state->inode);
+	struct buffer_head *bh;
+	int offset;
+	__le32 *blkp;
+	u32 blocknr;
+
+	dprintk("ext4_readpage_statemachine(%p): cur_depth=%d\n",
+		state, state->cur_depth);
+
+	if (state->waiting_on_lock)
+		goto lock_buffer;
+
+	offset = state->offsets[state->cur_depth];
+	if (state->cur_depth == 0)
+		blkp = ei->i_data + offset;
+	else {
+		if (!buffer_uptodate(state->cur_bh)) {
+			dprintk("ext4_readpage_statemachine: "
+                               "!buffer_update(%Lu)\n",
+			       (unsigned long long )state->cur_bh->b_blocknr);
+			brelse(state->cur_bh);
+			bio_put(state->bio);
+			SetPageError(state->page);
+			unlock_page(state->page);
+			return;	// FIXME: verify error handling is correct
+		}
+		blkp = (__le32 *)state->cur_bh->b_data + offset;
+	}
+
+	blocknr = le32_to_cpu(*blkp);
+	if (state->cur_bh)
+		brelse(state->cur_bh);
+	state->cur_depth++;
+
+	dprintk("state->cur_depth=%d depth=%d offset=%d blocknr=%u\n",
+		state->cur_depth, state->depth, offset, blocknr);
+	if (state->cur_depth == state->depth) {
+		dprintk("submitting bio %p for block %u\n", state->bio, blocknr);
+		state->bio->bi_iter.bi_sector =
+			(sector_t)blocknr << (state->inode->i_blkbits - 9);
+		mpage_bio_submit(READ, state->bio);
+		return;
+	}
+
+	state->cur_bh = sb_getblk(state->inode->i_sb, blocknr);
+	if (!state->cur_bh) {
+		dprintk("sb_getblk(%p, %u) failed\n",
+			state->inode->i_sb, blocknr);
+		dprintk("FAIL!\n");
+		bio_put(state->bio);
+		SetPageError(state->page);
+		unlock_page(state->page);
+		return;	// FIXME - verify error handling
+	}
+
+	dprintk("ext4_readpage_statemachine: cur_bh=%p\n", state->cur_bh);
+
+lock_buffer:
+	state->waiting_on_lock = 0;
+	if (buffer_uptodate(state->cur_bh)) {
+		dprintk("ext4_readpage_statemachine(%p): buffer uptodate\n",
+			state);
+		return ext4_readpage_statemachine(state);
+	}
+
+	dprintk("ext4_readpage_statemachine(%p): locking buffer\n", state);
+
+	if (!trylock_buffer(state->cur_bh)) {
+		state->waiting_on_lock = 1;
+		ext4_readpage_wait_on_bh(state);
+		return;
+	}
+
+	/* We have the buffer locked */
+	if (buffer_uptodate(state->cur_bh)) {
+		dprintk("ext4_readpage_statemachine: buffer uptodate after lock\n");
+		unlock_buffer(state->cur_bh);
+		return ext4_readpage_statemachine(state);
+	}
+
+	bh = state->cur_bh;
+	get_bh(bh);
+	bh->b_end_io = end_buffer_read_sync;
+	ext4_readpage_wait_on_bh(state);
+	submit_bh(READ | REQ_META /*| REQ_PRIO*/, bh);
+}
+
+static unsigned ext4_count_meta(unsigned blocks, unsigned ind_shift)
+{
+	const unsigned dind_shift = ind_shift * 2;
+	unsigned blocks_per_ind = 1U << ind_shift;
+	unsigned blocks_per_dind = 1U << dind_shift;
+	unsigned nr_meta = 0;
+
+	dprintk("ext4_count_meta(%Ld, %u)\n", (long long)blocks, blocks_per_ind);
+
+	/* direct entry? */
+	if (blocks <= EXT4_NDIR_BLOCKS)
+		return 0;
+
+	/* This has to be an indirect entry */
+	nr_meta ++;			// The indirect block
+	blocks -= EXT4_NDIR_BLOCKS;
+	if (blocks <= blocks_per_ind)
+		return 1;
+	blocks -= blocks_per_ind;
+
+	/* Now we have a double indirect entry */
+	nr_meta ++;			// The double indirect block
+	if (blocks <= blocks_per_dind) {
+		nr_meta += (blocks + blocks_per_ind - 1) >> ind_shift;
+		return nr_meta;
+	}
+
+	nr_meta += blocks_per_ind;	// The indirect blocks in the dind
+	blocks -= blocks_per_dind;
+
+	nr_meta ++;			// The triple indirect block
+
+	// The double indirect in the tind
+	nr_meta += (blocks + blocks_per_dind - 1) >> dind_shift;
+
+	// The indirect blocks pointed to by the dinds by the tind
+	nr_meta += (blocks + blocks_per_ind - 1) >> ind_shift;
+
+	return nr_meta;
+}
+
+static int ext4_async_readpage(struct file *file, struct page *page)
+{
+	struct ext4_readpage_state *state = page_address(page);
+	struct inode *inode = page->mapping->host;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	int blocks_to_boundary = 0;
+	int offsets[4] = { 0, };
+	sector_t iblock;
+	__le32 indirect;
+	sector_t blk;
+	int depth;
+
+	/* Attempt metadata readahead if this is a read of the first page of
+	 * the file.  We assume the metadata is contiguously laid out starting
+	 * at the first indirect block of the file.
+	 */
+	indirect = ei->i_data[EXT4_IND_BLOCK];
+	blk = le32_to_cpu(indirect);
+	dprintk("ext4_async_readpage: index=%Lu, blk=%u\n",
+		(unsigned long long)page->index, (unsigned)blk);
+	if ((page->index == 0) && blk) {
+		loff_t i_size = i_size_read(inode);
+		unsigned i, nr_meta;
+		i_size += PAGE_SIZE - 1;
+		nr_meta = ext4_count_meta(i_size >> inode->i_blkbits,
+					  inode->i_blkbits - 2);
+		dprintk("readpage(0): blk[IND]=%u nr_meta=%u blk[0]=%u i_size=%Ld\n",
+			(unsigned)blk, nr_meta, le32_to_cpu(ei->i_data[0]),
+			(long long)i_size);
+		for (i=0; i < nr_meta; i++) {
+			sb_breadahead(inode->i_sb, blk + i);
+		}
+	}
+
+	iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	depth = ext4_block_to_path(inode, iblock, offsets, &blocks_to_boundary);
+	if (page_has_buffers(page)) {
+		int nr_uptodate = 0, nr = 0;
+		struct buffer_head *bh, *head;
+
+		head = bh = page_buffers(page);
+		do {
+			nr++;
+			nr_uptodate += !!buffer_uptodate(bh);
+			bh = bh->b_this_page;
+		} while (bh != head) ;
+		dprintk("inode(%lu) index=%Lu has nr=%d nr_up=%d\n",
+			inode->i_ino, (unsigned long long)page->index,
+			nr, nr_uptodate);
+		// A previous write may have already marked a buffer_head
+		// covering the page as uptodate.  Reuse mpage_readpage() to
+		// handle this case.
+		if (nr_uptodate > 0) {
+			return mpage_readpage(page, ext4_get_block);
+		}
+	}
+
+	if (depth == 1)
+		return mpage_readpage(page, ext4_get_block);
+
+	/* now set things up for reading the page */
+	memset(state, 0, sizeof(*state));
+	state->page = page;
+	state->inode = inode;
+        state->depth = depth;
+	state->blocks_to_boundary = blocks_to_boundary;
+	memcpy(state->offsets, offsets, sizeof(offsets));
+
+	dprintk("inode[%lu] page[%Lu]: depth=%d, offsets=%d,%d,%d,%d\n",
+		inode->i_ino, (unsigned long long)page->index, state->depth,
+		state->offsets[0], state->offsets[1], state->offsets[2],
+		state->offsets[3]);
+
+	state->bio = mpage_alloc(inode->i_sb->s_bdev, 0, 1, GFP_NOFS);
+	if (!state->bio) {
+		dprintk("ext4_async_readpage(%p, %Lu): mpage_alloc failed\n",
+			file, (unsigned long long)iblock);
+		unlock_page(page);
+		return -ENOMEM;
+	}
+	bio_add_page(state->bio, page, PAGE_SIZE, 0);
+	ext4_readpage_statemachine(state);
+	return 0;
+}
+
 static int ext4_readpage(struct file *file, struct page *page)
 {
 	int ret = -EAGAIN;
@@ -2827,6 +3118,9 @@ static int ext4_readpage(struct file *file, struct page *page)
 
 	trace_ext4_readpage(page);
 
+	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+		return ext4_async_readpage(file, page);
+
 	if (ext4_has_inline_data(inode))
 		ret = ext4_readpage_inline(inode, page);
 
diff --git a/fs/mpage.c b/fs/mpage.c
index 5f9ed62..0fca557 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -54,14 +54,14 @@ static void mpage_end_io(struct bio *bio, int err)
 	bio_put(bio);
 }
 
-static struct bio *mpage_bio_submit(int rw, struct bio *bio)
+struct bio *mpage_bio_submit(int rw, struct bio *bio)
 {
 	bio->bi_end_io = mpage_end_io;
 	submit_bio(rw, bio);
 	return NULL;
 }
 
-static struct bio *
+struct bio *
 mpage_alloc(struct block_device *bdev,
 		sector_t first_sector, int nr_vecs,
 		gfp_t gfp_flags)
diff --git a/include/linux/mpage.h b/include/linux/mpage.h
index 068a0c9..3bf909b 100644
--- a/include/linux/mpage.h
+++ b/include/linux/mpage.h
@@ -13,6 +13,9 @@
 
 struct writeback_control;
 
+struct bio *mpage_bio_submit(int rw, struct bio *bio);
+struct bio * mpage_alloc(struct block_device *bdev, sector_t first_sector,
+			 int nr_vecs, gfp_t gfp_flags);
 int mpage_readpages(struct address_space *mapping, struct list_head *pages,
 				unsigned nr_pages, get_block_t get_block);
 int mpage_readpage(struct page *page, get_block_t get_block);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3df8c7d..afd1f20 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -495,6 +495,8 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
 	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
 }
 
+wait_queue_head_t *page_waitqueue(struct page *page);
+
 /*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC 2/2] ext4: async readpage for indirect style inodes
@ 2014-09-17 15:26     ` Benjamin LaHaise
  0 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-17 15:26 UTC (permalink / raw)
  To: Milosz Tanski, Theodore Ts'o
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Andreas Dilger

Hi all,

And here is the version of readpage for ext3/ext4 that performs async 
metadata reads for old-style indirect block based ext3/ext4 filesystems.  
This version only includes the changes against ext4 -- the changes to 
ext3 are pretty much identical.  This is only an RFC and has at least 
one known issue, that being that it only works on ext3 filesystems with 
block size equal to page size.

		-ben
-- 
"Thought is the essence of where you are now."

 fs/ext4/ext4.h          |    3 
 fs/ext4/indirect.c      |    6 
 fs/ext4/inode.c         |  294 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/mpage.c              |    4 
 include/linux/mpage.h   |    3 
 include/linux/pagemap.h |    2 
 6 files changed, 307 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b0c225c..8136284 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2835,6 +2835,9 @@ extern struct mutex ext4__aio_mutex[EXT4_WQ_HASH_SZ];
 extern int ext4_resize_begin(struct super_block *sb);
 extern void ext4_resize_end(struct super_block *sb);
 
+extern int ext4_block_to_path(struct inode *inode,
+			      ext4_lblk_t i_block,
+			      ext4_lblk_t offsets[4], int *boundary);
 #endif	/* __KERNEL__ */
 
 #endif	/* _EXT4_H */
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index e75f840..689267a 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -69,9 +69,9 @@ static inline void add_chain(Indirect *p, struct buffer_head *bh, __le32 *v)
  * get there at all.
  */
 
-static int ext4_block_to_path(struct inode *inode,
-			      ext4_lblk_t i_block,
-			      ext4_lblk_t offsets[4], int *boundary)
+int ext4_block_to_path(struct inode *inode,
+		       ext4_lblk_t i_block,
+		       ext4_lblk_t offsets[4], int *boundary)
 {
 	int ptrs = EXT4_ADDR_PER_BLOCK(inode->i_sb);
 	int ptrs_bits = EXT4_ADDR_PER_BLOCK_BITS(inode->i_sb);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9..4b36000 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2820,6 +2820,297 @@ static sector_t ext4_bmap(struct address_space *mapping, sector_t block)
 	return generic_block_bmap(mapping, block, ext4_get_block);
 }
 
+#include <linux/bio.h>
+
+struct ext4_readpage_state {
+	struct page		*page;
+	struct inode		*inode;
+	int			offsets[4];
+	int			depth;
+	int			blocks_to_boundary;
+	int			cur_depth;
+	int			cur_block;
+	int			waiting_on_lock;
+	struct buffer_head	*cur_bh;
+	struct bio		*bio;
+	struct wait_bit_queue	wait_bit;
+	struct work_struct	work;
+};
+
+#define dprintk(x...) do { ; } while (0)
+
+static void ext4_readpage_statemachine(struct ext4_readpage_state *state);
+static void ext4_readpage_work_func(struct work_struct *work)
+{
+	struct ext4_readpage_state *state;
+
+	state = container_of(work, struct ext4_readpage_state, work);
+	dprintk("ext4_readpage_work_func(%p): state=%p\n", work, state);
+	ext4_readpage_statemachine(state);
+}
+
+static int ext4_readpage_wait_func(wait_queue_t *wait, unsigned mode, int flags,
+				   void *arg)
+{
+	struct ext4_readpage_state *state = wait->private;
+	struct wait_bit_key *key = arg;
+
+	dprintk("ext4_readpage_wait_func: state=%p\n", state);
+	dprintk("key->flags=%p key->bit_nr=%d, page->flags=%p\n",
+		key->flags, key->bit_nr, &pginfo->page->flags);
+	if (state->wait_bit.key.flags != key->flags ||
+	    state->wait_bit.key.bit_nr != key->bit_nr ||
+	    test_bit(key->bit_nr, key->flags))
+		 return 0;
+	dprintk("ext4_readpage_wait_func: page is unlocked/uptodate\n");
+	list_del_init(&wait->task_list);
+	INIT_WORK(&state->work, ext4_readpage_work_func);
+	schedule_work(&state->work);
+	return 1;
+}
+
+static void ext4_readpage_wait_on_bh(struct ext4_readpage_state *state)
+{
+	wait_queue_head_t *wq;
+	unsigned long flags;
+	int ret = 0;
+
+	state->wait_bit.key.flags = &state->cur_bh->b_state;
+	state->wait_bit.key.bit_nr = BH_Lock;
+	state->wait_bit.wait.private = state;
+	state->wait_bit.wait.func = ext4_readpage_wait_func;
+
+	wq = bit_waitqueue(state->wait_bit.key.flags,
+			   state->wait_bit.key.bit_nr);
+	spin_lock_irqsave(&wq->lock, flags);
+	__add_wait_queue(wq, &state->wait_bit.wait);
+	if (!buffer_locked(state->cur_bh)) {
+		dprintk("ext4_readpage_wait_on_bh(%p): buffer not locked\n", state);
+		list_del_init(&state->wait_bit.wait.task_list);
+		ret = 1;
+	}
+	spin_unlock_irqrestore(&wq->lock, flags);
+
+	dprintk("ext4_readpage_wait_on_bh(%p): ret=%d\n", state, ret);
+	if (ret)
+		ext4_readpage_statemachine(state);
+}
+
+static void ext4_readpage_statemachine(struct ext4_readpage_state *state)
+{
+	struct ext4_inode_info *ei = EXT4_I(state->inode);
+	struct buffer_head *bh;
+	int offset;
+	__le32 *blkp;
+	u32 blocknr;
+
+	dprintk("ext4_readpage_statemachine(%p): cur_depth=%d\n",
+		state, state->cur_depth);
+
+	if (state->waiting_on_lock)
+		goto lock_buffer;
+
+	offset = state->offsets[state->cur_depth];
+	if (state->cur_depth == 0)
+		blkp = ei->i_data + offset;
+	else {
+		if (!buffer_uptodate(state->cur_bh)) {
+			dprintk("ext4_readpage_statemachine: "
+                               "!buffer_update(%Lu)\n",
+			       (unsigned long long )state->cur_bh->b_blocknr);
+			brelse(state->cur_bh);
+			bio_put(state->bio);
+			SetPageError(state->page);
+			unlock_page(state->page);
+			return;	// FIXME: verify error handling is correct
+		}
+		blkp = (__le32 *)state->cur_bh->b_data + offset;
+	}
+
+	blocknr = le32_to_cpu(*blkp);
+	if (state->cur_bh)
+		brelse(state->cur_bh);
+	state->cur_depth++;
+
+	dprintk("state->cur_depth=%d depth=%d offset=%d blocknr=%u\n",
+		state->cur_depth, state->depth, offset, blocknr);
+	if (state->cur_depth == state->depth) {
+		dprintk("submitting bio %p for block %u\n", state->bio, blocknr);
+		state->bio->bi_iter.bi_sector =
+			(sector_t)blocknr << (state->inode->i_blkbits - 9);
+		mpage_bio_submit(READ, state->bio);
+		return;
+	}
+
+	state->cur_bh = sb_getblk(state->inode->i_sb, blocknr);
+	if (!state->cur_bh) {
+		dprintk("sb_getblk(%p, %u) failed\n",
+			state->inode->i_sb, blocknr);
+		dprintk("FAIL!\n");
+		bio_put(state->bio);
+		SetPageError(state->page);
+		unlock_page(state->page);
+		return;	// FIXME - verify error handling
+	}
+
+	dprintk("ext4_readpage_statemachine: cur_bh=%p\n", state->cur_bh);
+
+lock_buffer:
+	state->waiting_on_lock = 0;
+	if (buffer_uptodate(state->cur_bh)) {
+		dprintk("ext4_readpage_statemachine(%p): buffer uptodate\n",
+			state);
+		return ext4_readpage_statemachine(state);
+	}
+
+	dprintk("ext4_readpage_statemachine(%p): locking buffer\n", state);
+
+	if (!trylock_buffer(state->cur_bh)) {
+		state->waiting_on_lock = 1;
+		ext4_readpage_wait_on_bh(state);
+		return;
+	}
+
+	/* We have the buffer locked */
+	if (buffer_uptodate(state->cur_bh)) {
+		dprintk("ext4_readpage_statemachine: buffer uptodate after lock\n");
+		unlock_buffer(state->cur_bh);
+		return ext4_readpage_statemachine(state);
+	}
+
+	bh = state->cur_bh;
+	get_bh(bh);
+	bh->b_end_io = end_buffer_read_sync;
+	ext4_readpage_wait_on_bh(state);
+	submit_bh(READ | REQ_META /*| REQ_PRIO*/, bh);
+}
+
+static unsigned ext4_count_meta(unsigned blocks, unsigned ind_shift)
+{
+	const unsigned dind_shift = ind_shift * 2;
+	unsigned blocks_per_ind = 1U << ind_shift;
+	unsigned blocks_per_dind = 1U << dind_shift;
+	unsigned nr_meta = 0;
+
+	dprintk("ext4_count_meta(%Ld, %u)\n", (long long)blocks, blocks_per_ind);
+
+	/* direct entry? */
+	if (blocks <= EXT4_NDIR_BLOCKS)
+		return 0;
+
+	/* This has to be an indirect entry */
+	nr_meta ++;			// The indirect block
+	blocks -= EXT4_NDIR_BLOCKS;
+	if (blocks <= blocks_per_ind)
+		return 1;
+	blocks -= blocks_per_ind;
+
+	/* Now we have a double indirect entry */
+	nr_meta ++;			// The double indirect block
+	if (blocks <= blocks_per_dind) {
+		nr_meta += (blocks + blocks_per_ind - 1) >> ind_shift;
+		return nr_meta;
+	}
+
+	nr_meta += blocks_per_ind;	// The indirect blocks in the dind
+	blocks -= blocks_per_dind;
+
+	nr_meta ++;			// The triple indirect block
+
+	// The double indirect in the tind
+	nr_meta += (blocks + blocks_per_dind - 1) >> dind_shift;
+
+	// The indirect blocks pointed to by the dinds by the tind
+	nr_meta += (blocks + blocks_per_ind - 1) >> ind_shift;
+
+	return nr_meta;
+}
+
+static int ext4_async_readpage(struct file *file, struct page *page)
+{
+	struct ext4_readpage_state *state = page_address(page);
+	struct inode *inode = page->mapping->host;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	int blocks_to_boundary = 0;
+	int offsets[4] = { 0, };
+	sector_t iblock;
+	__le32 indirect;
+	sector_t blk;
+	int depth;
+
+	/* Attempt metadata readahead if this is a read of the first page of
+	 * the file.  We assume the metadata is contiguously laid out starting
+	 * at the first indirect block of the file.
+	 */
+	indirect = ei->i_data[EXT4_IND_BLOCK];
+	blk = le32_to_cpu(indirect);
+	dprintk("ext4_async_readpage: index=%Lu, blk=%u\n",
+		(unsigned long long)page->index, (unsigned)blk);
+	if ((page->index == 0) && blk) {
+		loff_t i_size = i_size_read(inode);
+		unsigned i, nr_meta;
+		i_size += PAGE_SIZE - 1;
+		nr_meta = ext4_count_meta(i_size >> inode->i_blkbits,
+					  inode->i_blkbits - 2);
+		dprintk("readpage(0): blk[IND]=%u nr_meta=%u blk[0]=%u i_size=%Ld\n",
+			(unsigned)blk, nr_meta, le32_to_cpu(ei->i_data[0]),
+			(long long)i_size);
+		for (i=0; i < nr_meta; i++) {
+			sb_breadahead(inode->i_sb, blk + i);
+		}
+	}
+
+	iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	depth = ext4_block_to_path(inode, iblock, offsets, &blocks_to_boundary);
+	if (page_has_buffers(page)) {
+		int nr_uptodate = 0, nr = 0;
+		struct buffer_head *bh, *head;
+
+		head = bh = page_buffers(page);
+		do {
+			nr++;
+			nr_uptodate += !!buffer_uptodate(bh);
+			bh = bh->b_this_page;
+		} while (bh != head) ;
+		dprintk("inode(%lu) index=%Lu has nr=%d nr_up=%d\n",
+			inode->i_ino, (unsigned long long)page->index,
+			nr, nr_uptodate);
+		// A previous write may have already marked a buffer_head
+		// covering the page as uptodate.  Reuse mpage_readpage() to
+		// handle this case.
+		if (nr_uptodate > 0) {
+			return mpage_readpage(page, ext4_get_block);
+		}
+	}
+
+	if (depth == 1)
+		return mpage_readpage(page, ext4_get_block);
+
+	/* now set things up for reading the page */
+	memset(state, 0, sizeof(*state));
+	state->page = page;
+	state->inode = inode;
+        state->depth = depth;
+	state->blocks_to_boundary = blocks_to_boundary;
+	memcpy(state->offsets, offsets, sizeof(offsets));
+
+	dprintk("inode[%lu] page[%Lu]: depth=%d, offsets=%d,%d,%d,%d\n",
+		inode->i_ino, (unsigned long long)page->index, state->depth,
+		state->offsets[0], state->offsets[1], state->offsets[2],
+		state->offsets[3]);
+
+	state->bio = mpage_alloc(inode->i_sb->s_bdev, 0, 1, GFP_NOFS);
+	if (!state->bio) {
+		dprintk("ext4_async_readpage(%p, %Lu): mpage_alloc failed\n",
+			file, (unsigned long long)iblock);
+		unlock_page(page);
+		return -ENOMEM;
+	}
+	bio_add_page(state->bio, page, PAGE_SIZE, 0);
+	ext4_readpage_statemachine(state);
+	return 0;
+}
+
 static int ext4_readpage(struct file *file, struct page *page)
 {
 	int ret = -EAGAIN;
@@ -2827,6 +3118,9 @@ static int ext4_readpage(struct file *file, struct page *page)
 
 	trace_ext4_readpage(page);
 
+	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+		return ext4_async_readpage(file, page);
+
 	if (ext4_has_inline_data(inode))
 		ret = ext4_readpage_inline(inode, page);
 
diff --git a/fs/mpage.c b/fs/mpage.c
index 5f9ed62..0fca557 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -54,14 +54,14 @@ static void mpage_end_io(struct bio *bio, int err)
 	bio_put(bio);
 }
 
-static struct bio *mpage_bio_submit(int rw, struct bio *bio)
+struct bio *mpage_bio_submit(int rw, struct bio *bio)
 {
 	bio->bi_end_io = mpage_end_io;
 	submit_bio(rw, bio);
 	return NULL;
 }
 
-static struct bio *
+struct bio *
 mpage_alloc(struct block_device *bdev,
 		sector_t first_sector, int nr_vecs,
 		gfp_t gfp_flags)
diff --git a/include/linux/mpage.h b/include/linux/mpage.h
index 068a0c9..3bf909b 100644
--- a/include/linux/mpage.h
+++ b/include/linux/mpage.h
@@ -13,6 +13,9 @@
 
 struct writeback_control;
 
+struct bio *mpage_bio_submit(int rw, struct bio *bio);
+struct bio * mpage_alloc(struct block_device *bdev, sector_t first_sector,
+			 int nr_vecs, gfp_t gfp_flags);
 int mpage_readpages(struct address_space *mapping, struct list_head *pages,
 				unsigned nr_pages, get_block_t get_block);
 int mpage_readpage(struct page *page, get_block_t get_block);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3df8c7d..afd1f20 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -495,6 +495,8 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
 	return trylock_page(page) || __lock_page_or_retry(page, mm, flags);
 }
 
+wait_queue_head_t *page_waitqueue(struct page *page);
+
 /*
  * This is exported only for wait_on_page_locked/wait_on_page_writeback.
  * Never use this directly!

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-17 13:56               ` Benjamin LaHaise
@ 2014-09-17 15:33                 ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 15:33 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Theodore Ts'o, Dave Chinner, Jeff Moyer, LKML,
	Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 9:56 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> On Wed, Sep 17, 2014 at 09:47:02AM -0400, Theodore Ts'o wrote:
> ...
>> % git version
>> git version 2.1.0
>>
>> Perhaps you and other people are using your own scripts, and not using
>> git send-email?
>
> That would be because none of my systems have git 2.1.0 on them.  Fedora
> (and EPEL) appear to still be back on git 1.9.3 which does not have git
> send-email.  Until that command is more widely propagated, I expect we'll
> see people making this mistake every once in a while.
>
>                 -ben
> --
> "Thought is the essence of where you are now."

My workflow has been to use git format-patch (1.7.9.5 was shipped with
my distro), edit the cover letter then use and mutt to send the
generated emails. Before that I used git apply to import the patches
that Christoph sent me. I thought about it too, but hand editing the
email generated by format-patch to essentially having me take credit
for this sounded like a shady thing to do. The updated version of
git's (2.1.0) format-patch doesn't change the from email address field
either.

The submitting patches document doesn't really describe what to do if
you take patches / collaborate with somebody else or how to credit the
original author. It only deals with the case of a subsystem
maintainers editing the submitters code to fix it up.

All that aside, if somebody has a clear workflow that ensures this
doesn't happen I'm more then willing to follow it.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-17 15:33                 ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 15:33 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Theodore Ts'o, Dave Chinner, Jeff Moyer, LKML,
	Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 9:56 AM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> On Wed, Sep 17, 2014 at 09:47:02AM -0400, Theodore Ts'o wrote:
> ...
>> % git version
>> git version 2.1.0
>>
>> Perhaps you and other people are using your own scripts, and not using
>> git send-email?
>
> That would be because none of my systems have git 2.1.0 on them.  Fedora
> (and EPEL) appear to still be back on git 1.9.3 which does not have git
> send-email.  Until that command is more widely propagated, I expect we'll
> see people making this mistake every once in a while.
>
>                 -ben
> --
> "Thought is the essence of where you are now."

My workflow has been to use git format-patch (1.7.9.5 was shipped with
my distro), edit the cover letter then use and mutt to send the
generated emails. Before that I used git apply to import the patches
that Christoph sent me. I thought about it too, but hand editing the
email generated by format-patch to essentially having me take credit
for this sounded like a shady thing to do. The updated version of
git's (2.1.0) format-patch doesn't change the from email address field
either.

The submitting patches document doesn't really describe what to do if
you take patches / collaborate with somebody else or how to credit the
original author. It only deals with the case of a subsystem
maintainers editing the submitters code to fix it up.

All that aside, if somebody has a clear workflow that ensures this
doesn't happen I'm more then willing to follow it.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-15 20:20   ` Milosz Tanski
@ 2014-09-17 15:43     ` Theodore Ts'o
  -1 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2014-09-17 15:43 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer

On Mon, Sep 15, 2014 at 04:20:37PM -0400, Milosz Tanski wrote:
> New syscalls with an extra flag argument. For now all flags except for 0 are
> not supported.

This may fall in the category of bike-shedding, and so I apologize in
advance, but I wonder if we really need readv2 and writev2 as new
syscalls.  What if we just added preadv2 and pwritev2, and implemented
readv2 and writev2 as libc wrappers where which had the vector
allocated as an automatic stack variable?  Is the extra user memory
access really going to be that noticeable?

Cheers,

	       	      	       	       - Ted

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-17 15:43     ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2014-09-17 15:43 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer

On Mon, Sep 15, 2014 at 04:20:37PM -0400, Milosz Tanski wrote:
> New syscalls with an extra flag argument. For now all flags except for 0 are
> not supported.

This may fall in the category of bike-shedding, and so I apologize in
advance, but I wonder if we really need readv2 and writev2 as new
syscalls.  What if we just added preadv2 and pwritev2, and implemented
readv2 and writev2 as libc wrappers where which had the vector
allocated as an automatic stack variable?  Is the extra user memory
access really going to be that noticeable?

Cheers,

	       	      	       	       - Ted

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-17 15:33                 ` Milosz Tanski
@ 2014-09-17 15:49                   ` Theodore Ts'o
  -1 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2014-09-17 15:49 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Benjamin LaHaise, Dave Chinner, Jeff Moyer, LKML,
	Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 11:33:35AM -0400, Milosz Tanski wrote:
> 
> My workflow has been to use git format-patch (1.7.9.5 was shipped with
> my distro), edit the cover letter then use and mutt to send the
> generated emails. Before that I used git apply to import the patches
> that Christoph sent me. I thought about it too, but hand editing the
> email generated by format-patch to essentially having me take credit
> for this sounded like a shady thing to do. The updated version of
> git's (2.1.0) format-patch doesn't change the from email address field
> either.

That's because it gets handled in git send-email.  The resulting
e-mail gets sent like this:

From: Subsystem Maintainer <tytso@mit.edu>
To: LKML list <linux-ext4@vger.kernel.list>
Subject: [PATCH] ext4: foobie blart

From: Original Author <wenqing.lz@taobao.com>

Long commit description which gets placed as the third line of the
commit description, with the subject line as the first line of the
commit description, and the 2nd line being blank.

And then git am handles this correctly, attributing the patch
authorship to Original Author, with the git committer set to the
Subsystem Maintainer who ran the "git am" command.

> The submitting patches document doesn't really describe what to do if
> you take patches / collaborate with somebody else or how to credit the
> original author. It only deals with the case of a subsystem
> maintainers editing the submitters code to fix it up.

The short version is, use a non-prehistoric version of git, and use
"git format-patch" and "git send-email", and be happy.  :-)

It's not that hard to build your own version of git, if you are forced
to use some enterprise/LTS/Debian stable distro that can't be bothered
to update git for you.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-17 15:49                   ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2014-09-17 15:49 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Benjamin LaHaise, Dave Chinner, Jeff Moyer, LKML,
	Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 11:33:35AM -0400, Milosz Tanski wrote:
> 
> My workflow has been to use git format-patch (1.7.9.5 was shipped with
> my distro), edit the cover letter then use and mutt to send the
> generated emails. Before that I used git apply to import the patches
> that Christoph sent me. I thought about it too, but hand editing the
> email generated by format-patch to essentially having me take credit
> for this sounded like a shady thing to do. The updated version of
> git's (2.1.0) format-patch doesn't change the from email address field
> either.

That's because it gets handled in git send-email.  The resulting
e-mail gets sent like this:

From: Subsystem Maintainer <tytso@mit.edu>
To: LKML list <linux-ext4@vger.kernel.list>
Subject: [PATCH] ext4: foobie blart

From: Original Author <wenqing.lz@taobao.com>

Long commit description which gets placed as the third line of the
commit description, with the subject line as the first line of the
commit description, and the 2nd line being blank.

And then git am handles this correctly, attributing the patch
authorship to Original Author, with the git committer set to the
Subsystem Maintainer who ran the "git am" command.

> The submitting patches document doesn't really describe what to do if
> you take patches / collaborate with somebody else or how to credit the
> original author. It only deals with the case of a subsystem
> maintainers editing the submitters code to fix it up.

The short version is, use a non-prehistoric version of git, and use
"git format-patch" and "git send-email", and be happy.  :-)

It's not that hard to build your own version of git, if you are forced
to use some enterprise/LTS/Debian stable distro that can't be bothered
to update git for you.

Cheers,

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
  2014-09-17 13:56               ` Benjamin LaHaise
@ 2014-09-17 15:52                 ` Zach Brown
  -1 siblings, 0 replies; 167+ messages in thread
From: Zach Brown @ 2014-09-17 15:52 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Theodore Ts'o, Dave Chinner, Milosz Tanski, Jeff Moyer, LKML,
	Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 09:56:11AM -0400, Benjamin LaHaise wrote:
> On Wed, Sep 17, 2014 at 09:47:02AM -0400, Theodore Ts'o wrote:
> ...
> > % git version
> > git version 2.1.0
> > 
> > Perhaps you and other people are using your own scripts, and not using
> > git send-email?
> 
> That would be because none of my systems have git 2.1.0 on them.  Fedora 
> (and EPEL) appear to still be back on git 1.9.3 which does not have git 
> send-email.

Depends on the distro you're on, I suppose.  Modern Fedora certainly
has packaged git send-email for a while.

  https://apps.fedoraproject.org/packages/git-email

- z

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances
@ 2014-09-17 15:52                 ` Zach Brown
  0 siblings, 0 replies; 167+ messages in thread
From: Zach Brown @ 2014-09-17 15:52 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Theodore Ts'o, Dave Chinner, Milosz Tanski, Jeff Moyer, LKML,
	Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo

On Wed, Sep 17, 2014 at 09:56:11AM -0400, Benjamin LaHaise wrote:
> On Wed, Sep 17, 2014 at 09:47:02AM -0400, Theodore Ts'o wrote:
> ...
> > % git version
> > git version 2.1.0
> > 
> > Perhaps you and other people are using your own scripts, and not using
> > git send-email?
> 
> That would be because none of my systems have git 2.1.0 on them.  Fedora 
> (and EPEL) appear to still be back on git 1.9.3 which does not have git 
> send-email.

Depends on the distro you're on, I suppose.  Modern Fedora certainly
has packaged git send-email for a while.

  https://apps.fedoraproject.org/packages/git-email

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-17 15:43     ` Theodore Ts'o
@ 2014-09-17 16:05       ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 16:05 UTC (permalink / raw)
  To: Theodore Ts'o, Milosz Tanski, LKML, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer

Theodore,

I might be missing understanding something, but... I already omitted
read2 and write2 which can be implemented in userspace by libc (as you
pointed out). In the case of readv vs. preadv there's an extra
positional argument (file offset) and preadv version doesn't change
the file location. I didn't want to overload the meaning of preadv2 to
take a special negative offset value that uses the current file
position but also changes the file position.

As a counter point to my view point, Christop also brought the same
thing as you did.

- Milosz

On Wed, Sep 17, 2014 at 11:43 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Sep 15, 2014 at 04:20:37PM -0400, Milosz Tanski wrote:
>> New syscalls with an extra flag argument. For now all flags except for 0 are
>> not supported.
>
> This may fall in the category of bike-shedding, and so I apologize in
> advance, but I wonder if we really need readv2 and writev2 as new
> syscalls.  What if we just added preadv2 and pwritev2, and implemented
> readv2 and writev2 as libc wrappers where which had the vector
> allocated as an automatic stack variable?  Is the extra user memory
> access really going to be that noticeable?
>
> Cheers,
>
>                                        - Ted



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-17 16:05       ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 16:05 UTC (permalink / raw)
  To: Theodore Ts'o, Milosz Tanski, LKML, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer

Theodore,

I might be missing understanding something, but... I already omitted
read2 and write2 which can be implemented in userspace by libc (as you
pointed out). In the case of readv vs. preadv there's an extra
positional argument (file offset) and preadv version doesn't change
the file location. I didn't want to overload the meaning of preadv2 to
take a special negative offset value that uses the current file
position but also changes the file position.

As a counter point to my view point, Christop also brought the same
thing as you did.

- Milosz

On Wed, Sep 17, 2014 at 11:43 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Mon, Sep 15, 2014 at 04:20:37PM -0400, Milosz Tanski wrote:
>> New syscalls with an extra flag argument. For now all flags except for 0 are
>> not supported.
>
> This may fall in the category of bike-shedding, and so I apologize in
> advance, but I wonder if we really need readv2 and writev2 as new
> syscalls.  What if we just added preadv2 and pwritev2, and implemented
> readv2 and writev2 as libc wrappers where which had the vector
> allocated as an automatic stack variable?  Is the extra user memory
> access really going to be that noticeable?
>
> Cheers,
>
>                                        - Ted



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-17 16:05       ` Milosz Tanski
@ 2014-09-17 16:59         ` Theodore Ts'o
  -1 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2014-09-17 16:59 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

On Wed, Sep 17, 2014 at 12:05:23PM -0400, Milosz Tanski wrote:
> Theodore,
> 
> I might be missing understanding something, but... I already omitted
> read2 and write2 which can be implemented in userspace by libc (as you
> pointed out). In the case of readv vs. preadv there's an extra
> positional argument (file offset) and preadv version doesn't change
> the file location. I didn't want to overload the meaning of preadv2 to
> take a special negative offset value that uses the current file
> position but also changes the file position.

off_t has to be signed, so having a magic negative value doesn't
bother me that much.  Or you could use a flag bitvalue which means to
use the fd's offset and to ignore the positional value.  (More
bike-shedding :-)

The main reason why I mention it is we have a huge number of
read/write syscalls already, and if we add yet another to support
scatter-gather lists on the memory side, we'll be adding another
factor of two more read/write system calls.  So the suggestion was one
of trying to (probably fruitlessly) trying to stem the expnoential
increase in read/write system calls.  :-)

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-17 16:59         ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2014-09-17 16:59 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer

On Wed, Sep 17, 2014 at 12:05:23PM -0400, Milosz Tanski wrote:
> Theodore,
> 
> I might be missing understanding something, but... I already omitted
> read2 and write2 which can be implemented in userspace by libc (as you
> pointed out). In the case of readv vs. preadv there's an extra
> positional argument (file offset) and preadv version doesn't change
> the file location. I didn't want to overload the meaning of preadv2 to
> take a special negative offset value that uses the current file
> position but also changes the file position.

off_t has to be signed, so having a magic negative value doesn't
bother me that much.  Or you could use a flag bitvalue which means to
use the fd's offset and to ignore the positional value.  (More
bike-shedding :-)

The main reason why I mention it is we have a huge number of
read/write syscalls already, and if we add yet another to support
scatter-gather lists on the memory side, we'll be adding another
factor of two more read/write system calls.  So the suggestion was one
of trying to (probably fruitlessly) trying to stem the expnoential
increase in read/write system calls.  :-)

Cheers,

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-17 16:59         ` Theodore Ts'o
@ 2014-09-17 17:24           ` Zach Brown
  -1 siblings, 0 replies; 167+ messages in thread
From: Zach Brown @ 2014-09-17 17:24 UTC (permalink / raw)
  To: Theodore Ts'o, Milosz Tanski, LKML, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer

On Wed, Sep 17, 2014 at 12:59:30PM -0400, Theodore Ts'o wrote:
> On Wed, Sep 17, 2014 at 12:05:23PM -0400, Milosz Tanski wrote:
> > Theodore,
> > 
> > I might be missing understanding something, but... I already omitted
> > read2 and write2 which can be implemented in userspace by libc (as you
> > pointed out). In the case of readv vs. preadv there's an extra
> > positional argument (file offset) and preadv version doesn't change
> > the file location. I didn't want to overload the meaning of preadv2 to
> > take a special negative offset value that uses the current file
> > position but also changes the file position.
> 
> off_t has to be signed, so having a magic negative value doesn't
> bother me that much.  Or you could use a flag bitvalue which means to
> use the fd's offset and to ignore the positional value.  (More
> bike-shedding :-)

splice has already set the precedent for an optionally specified offset
that falls back to the files's position.

static long do_splice(struct file *in, loff_t __user *off_in,
                      struct file *out, loff_t __user *off_out,
                      size_t len, unsigned int flags)
{
...
                if (off_out) {
                        if (copy_from_user(&offset, off_out, sizeof(loff_t)))
                                return -EFAULT;
                } else {
                        offset = out->f_pos;
                }

...
                if (!off_out)
                        out->f_pos = offset;
                else if (copy_to_user(off_out, &offset, sizeof(loff_t)))
                        ret = -EFAULT;

It's nice and simple and lets you update the user's offset.

> So the suggestion was one of trying to (probably fruitlessly) trying
> to stem the expnoential increase in read/write system calls.  :-)

I support this windmill tilting :). 

- z

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-17 17:24           ` Zach Brown
  0 siblings, 0 replies; 167+ messages in thread
From: Zach Brown @ 2014-09-17 17:24 UTC (permalink / raw)
  To: Theodore Ts'o, Milosz Tanski, LKML, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer

On Wed, Sep 17, 2014 at 12:59:30PM -0400, Theodore Ts'o wrote:
> On Wed, Sep 17, 2014 at 12:05:23PM -0400, Milosz Tanski wrote:
> > Theodore,
> > 
> > I might be missing understanding something, but... I already omitted
> > read2 and write2 which can be implemented in userspace by libc (as you
> > pointed out). In the case of readv vs. preadv there's an extra
> > positional argument (file offset) and preadv version doesn't change
> > the file location. I didn't want to overload the meaning of preadv2 to
> > take a special negative offset value that uses the current file
> > position but also changes the file position.
> 
> off_t has to be signed, so having a magic negative value doesn't
> bother me that much.  Or you could use a flag bitvalue which means to
> use the fd's offset and to ignore the positional value.  (More
> bike-shedding :-)

splice has already set the precedent for an optionally specified offset
that falls back to the files's position.

static long do_splice(struct file *in, loff_t __user *off_in,
                      struct file *out, loff_t __user *off_out,
                      size_t len, unsigned int flags)
{
...
                if (off_out) {
                        if (copy_from_user(&offset, off_out, sizeof(loff_t)))
                                return -EFAULT;
                } else {
                        offset = out->f_pos;
                }

...
                if (!off_out)
                        out->f_pos = offset;
                else if (copy_to_user(off_out, &offset, sizeof(loff_t)))
                        ret = -EFAULT;

It's nice and simple and lets you update the user's offset.

> So the suggestion was one of trying to (probably fruitlessly) trying
> to stem the expnoential increase in read/write system calls.  :-)

I support this windmill tilting :). 

- z

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-17 22:20   ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

This patcheset introduces an ability to perform a non-blocking read from
regular files in buffered IO mode. This works by only for those filesystems
that have data in the page cache.

It does this by introducing new syscalls new syscalls readv2/writev2 and
preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
syscalls that accept an extra flag argument (O_NONBLOCK).

It's a very common patern today (samba, libuv, etc..) use a large threadpool to
perform buffered IO operations. They submit the work form another thread
that performs network IO and epoll or other threads that perform CPU work. This
leads to increased latency for processing, esp. in the case of data that's
already cached in the page cache.

With the new interface the applications will now be able to fetch the data in
their network / cpu bound thread(s) and only defer to a threadpool if it's not
there. In our own application (VLDB) we've observed a decrease in latency for
"fast" request by avoiding unnecessary queuing and having to swap out current
tasks in IO bound work threads.

Version 2 highlights:
 - Put the flags argument into kiocb (less noise), per. Al Viro
 - O_DIRECT checking early in the process, per. Jeff Moyer
 - Resolved duplicate (c&p) code in syscall code, per. Jeff
 - Included perf data in thread cover letter, per. Jeff
 - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff

Some perf data generated using fio comparing the posix aio engine to a version
of the posix AIO engine that attempts to performs "fast" reads before
submitting the operations to the queue. This workflow is on ext4 partition on
raid0 (test / build-rig.) Simulating our database access patern workload using
16kb read accesses. Our database uses a home-spun posix aio like queue (samba
does the same thing.)

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

before:

f1:
    bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
    lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
    lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
    bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
    lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
    lat (msec) : >=2000=4.33%
f3:
    bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
                 stdev=34526.89
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
    lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
    lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
   READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
         mint=600001msec, maxt=600113msec

after (with fast read using preadv2 before submit):

f1:
    bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
    lat (usec) : 2=70.63%, 4=0.01%
    lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
    bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
    lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
    lat (msec) : >=2000=9.99%
f3:
    bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
                 stdev=35995.60
    lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
    lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
    lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%
total:
   READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
	 mint=600020msec, maxt=600178msec

Interpreting the results you can see total bandwidth stays the same but overall
request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
workloads. There is a slight bump in latency for since it's random data that's
unlikely to be cached but we're always trying "fast read".

In our application we have starting keeping track of "fast read" hits/misses
and for files / requests that have a lot hit ratio we don't do "fast reads"
mostly getting rid of extra latency in the uncached cases.

I've performed other benchmarks and I have no observed any perf regressions in
any of the normal (old) code paths.

I have co-developed these changes with Christoph Hellwig.

Christoph Hellwig (1):
  Check for O_NONBLOCK in all read_iter instances

Milosz Tanski (4):
  Prepare for adding a new readv/writev with user flags.
  Define new syscalls readv2,preadv2,writev2,pwritev2
  Export new vector IO (with flags) to userland
  O_NONBLOCK flag for readv2/preadv2

 arch/x86/syscalls/syscall_32.tbl  |   4 ++
 arch/x86/syscalls/syscall_64.tbl  |   4 ++
 drivers/target/target_core_file.c |   6 +-
 fs/cifs/file.c                    |   6 ++
 fs/nfsd/vfs.c                     |   4 +-
 fs/ocfs2/file.c                   |   6 ++
 fs/pipe.c                         |   3 +-
 fs/read_write.c                   | 119 +++++++++++++++++++++++++++++---------
 fs/splice.c                       |   2 +-
 fs/xfs/xfs_file.c                 |   4 ++
 include/linux/aio.h               |   2 +
 include/linux/fs.h                |   7 ++-
 include/linux/syscalls.h          |  12 ++++
 include/uapi/asm-generic/unistd.h |  10 +++-
 mm/filemap.c                      |  23 +++++++-
 mm/shmem.c                        |   4 ++
 16 files changed, 178 insertions(+), 38 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
@ 2014-09-17 22:20   ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

This patcheset introduces an ability to perform a non-blocking read from
regular files in buffered IO mode. This works by only for those filesystems
that have data in the page cache.

It does this by introducing new syscalls new syscalls readv2/writev2 and
preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
syscalls that accept an extra flag argument (O_NONBLOCK).

It's a very common patern today (samba, libuv, etc..) use a large threadpool to
perform buffered IO operations. They submit the work form another thread
that performs network IO and epoll or other threads that perform CPU work. This
leads to increased latency for processing, esp. in the case of data that's
already cached in the page cache.

With the new interface the applications will now be able to fetch the data in
their network / cpu bound thread(s) and only defer to a threadpool if it's not
there. In our own application (VLDB) we've observed a decrease in latency for
"fast" request by avoiding unnecessary queuing and having to swap out current
tasks in IO bound work threads.

Version 2 highlights:
 - Put the flags argument into kiocb (less noise), per. Al Viro
 - O_DIRECT checking early in the process, per. Jeff Moyer
 - Resolved duplicate (c&p) code in syscall code, per. Jeff
 - Included perf data in thread cover letter, per. Jeff
 - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff

Some perf data generated using fio comparing the posix aio engine to a version
of the posix AIO engine that attempts to performs "fast" reads before
submitting the operations to the queue. This workflow is on ext4 partition on
raid0 (test / build-rig.) Simulating our database access patern workload using
16kb read accesses. Our database uses a home-spun posix aio like queue (samba
does the same thing.)

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

before:

f1:
    bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
    lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
    lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
    bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
    lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
    lat (msec) : >=2000=4.33%
f3:
    bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
                 stdev=34526.89
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
    lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
    lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
   READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
         mint=600001msec, maxt=600113msec

after (with fast read using preadv2 before submit):

f1:
    bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
    lat (usec) : 2=70.63%, 4=0.01%
    lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
    bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
    lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
    lat (msec) : >=2000=9.99%
f3:
    bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
                 stdev=35995.60
    lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
    lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
    lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%
total:
   READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
	 mint=600020msec, maxt=600178msec

Interpreting the results you can see total bandwidth stays the same but overall
request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
workloads. There is a slight bump in latency for since it's random data that's
unlikely to be cached but we're always trying "fast read".

In our application we have starting keeping track of "fast read" hits/misses
and for files / requests that have a lot hit ratio we don't do "fast reads"
mostly getting rid of extra latency in the uncached cases.

I've performed other benchmarks and I have no observed any perf regressions in
any of the normal (old) code paths.

I have co-developed these changes with Christoph Hellwig.

Christoph Hellwig (1):
  Check for O_NONBLOCK in all read_iter instances

Milosz Tanski (4):
  Prepare for adding a new readv/writev with user flags.
  Define new syscalls readv2,preadv2,writev2,pwritev2
  Export new vector IO (with flags) to userland
  O_NONBLOCK flag for readv2/preadv2

 arch/x86/syscalls/syscall_32.tbl  |   4 ++
 arch/x86/syscalls/syscall_64.tbl  |   4 ++
 drivers/target/target_core_file.c |   6 +-
 fs/cifs/file.c                    |   6 ++
 fs/nfsd/vfs.c                     |   4 +-
 fs/ocfs2/file.c                   |   6 ++
 fs/pipe.c                         |   3 +-
 fs/read_write.c                   | 119 +++++++++++++++++++++++++++++---------
 fs/splice.c                       |   2 +-
 fs/xfs/xfs_file.c                 |   4 ++
 include/linux/aio.h               |   2 +
 include/linux/fs.h                |   7 ++-
 include/linux/syscalls.h          |  12 ++++
 include/uapi/asm-generic/unistd.h |  10 +++-
 mm/filemap.c                      |  23 +++++++-
 mm/shmem.c                        |   4 ++
 16 files changed, 178 insertions(+), 38 deletions(-)

-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [RFC v2 1/5] Prepare for adding a new readv/writev with user flags.
  2014-09-17 22:20   ` Milosz Tanski
@ 2014-09-17 22:20     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 drivers/target/target_core_file.c |  6 +++---
 fs/nfsd/vfs.c                     |  4 ++--
 fs/read_write.c                   | 28 ++++++++++++++++------------
 fs/splice.c                       |  2 +-
 include/linux/aio.h               |  2 ++
 include/linux/fs.h                |  4 ++--
 mm/filemap.c                      |  2 +-
 7 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7d6cdda..58d9a6d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -350,9 +350,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
 	set_fs(get_ds());
 
 	if (is_write)
-		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
 	else
-		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);
 
 	set_fs(old_fs);
 
@@ -528,7 +528,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 
 	old_fs = get_fs();
 	set_fs(get_ds());
-	rc = vfs_writev(f, &iov[0], iov_num, &pos);
+	rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
 	set_fs(old_fs);
 
 	vfree(iov);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f501a9b..db7a31d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -855,7 +855,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
 
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
 	set_fs(oldfs);
 	return nfsd_finish_read(file, count, host_err);
 }
@@ -943,7 +943,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(oldfs);
 	if (host_err < 0)
 		goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index 009d854..9f6d13d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -651,7 +651,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
 EXPORT_SYMBOL(iov_shorten);
 
 static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
-		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+		int flags)
 {
 	struct kiocb kiocb;
 	struct iov_iter iter;
@@ -660,6 +661,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_nbytes = len;
+	kiocb.ki_rwflags = flags;
 
 	iov_iter_init(&iter, rw, iov, nr_segs, len);
 	ret = fn(&kiocb, &iter);
@@ -798,7 +800,8 @@ out:
 
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec __user * uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos,
+			       int flags)
 {
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -832,7 +835,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -855,27 +858,27 @@ out:
 }
 
 ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
-		  unsigned long vlen, loff_t *pos)
+		  unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_READ))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
 
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_readv);
 
 ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
-		   unsigned long vlen, loff_t *pos)
+		   unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
 
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
@@ -888,7 +891,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos);
+		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -908,7 +911,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos);
+		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -940,7 +943,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos);
+			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -964,7 +967,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos);
+			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1012,7 +1015,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, 0);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -1111,6 +1114,7 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 	return __compat_sys_preadv64(fd, vec, vlen, pos);
 }
 
+
 static size_t compat_writev(struct file *file,
 			    const struct compat_iovec __user *vec,
 			    unsigned long vlen, loff_t *pos)
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..9591b9f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(old_fs);
 
 	return res;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..9c1d499 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -52,6 +52,8 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	int			ki_rwflags;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..e9bea52 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1556,9 +1556,9 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..6e3ba07 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1456,7 +1456,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
-		struct iov_iter *iter, ssize_t written)
+		struct iov_iter *iter, ssize_t written, int flags)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v2 1/5] Prepare for adding a new readv/writev with user flags.
@ 2014-09-17 22:20     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 drivers/target/target_core_file.c |  6 +++---
 fs/nfsd/vfs.c                     |  4 ++--
 fs/read_write.c                   | 28 ++++++++++++++++------------
 fs/splice.c                       |  2 +-
 include/linux/aio.h               |  2 ++
 include/linux/fs.h                |  4 ++--
 mm/filemap.c                      |  2 +-
 7 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7d6cdda..58d9a6d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -350,9 +350,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
 	set_fs(get_ds());
 
 	if (is_write)
-		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
 	else
-		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);
 
 	set_fs(old_fs);
 
@@ -528,7 +528,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 
 	old_fs = get_fs();
 	set_fs(get_ds());
-	rc = vfs_writev(f, &iov[0], iov_num, &pos);
+	rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
 	set_fs(old_fs);
 
 	vfree(iov);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f501a9b..db7a31d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -855,7 +855,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
 
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
 	set_fs(oldfs);
 	return nfsd_finish_read(file, count, host_err);
 }
@@ -943,7 +943,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(oldfs);
 	if (host_err < 0)
 		goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index 009d854..9f6d13d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -651,7 +651,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
 EXPORT_SYMBOL(iov_shorten);
 
 static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
-		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+		int flags)
 {
 	struct kiocb kiocb;
 	struct iov_iter iter;
@@ -660,6 +661,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_nbytes = len;
+	kiocb.ki_rwflags = flags;
 
 	iov_iter_init(&iter, rw, iov, nr_segs, len);
 	ret = fn(&kiocb, &iter);
@@ -798,7 +800,8 @@ out:
 
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec __user * uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos,
+			       int flags)
 {
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -832,7 +835,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -855,27 +858,27 @@ out:
 }
 
 ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
-		  unsigned long vlen, loff_t *pos)
+		  unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_READ))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
 
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_readv);
 
 ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
-		   unsigned long vlen, loff_t *pos)
+		   unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
 
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
@@ -888,7 +891,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos);
+		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -908,7 +911,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos);
+		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -940,7 +943,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos);
+			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -964,7 +967,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos);
+			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1012,7 +1015,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, 0);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -1111,6 +1114,7 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 	return __compat_sys_preadv64(fd, vec, vlen, pos);
 }
 
+
 static size_t compat_writev(struct file *file,
 			    const struct compat_iovec __user *vec,
 			    unsigned long vlen, loff_t *pos)
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..9591b9f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(old_fs);
 
 	return res;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..9c1d499 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -52,6 +52,8 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	int			ki_rwflags;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..e9bea52 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1556,9 +1556,9 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..6e3ba07 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1456,7 +1456,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
-		struct iov_iter *iter, ssize_t written)
+		struct iov_iter *iter, ssize_t written, int flags)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-17 22:20   ` Milosz Tanski
@ 2014-09-17 22:20     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

New syscalls with an extra flag argument. For now all flags except for 0 are
not supported.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c                   | 80 +++++++++++++++++++++++++++++++++------
 include/linux/syscalls.h          | 12 ++++++
 include/uapi/asm-generic/unistd.h | 10 ++++-
 mm/filemap.c                      |  2 +-
 4 files changed, 90 insertions(+), 14 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9f6d13d..3db2e87 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -864,6 +864,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -877,21 +879,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
 
-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+			unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -903,15 +907,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -929,8 +933,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
 	return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
 }
 
-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, unsigned long pos_l,
+			 unsigned long pos_h, int flags)
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
@@ -943,7 +948,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -953,8 +958,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+			  unsigned long vlen, unsigned long pos_l,
+			  unsigned long pos_h, int flags)
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
@@ -967,7 +973,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -977,6 +983,56 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE4(readv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, int, flags)
+{
+	return do_readv(fd, vec, vlen, flags);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE4(writev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, int, flags)
+{
+	return do_writev(fd, vec, vlen, flags);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	return do_preadv(fd, vec,  vlen, pos_l, pos_h, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	return do_preadv(fd, vec,  vlen, pos_l, pos_h, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	return do_pwritev(fd, vec, vlen, pos_l, pos_h, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	return do_pwritev(fd, vec, vlen, pos_l, pos_h, flags);
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t compat_do_readv_writev(int type, struct file *file,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85..0c49ae4 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -559,19 +559,31 @@ asmlinkage long sys_readahead(int fd, loff_t offset, size_t count);
 asmlinkage long sys_readv(unsigned long fd,
 			  const struct iovec __user *vec,
 			  unsigned long vlen);
+asmlinkage long sys_readv2(unsigned long fd,
+			  const struct iovec __user *vec,
+			  unsigned long vlen, int flags);
 asmlinkage long sys_write(unsigned int fd, const char __user *buf,
 			  size_t count);
 asmlinkage long sys_writev(unsigned long fd,
 			   const struct iovec __user *vec,
 			   unsigned long vlen);
+asmlinkage long sys_writev2(unsigned long fd,
+			    const struct iovec __user *vec,
+			    unsigned long vlen, int flags);
 asmlinkage long sys_pread64(unsigned int fd, char __user *buf,
 			    size_t count, loff_t pos);
 asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
 			     size_t count, loff_t pos);
 asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
 			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
 			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
 asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
 asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 11d11bc..75ad687 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,14 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
 __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
 #define __NR_pwritev 70
 __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_readv2 280
+__SC_COMP(__NR_readv2, sys_readv2)
+#define __NR_writev2 281
+__SC_COMP(__NR_writev2, sys_writev2)
+#define __NR_preadv2 282
+__SC_COMP(__NR_preadv2, sys_preadv2)
+#define __NR_pwritev2 283
+__SC_COMP(__NR_pwritev2, sys_pwritev2)
 
 /* fs/sendfile.c */
 #define __NR3264_sendfile 71
@@ -707,7 +715,7 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
 
 #undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 284
 
 /*
  * All syscalls below here should go away really,
diff --git a/mm/filemap.c b/mm/filemap.c
index 6e3ba07..e0919ba 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1726,7 +1726,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		}
 	}
 
-	retval = do_generic_file_read(file, ppos, iter, retval);
+	retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
 out:
 	return retval;
 }
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-17 22:20     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

New syscalls with an extra flag argument. For now all flags except for 0 are
not supported.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c                   | 80 +++++++++++++++++++++++++++++++++------
 include/linux/syscalls.h          | 12 ++++++
 include/uapi/asm-generic/unistd.h | 10 ++++-
 mm/filemap.c                      |  2 +-
 4 files changed, 90 insertions(+), 14 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9f6d13d..3db2e87 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -864,6 +864,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -877,21 +879,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
 
-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+			unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -903,15 +907,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -929,8 +933,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
 	return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
 }
 
-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, unsigned long pos_l,
+			 unsigned long pos_h, int flags)
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
@@ -943,7 +948,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -953,8 +958,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+			  unsigned long vlen, unsigned long pos_l,
+			  unsigned long pos_h, int flags)
 {
 	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
@@ -967,7 +973,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -977,6 +983,56 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE4(readv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, int, flags)
+{
+	return do_readv(fd, vec, vlen, flags);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE4(writev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, int, flags)
+{
+	return do_writev(fd, vec, vlen, flags);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	return do_preadv(fd, vec,  vlen, pos_l, pos_h, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	return do_preadv(fd, vec,  vlen, pos_l, pos_h, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	return do_pwritev(fd, vec, vlen, pos_l, pos_h, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	return do_pwritev(fd, vec, vlen, pos_l, pos_h, flags);
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t compat_do_readv_writev(int type, struct file *file,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85..0c49ae4 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -559,19 +559,31 @@ asmlinkage long sys_readahead(int fd, loff_t offset, size_t count);
 asmlinkage long sys_readv(unsigned long fd,
 			  const struct iovec __user *vec,
 			  unsigned long vlen);
+asmlinkage long sys_readv2(unsigned long fd,
+			  const struct iovec __user *vec,
+			  unsigned long vlen, int flags);
 asmlinkage long sys_write(unsigned int fd, const char __user *buf,
 			  size_t count);
 asmlinkage long sys_writev(unsigned long fd,
 			   const struct iovec __user *vec,
 			   unsigned long vlen);
+asmlinkage long sys_writev2(unsigned long fd,
+			    const struct iovec __user *vec,
+			    unsigned long vlen, int flags);
 asmlinkage long sys_pread64(unsigned int fd, char __user *buf,
 			    size_t count, loff_t pos);
 asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
 			     size_t count, loff_t pos);
 asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
 			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
 			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
 asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
 asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 11d11bc..75ad687 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,14 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
 __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
 #define __NR_pwritev 70
 __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_readv2 280
+__SC_COMP(__NR_readv2, sys_readv2)
+#define __NR_writev2 281
+__SC_COMP(__NR_writev2, sys_writev2)
+#define __NR_preadv2 282
+__SC_COMP(__NR_preadv2, sys_preadv2)
+#define __NR_pwritev2 283
+__SC_COMP(__NR_pwritev2, sys_pwritev2)
 
 /* fs/sendfile.c */
 #define __NR3264_sendfile 71
@@ -707,7 +715,7 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
 
 #undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 284
 
 /*
  * All syscalls below here should go away really,
diff --git a/mm/filemap.c b/mm/filemap.c
index 6e3ba07..e0919ba 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1726,7 +1726,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		}
 	}
 
-	retval = do_generic_file_read(file, ppos, iter, retval);
+	retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
 out:
 	return retval;
 }
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v2 3/5] Export new vector IO (with flags) to userland
  2014-09-17 22:20   ` Milosz Tanski
@ 2014-09-17 22:20     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

This is only for x86_64 and x86. Will add other arch later.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 arch/x86/syscalls/syscall_32.tbl | 4 ++++
 arch/x86/syscalls/syscall_64.tbl | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b781..ed85dca 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,7 @@
 354	i386	seccomp			sys_seccomp
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
+357	i386	readv2			sys_readv2
+358	i386	writev2			sys_writev2
+359	i386	preadv2			sys_preadv2
+360	i386	pwritev2		sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922..76d9f60 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,10 @@
 318	common	getrandom		sys_getrandom
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
+321	64	readv2			sys_readv2
+322	64	writev2			sys_writev2
+323	64	preadv2			sys_preadv2
+324	64	pwritev2		sys_pwritev2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v2 3/5] Export new vector IO (with flags) to userland
@ 2014-09-17 22:20     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

This is only for x86_64 and x86. Will add other arch later.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 arch/x86/syscalls/syscall_32.tbl | 4 ++++
 arch/x86/syscalls/syscall_64.tbl | 4 ++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b781..ed85dca 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,7 @@
 354	i386	seccomp			sys_seccomp
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
+357	i386	readv2			sys_readv2
+358	i386	writev2			sys_writev2
+359	i386	preadv2			sys_preadv2
+360	i386	pwritev2		sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922..76d9f60 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,10 @@
 318	common	getrandom		sys_getrandom
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
+321	64	readv2			sys_readv2
+322	64	writev2			sys_writev2
+323	64	preadv2			sys_preadv2
+324	64	pwritev2		sys_pwritev2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2
  2014-09-17 22:20   ` Milosz Tanski
@ 2014-09-17 22:20     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Filesystems that generic_file_read_iter will not be allowed to perform
non-blocking reads. This only will read data if it's in the page cache and if
there is no page error (causing a re-read).

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c    |  4 +++-
 include/linux/fs.h |  3 +++
 mm/filemap.c       | 19 +++++++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 3db2e87..29b5823 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -864,8 +864,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
-	if (flags & ~0)
+	if (flags & ~RWF_NONBLOCK)
 		return -EINVAL;
+	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
+		return -EAGAIN;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e9bea52..c9c6ac5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1477,6 +1477,9 @@ struct block_device_operations;
 #define HAVE_COMPAT_IOCTL 1
 #define HAVE_UNLOCKED_IOCTL 1
 
+/* These flags are used for the readv/writev syscalls with flags. */
+#define RWF_NONBLOCK O_NONBLOCK
+
 struct iov_iter;
 
 struct file_operations {
diff --git a/mm/filemap.c b/mm/filemap.c
index e0919ba..6b7aba8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1483,7 +1483,10 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 		cond_resched();
 find_page:
 		page = find_get_page(mapping, index);
+
 		if (!page) {
+			if (flags & RWF_NONBLOCK)
+				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1575,6 +1578,11 @@ page_ok:
 		continue;
 
 page_not_up_to_date:
+		if (flags & RWF_NONBLOCK) {
+			page_cache_release(page);
+			goto would_block;
+		}
+
 		/* Get exclusive access to the page ... */
 		error = lock_page_killable(page);
 		if (unlikely(error))
@@ -1594,6 +1602,12 @@ page_not_up_to_date_locked:
 			goto page_ok;
 		}
 
+		if (flags & RWF_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto would_block;
+		}
+
 readpage:
 		/*
 		 * A previous I/O error may have been due to temporary
@@ -1664,6 +1678,8 @@ no_cached_page:
 		goto readpage;
 	}
 
+would_block:
+	error = -EAGAIN;
 out:
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_CACHE_SHIFT;
@@ -1697,6 +1713,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		size_t count = iov_iter_count(iter);
 		loff_t size;
 
+		if (iocb->ki_rwflags & RWF_NONBLOCK)
+			return -EAGAIN;
+
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2
@ 2014-09-17 22:20     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Filesystems that generic_file_read_iter will not be allowed to perform
non-blocking reads. This only will read data if it's in the page cache and if
there is no page error (causing a re-read).

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c    |  4 +++-
 include/linux/fs.h |  3 +++
 mm/filemap.c       | 19 +++++++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 3db2e87..29b5823 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -864,8 +864,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
-	if (flags & ~0)
+	if (flags & ~RWF_NONBLOCK)
 		return -EINVAL;
+	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
+		return -EAGAIN;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e9bea52..c9c6ac5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1477,6 +1477,9 @@ struct block_device_operations;
 #define HAVE_COMPAT_IOCTL 1
 #define HAVE_UNLOCKED_IOCTL 1
 
+/* These flags are used for the readv/writev syscalls with flags. */
+#define RWF_NONBLOCK O_NONBLOCK
+
 struct iov_iter;
 
 struct file_operations {
diff --git a/mm/filemap.c b/mm/filemap.c
index e0919ba..6b7aba8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1483,7 +1483,10 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 		cond_resched();
 find_page:
 		page = find_get_page(mapping, index);
+
 		if (!page) {
+			if (flags & RWF_NONBLOCK)
+				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1575,6 +1578,11 @@ page_ok:
 		continue;
 
 page_not_up_to_date:
+		if (flags & RWF_NONBLOCK) {
+			page_cache_release(page);
+			goto would_block;
+		}
+
 		/* Get exclusive access to the page ... */
 		error = lock_page_killable(page);
 		if (unlikely(error))
@@ -1594,6 +1602,12 @@ page_not_up_to_date_locked:
 			goto page_ok;
 		}
 
+		if (flags & RWF_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto would_block;
+		}
+
 readpage:
 		/*
 		 * A previous I/O error may have been due to temporary
@@ -1664,6 +1678,8 @@ no_cached_page:
 		goto readpage;
 	}
 
+would_block:
+	error = -EAGAIN;
 out:
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_CACHE_SHIFT;
@@ -1697,6 +1713,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		size_t count = iov_iter_count(iter);
 		loff_t size;
 
+		if (iocb->ki_rwflags & RWF_NONBLOCK)
+			return -EAGAIN;
+
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v2 5/5] Check for O_NONBLOCK in all read_iter instances
  2014-09-17 22:20   ` Milosz Tanski
@ 2014-09-17 22:20     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

From: Christoph Hellwig <hch@lst.de>

Go through filesystem paths and return EAGAIN if there's an operation (like
metadata) that would cause blocking.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/cifs/file.c    |  6 ++++++
 fs/ocfs2/file.c   |  6 ++++++
 fs/pipe.c         |  3 ++-
 fs/read_write.c   | 17 +++++++++++------
 fs/xfs/xfs_file.c |  4 ++++
 mm/shmem.c        |  4 ++++
 6 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 7c018a1..e7169ba 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
 	struct cifs_readdata *rdata, *tmp;
 	struct list_head rdata_list;
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;
@@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
 		return generic_file_read_iter(iocb, to);
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
 	 * with a brlock that prevents reading.
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 2930e23..d96f60d 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2473,6 +2473,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 			filp->f_path.dentry->d_name.name,
 			to->nr_segs);	/* GRRRRR */
 
+	/*
+	 * No non-blocking reads for ocfs2 for now.  Might be doable with
+	 * non-blocking cluster lock helpers.
+	 */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
 
 	if (!inode) {
 		ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..212bf68 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if ((filp->f_flags & O_NONBLOCK) ||
+			    (iocb->ki_rwflags & RWF_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
diff --git a/fs/read_write.c b/fs/read_write.c
index 29b5823..a3efa95 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -833,14 +833,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & RWF_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de5368c..cf61271 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -246,6 +246,10 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
+	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
diff --git a/mm/shmem.c b/mm/shmem.c
index 0e5fb22..ca2cae2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/* XXX: should be easily supportable */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v2 5/5] Check for O_NONBLOCK in all read_iter instances
@ 2014-09-17 22:20     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-17 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

From: Christoph Hellwig <hch@lst.de>

Go through filesystem paths and return EAGAIN if there's an operation (like
metadata) that would cause blocking.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/cifs/file.c    |  6 ++++++
 fs/ocfs2/file.c   |  6 ++++++
 fs/pipe.c         |  3 ++-
 fs/read_write.c   | 17 +++++++++++------
 fs/xfs/xfs_file.c |  4 ++++
 mm/shmem.c        |  4 ++++
 6 files changed, 33 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 7c018a1..e7169ba 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
 	struct cifs_readdata *rdata, *tmp;
 	struct list_head rdata_list;
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;
@@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
 		return generic_file_read_iter(iocb, to);
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
 	 * with a brlock that prevents reading.
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 2930e23..d96f60d 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2473,6 +2473,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 			filp->f_path.dentry->d_name.name,
 			to->nr_segs);	/* GRRRRR */
 
+	/*
+	 * No non-blocking reads for ocfs2 for now.  Might be doable with
+	 * non-blocking cluster lock helpers.
+	 */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
 
 	if (!inode) {
 		ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..212bf68 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if ((filp->f_flags & O_NONBLOCK) ||
+			    (iocb->ki_rwflags & RWF_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
diff --git a/fs/read_write.c b/fs/read_write.c
index 29b5823..a3efa95 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -833,14 +833,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & RWF_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de5368c..cf61271 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -246,6 +246,10 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
+	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
diff --git a/mm/shmem.c b/mm/shmem.c
index 0e5fb22..ca2cae2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/* XXX: should be easily supportable */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-17 22:20     ` Milosz Tanski
@ 2014-09-18 18:48       ` Darrick J. Wong
  -1 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2014-09-18 18:48 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Wed, Sep 17, 2014 at 10:20:47PM +0000, Milosz Tanski wrote:
> New syscalls with an extra flag argument. For now all flags except for 0 are
> not supported.
> 
> Signed-off-by: Milosz Tanski <milosz@adfin.com>
> ---
>  fs/read_write.c                   | 80 +++++++++++++++++++++++++++++++++------
>  include/linux/syscalls.h          | 12 ++++++
>  include/uapi/asm-generic/unistd.h | 10 ++++-
>  mm/filemap.c                      |  2 +-
>  4 files changed, 90 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 9f6d13d..3db2e87 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -864,6 +864,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
>  		return -EBADF;
>  	if (!(file->f_mode & FMODE_CAN_READ))
>  		return -EINVAL;
> +	if (flags & ~0)
> +		return -EINVAL;
>  
>  	return do_readv_writev(READ, file, vec, vlen, pos, flags);
>  }
> @@ -877,21 +879,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
>  		return -EBADF;
>  	if (!(file->f_mode & FMODE_CAN_WRITE))
>  		return -EINVAL;
> +	if (flags & ~0)
> +		return -EINVAL;
>  
>  	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
>  }
>  
>  EXPORT_SYMBOL(vfs_writev);
>  
> -SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> -		unsigned long, vlen)
> +static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
> +			unsigned long vlen, int flags)
>  {
>  	struct fd f = fdget_pos(fd);
>  	ssize_t ret = -EBADF;
>  
>  	if (f.file) {
>  		loff_t pos = file_pos_read(f.file);
> -		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> +		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
>  		if (ret >= 0)
>  			file_pos_write(f.file, pos);
>  		fdput_pos(f);
> @@ -903,15 +907,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
>  	return ret;
>  }
>  
> -SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> -		unsigned long, vlen)
> +static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
> +			 unsigned long vlen, int flags)
>  {
>  	struct fd f = fdget_pos(fd);
>  	ssize_t ret = -EBADF;
>  
>  	if (f.file) {
>  		loff_t pos = file_pos_read(f.file);
> -		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> +		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
>  		if (ret >= 0)
>  			file_pos_write(f.file, pos);
>  		fdput_pos(f);
> @@ -929,8 +933,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
>  	return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
>  }
>  
> -SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> -		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
> +			 unsigned long vlen, unsigned long pos_l,
> +			 unsigned long pos_h, int flags)
>  {
>  	loff_t pos = pos_from_hilo(pos_h, pos_l);
>  	struct fd f;
> @@ -943,7 +948,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
>  	if (f.file) {
>  		ret = -ESPIPE;
>  		if (f.file->f_mode & FMODE_PREAD)
> -			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> +			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
>  		fdput(f);
>  	}
>  
> @@ -953,8 +958,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
>  	return ret;
>  }
>  
> -SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> -		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
> +			  unsigned long vlen, unsigned long pos_l,
> +			  unsigned long pos_h, int flags)
>  {
>  	loff_t pos = pos_from_hilo(pos_h, pos_l);
>  	struct fd f;
> @@ -967,7 +973,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
>  	if (f.file) {
>  		ret = -ESPIPE;
>  		if (f.file->f_mode & FMODE_PWRITE)
> -			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> +			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
>  		fdput(f);
>  	}
>  
> @@ -977,6 +983,56 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
>  	return ret;
>  }
>  
> +SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen)
> +{
> +	return do_readv(fd, vec, vlen, 0);
> +}
> +
> +SYSCALL_DEFINE4(readv2, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, int, flags)
> +{
> +	return do_readv(fd, vec, vlen, flags);
> +}
> +
> +SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen)
> +{
> +	return do_writev(fd, vec, vlen, 0);
> +}
> +
> +SYSCALL_DEFINE4(writev2, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, int, flags)
> +{
> +	return do_writev(fd, vec, vlen, flags);
> +}
> +
> +SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +{
> +	return do_preadv(fd, vec,  vlen, pos_l, pos_h, 0);
> +}
> +
> +SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
> +		int, flags)
> +{
> +	return do_preadv(fd, vec,  vlen, pos_l, pos_h, flags);
> +}
> +
> +SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +{
> +	return do_pwritev(fd, vec, vlen, pos_l, pos_h, 0);
> +}
> +
> +SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
> +		int, flags)
> +{
> +	return do_pwritev(fd, vec, vlen, pos_l, pos_h, flags);
> +}
> +

A few months ago I was working on extending these interfaces (well, the
p{read,write}* ones and AIO) to tack on an IO extension buffer at the end of
the syscall arguments.

Hrmm, I guess I never /did/ send out a v2 after LSF.  The last time we
discussed this[1], the discussion ended with the creation of a structure that
looked something like this:

/* IO extension flags */
#define IO_EXT_PI	(1)	/* protection info (checksums, etc) */
#define IO_EXT_REPLICA	(0x2)	/* replica */
#define IO_EXT_ALL	(IO_EXT_PI | IO_EXT_REPLICA)

/* IO extension descriptor */
struct io_extension {
	__u64 ie_has;

	/* PI stuff */
	__u64 ie_pi_buf;
	__u32 ie_pi_buflen;
	__u64 ie_pi_flags;

	/* which replica do you want? */
	__u32 ie_replica;
};

Given the suggestion of avoiding an explosion of syscalls (here by stuffing all
these bits into a structure), I wonder what people think of moving 'int flags'
into this structure?  At least for the pread/pwrite variants since they already
have a lot of parameters, and for AIO whose struct iocb only has enough room
left for one pointer.

(For anyone paying attention to the original IO extension discussion: I've been
working on plumbing in the ie_replica parameter -- if your FS/blockdev/whatever
can store/fetch alternate copies of a data block, you can request a specific
copy.  Or I suppose one could interpret it as a "desperation" parameter; the
higher the number, the more extraordinary measures the storage can take to
recover data.)

--D

[1] http://article.gmane.org/gmane.linux.kernel.aio.general/3904

>  #ifdef CONFIG_COMPAT
>  
>  static ssize_t compat_do_readv_writev(int type, struct file *file,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 0f86d85..0c49ae4 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -559,19 +559,31 @@ asmlinkage long sys_readahead(int fd, loff_t offset, size_t count);
>  asmlinkage long sys_readv(unsigned long fd,
>  			  const struct iovec __user *vec,
>  			  unsigned long vlen);
> +asmlinkage long sys_readv2(unsigned long fd,
> +			  const struct iovec __user *vec,
> +			  unsigned long vlen, int flags);
>  asmlinkage long sys_write(unsigned int fd, const char __user *buf,
>  			  size_t count);
>  asmlinkage long sys_writev(unsigned long fd,
>  			   const struct iovec __user *vec,
>  			   unsigned long vlen);
> +asmlinkage long sys_writev2(unsigned long fd,
> +			    const struct iovec __user *vec,
> +			    unsigned long vlen, int flags);
>  asmlinkage long sys_pread64(unsigned int fd, char __user *buf,
>  			    size_t count, loff_t pos);
>  asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
>  			     size_t count, loff_t pos);
>  asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
>  			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
> +asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
> +			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
> +			    int flags);
>  asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
>  			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
> +asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
> +			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
> +			    int flags);
>  asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
>  asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
>  asmlinkage long sys_chdir(const char __user *filename);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 11d11bc..75ad687 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -213,6 +213,14 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
>  __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
>  #define __NR_pwritev 70
>  __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
> +#define __NR_readv2 280
> +__SC_COMP(__NR_readv2, sys_readv2)
> +#define __NR_writev2 281
> +__SC_COMP(__NR_writev2, sys_writev2)
> +#define __NR_preadv2 282
> +__SC_COMP(__NR_preadv2, sys_preadv2)
> +#define __NR_pwritev2 283
> +__SC_COMP(__NR_pwritev2, sys_pwritev2)
>  
>  /* fs/sendfile.c */
>  #define __NR3264_sendfile 71
> @@ -707,7 +715,7 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
>  __SYSCALL(__NR_memfd_create, sys_memfd_create)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 280
> +#define __NR_syscalls 284
>  
>  /*
>   * All syscalls below here should go away really,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6e3ba07..e0919ba 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1726,7 +1726,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  		}
>  	}
>  
> -	retval = do_generic_file_read(file, ppos, iter, retval);
> +	retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
>  out:
>  	return retval;
>  }
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-18 18:48       ` Darrick J. Wong
  0 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2014-09-18 18:48 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Wed, Sep 17, 2014 at 10:20:47PM +0000, Milosz Tanski wrote:
> New syscalls with an extra flag argument. For now all flags except for 0 are
> not supported.
> 
> Signed-off-by: Milosz Tanski <milosz@adfin.com>
> ---
>  fs/read_write.c                   | 80 +++++++++++++++++++++++++++++++++------
>  include/linux/syscalls.h          | 12 ++++++
>  include/uapi/asm-generic/unistd.h | 10 ++++-
>  mm/filemap.c                      |  2 +-
>  4 files changed, 90 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 9f6d13d..3db2e87 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -864,6 +864,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
>  		return -EBADF;
>  	if (!(file->f_mode & FMODE_CAN_READ))
>  		return -EINVAL;
> +	if (flags & ~0)
> +		return -EINVAL;
>  
>  	return do_readv_writev(READ, file, vec, vlen, pos, flags);
>  }
> @@ -877,21 +879,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
>  		return -EBADF;
>  	if (!(file->f_mode & FMODE_CAN_WRITE))
>  		return -EINVAL;
> +	if (flags & ~0)
> +		return -EINVAL;
>  
>  	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
>  }
>  
>  EXPORT_SYMBOL(vfs_writev);
>  
> -SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> -		unsigned long, vlen)
> +static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
> +			unsigned long vlen, int flags)
>  {
>  	struct fd f = fdget_pos(fd);
>  	ssize_t ret = -EBADF;
>  
>  	if (f.file) {
>  		loff_t pos = file_pos_read(f.file);
> -		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> +		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
>  		if (ret >= 0)
>  			file_pos_write(f.file, pos);
>  		fdput_pos(f);
> @@ -903,15 +907,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
>  	return ret;
>  }
>  
> -SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> -		unsigned long, vlen)
> +static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
> +			 unsigned long vlen, int flags)
>  {
>  	struct fd f = fdget_pos(fd);
>  	ssize_t ret = -EBADF;
>  
>  	if (f.file) {
>  		loff_t pos = file_pos_read(f.file);
> -		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> +		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
>  		if (ret >= 0)
>  			file_pos_write(f.file, pos);
>  		fdput_pos(f);
> @@ -929,8 +933,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
>  	return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
>  }
>  
> -SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> -		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
> +			 unsigned long vlen, unsigned long pos_l,
> +			 unsigned long pos_h, int flags)
>  {
>  	loff_t pos = pos_from_hilo(pos_h, pos_l);
>  	struct fd f;
> @@ -943,7 +948,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
>  	if (f.file) {
>  		ret = -ESPIPE;
>  		if (f.file->f_mode & FMODE_PREAD)
> -			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> +			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
>  		fdput(f);
>  	}
>  
> @@ -953,8 +958,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
>  	return ret;
>  }
>  
> -SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> -		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
> +			  unsigned long vlen, unsigned long pos_l,
> +			  unsigned long pos_h, int flags)
>  {
>  	loff_t pos = pos_from_hilo(pos_h, pos_l);
>  	struct fd f;
> @@ -967,7 +973,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
>  	if (f.file) {
>  		ret = -ESPIPE;
>  		if (f.file->f_mode & FMODE_PWRITE)
> -			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> +			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
>  		fdput(f);
>  	}
>  
> @@ -977,6 +983,56 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
>  	return ret;
>  }
>  
> +SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen)
> +{
> +	return do_readv(fd, vec, vlen, 0);
> +}
> +
> +SYSCALL_DEFINE4(readv2, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, int, flags)
> +{
> +	return do_readv(fd, vec, vlen, flags);
> +}
> +
> +SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen)
> +{
> +	return do_writev(fd, vec, vlen, 0);
> +}
> +
> +SYSCALL_DEFINE4(writev2, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, int, flags)
> +{
> +	return do_writev(fd, vec, vlen, flags);
> +}
> +
> +SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +{
> +	return do_preadv(fd, vec,  vlen, pos_l, pos_h, 0);
> +}
> +
> +SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
> +		int, flags)
> +{
> +	return do_preadv(fd, vec,  vlen, pos_l, pos_h, flags);
> +}
> +
> +SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
> +{
> +	return do_pwritev(fd, vec, vlen, pos_l, pos_h, 0);
> +}
> +
> +SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
> +		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
> +		int, flags)
> +{
> +	return do_pwritev(fd, vec, vlen, pos_l, pos_h, flags);
> +}
> +

A few months ago I was working on extending these interfaces (well, the
p{read,write}* ones and AIO) to tack on an IO extension buffer at the end of
the syscall arguments.

Hrmm, I guess I never /did/ send out a v2 after LSF.  The last time we
discussed this[1], the discussion ended with the creation of a structure that
looked something like this:

/* IO extension flags */
#define IO_EXT_PI	(1)	/* protection info (checksums, etc) */
#define IO_EXT_REPLICA	(0x2)	/* replica */
#define IO_EXT_ALL	(IO_EXT_PI | IO_EXT_REPLICA)

/* IO extension descriptor */
struct io_extension {
	__u64 ie_has;

	/* PI stuff */
	__u64 ie_pi_buf;
	__u32 ie_pi_buflen;
	__u64 ie_pi_flags;

	/* which replica do you want? */
	__u32 ie_replica;
};

Given the suggestion of avoiding an explosion of syscalls (here by stuffing all
these bits into a structure), I wonder what people think of moving 'int flags'
into this structure?  At least for the pread/pwrite variants since they already
have a lot of parameters, and for AIO whose struct iocb only has enough room
left for one pointer.

(For anyone paying attention to the original IO extension discussion: I've been
working on plumbing in the ie_replica parameter -- if your FS/blockdev/whatever
can store/fetch alternate copies of a data block, you can request a specific
copy.  Or I suppose one could interpret it as a "desperation" parameter; the
higher the number, the more extraordinary measures the storage can take to
recover data.)

--D

[1] http://article.gmane.org/gmane.linux.kernel.aio.general/3904

>  #ifdef CONFIG_COMPAT
>  
>  static ssize_t compat_do_readv_writev(int type, struct file *file,
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 0f86d85..0c49ae4 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -559,19 +559,31 @@ asmlinkage long sys_readahead(int fd, loff_t offset, size_t count);
>  asmlinkage long sys_readv(unsigned long fd,
>  			  const struct iovec __user *vec,
>  			  unsigned long vlen);
> +asmlinkage long sys_readv2(unsigned long fd,
> +			  const struct iovec __user *vec,
> +			  unsigned long vlen, int flags);
>  asmlinkage long sys_write(unsigned int fd, const char __user *buf,
>  			  size_t count);
>  asmlinkage long sys_writev(unsigned long fd,
>  			   const struct iovec __user *vec,
>  			   unsigned long vlen);
> +asmlinkage long sys_writev2(unsigned long fd,
> +			    const struct iovec __user *vec,
> +			    unsigned long vlen, int flags);
>  asmlinkage long sys_pread64(unsigned int fd, char __user *buf,
>  			    size_t count, loff_t pos);
>  asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
>  			     size_t count, loff_t pos);
>  asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
>  			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
> +asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
> +			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
> +			    int flags);
>  asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
>  			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
> +asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
> +			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
> +			    int flags);
>  asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
>  asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
>  asmlinkage long sys_chdir(const char __user *filename);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 11d11bc..75ad687 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -213,6 +213,14 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
>  __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
>  #define __NR_pwritev 70
>  __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
> +#define __NR_readv2 280
> +__SC_COMP(__NR_readv2, sys_readv2)
> +#define __NR_writev2 281
> +__SC_COMP(__NR_writev2, sys_writev2)
> +#define __NR_preadv2 282
> +__SC_COMP(__NR_preadv2, sys_preadv2)
> +#define __NR_pwritev2 283
> +__SC_COMP(__NR_pwritev2, sys_pwritev2)
>  
>  /* fs/sendfile.c */
>  #define __NR3264_sendfile 71
> @@ -707,7 +715,7 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
>  __SYSCALL(__NR_memfd_create, sys_memfd_create)
>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 280
> +#define __NR_syscalls 284
>  
>  /*
>   * All syscalls below here should go away really,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6e3ba07..e0919ba 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1726,7 +1726,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  		}
>  	}
>  
> -	retval = do_generic_file_read(file, ppos, iter, retval);
> +	retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
>  out:
>  	return retval;
>  }
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-18 18:48       ` Darrick J. Wong
@ 2014-09-19 10:52         ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 10:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Thu, Sep 18, 2014 at 11:48:23AM -0700, Darrick J. Wong wrote:
> A few months ago I was working on extending these interfaces (well, the
> p{read,write}* ones and AIO) to tack on an IO extension buffer at the end of
> the syscall arguments.

Honestly, that proposal is so but ugly that I treated it as an April
first joke.  I don't really think we want any of that overload mess.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-19 10:52         ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 10:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Thu, Sep 18, 2014 at 11:48:23AM -0700, Darrick J. Wong wrote:
> A few months ago I was working on extending these interfaces (well, the
> p{read,write}* ones and AIO) to tack on an IO extension buffer at the end of
> the syscall arguments.

Honestly, that proposal is so but ugly that I treated it as an April
first joke.  I don't really think we want any of that overload mess.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 22:36     ` Elliott, Robert (Server Storage)
@ 2014-09-19 11:21       ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:21 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Andreas Dilger, Milosz Tanski, linux-kernel, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer

On Mon, Sep 15, 2014 at 10:36:46PM +0000, Elliott, Robert (Server Storage) wrote:
> That sounds like the proposed WRITE SCATTERED/READ GATHERED 
> commands for SCSI (where are related to, but not necessarily
> tied to, atomic writes).  We discussed them a bit at 
> LSF-MM 2013 - see http://lwn.net/Articles/548116/.

In the same way a preadx/pwritex could use but would not require an
O_ATOMIC.  What's the status of those in t10?  Last I heard
READ GATHERED was out and they were only looking into WRITE SCATTERED?

Without the atomic WRITE SCATTERED use case adding the syscalls seems
rather pointless, and I'd really avoid blocking nice software only
features like the per-I/O nonblock flag (and the similarly trivial
per-I/O sync option I have a prototype for) on it.

Speaking of easy flags:  while a nonblock flags for writes wouldn't
be anywhere near as easy as the one for reads would there be sufficient
interested in it to bother implementing it?

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-19 11:21       ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:21 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Andreas Dilger, Milosz Tanski, linux-kernel, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer

On Mon, Sep 15, 2014 at 10:36:46PM +0000, Elliott, Robert (Server Storage) wrote:
> That sounds like the proposed WRITE SCATTERED/READ GATHERED 
> commands for SCSI (where are related to, but not necessarily
> tied to, atomic writes).  We discussed them a bit at 
> LSF-MM 2013 - see http://lwn.net/Articles/548116/.

In the same way a preadx/pwritex could use but would not require an
O_ATOMIC.  What's the status of those in t10?  Last I heard
READ GATHERED was out and they were only looking into WRITE SCATTERED?

Without the atomic WRITE SCATTERED use case adding the syscalls seems
rather pointless, and I'd really avoid blocking nice software only
features like the per-I/O nonblock flag (and the similarly trivial
per-I/O sync option I have a prototype for) on it.

Speaking of easy flags:  while a nonblock flags for writes wouldn't
be anywhere near as easy as the one for reads would there be sufficient
interested in it to bother implementing it?

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-15 21:58   ` Jeff Moyer
@ 2014-09-19 11:23     ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:23 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	michael.kerrisk

On Mon, Sep 15, 2014 at 05:58:44PM -0400, Jeff Moyer wrote:
> I thought you were going to introduce a new flag instead of using
> O_NONBLOCK for this.  I dug up an old email that suggested that enabling
> O_NONBLOCK for regular files (well, a device node in this case) broke a
> cd ripping or burning application.  I also found this old bugzilla,
> which states that squid would fail to start, and that gqview was also
> broken:
>   https://bugzilla.redhat.com/show_bug.cgi?id=136057

That is why we avoid looking a the per-open O_NONBLOCK flag, and only
apply it per I/O.  As mentioned in my last mail it's not quite as
trivial but still fairly easy to also do that for writes.

> I don't think O_NONBLOCK is the right flag.  What you're really
> specifying is a flag that prevents I/O in the read path, and nowhere
> else.  As such, I'd feel much better about this if we defined a new flag
> (O_NONBLOCK_READ maybe?  No, that's too verbose.).
> 
> In summary, I like the idea, but I worry about overloading O_NONBLOCK.

There's a fair argument we could use a different namespace for the
per-I/O ops, and it seems like Miklos already implemented this for the
next version.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-19 11:23     ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:23 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	michael.kerrisk

On Mon, Sep 15, 2014 at 05:58:44PM -0400, Jeff Moyer wrote:
> I thought you were going to introduce a new flag instead of using
> O_NONBLOCK for this.  I dug up an old email that suggested that enabling
> O_NONBLOCK for regular files (well, a device node in this case) broke a
> cd ripping or burning application.  I also found this old bugzilla,
> which states that squid would fail to start, and that gqview was also
> broken:
>   https://bugzilla.redhat.com/show_bug.cgi?id=136057

That is why we avoid looking a the per-open O_NONBLOCK flag, and only
apply it per I/O.  As mentioned in my last mail it's not quite as
trivial but still fairly easy to also do that for writes.

> I don't think O_NONBLOCK is the right flag.  What you're really
> specifying is a flag that prevents I/O in the read path, and nowhere
> else.  As such, I'd feel much better about this if we defined a new flag
> (O_NONBLOCK_READ maybe?  No, that's too verbose.).
> 
> In summary, I like the idea, but I worry about overloading O_NONBLOCK.

There's a fair argument we could use a different namespace for the
per-I/O ops, and it seems like Miklos already implemented this for the
next version.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC 1/2] aio: async readahead
  2014-09-17 14:49   ` Benjamin LaHaise
@ 2014-09-19 11:26     ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:26 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Andreas Dilger

Requiring the block mappings to be entirely async is why we never went
for full buffered aio.  What would seem more useful is to offload all
readahead to workqueues to make sure they never block the caller for
sys_readahead or if we decide to readahead for the nonblocking read.

I tried to implement this, but I couldn't find a good place to hang
the work_struct for it off.  If we decide to dynamically allocate
the ra structure separate from struct file that might be an obvious
place.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC 1/2] aio: async readahead
@ 2014-09-19 11:26     ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:26 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Andreas Dilger

Requiring the block mappings to be entirely async is why we never went
for full buffered aio.  What would seem more useful is to offload all
readahead to workqueues to make sure they never block the caller for
sys_readahead or if we decide to readahead for the nonblocking read.

I tried to implement this, but I couldn't find a good place to hang
the work_struct for it off.  If we decide to dynamically allocate
the ra structure separate from struct file that might be an obvious
place.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 5/5] Check for O_NONBLOCK in all read_iter instances
  2014-09-17 22:20     ` Milosz Tanski
@ 2014-09-19 11:26       ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:26 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro

FYI, this should be folded into the previous patch.


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 5/5] Check for O_NONBLOCK in all read_iter instances
@ 2014-09-19 11:26       ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:26 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro

FYI, this should be folded into the previous patch.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2
  2014-09-17 22:20     ` Milosz Tanski
@ 2014-09-19 11:27       ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:27 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

The subject needs an update now that the flag has been renamed.


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2
@ 2014-09-19 11:27       ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-19 11:27 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

The subject needs an update now that the flag has been renamed.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2
  2014-09-19 11:27       ` Christoph Hellwig
@ 2014-09-19 11:59         ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-19 11:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro

On Fri, Sep 19, 2014 at 7:27 AM, Christoph Hellwig <hch@infradead.org> wrote:
> The subject needs an update now that the flag has been renamed.
>

You're right, I'm not sure how I missed it. Fixed it locally for the
next submission.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2
@ 2014-09-19 11:59         ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-19 11:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro

On Fri, Sep 19, 2014 at 7:27 AM, Christoph Hellwig <hch@infradead.org> wrote:
> The subject needs an update now that the flag has been renamed.
>

You're right, I'm not sure how I missed it. Fixed it locally for the
next submission.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
  2014-09-17 22:20   ` Milosz Tanski
@ 2014-09-19 14:42     ` Jonathan Corbet
  -1 siblings, 0 replies; 167+ messages in thread
From: Jonathan Corbet @ 2014-09-19 14:42 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Wed, 17 Sep 2014 22:20:45 +0000
Milosz Tanski <milosz@adfin.com> wrote:

> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
> 
> It does this by introducing new syscalls new syscalls readv2/writev2 and
> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
> syscalls that accept an extra flag argument (O_NONBLOCK).

So I'm trying to understand the reasoning behind this approach so I can
explain it to others.  When you decided to add these syscalls, you
ruled out some other approaches that have been out there for a while.
I assume that, before these syscalls can be merged, people will want to
understand why you did that.  So I'll ask the dumb questions:

 - Non-blocking I/O has long been supported with a well-understood set
   of operations - O_NONBLOCK and fcntl().  Why do we need a different
   mechanism here - one that's only understood in the context of
   buffered file I/O?  I assume you didn't want to implement support
   for poll() and all that, but is that a good enough reason to add a
   new Linux-specific non-blocking I/O technique?

 - Patches adding fincore() have been around since at least 2010; see,
   for example, https://lwn.net/Articles/371538/ or
   https://lwn.net/Articles/604640/.  It seems this could be used in
   favor of four new read() syscalls; is there a reason it's not
   suitable for your use case?

 - Patches adding buffered support for AIO have been around since at
   least 2003 - https://lwn.net/Articles/24422/, for example.  I guess
   I don't really have to ask why you don't want to take that
   approach! :)  

Apologies for my ignorance here; that's what I get for hanging around
with the MM folks at LSFMM, I guess.  Anyway, I suspect I'm not the
only one who would appreciate any background you could give here.

Thanks,

jon

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
@ 2014-09-19 14:42     ` Jonathan Corbet
  0 siblings, 0 replies; 167+ messages in thread
From: Jonathan Corbet @ 2014-09-19 14:42 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Wed, 17 Sep 2014 22:20:45 +0000
Milosz Tanski <milosz@adfin.com> wrote:

> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
> 
> It does this by introducing new syscalls new syscalls readv2/writev2 and
> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
> syscalls that accept an extra flag argument (O_NONBLOCK).

So I'm trying to understand the reasoning behind this approach so I can
explain it to others.  When you decided to add these syscalls, you
ruled out some other approaches that have been out there for a while.
I assume that, before these syscalls can be merged, people will want to
understand why you did that.  So I'll ask the dumb questions:

 - Non-blocking I/O has long been supported with a well-understood set
   of operations - O_NONBLOCK and fcntl().  Why do we need a different
   mechanism here - one that's only understood in the context of
   buffered file I/O?  I assume you didn't want to implement support
   for poll() and all that, but is that a good enough reason to add a
   new Linux-specific non-blocking I/O technique?

 - Patches adding fincore() have been around since at least 2010; see,
   for example, https://lwn.net/Articles/371538/ or
   https://lwn.net/Articles/604640/.  It seems this could be used in
   favor of four new read() syscalls; is there a reason it's not
   suitable for your use case?

 - Patches adding buffered support for AIO have been around since at
   least 2003 - https://lwn.net/Articles/24422/, for example.  I guess
   I don't really have to ask why you don't want to take that
   approach! :)  

Apologies for my ignorance here; that's what I get for hanging around
with the MM folks at LSFMM, I guess.  Anyway, I suspect I'm not the
only one who would appreciate any background you could give here.

Thanks,

jon

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC 1/2] aio: async readahead
  2014-09-19 11:26     ` Christoph Hellwig
@ 2014-09-19 16:01       ` Benjamin LaHaise
  -1 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-19 16:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Andreas Dilger

On Fri, Sep 19, 2014 at 04:26:12AM -0700, Christoph Hellwig wrote:
> Requiring the block mappings to be entirely async is why we never went
> for full buffered aio.  What would seem more useful is to offload all
> readahead to workqueues to make sure they never block the caller for
> sys_readahead or if we decide to readahead for the nonblocking read.

I can appreciate that it may be difficult for some filesystems to implement 
a fully asynchronous readpage, but at least for some, it is possible 
and not too difficult.

> I tried to implement this, but I couldn't find a good place to hang
> the work_struct for it off.  If we decide to dynamically allocate
> the ra structure separate from struct file that might be an obvious
> place.

The approach I used in the async ext2/3/4 indirect style metadata readpage 
was to put the async state into the page's memory.  That won't work very 
well on 32 bit systems, but it works well and avoids having to perform 
another memory allocation on 64 bit systems.

I'm still of the opinion that the readpage operation should be started by 
the submitting process.  Some of the work I did in tuning things for my 
employer with async reads found that punting reads to another thread 
caused significant degradation of our workload (basically, reading in a 
bunch of persistent messages from disk, with small messages being an 
important corner of performance).  What ended up being the best performing 
for me was to have an async readahead operation to fill the page cache 
with data from the file, and then to issue a read that was essentially 
non-blocking.  This approach meant that the copy of data from the kernel 
into userspace was performed by the thread that was actually using the 
data.  By doing the copy only once all i/o completed, the data was primed 
in the CPU's cache, allowing the code that actually operates on the data 
to benefit.  Any gradual copy over time ended up performing significantly 
worse.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC 1/2] aio: async readahead
@ 2014-09-19 16:01       ` Benjamin LaHaise
  0 siblings, 0 replies; 167+ messages in thread
From: Benjamin LaHaise @ 2014-09-19 16:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Andreas Dilger

On Fri, Sep 19, 2014 at 04:26:12AM -0700, Christoph Hellwig wrote:
> Requiring the block mappings to be entirely async is why we never went
> for full buffered aio.  What would seem more useful is to offload all
> readahead to workqueues to make sure they never block the caller for
> sys_readahead or if we decide to readahead for the nonblocking read.

I can appreciate that it may be difficult for some filesystems to implement 
a fully asynchronous readpage, but at least for some, it is possible 
and not too difficult.

> I tried to implement this, but I couldn't find a good place to hang
> the work_struct for it off.  If we decide to dynamically allocate
> the ra structure separate from struct file that might be an obvious
> place.

The approach I used in the async ext2/3/4 indirect style metadata readpage 
was to put the async state into the page's memory.  That won't work very 
well on 32 bit systems, but it works well and avoids having to perform 
another memory allocation on 64 bit systems.

I'm still of the opinion that the readpage operation should be started by 
the submitting process.  Some of the work I did in tuning things for my 
employer with async reads found that punting reads to another thread 
caused significant degradation of our workload (basically, reading in a 
bunch of persistent messages from disk, with small messages being an 
important corner of performance).  What ended up being the best performing 
for me was to have an async readahead operation to fill the page cache 
with data from the file, and then to issue a read that was essentially 
non-blocking.  This approach meant that the copy of data from the kernel 
into userspace was performed by the thread that was actually using the 
data.  By doing the copy only once all i/o completed, the data was primed 
in the CPU's cache, allowing the code that actually operates on the data 
to benefit.  Any gradual copy over time ended up performing significantly 
worse.

		-ben
-- 
"Thought is the essence of where you are now."

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
  2014-09-19 14:42     ` Jonathan Corbet
@ 2014-09-19 16:13       ` Volker Lendecke
  -1 siblings, 0 replies; 167+ messages in thread
From: Volker Lendecke @ 2014-09-19 16:13 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

On Fri, Sep 19, 2014 at 10:42:04AM -0400, Jonathan Corbet wrote:
>  - Non-blocking I/O has long been supported with a well-understood set
>    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>    mechanism here - one that's only understood in the context of
>    buffered file I/O?  I assume you didn't want to implement support
>    for poll() and all that, but is that a good enough reason to add a
>    new Linux-specific non-blocking I/O technique?

The Samba usecase would be to first try the nonblocking read
and only if that fails hand over to a blocking thread on the
same fd. Both interleave, so it's not possible to fcntl in
between. dup()ing the fd is also difficult because of the
weird close() semantics regarding fcntl locks.

>  - Patches adding fincore() have been around since at least 2010; see,
>    for example, https://lwn.net/Articles/371538/ or
>    https://lwn.net/Articles/604640/.  It seems this could be used in
>    favor of four new read() syscalls; is there a reason it's not
>    suitable for your use case?

Isn't that at least racy?

>  - Patches adding buffered support for AIO have been around since at
>    least 2003 - https://lwn.net/Articles/24422/, for example.  I guess
>    I don't really have to ask why you don't want to take that
>    approach! :)  

Well, I guess this would work for Samba.

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt@sernet.de

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
@ 2014-09-19 16:13       ` Volker Lendecke
  0 siblings, 0 replies; 167+ messages in thread
From: Volker Lendecke @ 2014-09-19 16:13 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

On Fri, Sep 19, 2014 at 10:42:04AM -0400, Jonathan Corbet wrote:
>  - Non-blocking I/O has long been supported with a well-understood set
>    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>    mechanism here - one that's only understood in the context of
>    buffered file I/O?  I assume you didn't want to implement support
>    for poll() and all that, but is that a good enough reason to add a
>    new Linux-specific non-blocking I/O technique?

The Samba usecase would be to first try the nonblocking read
and only if that fails hand over to a blocking thread on the
same fd. Both interleave, so it's not possible to fcntl in
between. dup()ing the fd is also difficult because of the
weird close() semantics regarding fcntl locks.

>  - Patches adding fincore() have been around since at least 2010; see,
>    for example, https://lwn.net/Articles/371538/ or
>    https://lwn.net/Articles/604640/.  It seems this could be used in
>    favor of four new read() syscalls; is there a reason it's not
>    suitable for your use case?

Isn't that at least racy?

>  - Patches adding buffered support for AIO have been around since at
>    least 2003 - https://lwn.net/Articles/24422/, for example.  I guess
>    I don't really have to ask why you don't want to take that
>    approach! :)  

Well, I guess this would work for Samba.

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt@sernet.de

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
  2014-09-19 14:42     ` Jonathan Corbet
@ 2014-09-19 17:19       ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-19 17:19 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Jon, this is a very long-winded response so my apologies in advance...
when I sat down to write this my collective unconscious produced a
brain dump of my though process. Hopefully it gives you an idea of the
problem, my motivation and how it influenced the solution (which I
think is rather simple).

I think the best place would be to start with the motivation. I've
been lurking/following various attempts and approaches to non-blocking
buffered I/O at least since 2007 (and if I do a search in the archives
the discussion go a decade beyond that). Over the years -- including
various jobs and projects -- I've keep running into similar problems
network services with 100s, 1000s and now 10ks of various processing
tasks (or requests). The story would be simple if I was building a
webserver where I could just use epoll and sendfile and be done.
However, most of the things I've built require a combination of
network-bound, cpu-bound and disk-bound work and the disk-bound part
was always the weak part of the story.

Some major projects where I ran into this:
- distributed storage/processing of email (for compliance reasons)
- ad-serving (many nodes, with local db databases of candidates but
also cpu bound work to build a cost effective model),
- now a VLDB columnar db where there's is overlapping CPU work
(typical query processes in the low-mid billions of rows) and IO work
(data on Cephfs via FSCache) and global re-aggregation of data
(network bound) on all on the same nodes in the cluster.

I always wanted to leverage buffered IO in the kernel because I agree
with Linus' sentiment that I should be working with the page cache not
against the page cache in OS. And the truth is, the man years of work
that went into the linux mm system I could not / nor want to replicate
in user-space. It's next-impossible to compete with it especially with
all that painful scalability work that went into so it runs well on
many core systems that power servers nowadays. Buffered writes were
never as big a problem for me since there already are a lot of
interfaces for that in the kernel already that work okay
(sync_file_range) and you just toss it thread pool.

Lets get to the root of my problem; it was always buffered reads.
Sometimes it's the network thread (the one multiplexing via. epoll)
and other times it's CPU bound thread. You end up with one or two
problems. One is blocking and wasting CPU resources (instead of
running something else. The second one is provisioning the number of
the threads, but there you can't predict how much to over provision by
and you end up with times that you get swamped with too much CPU bound
work (since data is cached due to recent use, or read-ahead)... and
it's hard to create proper back-pressure in that system.

To avoid that problem the almost universal solution is create a
separate thread pool dedicated to blocking work. Lots of projects end
up going down that route like samba, libuv (which is use in many
services), countless java frameworks, my projects.

This is a very common architecture (here's a visualization:
http://i.imgur.com/f8Pla7j.png it's not a picture of a cat but it's a
shitty hand drawing). And this works kind-of-okay, however generally
this approach introduces latency into the requests. This latency is
caused by:

1. Having to stop our CPU bound task to fetch more data and switch to
other work (working set cache effects). In many cases for commonly
access data / sequential this will be in the page cache and we could
avoid this.
2. Having the fast (small/cached) requests get blocked behind slower
(large/uncached) requests.
3. Other general context switching, synchronization, notification latency.

This has been bugging me for years and I've tried countless
workarounds and followed countless lkml threads on buffered AIO that
got nowhere (many of them were very complex). Then I had an eureka
moment, I could solve 90% of the problem for this common architecture
if we had a very simple read syscall that would return if the data was
not in the page cache. And now it seams so obvious if you look at the
chart and the latency sources. We avoid latency by doing "fast read"
in the submitter and avoiding all that machinery if the data is
cached.

Here's why (and some assumptions):
- A large chunk of data is cached because it's commonly used (zipf
distribution of access) or is read sequentially (read-ahead).
- If are able to avoid submitting many cached requests to this IO
queue, that removes a lot of contention on the queue. Only the large /
uncached requests will go there (or the next read-ahead boundary)
- We're able to keep processing the current context in the CPU bound
thread thanks to "fast read" and we avoid a lot of needless work
context switching.
- We can control "fast read" / queuing policy in our application.

The last point is easy to miss but it actually gives the application a
lot of power. The application can prioritize "fast requests" in the
queue if they have a high ratio "fast reads" and vice-versa it can
avoid increasing latency (double syscall) in the uncached workload by
not attempting to fast reads in cases of very low "fast read" hits.

The real proof is in the tests. Both our application and the FIO tests
pain a story of greatly improved overall request latencies for these
kinds of workloads that want to overlap CPU bound and IO bound work in
one application. Take a look at the cover letter for the patch.

In conclusion we can get to what I consider a 90% solution to
non-blocking buffered file reads with a very small / easy to read
patch (where the other proposals ran int problems). It solves a common
read-world problem in a very common user-space architecture (so high
potential for impact). Finally, the new syscalls pave a way for other
per single read/write flags that other folks have already suggested in
this and other threads.

I'm sorry if this contains any errors, but I took me longer to write
this then I wanted to and I had to hurry to wrap up this email.

Best,
- Milosz

On Fri, Sep 19, 2014 at 10:42 AM, Jonathan Corbet <corbet@lwn.net> wrote:
> On Wed, 17 Sep 2014 22:20:45 +0000
> Milosz Tanski <milosz@adfin.com> wrote:
>
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2 and
>> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
>> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> So I'm trying to understand the reasoning behind this approach so I can
> explain it to others.  When you decided to add these syscalls, you
> ruled out some other approaches that have been out there for a while.
> I assume that, before these syscalls can be merged, people will want to
> understand why you did that.  So I'll ask the dumb questions:
>
>  - Non-blocking I/O has long been supported with a well-understood set
>    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>    mechanism here - one that's only understood in the context of
>    buffered file I/O?  I assume you didn't want to implement support
>    for poll() and all that, but is that a good enough reason to add a
>    new Linux-specific non-blocking I/O technique?
>
>  - Patches adding fincore() have been around since at least 2010; see,
>    for example, https://lwn.net/Articles/371538/ or
>    https://lwn.net/Articles/604640/.  It seems this could be used in
>    favor of four new read() syscalls; is there a reason it's not
>    suitable for your use case?
>
>  - Patches adding buffered support for AIO have been around since at
>    least 2003 - https://lwn.net/Articles/24422/, for example.  I guess
>    I don't really have to ask why you don't want to take that
>    approach! :)
>
> Apologies for my ignorance here; that's what I get for hanging around
> with the MM folks at LSFMM, I guess.  Anyway, I suspect I'm not the
> only one who would appreciate any background you could give here.
>
> Thanks,
>
> jon

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
@ 2014-09-19 17:19       ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-19 17:19 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Jon, this is a very long-winded response so my apologies in advance...
when I sat down to write this my collective unconscious produced a
brain dump of my though process. Hopefully it gives you an idea of the
problem, my motivation and how it influenced the solution (which I
think is rather simple).

I think the best place would be to start with the motivation. I've
been lurking/following various attempts and approaches to non-blocking
buffered I/O at least since 2007 (and if I do a search in the archives
the discussion go a decade beyond that). Over the years -- including
various jobs and projects -- I've keep running into similar problems
network services with 100s, 1000s and now 10ks of various processing
tasks (or requests). The story would be simple if I was building a
webserver where I could just use epoll and sendfile and be done.
However, most of the things I've built require a combination of
network-bound, cpu-bound and disk-bound work and the disk-bound part
was always the weak part of the story.

Some major projects where I ran into this:
- distributed storage/processing of email (for compliance reasons)
- ad-serving (many nodes, with local db databases of candidates but
also cpu bound work to build a cost effective model),
- now a VLDB columnar db where there's is overlapping CPU work
(typical query processes in the low-mid billions of rows) and IO work
(data on Cephfs via FSCache) and global re-aggregation of data
(network bound) on all on the same nodes in the cluster.

I always wanted to leverage buffered IO in the kernel because I agree
with Linus' sentiment that I should be working with the page cache not
against the page cache in OS. And the truth is, the man years of work
that went into the linux mm system I could not / nor want to replicate
in user-space. It's next-impossible to compete with it especially with
all that painful scalability work that went into so it runs well on
many core systems that power servers nowadays. Buffered writes were
never as big a problem for me since there already are a lot of
interfaces for that in the kernel already that work okay
(sync_file_range) and you just toss it thread pool.

Lets get to the root of my problem; it was always buffered reads.
Sometimes it's the network thread (the one multiplexing via. epoll)
and other times it's CPU bound thread. You end up with one or two
problems. One is blocking and wasting CPU resources (instead of
running something else. The second one is provisioning the number of
the threads, but there you can't predict how much to over provision by
and you end up with times that you get swamped with too much CPU bound
work (since data is cached due to recent use, or read-ahead)... and
it's hard to create proper back-pressure in that system.

To avoid that problem the almost universal solution is create a
separate thread pool dedicated to blocking work. Lots of projects end
up going down that route like samba, libuv (which is use in many
services), countless java frameworks, my projects.

This is a very common architecture (here's a visualization:
http://i.imgur.com/f8Pla7j.png it's not a picture of a cat but it's a
shitty hand drawing). And this works kind-of-okay, however generally
this approach introduces latency into the requests. This latency is
caused by:

1. Having to stop our CPU bound task to fetch more data and switch to
other work (working set cache effects). In many cases for commonly
access data / sequential this will be in the page cache and we could
avoid this.
2. Having the fast (small/cached) requests get blocked behind slower
(large/uncached) requests.
3. Other general context switching, synchronization, notification latency.

This has been bugging me for years and I've tried countless
workarounds and followed countless lkml threads on buffered AIO that
got nowhere (many of them were very complex). Then I had an eureka
moment, I could solve 90% of the problem for this common architecture
if we had a very simple read syscall that would return if the data was
not in the page cache. And now it seams so obvious if you look at the
chart and the latency sources. We avoid latency by doing "fast read"
in the submitter and avoiding all that machinery if the data is
cached.

Here's why (and some assumptions):
- A large chunk of data is cached because it's commonly used (zipf
distribution of access) or is read sequentially (read-ahead).
- If are able to avoid submitting many cached requests to this IO
queue, that removes a lot of contention on the queue. Only the large /
uncached requests will go there (or the next read-ahead boundary)
- We're able to keep processing the current context in the CPU bound
thread thanks to "fast read" and we avoid a lot of needless work
context switching.
- We can control "fast read" / queuing policy in our application.

The last point is easy to miss but it actually gives the application a
lot of power. The application can prioritize "fast requests" in the
queue if they have a high ratio "fast reads" and vice-versa it can
avoid increasing latency (double syscall) in the uncached workload by
not attempting to fast reads in cases of very low "fast read" hits.

The real proof is in the tests. Both our application and the FIO tests
pain a story of greatly improved overall request latencies for these
kinds of workloads that want to overlap CPU bound and IO bound work in
one application. Take a look at the cover letter for the patch.

In conclusion we can get to what I consider a 90% solution to
non-blocking buffered file reads with a very small / easy to read
patch (where the other proposals ran int problems). It solves a common
read-world problem in a very common user-space architecture (so high
potential for impact). Finally, the new syscalls pave a way for other
per single read/write flags that other folks have already suggested in
this and other threads.

I'm sorry if this contains any errors, but I took me longer to write
this then I wanted to and I had to hurry to wrap up this email.

Best,
- Milosz

On Fri, Sep 19, 2014 at 10:42 AM, Jonathan Corbet <corbet@lwn.net> wrote:
> On Wed, 17 Sep 2014 22:20:45 +0000
> Milosz Tanski <milosz@adfin.com> wrote:
>
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2 and
>> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
>> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> So I'm trying to understand the reasoning behind this approach so I can
> explain it to others.  When you decided to add these syscalls, you
> ruled out some other approaches that have been out there for a while.
> I assume that, before these syscalls can be merged, people will want to
> understand why you did that.  So I'll ask the dumb questions:
>
>  - Non-blocking I/O has long been supported with a well-understood set
>    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>    mechanism here - one that's only understood in the context of
>    buffered file I/O?  I assume you didn't want to implement support
>    for poll() and all that, but is that a good enough reason to add a
>    new Linux-specific non-blocking I/O technique?
>
>  - Patches adding fincore() have been around since at least 2010; see,
>    for example, https://lwn.net/Articles/371538/ or
>    https://lwn.net/Articles/604640/.  It seems this could be used in
>    favor of four new read() syscalls; is there a reason it's not
>    suitable for your use case?
>
>  - Patches adding buffered support for AIO have been around since at
>    least 2003 - https://lwn.net/Articles/24422/, for example.  I guess
>    I don't really have to ask why you don't want to take that
>    approach! :)
>
> Apologies for my ignorance here; that's what I get for hanging around
> with the MM folks at LSFMM, I guess.  Anyway, I suspect I'm not the
> only one who would appreciate any background you could give here.
>
> Thanks,
>
> jon

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
  2014-09-19 14:42     ` Jonathan Corbet
@ 2014-09-19 17:33       ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-19 17:33 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

On Fri, Sep 19, 2014 at 10:42 AM, Jonathan Corbet <corbet@lwn.net> wrote:
> On Wed, 17 Sep 2014 22:20:45 +0000
> Milosz Tanski <milosz@adfin.com> wrote:
>
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2 and
>> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
>> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> So I'm trying to understand the reasoning behind this approach so I can
> explain it to others.  When you decided to add these syscalls, you
> ruled out some other approaches that have been out there for a while.
> I assume that, before these syscalls can be merged, people will want to
> understand why you did that.  So I'll ask the dumb questions:
>
>  - Non-blocking I/O has long been supported with a well-understood set
>    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>    mechanism here - one that's only understood in the context of
>    buffered file I/O?  I assume you didn't want to implement support
>    for poll() and all that, but is that a good enough reason to add a
>    new Linux-specific non-blocking I/O technique?

I realized that I didn't answer this question well in my other long
email. O_NONBLOCK doesn't work on files under any commonly used OS,
and people have gotten use to this behavior so I doubt we could change
that without breaking a lot of folks applications. If you want to
ignore my other long email, what I realized that I could solve a lot
of problems if I had a syscall like recvmsg that takes a MSG_NONBLOCK
argument that worked on regular files (not sockets) and thus
readv2/preadv2 was born.

>
>  - Patches adding fincore() have been around since at least 2010; see,
>    for example, https://lwn.net/Articles/371538/ or
>    https://lwn.net/Articles/604640/.  It seems this could be used in
>    favor of four new read() syscalls; is there a reason it's not
>    suitable for your use case?
>
>  - Patches adding buffered support for AIO have been around since at
>    least 2003 - https://lwn.net/Articles/24422/, for example.  I guess
>    I don't really have to ask why you don't want to take that
>    approach! :)
>
> Apologies for my ignorance here; that's what I get for hanging around
> with the MM folks at LSFMM, I guess.  Anyway, I suspect I'm not the
> only one who would appreciate any background you could give here.
>
> Thanks,
>
> jon



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
@ 2014-09-19 17:33       ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-19 17:33 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

On Fri, Sep 19, 2014 at 10:42 AM, Jonathan Corbet <corbet@lwn.net> wrote:
> On Wed, 17 Sep 2014 22:20:45 +0000
> Milosz Tanski <milosz@adfin.com> wrote:
>
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls readv2/writev2 and
>> preadv2/pwritev2. These new syscalls behave like the network sendmsg, recvmsg
>> syscalls that accept an extra flag argument (O_NONBLOCK).
>
> So I'm trying to understand the reasoning behind this approach so I can
> explain it to others.  When you decided to add these syscalls, you
> ruled out some other approaches that have been out there for a while.
> I assume that, before these syscalls can be merged, people will want to
> understand why you did that.  So I'll ask the dumb questions:
>
>  - Non-blocking I/O has long been supported with a well-understood set
>    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>    mechanism here - one that's only understood in the context of
>    buffered file I/O?  I assume you didn't want to implement support
>    for poll() and all that, but is that a good enough reason to add a
>    new Linux-specific non-blocking I/O technique?

I realized that I didn't answer this question well in my other long
email. O_NONBLOCK doesn't work on files under any commonly used OS,
and people have gotten use to this behavior so I doubt we could change
that without breaking a lot of folks applications. If you want to
ignore my other long email, what I realized that I could solve a lot
of problems if I had a syscall like recvmsg that takes a MSG_NONBLOCK
argument that worked on regular files (not sockets) and thus
readv2/preadv2 was born.

>
>  - Patches adding fincore() have been around since at least 2010; see,
>    for example, https://lwn.net/Articles/371538/ or
>    https://lwn.net/Articles/604640/.  It seems this could be used in
>    favor of four new read() syscalls; is there a reason it's not
>    suitable for your use case?
>
>  - Patches adding buffered support for AIO have been around since at
>    least 2003 - https://lwn.net/Articles/24422/, for example.  I guess
>    I don't really have to ask why you don't want to take that
>    approach! :)
>
> Apologies for my ignorance here; that's what I get for hanging around
> with the MM folks at LSFMM, I guess.  Anyway, I suspect I'm not the
> only one who would appreciate any background you could give here.
>
> Thanks,
>
> jon



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2
  2014-09-19 10:52         ` Christoph Hellwig
@ 2014-09-20  0:19           ` Darrick J. Wong
  -1 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2014-09-20  0:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Fri, Sep 19, 2014 at 03:52:20AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 18, 2014 at 11:48:23AM -0700, Darrick J. Wong wrote:
> > A few months ago I was working on extending these interfaces (well, the
> > p{read,write}* ones and AIO) to tack on an IO extension buffer at the end of
> > the syscall arguments.
> 
> Honestly, that proposal is so but ugly that I treated it as an April
> first joke.  I don't really think we want any of that overload mess.

I agree that a kitchen sink structure full of IO attributes is messy; at best
it avoids maintenance of horrifying parameter lists.  The first two drafts of
the interface were too complicated and with the help of everyone who responded
to the first two threads with their criticisms, I've focused on paring down the
parts that people can screw up.

In v3, I define only a flat struct io_extension from which extensions can
copy_from_user whatever bits they want.  Ideally I'd have three or four uses of
the extension API lined up for a more thoughtful design, but I'm just now
getting around to a second.

Clearly you have ideas of what constitutes good and bad API design.  I've never
defined a major programming interface.  Can you point me towards examples of
where we've gotten it right?  Or possibly a discussion of design?  The
materials from mkerrisk's 2007 talk about kernel API design seem to have gone
down with kernel.org, and I prefer to avoid badgering linux-api until I'm more
confident that I won't fall into the "this is apparently so bad that people
won't reply" trap.

I'm willing to learn, but snark about April Fool's jokes leading to silence
does not help me to improve the code or to help myself.

--D

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2
@ 2014-09-20  0:19           ` Darrick J. Wong
  0 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2014-09-20  0:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Fri, Sep 19, 2014 at 03:52:20AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 18, 2014 at 11:48:23AM -0700, Darrick J. Wong wrote:
> > A few months ago I was working on extending these interfaces (well, the
> > p{read,write}* ones and AIO) to tack on an IO extension buffer at the end of
> > the syscall arguments.
> 
> Honestly, that proposal is so but ugly that I treated it as an April
> first joke.  I don't really think we want any of that overload mess.

I agree that a kitchen sink structure full of IO attributes is messy; at best
it avoids maintenance of horrifying parameter lists.  The first two drafts of
the interface were too complicated and with the help of everyone who responded
to the first two threads with their criticisms, I've focused on paring down the
parts that people can screw up.

In v3, I define only a flat struct io_extension from which extensions can
copy_from_user whatever bits they want.  Ideally I'd have three or four uses of
the extension API lined up for a more thoughtful design, but I'm just now
getting around to a second.

Clearly you have ideas of what constitutes good and bad API design.  I've never
defined a major programming interface.  Can you point me towards examples of
where we've gotten it right?  Or possibly a discussion of design?  The
materials from mkerrisk's 2007 talk about kernel API design seem to have gone
down with kernel.org, and I prefer to avoid badgering linux-api until I'm more
confident that I won't fall into the "this is apparently so bad that people
won't reply" trap.

I'm willing to learn, but snark about April Fool's jokes leading to silence
does not help me to improve the code or to help myself.

--D

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
  2014-09-19 17:33       ` Milosz Tanski
@ 2014-09-22 14:12         ` Jonathan Corbet
  -1 siblings, 0 replies; 167+ messages in thread
From: Jonathan Corbet @ 2014-09-22 14:12 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

On Fri, 19 Sep 2014 13:33:14 -0400
Milosz Tanski <milosz@adfin.com> wrote:

> >  - Non-blocking I/O has long been supported with a well-understood set
> >    of operations - O_NONBLOCK and fcntl().  Why do we need a different
> >    mechanism here - one that's only understood in the context of
> >    buffered file I/O?  I assume you didn't want to implement support
> >    for poll() and all that, but is that a good enough reason to add a
> >    new Linux-specific non-blocking I/O technique?  
> 
> I realized that I didn't answer this question well in my other long
> email. O_NONBLOCK doesn't work on files under any commonly used OS,
> and people have gotten use to this behavior so I doubt we could change
> that without breaking a lot of folks applications.

So I'm not contesting this, but I am genuinely curious: do you think
there are applications out there requesting non-blocking behavior on
regular files that will then break if they actually get non-blocking
behavior?  I don't suppose you have an example?

Thanks,

jon

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
@ 2014-09-22 14:12         ` Jonathan Corbet
  0 siblings, 0 replies; 167+ messages in thread
From: Jonathan Corbet @ 2014-09-22 14:12 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

On Fri, 19 Sep 2014 13:33:14 -0400
Milosz Tanski <milosz@adfin.com> wrote:

> >  - Non-blocking I/O has long been supported with a well-understood set
> >    of operations - O_NONBLOCK and fcntl().  Why do we need a different
> >    mechanism here - one that's only understood in the context of
> >    buffered file I/O?  I assume you didn't want to implement support
> >    for poll() and all that, but is that a good enough reason to add a
> >    new Linux-specific non-blocking I/O technique?  
> 
> I realized that I didn't answer this question well in my other long
> email. O_NONBLOCK doesn't work on files under any commonly used OS,
> and people have gotten use to this behavior so I doubt we could change
> that without breaking a lot of folks applications.

So I'm not contesting this, but I am genuinely curious: do you think
there are applications out there requesting non-blocking behavior on
regular files that will then break if they actually get non-blocking
behavior?  I don't suppose you have an example?

Thanks,

jon

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
  2014-09-22 14:12         ` Jonathan Corbet
@ 2014-09-22 14:24           ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-22 14:24 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Milosz Tanski, LKML, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Theodore Ts'o,
	Al Viro

Jonathan Corbet <corbet@lwn.net> writes:

> On Fri, 19 Sep 2014 13:33:14 -0400
> Milosz Tanski <milosz@adfin.com> wrote:
>
>> >  - Non-blocking I/O has long been supported with a well-understood set
>> >    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>> >    mechanism here - one that's only understood in the context of
>> >    buffered file I/O?  I assume you didn't want to implement support
>> >    for poll() and all that, but is that a good enough reason to add a
>> >    new Linux-specific non-blocking I/O technique?  
>> 
>> I realized that I didn't answer this question well in my other long
>> email. O_NONBLOCK doesn't work on files under any commonly used OS,
>> and people have gotten use to this behavior so I doubt we could change
>> that without breaking a lot of folks applications.
>
> So I'm not contesting this, but I am genuinely curious: do you think
> there are applications out there requesting non-blocking behavior on
> regular files that will then break if they actually get non-blocking
> behavior?  I don't suppose you have an example?

Hi, Jon,

Back when I tried to introduct O_NONBLOCK for regular files, the squid
proxy actually broke.  Software that dealt with burning optical media
also broke.  See my mail message here for more details:
  https://lkml.org/lkml/2014/9/15/942

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
@ 2014-09-22 14:24           ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-22 14:24 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Milosz Tanski, LKML, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Theodore Ts'o,
	Al Viro

Jonathan Corbet <corbet@lwn.net> writes:

> On Fri, 19 Sep 2014 13:33:14 -0400
> Milosz Tanski <milosz@adfin.com> wrote:
>
>> >  - Non-blocking I/O has long been supported with a well-understood set
>> >    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>> >    mechanism here - one that's only understood in the context of
>> >    buffered file I/O?  I assume you didn't want to implement support
>> >    for poll() and all that, but is that a good enough reason to add a
>> >    new Linux-specific non-blocking I/O technique?  
>> 
>> I realized that I didn't answer this question well in my other long
>> email. O_NONBLOCK doesn't work on files under any commonly used OS,
>> and people have gotten use to this behavior so I doubt we could change
>> that without breaking a lot of folks applications.
>
> So I'm not contesting this, but I am genuinely curious: do you think
> there are applications out there requesting non-blocking behavior on
> regular files that will then break if they actually get non-blocking
> behavior?  I don't suppose you have an example?

Hi, Jon,

Back when I tried to introduct O_NONBLOCK for regular files, the squid
proxy actually broke.  Software that dealt with burning optical media
also broke.  See my mail message here for more details:
  https://lkml.org/lkml/2014/9/15/942

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
  2014-09-22 14:12         ` Jonathan Corbet
@ 2014-09-22 14:25           ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-22 14:25 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Milosz Tanski, LKML, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Mon, Sep 22, 2014 at 10:12:21AM -0400, Jonathan Corbet wrote:
> So I'm not contesting this, but I am genuinely curious: do you think
> there are applications out there requesting non-blocking behavior on
> regular files that will then break if they actually get non-blocking
> behavior?  I don't suppose you have an example?

You only have to look as far as Samba, but Jeff quoted an old lkml
post earlier that had other examples.

source3/smbd/open.c:


	if (first_open_attempt && lp_kernel_oplocks(SNUM(conn))) {
		/*
		 * With kernel oplocks the open breaking an oplock
		 * blocks until the oplock holder has given up the
		 * oplock or closed the file. We prevent this by first
		 * trying to open the file with O_NONBLOCK (see "man
		 * fcntl" on Linux). For the second try, triggered by
		 * an oplock break response, we do not need this
		 * anymore.
		 *
		 * This is true under the assumption that only Samba
		 * requests kernel oplocks. Once someone else like
		 * NFSv4 starts to use that API, we will have to
		 * modify this by communicating with the NFSv4 server.
		 */
		flags2 |= O_NONBLOCK;
	}

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
@ 2014-09-22 14:25           ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-22 14:25 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Milosz Tanski, LKML, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro

On Mon, Sep 22, 2014 at 10:12:21AM -0400, Jonathan Corbet wrote:
> So I'm not contesting this, but I am genuinely curious: do you think
> there are applications out there requesting non-blocking behavior on
> regular files that will then break if they actually get non-blocking
> behavior?  I don't suppose you have an example?

You only have to look as far as Samba, but Jeff quoted an old lkml
post earlier that had other examples.

source3/smbd/open.c:


	if (first_open_attempt && lp_kernel_oplocks(SNUM(conn))) {
		/*
		 * With kernel oplocks the open breaking an oplock
		 * blocks until the oplock holder has given up the
		 * oplock or closed the file. We prevent this by first
		 * trying to open the file with O_NONBLOCK (see "man
		 * fcntl" on Linux). For the second try, triggered by
		 * an oplock break response, we do not need this
		 * anymore.
		 *
		 * This is true under the assumption that only Samba
		 * requests kernel oplocks. Once someone else like
		 * NFSv4 starts to use that API, we will have to
		 * modify this by communicating with the NFSv4 server.
		 */
		flags2 |= O_NONBLOCK;
	}

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
  2014-09-22 14:12         ` Jonathan Corbet
@ 2014-09-22 14:30           ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-22 14:30 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

On Mon, Sep 22, 2014 at 10:12 AM, Jonathan Corbet <corbet@lwn.net> wrote:
> On Fri, 19 Sep 2014 13:33:14 -0400
> Milosz Tanski <milosz@adfin.com> wrote:
>
>> >  - Non-blocking I/O has long been supported with a well-understood set
>> >    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>> >    mechanism here - one that's only understood in the context of
>> >    buffered file I/O?  I assume you didn't want to implement support
>> >    for poll() and all that, but is that a good enough reason to add a
>> >    new Linux-specific non-blocking I/O technique?
>>
>> I realized that I didn't answer this question well in my other long
>> email. O_NONBLOCK doesn't work on files under any commonly used OS,
>> and people have gotten use to this behavior so I doubt we could change
>> that without breaking a lot of folks applications.
>
> So I'm not contesting this, but I am genuinely curious: do you think
> there are applications out there requesting non-blocking behavior on
> regular files that will then break if they actually get non-blocking
> behavior?  I don't suppose you have an example?
>
> Thanks,
>
> jon

Earlier in this thread Jeff pointed (
https://lkml.org/lkml/2014/9/15/942 ) to a bug in RH bugzilla (
https://bugzilla.redhat.com/show_bug.cgi?id=136057 ) when an
application (squid) reads regular disk files started returning EAGAIN
when read from (provided that they were open with O_NONBLOCK) and
since that doesn't cause readhead it spins on it forever. As far as I
know O_NONBLOCK for regular files in Linux is undefined behavior as
non of the man pages I looked at (esp. fnctl, 2 open, 3 open) specify
what happens in the case of non-network, non-fifo case (some of them
refer to file descriptors that support non-blocking operation, which
is pretty vague).

So even if squid is wrong in it's behavior (since it's undefined), a
quick google search reveals lots of mailing lists / forum posts of
people essentially describing the behavior to date. Eg. O_NONBLOCK on
regular files blocks, with select/poll/epoll always returning a ready
behavior. Based on that anecdotical evidence, I assume a decent chunk
of user apps would beak.

- Milosz

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 0/5] Non-blockling buffered fs read (page cache only)
@ 2014-09-22 14:30           ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-22 14:30 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

On Mon, Sep 22, 2014 at 10:12 AM, Jonathan Corbet <corbet@lwn.net> wrote:
> On Fri, 19 Sep 2014 13:33:14 -0400
> Milosz Tanski <milosz@adfin.com> wrote:
>
>> >  - Non-blocking I/O has long been supported with a well-understood set
>> >    of operations - O_NONBLOCK and fcntl().  Why do we need a different
>> >    mechanism here - one that's only understood in the context of
>> >    buffered file I/O?  I assume you didn't want to implement support
>> >    for poll() and all that, but is that a good enough reason to add a
>> >    new Linux-specific non-blocking I/O technique?
>>
>> I realized that I didn't answer this question well in my other long
>> email. O_NONBLOCK doesn't work on files under any commonly used OS,
>> and people have gotten use to this behavior so I doubt we could change
>> that without breaking a lot of folks applications.
>
> So I'm not contesting this, but I am genuinely curious: do you think
> there are applications out there requesting non-blocking behavior on
> regular files that will then break if they actually get non-blocking
> behavior?  I don't suppose you have an example?
>
> Thanks,
>
> jon

Earlier in this thread Jeff pointed (
https://lkml.org/lkml/2014/9/15/942 ) to a bug in RH bugzilla (
https://bugzilla.redhat.com/show_bug.cgi?id=136057 ) when an
application (squid) reads regular disk files started returning EAGAIN
when read from (provided that they were open with O_NONBLOCK) and
since that doesn't cause readhead it spins on it forever. As far as I
know O_NONBLOCK for regular files in Linux is undefined behavior as
non of the man pages I looked at (esp. fnctl, 2 open, 3 open) specify
what happens in the case of non-network, non-fifo case (some of them
refer to file descriptors that support non-blocking operation, which
is pretty vague).

So even if squid is wrong in it's behavior (since it's undefined), a
quick google search reveals lots of mailing lists / forum posts of
people essentially describing the behavior to date. Eg. O_NONBLOCK on
regular files blocks, with select/poll/epoll always returning a ready
behavior. Based on that anecdotical evidence, I assume a decent chunk
of user apps would beak.

- Milosz

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-19 11:21       ` Christoph Hellwig
@ 2014-09-22 15:48         ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-22 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Elliott, Robert (Server Storage),
	Andreas Dilger, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

Christoph Hellwig <hch@infradead.org> writes:

> Without the atomic WRITE SCATTERED use case adding the syscalls seems
> rather pointless, and I'd really avoid blocking nice software only
> features like the per-I/O nonblock flag (and the similarly trivial
> per-I/O sync option I have a prototype for) on it.

Andreas and Zach pointed out that the scatter/gather system calls also
help network file systems.  I'm not yet sure how much work it would be,
but it certainly seems worth considering readx/writex (or whatever we
want to call them) to avoid needlessly adding a ton of system calls.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-22 15:48         ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-22 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Elliott, Robert (Server Storage),
	Andreas Dilger, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo

Christoph Hellwig <hch@infradead.org> writes:

> Without the atomic WRITE SCATTERED use case adding the syscalls seems
> rather pointless, and I'd really avoid blocking nice software only
> features like the per-I/O nonblock flag (and the similarly trivial
> per-I/O sync option I have a prototype for) on it.

Andreas and Zach pointed out that the scatter/gather system calls also
help network file systems.  I'm not yet sure how much work it would be,
but it certainly seems worth considering readx/writex (or whatever we
want to call them) to avoid needlessly adding a ton of system calls.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* RE: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-19 11:21       ` Christoph Hellwig
  (?)
  (?)
@ 2014-09-22 16:25       ` Elliott, Robert (Server Storage)
  -1 siblings, 0 replies; 167+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-09-22 16:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Dilger, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer

> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@infradead.org]
> Sent: Friday, 19 September, 2014 6:22 AM
> To: Elliott, Robert (Server Storage)
> Cc: Andreas Dilger; Milosz Tanski; linux-kernel@vger.kernel.org; Christoph
> Hellwig; linux-fsdevel@vger.kernel.org; linux-aio@kvack.org; Mel Gorman;
> Volker Lendecke; Tejun Heo; Jeff Moyer
> Subject: Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
> 
> On Mon, Sep 15, 2014 at 10:36:46PM +0000, Elliott, Robert (Server Storage)
> wrote:
> > That sounds like the proposed WRITE SCATTERED/READ GATHERED
> > commands for SCSI (where are related to, but not necessarily
> > tied to, atomic writes).  We discussed them a bit at
> > LSF-MM 2013 - see http://lwn.net/Articles/548116/.
> 
> In the same way a preadx/pwritex could use but would not require an
> O_ATOMIC.  What's the status of those in t10?  Last I heard
> READ GATHERED was out and they were only looking into WRITE SCATTERED?

Both of these essentially require more CDB bytes to convey the
LBA range list.  Under the current SCSI architecture model, the 
choices are:
* include in a longer CDB
* include in the data-out buffer

For longer CDBs:
* CDBs >16 bytes are not widely supported
* 260 byte max CDB size limits the number of LBA ranges
* in most SCSI protocols, commands are unsolicited (push rather
than pull), so the target must have buffer space for (max queue
depth)*(max CDB size). In SCSI Express, although CDBs are pulled
with PCIe memory reads rather than pushed, longer CDBs complicate
circular queue handling.

For the data-out buffer:
* not delivering all the CDB info upfront complicates drive 
hardware designs. They want to get the data transfer started
from the medium, but have to wait for a whole extra DMA 
transfer first. This is not so bad for low-latency PCIe,
but is not a good fit for protocols behind HBAs like
SAS, iSCSI, etc.
* READ GATHERED requires bidirectional command support, which 
is not widely or efficiently supported

Protocols could add direct support for delivering more CDB bytes
(like how the ATA PACKET command delivers a SCSI CDB over
an ATA transport), but that requires a lot of changes.

---
Rob Elliott    HP Server Storage

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-22 15:48         ` Jeff Moyer
@ 2014-09-22 16:32           ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-22 16:32 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Elliott, Robert (Server Storage),
	Andreas Dilger, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

On Mon, Sep 22, 2014 at 11:48 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <hch@infradead.org> writes:
>
>> Without the atomic WRITE SCATTERED use case adding the syscalls seems
>> rather pointless, and I'd really avoid blocking nice software only
>> features like the per-I/O nonblock flag (and the similarly trivial
>> per-I/O sync option I have a prototype for) on it.
>
> Andreas and Zach pointed out that the scatter/gather system calls also
> help network file systems.  I'm not yet sure how much work it would be,
> but it certainly seems worth considering readx/writex (or whatever we
> want to call them) to avoid needlessly adding a ton of system calls.
>
> Cheers,
> Jeff

I spent some time thinking about multi-position scatter/gather in
context of this over the weekend. The non-blocking case seams easy,
the implementation I purposed needs an extra loop. Where this gets
hairy is making the non-trivial blocking case work well (as in have
concurrent requests for each of the ranges) in the filesystem code. If
that's the road we're going to go down I have a gut feeling we're
going to get stuck in the same spot(s) as the other non-blocking
buffered r/w attempts from the past.

Best,
- Milosz

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-22 16:32           ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-22 16:32 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Elliott, Robert (Server Storage),
	Andreas Dilger, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

On Mon, Sep 22, 2014 at 11:48 AM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Christoph Hellwig <hch@infradead.org> writes:
>
>> Without the atomic WRITE SCATTERED use case adding the syscalls seems
>> rather pointless, and I'd really avoid blocking nice software only
>> features like the per-I/O nonblock flag (and the similarly trivial
>> per-I/O sync option I have a prototype for) on it.
>
> Andreas and Zach pointed out that the scatter/gather system calls also
> help network file systems.  I'm not yet sure how much work it would be,
> but it certainly seems worth considering readx/writex (or whatever we
> want to call them) to avoid needlessly adding a ton of system calls.
>
> Cheers,
> Jeff

I spent some time thinking about multi-position scatter/gather in
context of this over the weekend. The non-blocking case seams easy,
the implementation I purposed needs an extra loop. Where this gets
hairy is making the non-trivial blocking case work well (as in have
concurrent requests for each of the ranges) in the filesystem code. If
that's the road we're going to go down I have a gut feeling we're
going to get stuck in the same spot(s) as the other non-blocking
buffered r/w attempts from the past.

Best,
- Milosz

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-22 16:32           ` Milosz Tanski
  (?)
@ 2014-09-22 16:42           ` Christoph Hellwig
  2014-09-22 17:02               ` Milosz Tanski
  -1 siblings, 1 reply; 167+ messages in thread
From: Christoph Hellwig @ 2014-09-22 16:42 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Jeff Moyer, Elliott, Robert (Server Storage),
	Andreas Dilger, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

On Mon, Sep 22, 2014 at 12:32:02PM -0400, Milosz Tanski wrote:
> I spent some time thinking about multi-position scatter/gather in
> context of this over the weekend. The non-blocking case seams easy,
> the implementation I purposed needs an extra loop. Where this gets
> hairy is making the non-trivial blocking case work well (as in have
> concurrent requests for each of the ranges) in the filesystem code. If
> that's the road we're going to go down I have a gut feeling we're
> going to get stuck in the same spot(s) as the other non-blocking
> buffered r/w attempts from the past.

The other thing sis that we have a basically ready, easy to use
implementation of flagged I/O (my name for the new syscalls), while
S/G I/O will take forever to discuss and is the natual vehicle for
other extensions like T10 DIX.

I'd like to suggest you consolidate your syscalls down from 4 to 2
as suggestes by overloading the negative offset argument, giving
us two more syscalls slows for S/G once it's ready.  Note that
a sync S/G syscalls should of course also support these flags, although
I suspect the primary use cases for S/G I/O would be through the aio
machinery.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
  2014-09-22 16:42           ` Christoph Hellwig
@ 2014-09-22 17:02               ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-22 17:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Elliott, Robert (Server Storage),
	Andreas Dilger, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

I'll send out the next RFC with 2 syscalls and magic position values.
I'm waiting for Jeff to chime in on the v2 patchset before I send out
the next one.

On Mon, Sep 22, 2014 at 12:42 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Sep 22, 2014 at 12:32:02PM -0400, Milosz Tanski wrote:
>> I spent some time thinking about multi-position scatter/gather in
>> context of this over the weekend. The non-blocking case seams easy,
>> the implementation I purposed needs an extra loop. Where this gets
>> hairy is making the non-trivial blocking case work well (as in have
>> concurrent requests for each of the ranges) in the filesystem code. If
>> that's the road we're going to go down I have a gut feeling we're
>> going to get stuck in the same spot(s) as the other non-blocking
>> buffered r/w attempts from the past.
>
> The other thing sis that we have a basically ready, easy to use
> implementation of flagged I/O (my name for the new syscalls), while
> S/G I/O will take forever to discuss and is the natual vehicle for
> other extensions like T10 DIX.
>
> I'd like to suggest you consolidate your syscalls down from 4 to 2
> as suggestes by overloading the negative offset argument, giving
> us two more syscalls slows for S/G once it's ready.  Note that
> a sync S/G syscalls should of course also support these flags, although
> I suspect the primary use cases for S/G I/O would be through the aio
> machinery.



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only)
@ 2014-09-22 17:02               ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-22 17:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Elliott, Robert (Server Storage),
	Andreas Dilger, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo

I'll send out the next RFC with 2 syscalls and magic position values.
I'm waiting for Jeff to chime in on the v2 patchset before I send out
the next one.

On Mon, Sep 22, 2014 at 12:42 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Sep 22, 2014 at 12:32:02PM -0400, Milosz Tanski wrote:
>> I spent some time thinking about multi-position scatter/gather in
>> context of this over the weekend. The non-blocking case seams easy,
>> the implementation I purposed needs an extra loop. Where this gets
>> hairy is making the non-trivial blocking case work well (as in have
>> concurrent requests for each of the ranges) in the filesystem code. If
>> that's the road we're going to go down I have a gut feeling we're
>> going to get stuck in the same spot(s) as the other non-blocking
>> buffered r/w attempts from the past.
>
> The other thing sis that we have a basically ready, easy to use
> implementation of flagged I/O (my name for the new syscalls), while
> S/G I/O will take forever to discuss and is the natual vehicle for
> other extensions like T10 DIX.
>
> I'd like to suggest you consolidate your syscalls down from 4 to 2
> as suggestes by overloading the negative offset argument, giving
> us two more syscalls slows for S/G once it's ready.  Note that
> a sync S/G syscalls should of course also support these flags, although
> I suspect the primary use cases for S/G I/O would be through the aio
> machinery.



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2
  2014-09-17 22:20     ` Milosz Tanski
@ 2014-09-22 17:12       ` Jeff Moyer
  -1 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-22 17:12 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Theodore Ts'o,
	Al Viro

Milosz Tanski <milosz@adfin.com> writes:

> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3db2e87..29b5823 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -864,8 +864,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
>  		return -EBADF;
>  	if (!(file->f_mode & FMODE_CAN_READ))
>  		return -EINVAL;
> -	if (flags & ~0)
> +	if (flags & ~RWF_NONBLOCK)
>  		return -EINVAL;
> +	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
> +		return -EAGAIN;

Just to close out our discussion on EINVAL for O_DIRECT with
RWF_NONBLOCK:

After discussing this with Zach, I agree that EAGAIN would be better.
There may be libraries that lack context and may benefit from the EAGAIN
return value.

> +/* These flags are used for the readv/writev syscalls with flags. */
> +#define RWF_NONBLOCK O_NONBLOCK

I'm not sure this make sense.  As I mentioned earlier, the epoll variant
made sense because the flag was shared (passed unmodified to other vfs
calls that understand it).  Here we can just define an entirely new flag
space.  Unless, of course, someone can come up with a reason why this
/would/ make sense?

> diff --git a/mm/filemap.c b/mm/filemap.c
> index e0919ba..6b7aba8 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1483,7 +1483,10 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
>  		cond_resched();
>  find_page:
>  		page = find_get_page(mapping, index);
> +

Please resist the urge to add whitespace.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2
@ 2014-09-22 17:12       ` Jeff Moyer
  0 siblings, 0 replies; 167+ messages in thread
From: Jeff Moyer @ 2014-09-22 17:12 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Theodore Ts'o,
	Al Viro

Milosz Tanski <milosz@adfin.com> writes:

> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3db2e87..29b5823 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -864,8 +864,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
>  		return -EBADF;
>  	if (!(file->f_mode & FMODE_CAN_READ))
>  		return -EINVAL;
> -	if (flags & ~0)
> +	if (flags & ~RWF_NONBLOCK)
>  		return -EINVAL;
> +	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
> +		return -EAGAIN;

Just to close out our discussion on EINVAL for O_DIRECT with
RWF_NONBLOCK:

After discussing this with Zach, I agree that EAGAIN would be better.
There may be libraries that lack context and may benefit from the EAGAIN
return value.

> +/* These flags are used for the readv/writev syscalls with flags. */
> +#define RWF_NONBLOCK O_NONBLOCK

I'm not sure this make sense.  As I mentioned earlier, the epoll variant
made sense because the flag was shared (passed unmodified to other vfs
calls that understand it).  Here we can just define an entirely new flag
space.  Unless, of course, someone can come up with a reason why this
/would/ make sense?

> diff --git a/mm/filemap.c b/mm/filemap.c
> index e0919ba..6b7aba8 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1483,7 +1483,10 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
>  		cond_resched();
>  find_page:
>  		page = find_get_page(mapping, index);
> +

Please resist the urge to add whitespace.

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
  2014-09-15 20:20 ` Milosz Tanski
@ 2014-09-24 21:46   ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

This patcheset introduces an ability to perform a non-blocking read from
regular files in buffered IO mode. This works by only for those filesystems
that have data in the page cache.

It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
extra flag argument (RWF_NONBLOCK).

It's a very common patern today (samba, libuv, etc..) use a large threadpool to
perform buffered IO operations. They submit the work form another thread
that performs network IO and epoll or other threads that perform CPU work. This
leads to increased latency for processing, esp. in the case of data that's
already cached in the page cache.

With the new interface the applications will now be able to fetch the data in
their network / cpu bound thread(s) and only defer to a threadpool if it's not
there. In our own application (VLDB) we've observed a decrease in latency for
"fast" request by avoiding unnecessary queuing and having to swap out current
tasks in IO bound work threads.

Version 3 highlights:
 - Down to 2 syscalls from 4; can user fp or argument position.
 - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.

Version 2 highlights:
 - Put the flags argument into kiocb (less noise), per. Al Viro
 - O_DIRECT checking early in the process, per. Jeff Moyer
 - Resolved duplicate (c&p) code in syscall code, per. Jeff
 - Included perf data in thread cover letter, per. Jeff
 - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff

Some perf data generated using fio comparing the posix aio engine to a version
of the posix AIO engine that attempts to performs "fast" reads before
submitting the operations to the queue. This workflow is on ext4 partition on
raid0 (test / build-rig.) Simulating our database access patern workload using
16kb read accesses. Our database uses a home-spun posix aio like queue (samba
does the same thing.)

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

before:

f1:
    bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
    lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
    lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
    bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
    lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
    lat (msec) : >=2000=4.33%
f3:
    bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
                 stdev=34526.89
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
    lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
    lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
   READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
         mint=600001msec, maxt=600113msec

after (with fast read using preadv2 before submit):

f1:
    bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
    lat (usec) : 2=70.63%, 4=0.01%
    lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
    bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
    lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
    lat (msec) : >=2000=9.99%
f3:
    bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
                 stdev=35995.60
    lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
    lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
    lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%
total:
   READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
         mint=600020msec, maxt=600178msec

Interpreting the results you can see total bandwidth stays the same but overall
request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
workloads. There is a slight bump in latency for since it's random data that's
unlikely to be cached but we're always trying "fast read".

In our application we have starting keeping track of "fast read" hits/misses
and for files / requests that have a lot hit ratio we don't do "fast reads"
mostly getting rid of extra latency in the uncached cases.

I've performed other benchmarks and I have no observed any perf regressions in
any of the normal (old) code paths.

I have co-developed these changes with Christoph Hellwig.

Milosz Tanski (4):
  vfs: Prepare for adding a new preadv/pwritev with user flags.
  vfs: Define new syscalls preadv2,pwritev2
  vfs: Export new vector IO syscalls (with flags) to userland
  vfs: RWF_NONBLOCK flag for preadv2

 arch/x86/syscalls/syscall_32.tbl  |   2 +
 arch/x86/syscalls/syscall_64.tbl  |   2 +
 drivers/target/target_core_file.c |   6 +-
 fs/cifs/file.c                    |   6 ++
 fs/nfsd/vfs.c                     |   4 +-
 fs/ocfs2/file.c                   |   6 ++
 fs/pipe.c                         |   3 +-
 fs/read_write.c                   | 121 +++++++++++++++++++++++++++++---------
 fs/splice.c                       |   2 +-
 fs/xfs/xfs_file.c                 |   4 ++
 include/linux/aio.h               |   2 +
 include/linux/fs.h                |   7 ++-
 include/linux/syscalls.h          |   6 ++
 include/uapi/asm-generic/unistd.h |   6 +-
 mm/filemap.c                      |  22 ++++++-
 mm/shmem.c                        |   4 ++
 16 files changed, 163 insertions(+), 40 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
@ 2014-09-24 21:46   ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

This patcheset introduces an ability to perform a non-blocking read from
regular files in buffered IO mode. This works by only for those filesystems
that have data in the page cache.

It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
extra flag argument (RWF_NONBLOCK).

It's a very common patern today (samba, libuv, etc..) use a large threadpool to
perform buffered IO operations. They submit the work form another thread
that performs network IO and epoll or other threads that perform CPU work. This
leads to increased latency for processing, esp. in the case of data that's
already cached in the page cache.

With the new interface the applications will now be able to fetch the data in
their network / cpu bound thread(s) and only defer to a threadpool if it's not
there. In our own application (VLDB) we've observed a decrease in latency for
"fast" request by avoiding unnecessary queuing and having to swap out current
tasks in IO bound work threads.

Version 3 highlights:
 - Down to 2 syscalls from 4; can user fp or argument position.
 - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.

Version 2 highlights:
 - Put the flags argument into kiocb (less noise), per. Al Viro
 - O_DIRECT checking early in the process, per. Jeff Moyer
 - Resolved duplicate (c&p) code in syscall code, per. Jeff
 - Included perf data in thread cover letter, per. Jeff
 - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff

Some perf data generated using fio comparing the posix aio engine to a version
of the posix AIO engine that attempts to performs "fast" reads before
submitting the operations to the queue. This workflow is on ext4 partition on
raid0 (test / build-rig.) Simulating our database access patern workload using
16kb read accesses. Our database uses a home-spun posix aio like queue (samba
does the same thing.)

f1: ~73% rand read over mostly cached data (zipf med-size dataset)
f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
f3: ~9% seq-read over large dataset

before:

f1:
    bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
    lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
    lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
f2:
    bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
    lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
    lat (msec) : >=2000=4.33%
f3:
    bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
                 stdev=34526.89
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
    lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
    lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
total:
   READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
         mint=600001msec, maxt=600113msec

after (with fast read using preadv2 before submit):

f1:
    bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
    lat (usec) : 2=70.63%, 4=0.01%
    lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
f2:
    bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
    lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
    lat (msec) : >=2000=9.99%
f3:
    bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
                 stdev=35995.60
    lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
    lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
    lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
    lat (msec) : 100=0.05%, 250=0.02%
total:
   READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
         mint=600020msec, maxt=600178msec

Interpreting the results you can see total bandwidth stays the same but overall
request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
workloads. There is a slight bump in latency for since it's random data that's
unlikely to be cached but we're always trying "fast read".

In our application we have starting keeping track of "fast read" hits/misses
and for files / requests that have a lot hit ratio we don't do "fast reads"
mostly getting rid of extra latency in the uncached cases.

I've performed other benchmarks and I have no observed any perf regressions in
any of the normal (old) code paths.

I have co-developed these changes with Christoph Hellwig.

Milosz Tanski (4):
  vfs: Prepare for adding a new preadv/pwritev with user flags.
  vfs: Define new syscalls preadv2,pwritev2
  vfs: Export new vector IO syscalls (with flags) to userland
  vfs: RWF_NONBLOCK flag for preadv2

 arch/x86/syscalls/syscall_32.tbl  |   2 +
 arch/x86/syscalls/syscall_64.tbl  |   2 +
 drivers/target/target_core_file.c |   6 +-
 fs/cifs/file.c                    |   6 ++
 fs/nfsd/vfs.c                     |   4 +-
 fs/ocfs2/file.c                   |   6 ++
 fs/pipe.c                         |   3 +-
 fs/read_write.c                   | 121 +++++++++++++++++++++++++++++---------
 fs/splice.c                       |   2 +-
 fs/xfs/xfs_file.c                 |   4 ++
 include/linux/aio.h               |   2 +
 include/linux/fs.h                |   7 ++-
 include/linux/syscalls.h          |   6 ++
 include/uapi/asm-generic/unistd.h |   6 +-
 mm/filemap.c                      |  22 ++++++-
 mm/shmem.c                        |   4 ++
 16 files changed, 163 insertions(+), 40 deletions(-)

-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [RFC v3 1/4] vfs: Prepare for adding a new preadv/pwritev with user flags.
  2014-09-24 21:46   ` Milosz Tanski
@ 2014-09-24 21:46     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 drivers/target/target_core_file.c |  6 +++---
 fs/nfsd/vfs.c                     |  4 ++--
 fs/read_write.c                   | 28 ++++++++++++++++------------
 fs/splice.c                       |  2 +-
 include/linux/aio.h               |  2 ++
 include/linux/fs.h                |  4 ++--
 mm/filemap.c                      |  2 +-
 7 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7d6cdda..58d9a6d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -350,9 +350,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
 	set_fs(get_ds());
 
 	if (is_write)
-		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
 	else
-		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);
 
 	set_fs(old_fs);
 
@@ -528,7 +528,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 
 	old_fs = get_fs();
 	set_fs(get_ds());
-	rc = vfs_writev(f, &iov[0], iov_num, &pos);
+	rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
 	set_fs(old_fs);
 
 	vfree(iov);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f501a9b..db7a31d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -855,7 +855,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
 
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
 	set_fs(oldfs);
 	return nfsd_finish_read(file, count, host_err);
 }
@@ -943,7 +943,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(oldfs);
 	if (host_err < 0)
 		goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index 009d854..9f6d13d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -651,7 +651,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
 EXPORT_SYMBOL(iov_shorten);
 
 static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
-		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+		int flags)
 {
 	struct kiocb kiocb;
 	struct iov_iter iter;
@@ -660,6 +661,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_nbytes = len;
+	kiocb.ki_rwflags = flags;
 
 	iov_iter_init(&iter, rw, iov, nr_segs, len);
 	ret = fn(&kiocb, &iter);
@@ -798,7 +800,8 @@ out:
 
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec __user * uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos,
+			       int flags)
 {
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -832,7 +835,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -855,27 +858,27 @@ out:
 }
 
 ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
-		  unsigned long vlen, loff_t *pos)
+		  unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_READ))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
 
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_readv);
 
 ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
-		   unsigned long vlen, loff_t *pos)
+		   unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
 
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
@@ -888,7 +891,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos);
+		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -908,7 +911,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos);
+		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -940,7 +943,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos);
+			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -964,7 +967,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos);
+			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1012,7 +1015,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, 0);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -1111,6 +1114,7 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 	return __compat_sys_preadv64(fd, vec, vlen, pos);
 }
 
+
 static size_t compat_writev(struct file *file,
 			    const struct compat_iovec __user *vec,
 			    unsigned long vlen, loff_t *pos)
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..9591b9f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(old_fs);
 
 	return res;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..9c1d499 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -52,6 +52,8 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	int			ki_rwflags;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..e9bea52 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1556,9 +1556,9 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..6e3ba07 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1456,7 +1456,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
-		struct iov_iter *iter, ssize_t written)
+		struct iov_iter *iter, ssize_t written, int flags)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v3 1/4] vfs: Prepare for adding a new preadv/pwritev with user flags.
@ 2014-09-24 21:46     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 drivers/target/target_core_file.c |  6 +++---
 fs/nfsd/vfs.c                     |  4 ++--
 fs/read_write.c                   | 28 ++++++++++++++++------------
 fs/splice.c                       |  2 +-
 include/linux/aio.h               |  2 ++
 include/linux/fs.h                |  4 ++--
 mm/filemap.c                      |  2 +-
 7 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 7d6cdda..58d9a6d 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -350,9 +350,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
 	set_fs(get_ds());
 
 	if (is_write)
-		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
 	else
-		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);
 
 	set_fs(old_fs);
 
@@ -528,7 +528,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 
 	old_fs = get_fs();
 	set_fs(get_ds());
-	rc = vfs_writev(f, &iov[0], iov_num, &pos);
+	rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
 	set_fs(old_fs);
 
 	vfree(iov);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f501a9b..db7a31d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -855,7 +855,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
 
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
 	set_fs(oldfs);
 	return nfsd_finish_read(file, count, host_err);
 }
@@ -943,7 +943,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(oldfs);
 	if (host_err < 0)
 		goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index 009d854..9f6d13d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -651,7 +651,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
 EXPORT_SYMBOL(iov_shorten);
 
 static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
-		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+		int flags)
 {
 	struct kiocb kiocb;
 	struct iov_iter iter;
@@ -660,6 +661,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_nbytes = len;
+	kiocb.ki_rwflags = flags;
 
 	iov_iter_init(&iter, rw, iov, nr_segs, len);
 	ret = fn(&kiocb, &iter);
@@ -798,7 +800,8 @@ out:
 
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec __user * uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos,
+			       int flags)
 {
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -832,7 +835,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -855,27 +858,27 @@ out:
 }
 
 ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
-		  unsigned long vlen, loff_t *pos)
+		  unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_READ))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
 
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_readv);
 
 ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
-		   unsigned long vlen, loff_t *pos)
+		   unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
 
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
@@ -888,7 +891,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos);
+		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -908,7 +911,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos);
+		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -940,7 +943,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos);
+			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -964,7 +967,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos);
+			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1012,7 +1015,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, 0);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -1111,6 +1114,7 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 	return __compat_sys_preadv64(fd, vec, vlen, pos);
 }
 
+
 static size_t compat_writev(struct file *file,
 			    const struct compat_iovec __user *vec,
 			    unsigned long vlen, loff_t *pos)
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba..9591b9f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(old_fs);
 
 	return res;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..9c1d499 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -52,6 +52,8 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	int			ki_rwflags;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..e9bea52 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1556,9 +1556,9 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/mm/filemap.c b/mm/filemap.c
index 90effcd..6e3ba07 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1456,7 +1456,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
-		struct iov_iter *iter, ssize_t written)
+		struct iov_iter *iter, ssize_t written, int flags)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v3 2/4] vfs: Define new syscalls preadv2,pwritev2
  2014-09-24 21:46   ` Milosz Tanski
@ 2014-09-24 21:46     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

New syscalls that take an flag argument. This change does not add any specific
flags.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c                   | 82 ++++++++++++++++++++++++++++++++-------
 include/linux/syscalls.h          |  6 +++
 include/uapi/asm-generic/unistd.h |  6 ++-
 mm/filemap.c                      |  2 +-
 4 files changed, 80 insertions(+), 16 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9f6d13d..a983fc1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -864,6 +864,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -877,21 +879,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
 
-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+			unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -903,15 +907,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -929,10 +933,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
 	return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
 }
 
-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -943,7 +946,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -953,10 +956,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+			  unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -967,7 +969,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -977,6 +979,58 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_preadv(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_readv(fd, vec, vlen, flags);
+
+	return do_preadv(fd, vec, vlen, pos, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_pwritev(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_writev(fd, vec, vlen, flags);
+
+	return do_pwritev(fd, vec, vlen, pos, flags);
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t compat_do_readv_writev(int type, struct file *file,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85..830285f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -570,8 +570,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
 			     size_t count, loff_t pos);
 asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
 			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
 			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
 asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
 asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 11d11bc..589add9 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,10 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
 __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
 #define __NR_pwritev 70
 __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_preadv2 280
+__SC_COMP(__NR_preadv2, sys_preadv2)
+#define __NR_pwritev2 281
+__SC_COMP(__NR_pwritev2, sys_pwritev2)
 
 /* fs/sendfile.c */
 #define __NR3264_sendfile 71
@@ -707,7 +711,7 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
 
 #undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 282
 
 /*
  * All syscalls below here should go away really,
diff --git a/mm/filemap.c b/mm/filemap.c
index 6e3ba07..e0919ba 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1726,7 +1726,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		}
 	}
 
-	retval = do_generic_file_read(file, ppos, iter, retval);
+	retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
 out:
 	return retval;
 }
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v3 2/4] vfs: Define new syscalls preadv2,pwritev2
@ 2014-09-24 21:46     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

New syscalls that take an flag argument. This change does not add any specific
flags.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/read_write.c                   | 82 ++++++++++++++++++++++++++++++++-------
 include/linux/syscalls.h          |  6 +++
 include/uapi/asm-generic/unistd.h |  6 ++-
 mm/filemap.c                      |  2 +-
 4 files changed, 80 insertions(+), 16 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 9f6d13d..a983fc1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -864,6 +864,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -877,21 +879,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
 
-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+			unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -903,15 +907,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -929,10 +933,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
 	return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
 }
 
-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -943,7 +946,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -953,10 +956,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+			  unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -967,7 +969,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -977,6 +979,58 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_preadv(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_readv(fd, vec, vlen, flags);
+
+	return do_preadv(fd, vec, vlen, pos, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_pwritev(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_writev(fd, vec, vlen, flags);
+
+	return do_pwritev(fd, vec, vlen, pos, flags);
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t compat_do_readv_writev(int type, struct file *file,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85..830285f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -570,8 +570,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
 			     size_t count, loff_t pos);
 asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
 			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
 			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
 asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
 asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 11d11bc..589add9 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,10 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
 __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
 #define __NR_pwritev 70
 __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_preadv2 280
+__SC_COMP(__NR_preadv2, sys_preadv2)
+#define __NR_pwritev2 281
+__SC_COMP(__NR_pwritev2, sys_pwritev2)
 
 /* fs/sendfile.c */
 #define __NR3264_sendfile 71
@@ -707,7 +711,7 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
 
 #undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 282
 
 /*
  * All syscalls below here should go away really,
diff --git a/mm/filemap.c b/mm/filemap.c
index 6e3ba07..e0919ba 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1726,7 +1726,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		}
 	}
 
-	retval = do_generic_file_read(file, ppos, iter, retval);
+	retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
 out:
 	return retval;
 }
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v3 3/4] vfs: Export new vector IO syscalls (with flags) to userland
  2014-09-24 21:46   ` Milosz Tanski
@ 2014-09-24 21:46     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

This is only for x86_64 and x86. Will add other arch later.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 arch/x86/syscalls/syscall_32.tbl | 2 ++
 arch/x86/syscalls/syscall_64.tbl | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b781..fbd98ab 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,5 @@
 354	i386	seccomp			sys_seccomp
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
+357	i386	preadv2			sys_preadv2
+358	i386	pwritev2		sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922..5c91cf6 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,8 @@
 318	common	getrandom		sys_getrandom
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
+321	64	preadv2			sys_preadv2
+322	64	pwritev2		sys_pwritev2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v3 3/4] vfs: Export new vector IO syscalls (with flags) to userland
@ 2014-09-24 21:46     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

This is only for x86_64 and x86. Will add other arch later.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 arch/x86/syscalls/syscall_32.tbl | 2 ++
 arch/x86/syscalls/syscall_64.tbl | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b781..fbd98ab 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,5 @@
 354	i386	seccomp			sys_seccomp
 355	i386	getrandom		sys_getrandom
 356	i386	memfd_create		sys_memfd_create
+357	i386	preadv2			sys_preadv2
+358	i386	pwritev2		sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922..5c91cf6 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,8 @@
 318	common	getrandom		sys_getrandom
 319	common	memfd_create		sys_memfd_create
 320	common	kexec_file_load		sys_kexec_file_load
+321	64	preadv2			sys_preadv2
+322	64	pwritev2		sys_pwritev2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v3 4/4] vfs: RWF_NONBLOCK flag for preadv2
  2014-09-24 21:46   ` Milosz Tanski
@ 2014-09-24 21:46     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Filesystems that generic_file_read_iter will not be allowed to perform
non-blocking reads. This only will read data if it's in the page cache and if
there is no page error (causing a re-read).

Christoph Hellwig wrote the filesystem specify code (cifs, ofs, shm, xfs).

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/cifs/file.c     |  6 ++++++
 fs/ocfs2/file.c    |  6 ++++++
 fs/pipe.c          |  3 ++-
 fs/read_write.c    | 21 ++++++++++++++-------
 fs/xfs/xfs_file.c  |  4 ++++
 include/linux/fs.h |  3 +++
 mm/filemap.c       | 18 ++++++++++++++++++
 mm/shmem.c         |  4 ++++
 8 files changed, 57 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 7c018a1..e7169ba 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
 	struct cifs_readdata *rdata, *tmp;
 	struct list_head rdata_list;
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;
@@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
 		return generic_file_read_iter(iocb, to);
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
 	 * with a brlock that prevents reading.
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 2930e23..d96f60d 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2473,6 +2473,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 			filp->f_path.dentry->d_name.name,
 			to->nr_segs);	/* GRRRRR */
 
+	/*
+	 * No non-blocking reads for ocfs2 for now.  Might be doable with
+	 * non-blocking cluster lock helpers.
+	 */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
 
 	if (!inode) {
 		ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..212bf68 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if ((filp->f_flags & O_NONBLOCK) ||
+			    (iocb->ki_rwflags & RWF_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
diff --git a/fs/read_write.c b/fs/read_write.c
index a983fc1..5592a18 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -833,14 +833,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & RWF_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
@@ -864,8 +869,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
-	if (flags & ~0)
+	if (flags & ~RWF_NONBLOCK)
 		return -EINVAL;
+	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
+		return -EAGAIN;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de5368c..cf61271 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -246,6 +246,10 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
+	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e9bea52..b884975 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1477,6 +1477,9 @@ struct block_device_operations;
 #define HAVE_COMPAT_IOCTL 1
 #define HAVE_UNLOCKED_IOCTL 1
 
+/* These flags are used for the readv/writev syscalls with flags. */
+#define RWF_NONBLOCK 0x00000001
+
 struct iov_iter;
 
 struct file_operations {
diff --git a/mm/filemap.c b/mm/filemap.c
index e0919ba..86ed6f7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1484,6 +1484,8 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 find_page:
 		page = find_get_page(mapping, index);
 		if (!page) {
+			if (flags & RWF_NONBLOCK)
+				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1575,6 +1577,11 @@ page_ok:
 		continue;
 
 page_not_up_to_date:
+		if (flags & RWF_NONBLOCK) {
+			page_cache_release(page);
+			goto would_block;
+		}
+
 		/* Get exclusive access to the page ... */
 		error = lock_page_killable(page);
 		if (unlikely(error))
@@ -1594,6 +1601,12 @@ page_not_up_to_date_locked:
 			goto page_ok;
 		}
 
+		if (flags & RWF_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto would_block;
+		}
+
 readpage:
 		/*
 		 * A previous I/O error may have been due to temporary
@@ -1664,6 +1677,8 @@ no_cached_page:
 		goto readpage;
 	}
 
+would_block:
+	error = -EAGAIN;
 out:
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_CACHE_SHIFT;
@@ -1697,6 +1712,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		size_t count = iov_iter_count(iter);
 		loff_t size;
 
+		if (iocb->ki_rwflags & RWF_NONBLOCK)
+			return -EAGAIN;
+
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
diff --git a/mm/shmem.c b/mm/shmem.c
index 0e5fb22..ca2cae2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/* XXX: should be easily supportable */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [RFC v3 4/4] vfs: RWF_NONBLOCK flag for preadv2
@ 2014-09-24 21:46     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-24 21:46 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

Filesystems that generic_file_read_iter will not be allowed to perform
non-blocking reads. This only will read data if it's in the page cache and if
there is no page error (causing a re-read).

Christoph Hellwig wrote the filesystem specify code (cifs, ofs, shm, xfs).

Signed-off-by: Milosz Tanski <milosz@adfin.com>
---
 fs/cifs/file.c     |  6 ++++++
 fs/ocfs2/file.c    |  6 ++++++
 fs/pipe.c          |  3 ++-
 fs/read_write.c    | 21 ++++++++++++++-------
 fs/xfs/xfs_file.c  |  4 ++++
 include/linux/fs.h |  3 +++
 mm/filemap.c       | 18 ++++++++++++++++++
 mm/shmem.c         |  4 ++++
 8 files changed, 57 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 7c018a1..e7169ba 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3005,6 +3005,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
 	struct cifs_readdata *rdata, *tmp;
 	struct list_head rdata_list;
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;
@@ -3123,6 +3126,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
 		return generic_file_read_iter(iocb, to);
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
 	 * with a brlock that prevents reading.
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 2930e23..d96f60d 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2473,6 +2473,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 			filp->f_path.dentry->d_name.name,
 			to->nr_segs);	/* GRRRRR */
 
+	/*
+	 * No non-blocking reads for ocfs2 for now.  Might be doable with
+	 * non-blocking cluster lock helpers.
+	 */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
 
 	if (!inode) {
 		ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..212bf68 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if ((filp->f_flags & O_NONBLOCK) ||
+			    (iocb->ki_rwflags & RWF_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
diff --git a/fs/read_write.c b/fs/read_write.c
index a983fc1..5592a18 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -833,14 +833,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & RWF_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
@@ -864,8 +869,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
-	if (flags & ~0)
+	if (flags & ~RWF_NONBLOCK)
 		return -EINVAL;
+	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
+		return -EAGAIN;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index de5368c..cf61271 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -246,6 +246,10 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
+	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e9bea52..b884975 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1477,6 +1477,9 @@ struct block_device_operations;
 #define HAVE_COMPAT_IOCTL 1
 #define HAVE_UNLOCKED_IOCTL 1
 
+/* These flags are used for the readv/writev syscalls with flags. */
+#define RWF_NONBLOCK 0x00000001
+
 struct iov_iter;
 
 struct file_operations {
diff --git a/mm/filemap.c b/mm/filemap.c
index e0919ba..86ed6f7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1484,6 +1484,8 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 find_page:
 		page = find_get_page(mapping, index);
 		if (!page) {
+			if (flags & RWF_NONBLOCK)
+				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1575,6 +1577,11 @@ page_ok:
 		continue;
 
 page_not_up_to_date:
+		if (flags & RWF_NONBLOCK) {
+			page_cache_release(page);
+			goto would_block;
+		}
+
 		/* Get exclusive access to the page ... */
 		error = lock_page_killable(page);
 		if (unlikely(error))
@@ -1594,6 +1601,12 @@ page_not_up_to_date_locked:
 			goto page_ok;
 		}
 
+		if (flags & RWF_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto would_block;
+		}
+
 readpage:
 		/*
 		 * A previous I/O error may have been due to temporary
@@ -1664,6 +1677,8 @@ no_cached_page:
 		goto readpage;
 	}
 
+would_block:
+	error = -EAGAIN;
 out:
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_CACHE_SHIFT;
@@ -1697,6 +1712,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		size_t count = iov_iter_count(iter);
 		loff_t size;
 
+		if (iocb->ki_rwflags & RWF_NONBLOCK)
+			return -EAGAIN;
+
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
diff --git a/mm/shmem.c b/mm/shmem.c
index 0e5fb22..ca2cae2 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1531,6 +1531,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/* XXX: should be easily supportable */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
@ 2014-09-25  4:06     ` Michael Kerrisk
  0 siblings, 0 replies; 167+ messages in thread
From: Michael Kerrisk @ 2014-09-25  4:06 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Linux Kernel, Christoph Hellwig, Linux-Fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API

Hello Milosz,

On Wed, Sep 24, 2014 at 11:46 PM, Milosz Tanski <milosz@adfin.com> wrote:
> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> extra flag argument (RWF_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.

Since this is a change to the user-space API, could you CC future
versions of this patch set to linux-api@vgerr.kernel.org please, as
per Documentation/SubmitChecklist. See also
https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Thanks,

Michael


> Version 3 highlights:
>  - Down to 2 syscalls from 4; can user fp or argument position.
>  - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.
>
> Version 2 highlights:
>  - Put the flags argument into kiocb (less noise), per. Al Viro
>  - O_DIRECT checking early in the process, per. Jeff Moyer
>  - Resolved duplicate (c&p) code in syscall code, per. Jeff
>  - Included perf data in thread cover letter, per. Jeff
>  - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff
>
>
> Some perf data generated using fio comparing the posix aio engine to a version
> of the posix AIO engine that attempts to performs "fast" reads before
> submitting the operations to the queue. This workflow is on ext4 partition on
> raid0 (test / build-rig.) Simulating our database access patern workload using
> 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
> does the same thing.)
>
> f1: ~73% rand read over mostly cached data (zipf med-size dataset)
> f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
> f3: ~9% seq-read over large dataset
>
> before:
>
> f1:
>     bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
>     lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
>     lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
> f2:
>     bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
>     lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
>     lat (msec) : >=2000=4.33%
> f3:
>     bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
>                  stdev=34526.89
>     lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
>     lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
>     lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
>     lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
> total:
>    READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
>          mint=600001msec, maxt=600113msec
>
> after (with fast read using preadv2 before submit):
>
> f1:
>     bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
>     lat (usec) : 2=70.63%, 4=0.01%
>     lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
> f2:
>     bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
>     lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
>     lat (msec) : >=2000=9.99%
> f3:
>     bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
>                  stdev=35995.60
>     lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
>     lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
>     lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
>     lat (msec) : 100=0.05%, 250=0.02%
> total:
>    READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
>          mint=600020msec, maxt=600178msec
>
> Interpreting the results you can see total bandwidth stays the same but overall
> request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
> workloads. There is a slight bump in latency for since it's random data that's
> unlikely to be cached but we're always trying "fast read".
>
> In our application we have starting keeping track of "fast read" hits/misses
> and for files / requests that have a lot hit ratio we don't do "fast reads"
> mostly getting rid of extra latency in the uncached cases.
>
> I've performed other benchmarks and I have no observed any perf regressions in
> any of the normal (old) code paths.
>
>
> I have co-developed these changes with Christoph Hellwig.
>
> Milosz Tanski (4):
>   vfs: Prepare for adding a new preadv/pwritev with user flags.
>   vfs: Define new syscalls preadv2,pwritev2
>   vfs: Export new vector IO syscalls (with flags) to userland
>   vfs: RWF_NONBLOCK flag for preadv2
>
>  arch/x86/syscalls/syscall_32.tbl  |   2 +
>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>  drivers/target/target_core_file.c |   6 +-
>  fs/cifs/file.c                    |   6 ++
>  fs/nfsd/vfs.c                     |   4 +-
>  fs/ocfs2/file.c                   |   6 ++
>  fs/pipe.c                         |   3 +-
>  fs/read_write.c                   | 121 +++++++++++++++++++++++++++++---------
>  fs/splice.c                       |   2 +-
>  fs/xfs/xfs_file.c                 |   4 ++
>  include/linux/aio.h               |   2 +
>  include/linux/fs.h                |   7 ++-
>  include/linux/syscalls.h          |   6 ++
>  include/uapi/asm-generic/unistd.h |   6 +-
>  mm/filemap.c                      |  22 ++++++-
>  mm/shmem.c                        |   4 ++
>  16 files changed, 163 insertions(+), 40 deletions(-)
>
> --
> 2.1.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
@ 2014-09-25  4:06     ` Michael Kerrisk
  0 siblings, 0 replies; 167+ messages in thread
From: Michael Kerrisk @ 2014-09-25  4:06 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Linux Kernel, Christoph Hellwig, Linux-Fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API

Hello Milosz,

On Wed, Sep 24, 2014 at 11:46 PM, Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> wrote:
> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> extra flag argument (RWF_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.

Since this is a change to the user-space API, could you CC future
versions of this patch set to linux-api-U6gemL7RCxhzeIdxy0IIJw@public.gmane.org please, as
per Documentation/SubmitChecklist. See also
https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Thanks,

Michael


> Version 3 highlights:
>  - Down to 2 syscalls from 4; can user fp or argument position.
>  - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.
>
> Version 2 highlights:
>  - Put the flags argument into kiocb (less noise), per. Al Viro
>  - O_DIRECT checking early in the process, per. Jeff Moyer
>  - Resolved duplicate (c&p) code in syscall code, per. Jeff
>  - Included perf data in thread cover letter, per. Jeff
>  - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff
>
>
> Some perf data generated using fio comparing the posix aio engine to a version
> of the posix AIO engine that attempts to performs "fast" reads before
> submitting the operations to the queue. This workflow is on ext4 partition on
> raid0 (test / build-rig.) Simulating our database access patern workload using
> 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
> does the same thing.)
>
> f1: ~73% rand read over mostly cached data (zipf med-size dataset)
> f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
> f3: ~9% seq-read over large dataset
>
> before:
>
> f1:
>     bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
>     lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
>     lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
> f2:
>     bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
>     lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
>     lat (msec) : >=2000=4.33%
> f3:
>     bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
>                  stdev=34526.89
>     lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
>     lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
>     lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
>     lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
> total:
>    READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
>          mint=600001msec, maxt=600113msec
>
> after (with fast read using preadv2 before submit):
>
> f1:
>     bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
>     lat (usec) : 2=70.63%, 4=0.01%
>     lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
> f2:
>     bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
>     lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
>     lat (msec) : >=2000=9.99%
> f3:
>     bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
>                  stdev=35995.60
>     lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
>     lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
>     lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
>     lat (msec) : 100=0.05%, 250=0.02%
> total:
>    READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
>          mint=600020msec, maxt=600178msec
>
> Interpreting the results you can see total bandwidth stays the same but overall
> request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
> workloads. There is a slight bump in latency for since it's random data that's
> unlikely to be cached but we're always trying "fast read".
>
> In our application we have starting keeping track of "fast read" hits/misses
> and for files / requests that have a lot hit ratio we don't do "fast reads"
> mostly getting rid of extra latency in the uncached cases.
>
> I've performed other benchmarks and I have no observed any perf regressions in
> any of the normal (old) code paths.
>
>
> I have co-developed these changes with Christoph Hellwig.
>
> Milosz Tanski (4):
>   vfs: Prepare for adding a new preadv/pwritev with user flags.
>   vfs: Define new syscalls preadv2,pwritev2
>   vfs: Export new vector IO syscalls (with flags) to userland
>   vfs: RWF_NONBLOCK flag for preadv2
>
>  arch/x86/syscalls/syscall_32.tbl  |   2 +
>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>  drivers/target/target_core_file.c |   6 +-
>  fs/cifs/file.c                    |   6 ++
>  fs/nfsd/vfs.c                     |   4 +-
>  fs/ocfs2/file.c                   |   6 ++
>  fs/pipe.c                         |   3 +-
>  fs/read_write.c                   | 121 +++++++++++++++++++++++++++++---------
>  fs/splice.c                       |   2 +-
>  fs/xfs/xfs_file.c                 |   4 ++
>  include/linux/aio.h               |   2 +
>  include/linux/fs.h                |   7 ++-
>  include/linux/syscalls.h          |   6 ++
>  include/uapi/asm-generic/unistd.h |   6 +-
>  mm/filemap.c                      |  22 ++++++-
>  mm/shmem.c                        |   4 ++
>  16 files changed, 163 insertions(+), 40 deletions(-)
>
> --
> 2.1.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"aart-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org">aart-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org</a>



-- 
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
  2014-09-25  4:06     ` Michael Kerrisk
@ 2014-09-25 11:16       ` Jan Kara
  -1 siblings, 0 replies; 167+ messages in thread
From: Jan Kara @ 2014-09-25 11:16 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Milosz Tanski, Linux Kernel, Christoph Hellwig, Linux-Fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API

On Thu 25-09-14 06:06:14, Michael Kerrisk wrote:
> Hello Milosz,
> 
> On Wed, Sep 24, 2014 at 11:46 PM, Milosz Tanski <milosz@adfin.com> wrote:
> > This patcheset introduces an ability to perform a non-blocking read from
> > regular files in buffered IO mode. This works by only for those filesystems
> > that have data in the page cache.
> >
> > It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> > new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> > extra flag argument (RWF_NONBLOCK).
> >
> > It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> > perform buffered IO operations. They submit the work form another thread
> > that performs network IO and epoll or other threads that perform CPU work. This
> > leads to increased latency for processing, esp. in the case of data that's
> > already cached in the page cache.
> >
> > With the new interface the applications will now be able to fetch the data in
> > their network / cpu bound thread(s) and only defer to a threadpool if it's not
> > there. In our own application (VLDB) we've observed a decrease in latency for
> > "fast" request by avoiding unnecessary queuing and having to swap out current
> > tasks in IO bound work threads.
> 
> Since this is a change to the user-space API, could you CC future
> versions of this patch set to linux-api@vgerr.kernel.org please, as
  There's typo in the address. It should be:
linux-api@vger.kernel.org

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
@ 2014-09-25 11:16       ` Jan Kara
  0 siblings, 0 replies; 167+ messages in thread
From: Jan Kara @ 2014-09-25 11:16 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Milosz Tanski, Linux Kernel, Christoph Hellwig, Linux-Fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API

On Thu 25-09-14 06:06:14, Michael Kerrisk wrote:
> Hello Milosz,
> 
> On Wed, Sep 24, 2014 at 11:46 PM, Milosz Tanski <milosz@adfin.com> wrote:
> > This patcheset introduces an ability to perform a non-blocking read from
> > regular files in buffered IO mode. This works by only for those filesystems
> > that have data in the page cache.
> >
> > It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> > new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> > extra flag argument (RWF_NONBLOCK).
> >
> > It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> > perform buffered IO operations. They submit the work form another thread
> > that performs network IO and epoll or other threads that perform CPU work. This
> > leads to increased latency for processing, esp. in the case of data that's
> > already cached in the page cache.
> >
> > With the new interface the applications will now be able to fetch the data in
> > their network / cpu bound thread(s) and only defer to a threadpool if it's not
> > there. In our own application (VLDB) we've observed a decrease in latency for
> > "fast" request by avoiding unnecessary queuing and having to swap out current
> > tasks in IO bound work threads.
> 
> Since this is a change to the user-space API, could you CC future
> versions of this patch set to linux-api@vgerr.kernel.org please, as
  There's typo in the address. It should be:
linux-api@vger.kernel.org

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
  2014-09-25  4:06     ` Michael Kerrisk
@ 2014-09-25 15:48       ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-25 15:48 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Linux Kernel, Christoph Hellwig, Linux-Fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API

On Thu, Sep 25, 2014 at 12:06 AM, Michael Kerrisk
<mtk.manpages@gmail.com> wrote:
> Hello Milosz,
>
> On Wed, Sep 24, 2014 at 11:46 PM, Milosz Tanski <milosz@adfin.com> wrote:
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
>> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
>> extra flag argument (RWF_NONBLOCK).
>>
>> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
>> perform buffered IO operations. They submit the work form another thread
>> that performs network IO and epoll or other threads that perform CPU work. This
>> leads to increased latency for processing, esp. in the case of data that's
>> already cached in the page cache.
>>
>> With the new interface the applications will now be able to fetch the data in
>> their network / cpu bound thread(s) and only defer to a threadpool if it's not
>> there. In our own application (VLDB) we've observed a decrease in latency for
>> "fast" request by avoiding unnecessary queuing and having to swap out current
>> tasks in IO bound work threads.
>
> Since this is a change to the user-space API, could you CC future
> versions of this patch set to linux-api@vgerr.kernel.org please, as
> per Documentation/SubmitChecklist. See also
> https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Will do and sorry about this; also I noted Jan's correction.

>
> Thanks,
>
> Michael
>
>
>> Version 3 highlights:
>>  - Down to 2 syscalls from 4; can user fp or argument position.
>>  - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.
>>
>> Version 2 highlights:
>>  - Put the flags argument into kiocb (less noise), per. Al Viro
>>  - O_DIRECT checking early in the process, per. Jeff Moyer
>>  - Resolved duplicate (c&p) code in syscall code, per. Jeff
>>  - Included perf data in thread cover letter, per. Jeff
>>  - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff
>>
>>
>> Some perf data generated using fio comparing the posix aio engine to a version
>> of the posix AIO engine that attempts to performs "fast" reads before
>> submitting the operations to the queue. This workflow is on ext4 partition on
>> raid0 (test / build-rig.) Simulating our database access patern workload using
>> 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
>> does the same thing.)
>>
>> f1: ~73% rand read over mostly cached data (zipf med-size dataset)
>> f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
>> f3: ~9% seq-read over large dataset
>>
>> before:
>>
>> f1:
>>     bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
>>     lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
>>     lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
>> f2:
>>     bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
>>     lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
>>     lat (msec) : >=2000=4.33%
>> f3:
>>     bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
>>                  stdev=34526.89
>>     lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
>>     lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
>>     lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
>>     lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
>> total:
>>    READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
>>          mint=600001msec, maxt=600113msec
>>
>> after (with fast read using preadv2 before submit):
>>
>> f1:
>>     bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
>>     lat (usec) : 2=70.63%, 4=0.01%
>>     lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
>> f2:
>>     bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
>>     lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
>>     lat (msec) : >=2000=9.99%
>> f3:
>>     bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
>>                  stdev=35995.60
>>     lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
>>     lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
>>     lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
>>     lat (msec) : 100=0.05%, 250=0.02%
>> total:
>>    READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
>>          mint=600020msec, maxt=600178msec
>>
>> Interpreting the results you can see total bandwidth stays the same but overall
>> request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
>> workloads. There is a slight bump in latency for since it's random data that's
>> unlikely to be cached but we're always trying "fast read".
>>
>> In our application we have starting keeping track of "fast read" hits/misses
>> and for files / requests that have a lot hit ratio we don't do "fast reads"
>> mostly getting rid of extra latency in the uncached cases.
>>
>> I've performed other benchmarks and I have no observed any perf regressions in
>> any of the normal (old) code paths.
>>
>>
>> I have co-developed these changes with Christoph Hellwig.
>>
>> Milosz Tanski (4):
>>   vfs: Prepare for adding a new preadv/pwritev with user flags.
>>   vfs: Define new syscalls preadv2,pwritev2
>>   vfs: Export new vector IO syscalls (with flags) to userland
>>   vfs: RWF_NONBLOCK flag for preadv2
>>
>>  arch/x86/syscalls/syscall_32.tbl  |   2 +
>>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>>  drivers/target/target_core_file.c |   6 +-
>>  fs/cifs/file.c                    |   6 ++
>>  fs/nfsd/vfs.c                     |   4 +-
>>  fs/ocfs2/file.c                   |   6 ++
>>  fs/pipe.c                         |   3 +-
>>  fs/read_write.c                   | 121 +++++++++++++++++++++++++++++---------
>>  fs/splice.c                       |   2 +-
>>  fs/xfs/xfs_file.c                 |   4 ++
>>  include/linux/aio.h               |   2 +
>>  include/linux/fs.h                |   7 ++-
>>  include/linux/syscalls.h          |   6 ++
>>  include/uapi/asm-generic/unistd.h |   6 +-
>>  mm/filemap.c                      |  22 ++++++-
>>  mm/shmem.c                        |   4 ++
>>  16 files changed, 163 insertions(+), 40 deletions(-)
>>
>> --
>> 2.1.0
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>> the body to majordomo@kvack.org.  For more info on Linux AIO,
>> see: http://www.kvack.org/aio/
>> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
>
>
>
> --
> Michael Kerrisk Linux man-pages maintainer;
> http://www.kernel.org/doc/man-pages/
> Author of "The Linux Programming Interface", http://blog.man7.org/



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
@ 2014-09-25 15:48       ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-09-25 15:48 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Linux Kernel, Christoph Hellwig, Linux-Fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API

On Thu, Sep 25, 2014 at 12:06 AM, Michael Kerrisk
<mtk.manpages@gmail.com> wrote:
> Hello Milosz,
>
> On Wed, Sep 24, 2014 at 11:46 PM, Milosz Tanski <milosz@adfin.com> wrote:
>> This patcheset introduces an ability to perform a non-blocking read from
>> regular files in buffered IO mode. This works by only for those filesystems
>> that have data in the page cache.
>>
>> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
>> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
>> extra flag argument (RWF_NONBLOCK).
>>
>> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
>> perform buffered IO operations. They submit the work form another thread
>> that performs network IO and epoll or other threads that perform CPU work. This
>> leads to increased latency for processing, esp. in the case of data that's
>> already cached in the page cache.
>>
>> With the new interface the applications will now be able to fetch the data in
>> their network / cpu bound thread(s) and only defer to a threadpool if it's not
>> there. In our own application (VLDB) we've observed a decrease in latency for
>> "fast" request by avoiding unnecessary queuing and having to swap out current
>> tasks in IO bound work threads.
>
> Since this is a change to the user-space API, could you CC future
> versions of this patch set to linux-api@vgerr.kernel.org please, as
> per Documentation/SubmitChecklist. See also
> https://www.kernel.org/doc/man-pages/linux-api-ml.html.

Will do and sorry about this; also I noted Jan's correction.

>
> Thanks,
>
> Michael
>
>
>> Version 3 highlights:
>>  - Down to 2 syscalls from 4; can user fp or argument position.
>>  - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.
>>
>> Version 2 highlights:
>>  - Put the flags argument into kiocb (less noise), per. Al Viro
>>  - O_DIRECT checking early in the process, per. Jeff Moyer
>>  - Resolved duplicate (c&p) code in syscall code, per. Jeff
>>  - Included perf data in thread cover letter, per. Jeff
>>  - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff
>>
>>
>> Some perf data generated using fio comparing the posix aio engine to a version
>> of the posix AIO engine that attempts to performs "fast" reads before
>> submitting the operations to the queue. This workflow is on ext4 partition on
>> raid0 (test / build-rig.) Simulating our database access patern workload using
>> 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
>> does the same thing.)
>>
>> f1: ~73% rand read over mostly cached data (zipf med-size dataset)
>> f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
>> f3: ~9% seq-read over large dataset
>>
>> before:
>>
>> f1:
>>     bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
>>     lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
>>     lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
>> f2:
>>     bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
>>     lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
>>     lat (msec) : >=2000=4.33%
>> f3:
>>     bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
>>                  stdev=34526.89
>>     lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
>>     lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
>>     lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
>>     lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
>> total:
>>    READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
>>          mint=600001msec, maxt=600113msec
>>
>> after (with fast read using preadv2 before submit):
>>
>> f1:
>>     bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
>>     lat (usec) : 2=70.63%, 4=0.01%
>>     lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
>> f2:
>>     bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
>>     lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
>>     lat (msec) : >=2000=9.99%
>> f3:
>>     bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
>>                  stdev=35995.60
>>     lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
>>     lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
>>     lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
>>     lat (msec) : 100=0.05%, 250=0.02%
>> total:
>>    READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
>>          mint=600020msec, maxt=600178msec
>>
>> Interpreting the results you can see total bandwidth stays the same but overall
>> request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
>> workloads. There is a slight bump in latency for since it's random data that's
>> unlikely to be cached but we're always trying "fast read".
>>
>> In our application we have starting keeping track of "fast read" hits/misses
>> and for files / requests that have a lot hit ratio we don't do "fast reads"
>> mostly getting rid of extra latency in the uncached cases.
>>
>> I've performed other benchmarks and I have no observed any perf regressions in
>> any of the normal (old) code paths.
>>
>>
>> I have co-developed these changes with Christoph Hellwig.
>>
>> Milosz Tanski (4):
>>   vfs: Prepare for adding a new preadv/pwritev with user flags.
>>   vfs: Define new syscalls preadv2,pwritev2
>>   vfs: Export new vector IO syscalls (with flags) to userland
>>   vfs: RWF_NONBLOCK flag for preadv2
>>
>>  arch/x86/syscalls/syscall_32.tbl  |   2 +
>>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>>  drivers/target/target_core_file.c |   6 +-
>>  fs/cifs/file.c                    |   6 ++
>>  fs/nfsd/vfs.c                     |   4 +-
>>  fs/ocfs2/file.c                   |   6 ++
>>  fs/pipe.c                         |   3 +-
>>  fs/read_write.c                   | 121 +++++++++++++++++++++++++++++---------
>>  fs/splice.c                       |   2 +-
>>  fs/xfs/xfs_file.c                 |   4 ++
>>  include/linux/aio.h               |   2 +
>>  include/linux/fs.h                |   7 ++-
>>  include/linux/syscalls.h          |   6 ++
>>  include/uapi/asm-generic/unistd.h |   6 +-
>>  mm/filemap.c                      |  22 ++++++-
>>  mm/shmem.c                        |   4 ++
>>  16 files changed, 163 insertions(+), 40 deletions(-)
>>
>> --
>> 2.1.0
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>> the body to majordomo@kvack.org.  For more info on Linux AIO,
>> see: http://www.kvack.org/aio/
>> Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>
>
>
>
> --
> Michael Kerrisk Linux man-pages maintainer;
> http://www.kernel.org/doc/man-pages/
> Author of "The Linux Programming Interface", http://blog.man7.org/



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
  2014-09-24 21:46   ` Milosz Tanski
@ 2014-10-08  2:53     ` Milosz Tanski
  -1 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-10-08  2:53 UTC (permalink / raw)
  To: LKML
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

I haven't seen too many replies to this last request so I imagine
there's not to many things to fix.. which is good to see.

Additionally, with Jeff's help I wrote up man pages to be ready to go
with these changes.

I'm thinking of changing the preadv2 call to be a preadv6/pwritev6
call for the next submission... so the number represents the argument
count. This should more closely follow the other syscall convention
(like accept4, wait4, dup3).

I'm going to be away for a week. I hope to have the next version (with
the above change) for you guys to review next Wednesday.

Best,
- Milosz

On Wed, Sep 24, 2014 at 5:46 PM, Milosz Tanski <milosz@adfin.com> wrote:
> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> extra flag argument (RWF_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.
>
>
> Version 3 highlights:
>  - Down to 2 syscalls from 4; can user fp or argument position.
>  - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.
>
> Version 2 highlights:
>  - Put the flags argument into kiocb (less noise), per. Al Viro
>  - O_DIRECT checking early in the process, per. Jeff Moyer
>  - Resolved duplicate (c&p) code in syscall code, per. Jeff
>  - Included perf data in thread cover letter, per. Jeff
>  - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff
>
>
> Some perf data generated using fio comparing the posix aio engine to a version
> of the posix AIO engine that attempts to performs "fast" reads before
> submitting the operations to the queue. This workflow is on ext4 partition on
> raid0 (test / build-rig.) Simulating our database access patern workload using
> 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
> does the same thing.)
>
> f1: ~73% rand read over mostly cached data (zipf med-size dataset)
> f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
> f3: ~9% seq-read over large dataset
>
> before:
>
> f1:
>     bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
>     lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
>     lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
> f2:
>     bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
>     lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
>     lat (msec) : >=2000=4.33%
> f3:
>     bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
>                  stdev=34526.89
>     lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
>     lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
>     lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
>     lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
> total:
>    READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
>          mint=600001msec, maxt=600113msec
>
> after (with fast read using preadv2 before submit):
>
> f1:
>     bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
>     lat (usec) : 2=70.63%, 4=0.01%
>     lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
> f2:
>     bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
>     lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
>     lat (msec) : >=2000=9.99%
> f3:
>     bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
>                  stdev=35995.60
>     lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
>     lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
>     lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
>     lat (msec) : 100=0.05%, 250=0.02%
> total:
>    READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
>          mint=600020msec, maxt=600178msec
>
> Interpreting the results you can see total bandwidth stays the same but overall
> request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
> workloads. There is a slight bump in latency for since it's random data that's
> unlikely to be cached but we're always trying "fast read".
>
> In our application we have starting keeping track of "fast read" hits/misses
> and for files / requests that have a lot hit ratio we don't do "fast reads"
> mostly getting rid of extra latency in the uncached cases.
>
> I've performed other benchmarks and I have no observed any perf regressions in
> any of the normal (old) code paths.
>
>
> I have co-developed these changes with Christoph Hellwig.
>
> Milosz Tanski (4):
>   vfs: Prepare for adding a new preadv/pwritev with user flags.
>   vfs: Define new syscalls preadv2,pwritev2
>   vfs: Export new vector IO syscalls (with flags) to userland
>   vfs: RWF_NONBLOCK flag for preadv2
>
>  arch/x86/syscalls/syscall_32.tbl  |   2 +
>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>  drivers/target/target_core_file.c |   6 +-
>  fs/cifs/file.c                    |   6 ++
>  fs/nfsd/vfs.c                     |   4 +-
>  fs/ocfs2/file.c                   |   6 ++
>  fs/pipe.c                         |   3 +-
>  fs/read_write.c                   | 121 +++++++++++++++++++++++++++++---------
>  fs/splice.c                       |   2 +-
>  fs/xfs/xfs_file.c                 |   4 ++
>  include/linux/aio.h               |   2 +
>  include/linux/fs.h                |   7 ++-
>  include/linux/syscalls.h          |   6 ++
>  include/uapi/asm-generic/unistd.h |   6 +-
>  mm/filemap.c                      |  22 ++++++-
>  mm/shmem.c                        |   4 ++
>  16 files changed, 163 insertions(+), 40 deletions(-)
>
> --
> 2.1.0
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only)
@ 2014-10-08  2:53     ` Milosz Tanski
  0 siblings, 0 replies; 167+ messages in thread
From: Milosz Tanski @ 2014-10-08  2:53 UTC (permalink / raw)
  To: LKML
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro

I haven't seen too many replies to this last request so I imagine
there's not to many things to fix.. which is good to see.

Additionally, with Jeff's help I wrote up man pages to be ready to go
with these changes.

I'm thinking of changing the preadv2 call to be a preadv6/pwritev6
call for the next submission... so the number represents the argument
count. This should more closely follow the other syscall convention
(like accept4, wait4, dup3).

I'm going to be away for a week. I hope to have the next version (with
the above change) for you guys to review next Wednesday.

Best,
- Milosz

On Wed, Sep 24, 2014 at 5:46 PM, Milosz Tanski <milosz@adfin.com> wrote:
> This patcheset introduces an ability to perform a non-blocking read from
> regular files in buffered IO mode. This works by only for those filesystems
> that have data in the page cache.
>
> It does this by introducing new syscalls new syscalls preadv2/pwritev2. These
> new syscalls behave like the network sendmsg, recvmsg syscalls that accept an
> extra flag argument (RWF_NONBLOCK).
>
> It's a very common patern today (samba, libuv, etc..) use a large threadpool to
> perform buffered IO operations. They submit the work form another thread
> that performs network IO and epoll or other threads that perform CPU work. This
> leads to increased latency for processing, esp. in the case of data that's
> already cached in the page cache.
>
> With the new interface the applications will now be able to fetch the data in
> their network / cpu bound thread(s) and only defer to a threadpool if it's not
> there. In our own application (VLDB) we've observed a decrease in latency for
> "fast" request by avoiding unnecessary queuing and having to swap out current
> tasks in IO bound work threads.
>
>
> Version 3 highlights:
>  - Down to 2 syscalls from 4; can user fp or argument position.
>  - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.
>
> Version 2 highlights:
>  - Put the flags argument into kiocb (less noise), per. Al Viro
>  - O_DIRECT checking early in the process, per. Jeff Moyer
>  - Resolved duplicate (c&p) code in syscall code, per. Jeff
>  - Included perf data in thread cover letter, per. Jeff
>  - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff
>
>
> Some perf data generated using fio comparing the posix aio engine to a version
> of the posix AIO engine that attempts to performs "fast" reads before
> submitting the operations to the queue. This workflow is on ext4 partition on
> raid0 (test / build-rig.) Simulating our database access patern workload using
> 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
> does the same thing.)
>
> f1: ~73% rand read over mostly cached data (zipf med-size dataset)
> f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
> f3: ~9% seq-read over large dataset
>
> before:
>
> f1:
>     bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
>     lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
>     lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
> f2:
>     bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
>     lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
>     lat (msec) : >=2000=4.33%
> f3:
>     bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
>                  stdev=34526.89
>     lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
>     lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
>     lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
>     lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
> total:
>    READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
>          mint=600001msec, maxt=600113msec
>
> after (with fast read using preadv2 before submit):
>
> f1:
>     bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
>     lat (usec) : 2=70.63%, 4=0.01%
>     lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
> f2:
>     bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
>     lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
>     lat (msec) : >=2000=9.99%
> f3:
>     bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
>                  stdev=35995.60
>     lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
>     lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
>     lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
>     lat (msec) : 100=0.05%, 250=0.02%
> total:
>    READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
>          mint=600020msec, maxt=600178msec
>
> Interpreting the results you can see total bandwidth stays the same but overall
> request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
> workloads. There is a slight bump in latency for since it's random data that's
> unlikely to be cached but we're always trying "fast read".
>
> In our application we have starting keeping track of "fast read" hits/misses
> and for files / requests that have a lot hit ratio we don't do "fast reads"
> mostly getting rid of extra latency in the uncached cases.
>
> I've performed other benchmarks and I have no observed any perf regressions in
> any of the normal (old) code paths.
>
>
> I have co-developed these changes with Christoph Hellwig.
>
> Milosz Tanski (4):
>   vfs: Prepare for adding a new preadv/pwritev with user flags.
>   vfs: Define new syscalls preadv2,pwritev2
>   vfs: Export new vector IO syscalls (with flags) to userland
>   vfs: RWF_NONBLOCK flag for preadv2
>
>  arch/x86/syscalls/syscall_32.tbl  |   2 +
>  arch/x86/syscalls/syscall_64.tbl  |   2 +
>  drivers/target/target_core_file.c |   6 +-
>  fs/cifs/file.c                    |   6 ++
>  fs/nfsd/vfs.c                     |   4 +-
>  fs/ocfs2/file.c                   |   6 ++
>  fs/pipe.c                         |   3 +-
>  fs/read_write.c                   | 121 +++++++++++++++++++++++++++++---------
>  fs/splice.c                       |   2 +-
>  fs/xfs/xfs_file.c                 |   4 ++
>  include/linux/aio.h               |   2 +
>  include/linux/fs.h                |   7 ++-
>  include/linux/syscalls.h          |   6 ++
>  include/uapi/asm-generic/unistd.h |   6 +-
>  mm/filemap.c                      |  22 ++++++-
>  mm/shmem.c                        |   4 ++
>  16 files changed, 163 insertions(+), 40 deletions(-)
>
> --
> 2.1.0
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

end of thread, other threads:[~2014-10-08  2:53 UTC | newest]

Thread overview: 167+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-15 20:20 [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only) Milosz Tanski
2014-09-15 20:20 ` Milosz Tanski
2014-09-15 20:20 ` [PATCH 1/7] Prepare for adding a new readv/writev with user flags Milosz Tanski
2014-09-15 20:20   ` Milosz Tanski
2014-09-15 20:28   ` Al Viro
2014-09-15 21:15     ` Christoph Hellwig
2014-09-15 21:15       ` Christoph Hellwig
2014-09-15 21:44       ` Milosz Tanski
2014-09-15 21:44         ` Milosz Tanski
2014-09-15 20:20 ` [PATCH 2/7] Define new syscalls readv2,preadv2,writev2,pwritev2 Milosz Tanski
2014-09-15 20:20   ` Milosz Tanski
2014-09-16 19:20   ` Jeff Moyer
2014-09-16 19:20     ` Jeff Moyer
2014-09-16 19:54     ` Milosz Tanski
2014-09-16 19:54       ` Milosz Tanski
2014-09-16 21:03     ` Christoph Hellwig
2014-09-16 21:03       ` Christoph Hellwig
2014-09-17 15:43   ` Theodore Ts'o
2014-09-17 15:43     ` Theodore Ts'o
2014-09-17 16:05     ` Milosz Tanski
2014-09-17 16:05       ` Milosz Tanski
2014-09-17 16:59       ` Theodore Ts'o
2014-09-17 16:59         ` Theodore Ts'o
2014-09-17 17:24         ` Zach Brown
2014-09-17 17:24           ` Zach Brown
2014-09-15 20:20 ` [PATCH 3/7] Export new vector IO (with flags) to userland Milosz Tanski
2014-09-15 20:20   ` Milosz Tanski
2014-09-15 20:21 ` [PATCH 4/7] O_NONBLOCK flag for readv2/preadv2 Milosz Tanski
2014-09-15 20:21   ` Milosz Tanski
2014-09-16 19:19   ` Jeff Moyer
2014-09-16 19:19     ` Jeff Moyer
2014-09-16 19:44     ` Milosz Tanski
2014-09-16 19:44       ` Milosz Tanski
2014-09-16 19:53       ` Jeff Moyer
2014-09-16 19:53         ` Jeff Moyer
2014-09-15 20:21 ` [PATCH 5/7] documentation updates Christoph Hellwig
2014-09-15 20:21   ` Christoph Hellwig
2014-09-15 20:21 ` [PATCH 6/7] move flags enforcement to vfs_preadv/vfs_pwritev Christoph Hellwig
2014-09-15 21:15   ` Christoph Hellwig
2014-09-15 21:15     ` Christoph Hellwig
2014-09-15 21:45     ` Milosz Tanski
2014-09-15 21:45       ` Milosz Tanski
2014-09-15 20:22 ` [PATCH 7/7] check for O_NONBLOCK in all read_iter instances Christoph Hellwig
2014-09-15 20:22   ` Christoph Hellwig
2014-09-16 19:27   ` Jeff Moyer
2014-09-16 19:27     ` Jeff Moyer
2014-09-16 19:45     ` Milosz Tanski
2014-09-16 19:45       ` Milosz Tanski
2014-09-16 21:42       ` Dave Chinner
2014-09-16 21:42         ` Dave Chinner
2014-09-17 12:24         ` Benjamin LaHaise
2014-09-17 12:24           ` Benjamin LaHaise
2014-09-17 13:47           ` Theodore Ts'o
2014-09-17 13:47             ` Theodore Ts'o
2014-09-17 13:56             ` Benjamin LaHaise
2014-09-17 13:56               ` Benjamin LaHaise
2014-09-17 15:33               ` Milosz Tanski
2014-09-17 15:33                 ` Milosz Tanski
2014-09-17 15:49                 ` Theodore Ts'o
2014-09-17 15:49                   ` Theodore Ts'o
2014-09-17 15:52               ` Zach Brown
2014-09-17 15:52                 ` Zach Brown
2014-09-16 21:04     ` Christoph Hellwig
2014-09-16 21:04       ` Christoph Hellwig
2014-09-16 21:24       ` Jeff Moyer
2014-09-16 21:24         ` Jeff Moyer
2014-09-15 20:27 ` [RFC PATCH 0/7] Non-blockling buffered fs read (page cache only) Milosz Tanski
2014-09-15 20:27   ` Milosz Tanski
2014-09-15 21:33 ` Andreas Dilger
2014-09-15 22:13   ` Milosz Tanski
2014-09-15 22:13     ` Milosz Tanski
2014-09-15 22:36   ` Elliott, Robert (Server Storage)
2014-09-15 22:36     ` Elliott, Robert (Server Storage)
2014-09-16 18:24     ` Zach Brown
2014-09-16 18:24       ` Zach Brown
2014-09-19 11:21     ` Christoph Hellwig
2014-09-19 11:21       ` Christoph Hellwig
2014-09-22 15:48       ` Jeff Moyer
2014-09-22 15:48         ` Jeff Moyer
2014-09-22 16:32         ` Milosz Tanski
2014-09-22 16:32           ` Milosz Tanski
2014-09-22 16:42           ` Christoph Hellwig
2014-09-22 17:02             ` Milosz Tanski
2014-09-22 17:02               ` Milosz Tanski
2014-09-22 16:25       ` Elliott, Robert (Server Storage)
2014-09-15 21:58 ` Jeff Moyer
2014-09-15 21:58   ` Jeff Moyer
2014-09-15 22:27   ` Milosz Tanski
2014-09-15 22:27     ` Milosz Tanski
2014-09-16 13:44     ` Jeff Moyer
2014-09-16 13:44       ` Jeff Moyer
2014-09-19 11:23   ` Christoph Hellwig
2014-09-19 11:23     ` Christoph Hellwig
2014-09-16 19:30 ` Jeff Moyer
2014-09-16 19:30   ` Jeff Moyer
2014-09-16 20:34   ` Milosz Tanski
2014-09-16 20:34     ` Milosz Tanski
2014-09-16 20:49     ` Jeff Moyer
2014-09-16 20:49       ` Jeff Moyer
2014-09-17 14:49 ` [RFC 1/2] aio: async readahead Benjamin LaHaise
2014-09-17 14:49   ` Benjamin LaHaise
2014-09-17 15:26   ` [RFC 2/2] ext4: async readpage for indirect style inodes Benjamin LaHaise
2014-09-17 15:26     ` Benjamin LaHaise
2014-09-19 11:26   ` [RFC 1/2] aio: async readahead Christoph Hellwig
2014-09-19 11:26     ` Christoph Hellwig
2014-09-19 16:01     ` Benjamin LaHaise
2014-09-19 16:01       ` Benjamin LaHaise
2014-09-17 22:20 ` [RFC v2 0/5] Non-blockling buffered fs read (page cache only) Milosz Tanski
2014-09-17 22:20   ` Milosz Tanski
2014-09-17 22:20   ` [RFC v2 1/5] Prepare for adding a new readv/writev with user flags Milosz Tanski
2014-09-17 22:20     ` Milosz Tanski
2014-09-17 22:20   ` [RFC v2 2/5] Define new syscalls readv2,preadv2,writev2,pwritev2 Milosz Tanski
2014-09-17 22:20     ` Milosz Tanski
2014-09-18 18:48     ` Darrick J. Wong
2014-09-18 18:48       ` Darrick J. Wong
2014-09-19 10:52       ` Christoph Hellwig
2014-09-19 10:52         ` Christoph Hellwig
2014-09-20  0:19         ` Darrick J. Wong
2014-09-20  0:19           ` Darrick J. Wong
2014-09-17 22:20   ` [RFC v2 3/5] Export new vector IO (with flags) to userland Milosz Tanski
2014-09-17 22:20     ` Milosz Tanski
2014-09-17 22:20   ` [RFC v2 4/5] O_NONBLOCK flag for readv2/preadv2 Milosz Tanski
2014-09-17 22:20     ` Milosz Tanski
2014-09-19 11:27     ` Christoph Hellwig
2014-09-19 11:27       ` Christoph Hellwig
2014-09-19 11:59       ` Milosz Tanski
2014-09-19 11:59         ` Milosz Tanski
2014-09-22 17:12     ` Jeff Moyer
2014-09-22 17:12       ` Jeff Moyer
2014-09-17 22:20   ` [RFC v2 5/5] Check for O_NONBLOCK in all read_iter instances Milosz Tanski
2014-09-17 22:20     ` Milosz Tanski
2014-09-19 11:26     ` Christoph Hellwig
2014-09-19 11:26       ` Christoph Hellwig
2014-09-19 14:42   ` [RFC v2 0/5] Non-blockling buffered fs read (page cache only) Jonathan Corbet
2014-09-19 14:42     ` Jonathan Corbet
2014-09-19 16:13     ` Volker Lendecke
2014-09-19 16:13       ` Volker Lendecke
2014-09-19 17:19     ` Milosz Tanski
2014-09-19 17:19       ` Milosz Tanski
2014-09-19 17:33     ` Milosz Tanski
2014-09-19 17:33       ` Milosz Tanski
2014-09-22 14:12       ` Jonathan Corbet
2014-09-22 14:12         ` Jonathan Corbet
2014-09-22 14:24         ` Jeff Moyer
2014-09-22 14:24           ` Jeff Moyer
2014-09-22 14:25         ` Christoph Hellwig
2014-09-22 14:25           ` Christoph Hellwig
2014-09-22 14:30         ` Milosz Tanski
2014-09-22 14:30           ` Milosz Tanski
2014-09-24 21:46 ` [RFC v3 0/4] vfs: " Milosz Tanski
2014-09-24 21:46   ` Milosz Tanski
2014-09-24 21:46   ` [RFC v3 1/4] vfs: Prepare for adding a new preadv/pwritev with user flags Milosz Tanski
2014-09-24 21:46     ` Milosz Tanski
2014-09-24 21:46   ` [RFC v3 2/4] vfs: Define new syscalls preadv2,pwritev2 Milosz Tanski
2014-09-24 21:46     ` Milosz Tanski
2014-09-24 21:46   ` [RFC v3 3/4] vfs: Export new vector IO syscalls (with flags) to userland Milosz Tanski
2014-09-24 21:46     ` Milosz Tanski
2014-09-24 21:46   ` [RFC v3 4/4] vfs: RWF_NONBLOCK flag for preadv2 Milosz Tanski
2014-09-24 21:46     ` Milosz Tanski
2014-09-25  4:06   ` [RFC v3 0/4] vfs: Non-blockling buffered fs read (page cache only) Michael Kerrisk
2014-09-25  4:06     ` Michael Kerrisk
2014-09-25 11:16     ` Jan Kara
2014-09-25 11:16       ` Jan Kara
2014-09-25 15:48     ` Milosz Tanski
2014-09-25 15:48       ` Milosz Tanski
2014-10-08  2:53   ` Milosz Tanski
2014-10-08  2:53     ` Milosz Tanski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.