All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-16 18:27 Milosz Tanski
  2015-03-16 18:27   ` Milosz Tanski
                   ` (8 more replies)
  0 siblings, 9 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

This patchset introduces two new syscalls preadv2 and pwritev2. They are the
same syscalls as preadv and pwrite but with a flag argument. Additionally,
preadv2 implements an extra RWF_NONBLOCK flag. 

The RWF_NONBLOCK flag in preadv2 introduces an ability to perform a
non-blocking read from regular files in buffered IO mode. This works by only
for those filesystems that have data in the page cache.

We discussed these changes at this year's LSF/MM summit in Boston. More details
on the Samba use case, the numbers, and presentation is available at this link:
https://lists.samba.org/archive/samba-technical/2015-March/106290.html

Please stayed tune for man pages patches and xfstest patches. They will be sent
as In-Reply-To.


Latest changes highlight:
 - Drops RWF_DSYNC from pwritev2, per Christoph and Andrew
 - Updated man pages
 - Added tests for this functionality to xfstests, per Dave Chinner
 - Based on top of 4.1-rc3
 - Tests / numbers using samba and a CIFS client FIO engine

Forward looking:

 Christoph committed to sending a separate patch series for the RWF_DSYNC for
 pwritev2 implementation so it can be evaluated independently. This helps
 with implementing userspace file servers for protocols that have a per operation
 sync flag (CIFS).

 Additionally, Christoph committed to implementing RWF_NONBLOCK for the write
 case as well (in pwritev2) at a later date.


Background:

 Using a threadpool to emulate non-blocking operations on regular buffered
 files is a common pattern today (samba, libuv, etc...) Applications split the
 work between network bound threads (epoll) and IO threadpool. Not every
 application can use sendfile syscall (TLS / post-processing).

 This common pattern leads to increased request latency. Latency can be due to
 additional synchronization between the threads or fast (cached data) request
 stuck behind slow request (large / uncached data).

 The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass
 enqueuing operation in the threadpool if it's already available in the
 pagecache.


Performance numbers (newer Samba):

 https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing
 https://docs.google.com/spreadsheets/d/1GGTivi-MfZU0doMzomG4XUo9ioWtRvOGQ5FId042L6s/edit?usp=sharing


Performance number (older):

 Some perf data generated using fio comparing the posix aio engine to a version
 of the posix AIO engine that attempts to performs "fast" reads before
 submitting the operations to the queue. This workflow is on ext4 partition on
 raid0 (test / build-rig.) Simulating our database access patern workload using
 16kb read accesses. Our database uses a home-spun posix aio like queue (samba
 does the same thing.)

 f1: ~73% rand read over mostly cached data (zipf med-size dataset)
 f2: ~18% rand read over mostly un-cached data (uniform large-dataset)
 f3: ~9% seq-read over large dataset

 before:

 f1:
     bw (KB  /s): min=   11, max= 9088, per=0.56%, avg=969.54, stdev=827.99
     lat (msec) : 50=0.01%, 100=1.06%, 250=5.88%, 500=4.08%, 750=12.48%
     lat (msec) : 1000=17.27%, 2000=49.86%, >=2000=9.42%
 f2:
     bw (KB  /s): min=    2, max= 1882, per=0.16%, avg=273.28, stdev=220.26
     lat (msec) : 250=5.65%, 500=3.31%, 750=15.64%, 1000=24.59%, 2000=46.56%
     lat (msec) : >=2000=4.33%
 f3:
     bw (KB  /s): min=    0, max=265568, per=99.95%, avg=174575.10,
                  stdev=34526.89
     lat (usec) : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.27%, 50=10.82%
     lat (usec) : 100=50.34%, 250=5.05%, 500=7.12%, 750=6.60%, 1000=4.55%
     lat (msec) : 2=8.73%, 4=3.49%, 10=1.83%, 20=0.89%, 50=0.22%
     lat (msec) : 100=0.05%, 250=0.02%, 500=0.01%
 total:
    READ: io=102365MB, aggrb=174669KB/s, minb=240KB/s, maxb=173599KB/s,
          mint=600001msec, maxt=600113msec

 after (with fast read using preadv2 before submit):

 f1:
     bw (KB  /s): min=    3, max=14897, per=1.28%, avg=2276.69, stdev=2930.39
     lat (usec) : 2=70.63%, 4=0.01%
     lat (msec) : 250=0.20%, 500=2.26%, 750=1.18%, 2000=0.22%, >=2000=25.53%
 f2:
     bw (KB  /s): min=    2, max= 2362, per=0.14%, avg=249.83, stdev=222.00
     lat (msec) : 250=6.35%, 500=1.78%, 750=9.29%, 1000=20.49%, 2000=52.18%
     lat (msec) : >=2000=9.99%
 f3:
     bw (KB  /s): min=    1, max=245448, per=100.00%, avg=177366.50,
                  stdev=35995.60
     lat (usec) : 2=64.04%, 4=0.01%, 10=0.01%, 20=0.06%, 50=0.43%
     lat (usec) : 100=0.20%, 250=1.27%, 500=2.93%, 750=3.93%, 1000=7.35%
     lat (msec) : 2=14.27%, 4=2.88%, 10=1.54%, 20=0.81%, 50=0.22%
     lat (msec) : 100=0.05%, 250=0.02%
 total:
    READ: io=103941MB, aggrb=177339KB/s, minb=213KB/s, maxb=176375KB/s,
          mint=600020msec, maxt=600178msec

 Interpreting the results you can see total bandwidth stays the same but overall
 request latency is decreased in f1 (random, mostly cached) and f3 (sequential)
 workloads. There is a slight bump in latency for since it's random data that's
 unlikely to be cached but we're always trying "fast read".

 In our application we have starting keeping track of "fast read" hits/misses
 and for files / requests that have a lot hit ratio we don't do "fast reads"
 mostly getting rid of extra latency in the uncached cases. In our real world
 work load we were able to reduce average response time by 20 to 30% (depends
 on amount of IO done by request).

 I've performed other benchmarks and I have no observed any perf regressions in
 any of the normal (old) code paths.


Full change log:

 Version 7 highlight:
  - Drops RWF_DSYNC from pwritev2, per Christoph and Andrew
  - Updated man pages
  - Added tests for this functionality to xfstests, per Dave Chinner
  - Based on top of 4.1-rc3
  - Tests / numbers using samba and a CIFS client FIO engine

 Version 6 highlight:
  - Compat syscall flag checks, per. Jeff.
  - Minor stylistic suggestions.

 Version 5 highlight:
  - XFS support for RWF_NONBLOCK. from Christoph.
  - RWF_DSYNC flag and support for pwritev2, from Christoph.
  - Implemented compat syscalls, per. Jeff.
  - Missing nfs, ceph changes from older patchset.

 Version 4 highlight:
  - Updated for 3.18-rc1.
  - Performance data from our application.
  - First stab at man page with Jeff's help. Patch is in-reply to.

 RFC Version 3 highlights:
  - Down to 2 syscalls from 4; can user fp or argument position.
  - RWF_NONBLOCK value flag is not the same O_NONBLOCK, per Jeff.

 RFC Version 2 highlights:
  - Put the flags argument into kiocb (less noise), per. Al Viro
  - O_DIRECT checking early in the process, per. Jeff Moyer
  - Resolved duplicate (c&p) code in syscall code, per. Jeff
  - Included perf data in thread cover letter, per. Jeff
  - Created a new flag (not O_NONBLOCK) for readv2, perf Jeff


I have co-developed these changes with Christoph Hellwig.


Christoph Hellwig (1):
  xfs: add RWF_NONBLOCK support

Milosz Tanski (4):
  vfs: Prepare for adding a new preadv/pwritev with user flags.
  vfs: Define new syscalls preadv2,pwritev2
  x86: wire up preadv2 and pwritev2
  vfs: RWF_NONBLOCK flag for preadv2

 arch/x86/syscalls/syscall_32.tbl  |   2 +
 arch/x86/syscalls/syscall_64.tbl  |   2 +
 drivers/target/target_core_file.c |   6 +-
 fs/ceph/file.c                    |   2 +
 fs/cifs/file.c                    |   6 +
 fs/nfs/file.c                     |   5 +-
 fs/nfsd/vfs.c                     |   4 +-
 fs/ocfs2/file.c                   |   6 +
 fs/pipe.c                         |   3 +-
 fs/read_write.c                   | 229 +++++++++++++++++++++++++++++---------
 fs/splice.c                       |   2 +-
 fs/xfs/xfs_file.c                 |  28 ++++-
 include/linux/aio.h               |   2 +
 include/linux/compat.h            |   6 +
 include/linux/fs.h                |   6 +-
 include/linux/syscalls.h          |   6 +
 include/uapi/asm-generic/unistd.h |   6 +-
 mm/filemap.c                      |  23 +++-
 mm/shmem.c                        |   4 +
 19 files changed, 279 insertions(+), 69 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 94+ messages in thread

* [PATCH v7 1/5] vfs: Prepare for adding a new preadv/pwritev with user flags.
  2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
@ 2015-03-16 18:27   ` Milosz Tanski
  2015-03-16 18:27   ` Milosz Tanski
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/target/target_core_file.c |  6 +++---
 fs/nfsd/vfs.c                     |  4 ++--
 fs/read_write.c                   | 27 +++++++++++++++------------
 fs/splice.c                       |  2 +-
 include/linux/aio.h               |  2 ++
 include/linux/fs.h                |  4 ++--
 6 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 44620fb..fdd0a10 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -351,9 +351,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
 	set_fs(get_ds());
 
 	if (is_write)
-		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
 	else
-		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);
 
 	set_fs(old_fs);
 
@@ -534,7 +534,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 
 	old_fs = get_fs();
 	set_fs(get_ds());
-	rc = vfs_writev(f, &iov[0], iov_num, &pos);
+	rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
 	set_fs(old_fs);
 
 	vfree(iov);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 3685265..1c6faaa 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -893,7 +893,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
 
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
 	set_fs(oldfs);
 	return nfsd_finish_read(file, count, host_err);
 }
@@ -980,7 +980,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(oldfs);
 	if (host_err < 0)
 		goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index 8e1b687..b53bb59 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -711,7 +711,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
 EXPORT_SYMBOL(iov_shorten);
 
 static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
-		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+		int flags)
 {
 	struct kiocb kiocb;
 	struct iov_iter iter;
@@ -720,6 +721,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_nbytes = len;
+	kiocb.ki_rwflags = flags;
 
 	iov_iter_init(&iter, rw, iov, nr_segs, len);
 	ret = fn(&kiocb, &iter);
@@ -858,7 +860,8 @@ out:
 
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec __user * uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos,
+			       int flags)
 {
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -892,7 +895,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -915,27 +918,27 @@ out:
 }
 
 ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
-		  unsigned long vlen, loff_t *pos)
+		  unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_READ))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
 
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_readv);
 
 ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
-		   unsigned long vlen, loff_t *pos)
+		   unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
 
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
@@ -948,7 +951,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos);
+		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -968,7 +971,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos);
+		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -1000,7 +1003,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos);
+			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1024,7 +1027,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos);
+			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1072,7 +1075,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, 0);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
diff --git a/fs/splice.c b/fs/splice.c
index 7968da9..ee3fd4c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(old_fs);
 
 	return res;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..9c1d499 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -52,6 +52,8 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	int			ki_rwflags;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..c018335 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1619,9 +1619,9 @@ extern ssize_t __vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v7 1/5] vfs: Prepare for adding a new preadv/pwritev with user flags.
@ 2015-03-16 18:27   ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

Plumbing the flags argument through the vfs code so they can be passed down to
__generic_file_(read/write)_iter function that do the acctual work.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
---
 drivers/target/target_core_file.c |  6 +++---
 fs/nfsd/vfs.c                     |  4 ++--
 fs/read_write.c                   | 27 +++++++++++++++------------
 fs/splice.c                       |  2 +-
 include/linux/aio.h               |  2 ++
 include/linux/fs.h                |  4 ++--
 6 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/target/target_core_file.c b/drivers/target/target_core_file.c
index 44620fb..fdd0a10 100644
--- a/drivers/target/target_core_file.c
+++ b/drivers/target/target_core_file.c
@@ -351,9 +351,9 @@ static int fd_do_rw(struct se_cmd *cmd, struct scatterlist *sgl,
 	set_fs(get_ds());
 
 	if (is_write)
-		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_writev(fd, &iov[0], sgl_nents, &pos, 0);
 	else
-		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos);
+		ret = vfs_readv(fd, &iov[0], sgl_nents, &pos, 0);
 
 	set_fs(old_fs);
 
@@ -534,7 +534,7 @@ fd_execute_write_same(struct se_cmd *cmd)
 
 	old_fs = get_fs();
 	set_fs(get_ds());
-	rc = vfs_writev(f, &iov[0], iov_num, &pos);
+	rc = vfs_writev(f, &iov[0], iov_num, &pos, 0);
 	set_fs(old_fs);
 
 	vfree(iov);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 3685265..1c6faaa 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -893,7 +893,7 @@ __be32 nfsd_readv(struct file *file, loff_t offset, struct kvec *vec, int vlen,
 
 	oldfs = get_fs();
 	set_fs(KERNEL_DS);
-	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset);
+	host_err = vfs_readv(file, (struct iovec __user *)vec, vlen, &offset, 0);
 	set_fs(oldfs);
 	return nfsd_finish_read(file, count, host_err);
 }
@@ -980,7 +980,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
 
 	/* Write the data. */
 	oldfs = get_fs(); set_fs(KERNEL_DS);
-	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos);
+	host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(oldfs);
 	if (host_err < 0)
 		goto out_nfserr;
diff --git a/fs/read_write.c b/fs/read_write.c
index 8e1b687..b53bb59 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -711,7 +711,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
 EXPORT_SYMBOL(iov_shorten);
 
 static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
-		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
+		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
+		int flags)
 {
 	struct kiocb kiocb;
 	struct iov_iter iter;
@@ -720,6 +721,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
 	init_sync_kiocb(&kiocb, filp);
 	kiocb.ki_pos = *ppos;
 	kiocb.ki_nbytes = len;
+	kiocb.ki_rwflags = flags;
 
 	iov_iter_init(&iter, rw, iov, nr_segs, len);
 	ret = fn(&kiocb, &iter);
@@ -858,7 +860,8 @@ out:
 
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec __user * uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos,
+			       int flags)
 {
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -892,7 +895,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -915,27 +918,27 @@ out:
 }
 
 ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
-		  unsigned long vlen, loff_t *pos)
+		  unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_READ))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
 
-	return do_readv_writev(READ, file, vec, vlen, pos);
+	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_readv);
 
 ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
-		   unsigned long vlen, loff_t *pos)
+		   unsigned long vlen, loff_t *pos, int flags)
 {
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
 
-	return do_readv_writev(WRITE, file, vec, vlen, pos);
+	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
@@ -948,7 +951,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos);
+		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -968,7 +971,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos);
+		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -1000,7 +1003,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos);
+			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1024,7 +1027,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos);
+			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
 		fdput(f);
 	}
 
@@ -1072,7 +1075,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn);
+						pos, iter_fn, 0);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
diff --git a/fs/splice.c b/fs/splice.c
index 7968da9..ee3fd4c 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
 	old_fs = get_fs();
 	set_fs(get_ds());
 	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
+	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
 	set_fs(old_fs);
 
 	return res;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index d9c92da..9c1d499 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -52,6 +52,8 @@ struct kiocb {
 	 * this is the underlying eventfd context to deliver events to.
 	 */
 	struct eventfd_ctx	*ki_eventfd;
+
+	int			ki_rwflags;
 };
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b4d71b5..c018335 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1619,9 +1619,9 @@ extern ssize_t __vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *);
+		unsigned long, loff_t *, int);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v7 2/5] vfs: Define new syscalls preadv2,pwritev2
  2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
@ 2015-03-16 18:27   ` Milosz Tanski
  2015-03-16 18:27   ` Milosz Tanski
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

New syscalls that take an flag argument. This change does not add any specific
flags.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c                   | 172 ++++++++++++++++++++++++++++++--------
 include/linux/compat.h            |   6 ++
 include/linux/syscalls.h          |   6 ++
 include/uapi/asm-generic/unistd.h |   6 +-
 mm/filemap.c                      |   5 +-
 5 files changed, 156 insertions(+), 39 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index b53bb59..e91f46e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -924,6 +924,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -937,21 +939,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
 
-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+			unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -963,15 +967,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -989,10 +993,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
 	return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
 }
 
-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -1003,7 +1006,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -1013,10 +1016,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+			  unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -1027,7 +1029,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -1037,11 +1039,63 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_preadv(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_readv(fd, vec, vlen, flags);
+
+	return do_preadv(fd, vec, vlen, pos, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_pwritev(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_writev(fd, vec, vlen, flags);
+
+	return do_pwritev(fd, vec, vlen, pos, flags);
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t compat_do_readv_writev(int type, struct file *file,
 			       const struct compat_iovec __user *uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos, int flags)
 {
 	compat_ssize_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -1075,7 +1129,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn, 0);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -1099,7 +1153,7 @@ out:
 
 static size_t compat_readv(struct file *file,
 			   const struct compat_iovec __user *vec,
-			   unsigned long vlen, loff_t *pos)
+			   unsigned long vlen, loff_t *pos, int flags)
 {
 	ssize_t ret = -EBADF;
 
@@ -1109,8 +1163,10 @@ static size_t compat_readv(struct file *file,
 	ret = -EINVAL;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		goto out;
+	if (flags & ~0)
+		goto out;
 
-	ret = compat_do_readv_writev(READ, file, vec, vlen, pos);
+	ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);
 
 out:
 	if (ret > 0)
@@ -1119,9 +1175,9 @@ out:
 	return ret;
 }
 
-COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
-		const struct compat_iovec __user *,vec,
-		compat_ulong_t, vlen)
+static size_t __compat_sys_readv(compat_ulong_t fd,
+				 const struct compat_iovec __user *vec,
+				 compat_ulong_t vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret;
@@ -1130,16 +1186,24 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
 	if (!f.file)
 		return -EBADF;
 	pos = f.file->f_pos;
-	ret = compat_readv(f.file, vec, vlen, &pos);
+	ret = compat_readv(f.file, vec, vlen, &pos, flags);
 	if (ret >= 0)
 		f.file->f_pos = pos;
 	fdput_pos(f);
 	return ret;
+
+}
+
+COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
+		const struct compat_iovec __user *,vec,
+		compat_ulong_t, vlen)
+{
+	return __compat_sys_readv(fd, vec, vlen, 0);
 }
 
 static long __compat_sys_preadv64(unsigned long fd,
 				  const struct compat_iovec __user *vec,
-				  unsigned long vlen, loff_t pos)
+				  unsigned long vlen, loff_t pos, int flags)
 {
 	struct fd f;
 	ssize_t ret;
@@ -1151,7 +1215,7 @@ static long __compat_sys_preadv64(unsigned long fd,
 		return -EBADF;
 	ret = -ESPIPE;
 	if (f.file->f_mode & FMODE_PREAD)
-		ret = compat_readv(f.file, vec, vlen, &pos);
+		ret = compat_readv(f.file, vec, vlen, &pos, flags);
 	fdput(f);
 	return ret;
 }
@@ -1161,7 +1225,7 @@ COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
 		const struct compat_iovec __user *,vec,
 		unsigned long, vlen, loff_t, pos)
 {
-	return __compat_sys_preadv64(fd, vec, vlen, pos);
+	return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
 }
 #endif
 
@@ -1171,12 +1235,25 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 {
 	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
 
-	return __compat_sys_preadv64(fd, vec, vlen, pos);
+	return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
+}
+
+COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd,
+		const struct compat_iovec __user *,vec,
+		compat_ulong_t, vlen, u32, pos_low, u32, pos_high,
+		int, flags)
+{
+	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+	if (pos == -1)
+		return __compat_sys_readv(fd, vec, vlen, flags);
+
+	return __compat_sys_preadv64(fd, vec, vlen, pos, flags);
 }
 
 static size_t compat_writev(struct file *file,
 			    const struct compat_iovec __user *vec,
-			    unsigned long vlen, loff_t *pos)
+			    unsigned long vlen, loff_t *pos, int flags)
 {
 	ssize_t ret = -EBADF;
 
@@ -1186,8 +1263,10 @@ static size_t compat_writev(struct file *file,
 	ret = -EINVAL;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		goto out;
+	if (flags & ~0)
+		goto out;
 
-	ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);
+	ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 
 out:
 	if (ret > 0)
@@ -1196,9 +1275,9 @@ out:
 	return ret;
 }
 
-COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
-		const struct compat_iovec __user *, vec,
-		compat_ulong_t, vlen)
+static size_t __compat_sys_writev(compat_ulong_t fd,
+				  const struct compat_iovec __user* vec,
+				  compat_ulong_t vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret;
@@ -1207,28 +1286,36 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
 	if (!f.file)
 		return -EBADF;
 	pos = f.file->f_pos;
-	ret = compat_writev(f.file, vec, vlen, &pos);
+	ret = compat_writev(f.file, vec, vlen, &pos, flags);
 	if (ret >= 0)
 		f.file->f_pos = pos;
 	fdput_pos(f);
 	return ret;
 }
 
+COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
+		const struct compat_iovec __user *, vec,
+		compat_ulong_t, vlen)
+{
+	return __compat_sys_writev(fd, vec, vlen, 0);
+}
+
 static long __compat_sys_pwritev64(unsigned long fd,
 				   const struct compat_iovec __user *vec,
-				   unsigned long vlen, loff_t pos)
+				   unsigned long vlen, loff_t pos, int flags)
 {
 	struct fd f;
 	ssize_t ret;
 
 	if (pos < 0)
 		return -EINVAL;
+
 	f = fdget(fd);
 	if (!f.file)
 		return -EBADF;
 	ret = -ESPIPE;
 	if (f.file->f_mode & FMODE_PWRITE)
-		ret = compat_writev(f.file, vec, vlen, &pos);
+		ret = compat_writev(f.file, vec, vlen, &pos, flags);
 	fdput(f);
 	return ret;
 }
@@ -1238,7 +1325,7 @@ COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd,
 		const struct compat_iovec __user *,vec,
 		unsigned long, vlen, loff_t, pos)
 {
-	return __compat_sys_pwritev64(fd, vec, vlen, pos);
+	return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
 }
 #endif
 
@@ -1248,8 +1335,21 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
 {
 	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
 
-	return __compat_sys_pwritev64(fd, vec, vlen, pos);
+	return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
+}
+
+COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd,
+		const struct compat_iovec __user *,vec,
+		compat_ulong_t, vlen, u32, pos_low, u32, pos_high, int, flags)
+{
+	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+	if (pos == -1)
+		return __compat_sys_writev(fd, vec, vlen, flags);
+
+	return __compat_sys_pwritev64(fd, vec, vlen, pos, flags);
 }
+
 #endif
 
 static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
diff --git a/include/linux/compat.h b/include/linux/compat.h
index ab25814..6e4be9e 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -340,6 +340,12 @@ asmlinkage ssize_t compat_sys_preadv(compat_ulong_t fd,
 asmlinkage ssize_t compat_sys_pwritev(compat_ulong_t fd,
 		const struct compat_iovec __user *vec,
 		compat_ulong_t vlen, u32 pos_low, u32 pos_high);
+asmlinkage ssize_t compat_sys_preadv2(compat_ulong_t fd,
+		const struct compat_iovec __user *vec,
+		compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
+asmlinkage ssize_t compat_sys_pwritev2(compat_ulong_t fd,
+		const struct compat_iovec __user *vec,
+		compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
 
 #ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
 asmlinkage long compat_sys_preadv64(unsigned long fd,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..f25ed7b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -575,8 +575,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
 			     size_t count, loff_t pos);
 asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
 			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
 			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
 asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
 asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..4d2c4c5 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,10 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
 __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
 #define __NR_pwritev 70
 __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_preadv2 282
+__SC_COMP(__NR_preadv2, sys_preadv2, compat_sys_preadv2)
+#define __NR_pwritev2 283
+__SC_COMP(__NR_pwritev2, sys_pwritev2, compat_sys_pwritev2)
 
 /* fs/sendfile.c */
 #define __NR3264_sendfile 71
@@ -711,7 +715,7 @@ __SYSCALL(__NR_bpf, sys_bpf)
 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
 
 #undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 284
 
 /*
  * All syscalls below here should go away really,
diff --git a/mm/filemap.c b/mm/filemap.c
index ad72420..7865f64 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1453,6 +1453,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @ppos:	current file position
  * @iter:	data destination
  * @written:	already copied
+ * @flags:	optional flags
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1461,7 +1462,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
-		struct iov_iter *iter, ssize_t written)
+		struct iov_iter *iter, ssize_t written, int flags)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1732,7 +1733,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		}
 	}
 
-	retval = do_generic_file_read(file, ppos, iter, retval);
+	retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
 out:
 	return retval;
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v7 2/5] vfs: Define new syscalls preadv2,pwritev2
@ 2015-03-16 18:27   ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

New syscalls that take an flag argument. This change does not add any specific
flags.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/read_write.c                   | 172 ++++++++++++++++++++++++++++++--------
 include/linux/compat.h            |   6 ++
 include/linux/syscalls.h          |   6 ++
 include/uapi/asm-generic/unistd.h |   6 +-
 mm/filemap.c                      |   5 +-
 5 files changed, 156 insertions(+), 39 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index b53bb59..e91f46e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -924,6 +924,8 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -937,21 +939,23 @@ ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
+	if (flags & ~0)
+		return -EINVAL;
 
 	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 }
 
 EXPORT_SYMBOL(vfs_writev);
 
-SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_readv(unsigned long fd, const struct iovec __user *vec,
+			unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+		ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -963,15 +967,15 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen)
+static ssize_t do_writev(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret = -EBADF;
 
 	if (f.file) {
 		loff_t pos = file_pos_read(f.file);
-		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+		ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		if (ret >= 0)
 			file_pos_write(f.file, pos);
 		fdput_pos(f);
@@ -989,10 +993,9 @@ static inline loff_t pos_from_hilo(unsigned long high, unsigned long low)
 	return (((loff_t)high << HALF_LONG_BITS) << HALF_LONG_BITS) | low;
 }
 
-SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_preadv(unsigned long fd, const struct iovec __user *vec,
+			 unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -1003,7 +1006,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PREAD)
-			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
+			ret = vfs_readv(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -1013,10 +1016,9 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
-SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
-		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+static ssize_t do_pwritev(unsigned long fd, const struct iovec __user *vec,
+			  unsigned long vlen, loff_t pos, int flags)
 {
-	loff_t pos = pos_from_hilo(pos_h, pos_l);
 	struct fd f;
 	ssize_t ret = -EBADF;
 
@@ -1027,7 +1029,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	if (f.file) {
 		ret = -ESPIPE;
 		if (f.file->f_mode & FMODE_PWRITE)
-			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
+			ret = vfs_writev(f.file, vec, vlen, &pos, flags);
 		fdput(f);
 	}
 
@@ -1037,11 +1039,63 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
 	return ret;
 }
 
+SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_readv(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen)
+{
+	return do_writev(fd, vec, vlen, 0);
+}
+
+SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_preadv(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(preadv2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_readv(fd, vec, vlen, flags);
+
+	return do_preadv(fd, vec, vlen, pos, flags);
+}
+
+SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	return do_pwritev(fd, vec, vlen, pos, 0);
+}
+
+SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned long, pos_l, unsigned long, pos_h,
+		int, flags)
+{
+	loff_t pos = pos_from_hilo(pos_h, pos_l);
+
+	if (pos == -1)
+		return do_writev(fd, vec, vlen, flags);
+
+	return do_pwritev(fd, vec, vlen, pos, flags);
+}
+
 #ifdef CONFIG_COMPAT
 
 static ssize_t compat_do_readv_writev(int type, struct file *file,
 			       const struct compat_iovec __user *uvector,
-			       unsigned long nr_segs, loff_t *pos)
+			       unsigned long nr_segs, loff_t *pos, int flags)
 {
 	compat_ssize_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
@@ -1075,7 +1129,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 
 	if (iter_fn)
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
-						pos, iter_fn, 0);
+						pos, iter_fn, flags);
 	else if (fnv)
 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
 						pos, fnv);
@@ -1099,7 +1153,7 @@ out:
 
 static size_t compat_readv(struct file *file,
 			   const struct compat_iovec __user *vec,
-			   unsigned long vlen, loff_t *pos)
+			   unsigned long vlen, loff_t *pos, int flags)
 {
 	ssize_t ret = -EBADF;
 
@@ -1109,8 +1163,10 @@ static size_t compat_readv(struct file *file,
 	ret = -EINVAL;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		goto out;
+	if (flags & ~0)
+		goto out;
 
-	ret = compat_do_readv_writev(READ, file, vec, vlen, pos);
+	ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);
 
 out:
 	if (ret > 0)
@@ -1119,9 +1175,9 @@ out:
 	return ret;
 }
 
-COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
-		const struct compat_iovec __user *,vec,
-		compat_ulong_t, vlen)
+static size_t __compat_sys_readv(compat_ulong_t fd,
+				 const struct compat_iovec __user *vec,
+				 compat_ulong_t vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret;
@@ -1130,16 +1186,24 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
 	if (!f.file)
 		return -EBADF;
 	pos = f.file->f_pos;
-	ret = compat_readv(f.file, vec, vlen, &pos);
+	ret = compat_readv(f.file, vec, vlen, &pos, flags);
 	if (ret >= 0)
 		f.file->f_pos = pos;
 	fdput_pos(f);
 	return ret;
+
+}
+
+COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
+		const struct compat_iovec __user *,vec,
+		compat_ulong_t, vlen)
+{
+	return __compat_sys_readv(fd, vec, vlen, 0);
 }
 
 static long __compat_sys_preadv64(unsigned long fd,
 				  const struct compat_iovec __user *vec,
-				  unsigned long vlen, loff_t pos)
+				  unsigned long vlen, loff_t pos, int flags)
 {
 	struct fd f;
 	ssize_t ret;
@@ -1151,7 +1215,7 @@ static long __compat_sys_preadv64(unsigned long fd,
 		return -EBADF;
 	ret = -ESPIPE;
 	if (f.file->f_mode & FMODE_PREAD)
-		ret = compat_readv(f.file, vec, vlen, &pos);
+		ret = compat_readv(f.file, vec, vlen, &pos, flags);
 	fdput(f);
 	return ret;
 }
@@ -1161,7 +1225,7 @@ COMPAT_SYSCALL_DEFINE4(preadv64, unsigned long, fd,
 		const struct compat_iovec __user *,vec,
 		unsigned long, vlen, loff_t, pos)
 {
-	return __compat_sys_preadv64(fd, vec, vlen, pos);
+	return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
 }
 #endif
 
@@ -1171,12 +1235,25 @@ COMPAT_SYSCALL_DEFINE5(preadv, compat_ulong_t, fd,
 {
 	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
 
-	return __compat_sys_preadv64(fd, vec, vlen, pos);
+	return __compat_sys_preadv64(fd, vec, vlen, pos, 0);
+}
+
+COMPAT_SYSCALL_DEFINE6(preadv2, compat_ulong_t, fd,
+		const struct compat_iovec __user *,vec,
+		compat_ulong_t, vlen, u32, pos_low, u32, pos_high,
+		int, flags)
+{
+	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+	if (pos == -1)
+		return __compat_sys_readv(fd, vec, vlen, flags);
+
+	return __compat_sys_preadv64(fd, vec, vlen, pos, flags);
 }
 
 static size_t compat_writev(struct file *file,
 			    const struct compat_iovec __user *vec,
-			    unsigned long vlen, loff_t *pos)
+			    unsigned long vlen, loff_t *pos, int flags)
 {
 	ssize_t ret = -EBADF;
 
@@ -1186,8 +1263,10 @@ static size_t compat_writev(struct file *file,
 	ret = -EINVAL;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		goto out;
+	if (flags & ~0)
+		goto out;
 
-	ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);
+	ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos, flags);
 
 out:
 	if (ret > 0)
@@ -1196,9 +1275,9 @@ out:
 	return ret;
 }
 
-COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
-		const struct compat_iovec __user *, vec,
-		compat_ulong_t, vlen)
+static size_t __compat_sys_writev(compat_ulong_t fd,
+				  const struct compat_iovec __user* vec,
+				  compat_ulong_t vlen, int flags)
 {
 	struct fd f = fdget_pos(fd);
 	ssize_t ret;
@@ -1207,28 +1286,36 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
 	if (!f.file)
 		return -EBADF;
 	pos = f.file->f_pos;
-	ret = compat_writev(f.file, vec, vlen, &pos);
+	ret = compat_writev(f.file, vec, vlen, &pos, flags);
 	if (ret >= 0)
 		f.file->f_pos = pos;
 	fdput_pos(f);
 	return ret;
 }
 
+COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
+		const struct compat_iovec __user *, vec,
+		compat_ulong_t, vlen)
+{
+	return __compat_sys_writev(fd, vec, vlen, 0);
+}
+
 static long __compat_sys_pwritev64(unsigned long fd,
 				   const struct compat_iovec __user *vec,
-				   unsigned long vlen, loff_t pos)
+				   unsigned long vlen, loff_t pos, int flags)
 {
 	struct fd f;
 	ssize_t ret;
 
 	if (pos < 0)
 		return -EINVAL;
+
 	f = fdget(fd);
 	if (!f.file)
 		return -EBADF;
 	ret = -ESPIPE;
 	if (f.file->f_mode & FMODE_PWRITE)
-		ret = compat_writev(f.file, vec, vlen, &pos);
+		ret = compat_writev(f.file, vec, vlen, &pos, flags);
 	fdput(f);
 	return ret;
 }
@@ -1238,7 +1325,7 @@ COMPAT_SYSCALL_DEFINE4(pwritev64, unsigned long, fd,
 		const struct compat_iovec __user *,vec,
 		unsigned long, vlen, loff_t, pos)
 {
-	return __compat_sys_pwritev64(fd, vec, vlen, pos);
+	return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
 }
 #endif
 
@@ -1248,8 +1335,21 @@ COMPAT_SYSCALL_DEFINE5(pwritev, compat_ulong_t, fd,
 {
 	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
 
-	return __compat_sys_pwritev64(fd, vec, vlen, pos);
+	return __compat_sys_pwritev64(fd, vec, vlen, pos, 0);
+}
+
+COMPAT_SYSCALL_DEFINE6(pwritev2, compat_ulong_t, fd,
+		const struct compat_iovec __user *,vec,
+		compat_ulong_t, vlen, u32, pos_low, u32, pos_high, int, flags)
+{
+	loff_t pos = ((loff_t)pos_high << 32) | pos_low;
+
+	if (pos == -1)
+		return __compat_sys_writev(fd, vec, vlen, flags);
+
+	return __compat_sys_pwritev64(fd, vec, vlen, pos, flags);
 }
+
 #endif
 
 static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
diff --git a/include/linux/compat.h b/include/linux/compat.h
index ab25814..6e4be9e 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -340,6 +340,12 @@ asmlinkage ssize_t compat_sys_preadv(compat_ulong_t fd,
 asmlinkage ssize_t compat_sys_pwritev(compat_ulong_t fd,
 		const struct compat_iovec __user *vec,
 		compat_ulong_t vlen, u32 pos_low, u32 pos_high);
+asmlinkage ssize_t compat_sys_preadv2(compat_ulong_t fd,
+		const struct compat_iovec __user *vec,
+		compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
+asmlinkage ssize_t compat_sys_pwritev2(compat_ulong_t fd,
+		const struct compat_iovec __user *vec,
+		compat_ulong_t vlen, u32 pos_low, u32 pos_high, int flags);
 
 #ifdef __ARCH_WANT_COMPAT_SYS_PREADV64
 asmlinkage long compat_sys_preadv64(unsigned long fd,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 76d1e38..f25ed7b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -575,8 +575,14 @@ asmlinkage long sys_pwrite64(unsigned int fd, const char __user *buf,
 			     size_t count, loff_t pos);
 asmlinkage long sys_preadv(unsigned long fd, const struct iovec __user *vec,
 			   unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_preadv2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_pwritev(unsigned long fd, const struct iovec __user *vec,
 			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h);
+asmlinkage long sys_pwritev2(unsigned long fd, const struct iovec __user *vec,
+			    unsigned long vlen, unsigned long pos_l, unsigned long pos_h,
+			    int flags);
 asmlinkage long sys_getcwd(char __user *buf, unsigned long size);
 asmlinkage long sys_mkdir(const char __user *pathname, umode_t mode);
 asmlinkage long sys_chdir(const char __user *filename);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..4d2c4c5 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -213,6 +213,10 @@ __SC_COMP(__NR_pwrite64, sys_pwrite64, compat_sys_pwrite64)
 __SC_COMP(__NR_preadv, sys_preadv, compat_sys_preadv)
 #define __NR_pwritev 70
 __SC_COMP(__NR_pwritev, sys_pwritev, compat_sys_pwritev)
+#define __NR_preadv2 282
+__SC_COMP(__NR_preadv2, sys_preadv2, compat_sys_preadv2)
+#define __NR_pwritev2 283
+__SC_COMP(__NR_pwritev2, sys_pwritev2, compat_sys_pwritev2)
 
 /* fs/sendfile.c */
 #define __NR3264_sendfile 71
@@ -711,7 +715,7 @@ __SYSCALL(__NR_bpf, sys_bpf)
 __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
 
 #undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 284
 
 /*
  * All syscalls below here should go away really,
diff --git a/mm/filemap.c b/mm/filemap.c
index ad72420..7865f64 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1453,6 +1453,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * @ppos:	current file position
  * @iter:	data destination
  * @written:	already copied
+ * @flags:	optional flags
  *
  * This is a generic file read routine, and uses the
  * mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1461,7 +1462,7 @@ static void shrink_readahead_size_eio(struct file *filp,
  * of the logic when it comes to error handling etc.
  */
 static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
-		struct iov_iter *iter, ssize_t written)
+		struct iov_iter *iter, ssize_t written, int flags)
 {
 	struct address_space *mapping = filp->f_mapping;
 	struct inode *inode = mapping->host;
@@ -1732,7 +1733,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		}
 	}
 
-	retval = do_generic_file_read(file, ppos, iter, retval);
+	retval = do_generic_file_read(file, ppos, iter, retval, iocb->ki_rwflags);
 out:
 	return retval;
 }
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v7 3/5] x86: wire up preadv2 and pwritev2
  2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
@ 2015-03-16 18:27   ` Milosz Tanski
  2015-03-16 18:27   ` Milosz Tanski
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/syscalls/syscall_32.tbl | 2 ++
 arch/x86/syscalls/syscall_64.tbl | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..b37aa9c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,5 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	preadv2			sys_preadv2
+360	i386	pwritev2		sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..3802ebf 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,8 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	preadv2			sys_preadv2
+324	64	pwritev2		sys_pwritev2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v7 3/5] x86: wire up preadv2 and pwritev2
@ 2015-03-16 18:27   ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 arch/x86/syscalls/syscall_32.tbl | 2 ++
 arch/x86/syscalls/syscall_64.tbl | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..b37aa9c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,5 @@
 356	i386	memfd_create		sys_memfd_create
 357	i386	bpf			sys_bpf
 358	i386	execveat		sys_execveat			stub32_execveat
+359	i386	preadv2			sys_preadv2
+360	i386	pwritev2		sys_pwritev2
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..3802ebf 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,8 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	preadv2			sys_preadv2
+324	64	pwritev2		sys_pwritev2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v7 4/5] vfs: RWF_NONBLOCK flag for preadv2
  2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
@ 2015-03-16 18:27   ` Milosz Tanski
  2015-03-16 18:27   ` Milosz Tanski
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

generic_file_read_iter() supports a new flag RWF_NONBLOCK which says that we
only want to read the data if it's already in the page cache.

Additionally, there are a few filesystems that we have to specifically
bail early if RWF_NONBLOCK because the op would block. Christoph Hellwig
contributed this code.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Sage Weil <sage@redhat.com>
---
 fs/ceph/file.c     |  2 ++
 fs/cifs/file.c     |  6 ++++++
 fs/nfs/file.c      |  5 ++++-
 fs/ocfs2/file.c    |  6 ++++++
 fs/pipe.c          |  3 ++-
 fs/read_write.c    | 44 ++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_file.c  |  4 ++++
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 18 ++++++++++++++++++
 mm/shmem.c         |  4 ++++
 10 files changed, 78 insertions(+), 16 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d533075..78bdde3 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -831,6 +831,8 @@ again:
 	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
 	    (iocb->ki_filp->f_flags & O_DIRECT) ||
 	    (fi->flags & CEPH_F_SYNC)) {
+		if (iocb->ki_rwflags & O_NONBLOCK)
+			return -EAGAIN;
 
 		dout("aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n",
 		     inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index a94b3e6..1d16b5a 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3003,6 +3003,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
 	struct cifs_readdata *rdata, *tmp;
 	struct list_head rdata_list;
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;
@@ -3121,6 +3124,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
 		return generic_file_read_iter(iocb, to);
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
 	 * with a brlock that prevents reading.
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index e679d24..58c21d7 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -171,8 +171,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t result;
 
-	if (iocb->ki_filp->f_flags & O_DIRECT)
+	if (iocb->ki_filp->f_flags & O_DIRECT) {
+		if (iocb->ki_rwflags & O_NONBLOCK)
+			return -EAGAIN;
 		return nfs_file_direct_read(iocb, to, iocb->ki_pos);
+	}
 
 	dprintk("NFS: read(%pD2, %zu@%lu)\n",
 		iocb->ki_filp,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 46e0d4e..c155752 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2536,6 +2536,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 			filp->f_path.dentry->d_name.name,
 			to->nr_segs);	/* GRRRRR */
 
+	/*
+	 * No non-blocking reads for ocfs2 for now.  Might be doable with
+	 * non-blocking cluster lock helpers.
+	 */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
 
 	if (!inode) {
 		ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..212bf68 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if ((filp->f_flags & O_NONBLOCK) ||
+			    (iocb->ki_rwflags & RWF_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
diff --git a/fs/read_write.c b/fs/read_write.c
index e91f46e..339477b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -893,14 +893,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & RWF_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
@@ -924,8 +929,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
-	if (flags & ~0)
+	if (flags & ~RWF_NONBLOCK)
 		return -EINVAL;
+	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
+		return -EAGAIN;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -1127,14 +1134,19 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & RWF_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
@@ -1163,7 +1175,11 @@ static size_t compat_readv(struct file *file,
 	ret = -EINVAL;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		goto out;
-	if (flags & ~0)
+	if (flags & ~RWF_NONBLOCK)
+		goto out;
+
+	ret = -EAGAIN;
+	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
 		goto out;
 
 	ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a2e1cb8..a38ddc1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -280,6 +280,10 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
+	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c018335..fb2de58 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1531,6 +1531,8 @@ struct block_device_operations;
 #define NOMMU_VMFLAGS \
 	(NOMMU_MAP_READ | NOMMU_MAP_WRITE | NOMMU_MAP_EXEC)
 
+/* These flags are used for the readv/writev syscalls with flags. */
+#define RWF_NONBLOCK 0x00000001
 
 struct iov_iter;
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 7865f64..ad789e0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1490,6 +1490,8 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 find_page:
 		page = find_get_page(mapping, index);
 		if (!page) {
+			if (flags & RWF_NONBLOCK)
+				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1581,6 +1583,11 @@ page_ok:
 		continue;
 
 page_not_up_to_date:
+		if (flags & RWF_NONBLOCK) {
+			page_cache_release(page);
+			goto would_block;
+		}
+
 		/* Get exclusive access to the page ... */
 		error = lock_page_killable(page);
 		if (unlikely(error))
@@ -1600,6 +1607,12 @@ page_not_up_to_date_locked:
 			goto page_ok;
 		}
 
+		if (flags & RWF_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto would_block;
+		}
+
 readpage:
 		/*
 		 * A previous I/O error may have been due to temporary
@@ -1670,6 +1683,8 @@ no_cached_page:
 		goto readpage;
 	}
 
+would_block:
+	error = -EAGAIN;
 out:
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_CACHE_SHIFT;
@@ -1702,6 +1717,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		size_t count = iov_iter_count(iter);
 		loff_t size;
 
+		if (iocb->ki_rwflags & RWF_NONBLOCK)
+			return -EAGAIN;
+
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
diff --git a/mm/shmem.c b/mm/shmem.c
index cf2d0ca..c5b78f8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1528,6 +1528,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/* XXX: should be easily supportable */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v7 4/5] vfs: RWF_NONBLOCK flag for preadv2
@ 2015-03-16 18:27   ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

generic_file_read_iter() supports a new flag RWF_NONBLOCK which says that we
only want to read the data if it's already in the page cache.

Additionally, there are a few filesystems that we have to specifically
bail early if RWF_NONBLOCK because the op would block. Christoph Hellwig
contributed this code.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Sage Weil <sage@redhat.com>
---
 fs/ceph/file.c     |  2 ++
 fs/cifs/file.c     |  6 ++++++
 fs/nfs/file.c      |  5 ++++-
 fs/ocfs2/file.c    |  6 ++++++
 fs/pipe.c          |  3 ++-
 fs/read_write.c    | 44 ++++++++++++++++++++++++++++++--------------
 fs/xfs/xfs_file.c  |  4 ++++
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 18 ++++++++++++++++++
 mm/shmem.c         |  4 ++++
 10 files changed, 78 insertions(+), 16 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d533075..78bdde3 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -831,6 +831,8 @@ again:
 	if ((got & (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) == 0 ||
 	    (iocb->ki_filp->f_flags & O_DIRECT) ||
 	    (fi->flags & CEPH_F_SYNC)) {
+		if (iocb->ki_rwflags & O_NONBLOCK)
+			return -EAGAIN;
 
 		dout("aio_sync_read %p %llx.%llx %llu~%u got cap refs on %s\n",
 		     inode, ceph_vinop(inode), iocb->ki_pos, (unsigned)len,
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index a94b3e6..1d16b5a 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3003,6 +3003,9 @@ ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to)
 	struct cifs_readdata *rdata, *tmp;
 	struct list_head rdata_list;
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;
@@ -3121,6 +3124,9 @@ cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to)
 	    ((cifs_sb->mnt_cifs_flags & CIFS_MOUNT_NOPOSIXBRL) == 0))
 		return generic_file_read_iter(iocb, to);
 
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * We need to hold the sem to be sure nobody modifies lock list
 	 * with a brlock that prevents reading.
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index e679d24..58c21d7 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -171,8 +171,11 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 	struct inode *inode = file_inode(iocb->ki_filp);
 	ssize_t result;
 
-	if (iocb->ki_filp->f_flags & O_DIRECT)
+	if (iocb->ki_filp->f_flags & O_DIRECT) {
+		if (iocb->ki_rwflags & O_NONBLOCK)
+			return -EAGAIN;
 		return nfs_file_direct_read(iocb, to, iocb->ki_pos);
+	}
 
 	dprintk("NFS: read(%pD2, %zu@%lu)\n",
 		iocb->ki_filp,
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 46e0d4e..c155752 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2536,6 +2536,12 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 			filp->f_path.dentry->d_name.name,
 			to->nr_segs);	/* GRRRRR */
 
+	/*
+	 * No non-blocking reads for ocfs2 for now.  Might be doable with
+	 * non-blocking cluster lock helpers.
+	 */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
 
 	if (!inode) {
 		ret = -EINVAL;
diff --git a/fs/pipe.c b/fs/pipe.c
index 21981e5..212bf68 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -302,7 +302,8 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (ret)
 				break;
-			if (filp->f_flags & O_NONBLOCK) {
+			if ((filp->f_flags & O_NONBLOCK) ||
+			    (iocb->ki_rwflags & RWF_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
diff --git a/fs/read_write.c b/fs/read_write.c
index e91f46e..339477b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -893,14 +893,19 @@ static ssize_t do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & RWF_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
@@ -924,8 +929,10 @@ ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
-	if (flags & ~0)
+	if (flags & ~RWF_NONBLOCK)
 		return -EINVAL;
+	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
+		return -EAGAIN;
 
 	return do_readv_writev(READ, file, vec, vlen, pos, flags);
 }
@@ -1127,14 +1134,19 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
 		file_start_write(file);
 	}
 
-	if (iter_fn)
+	if (iter_fn) {
 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
 						pos, iter_fn, flags);
-	else if (fnv)
-		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
-						pos, fnv);
-	else
-		ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	} else {
+		if (type == READ && (flags & RWF_NONBLOCK))
+			return -EAGAIN;
+
+		if (fnv)
+			ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
+							pos, fnv);
+		else
+			ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);
+	}
 
 	if (type != READ)
 		file_end_write(file);
@@ -1163,7 +1175,11 @@ static size_t compat_readv(struct file *file,
 	ret = -EINVAL;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		goto out;
-	if (flags & ~0)
+	if (flags & ~RWF_NONBLOCK)
+		goto out;
+
+	ret = -EAGAIN;
+	if ((file->f_flags & O_DIRECT) && (flags & RWF_NONBLOCK))
 		goto out;
 
 	ret = compat_do_readv_writev(READ, file, vec, vlen, pos, flags);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a2e1cb8..a38ddc1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -280,6 +280,10 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
+	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c018335..fb2de58 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1531,6 +1531,8 @@ struct block_device_operations;
 #define NOMMU_VMFLAGS \
 	(NOMMU_MAP_READ | NOMMU_MAP_WRITE | NOMMU_MAP_EXEC)
 
+/* These flags are used for the readv/writev syscalls with flags. */
+#define RWF_NONBLOCK 0x00000001
 
 struct iov_iter;
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 7865f64..ad789e0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1490,6 +1490,8 @@ static ssize_t do_generic_file_read(struct file *filp, loff_t *ppos,
 find_page:
 		page = find_get_page(mapping, index);
 		if (!page) {
+			if (flags & RWF_NONBLOCK)
+				goto would_block;
 			page_cache_sync_readahead(mapping,
 					ra, filp,
 					index, last_index - index);
@@ -1581,6 +1583,11 @@ page_ok:
 		continue;
 
 page_not_up_to_date:
+		if (flags & RWF_NONBLOCK) {
+			page_cache_release(page);
+			goto would_block;
+		}
+
 		/* Get exclusive access to the page ... */
 		error = lock_page_killable(page);
 		if (unlikely(error))
@@ -1600,6 +1607,12 @@ page_not_up_to_date_locked:
 			goto page_ok;
 		}
 
+		if (flags & RWF_NONBLOCK) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto would_block;
+		}
+
 readpage:
 		/*
 		 * A previous I/O error may have been due to temporary
@@ -1670,6 +1683,8 @@ no_cached_page:
 		goto readpage;
 	}
 
+would_block:
+	error = -EAGAIN;
 out:
 	ra->prev_pos = prev_index;
 	ra->prev_pos <<= PAGE_CACHE_SHIFT;
@@ -1702,6 +1717,9 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		size_t count = iov_iter_count(iter);
 		loff_t size;
 
+		if (iocb->ki_rwflags & RWF_NONBLOCK)
+			return -EAGAIN;
+
 		if (!count)
 			goto out; /* skip atime */
 		size = i_size_read(inode);
diff --git a/mm/shmem.c b/mm/shmem.c
index cf2d0ca..c5b78f8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1528,6 +1528,10 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/* XXX: should be easily supportable */
+	if (iocb->ki_rwflags & RWF_NONBLOCK)
+		return -EAGAIN;
+
 	/*
 	 * Might this read be for a stacking filesystem?  Then when reading
 	 * holes of a sparse file, we actually need to allocate those pages,
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v7 5/5] xfs: add RWF_NONBLOCK support
  2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
@ 2015-03-16 18:27   ` Milosz Tanski
  2015-03-16 18:27   ` Milosz Tanski
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner, Andrew Morton

From: Christoph Hellwig <hch@lst.de>

Add support for non-blocking reads.  The guts are handled by the generic
code, the only addition is a non-blocking variant of xfs_rw_ilock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a38ddc1..69333a7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -59,6 +59,25 @@ xfs_rw_ilock(
 	xfs_ilock(ip, type);
 }
 
+static inline bool
+xfs_rw_ilock_nowait(
+	struct xfs_inode	*ip,
+	int			type)
+{
+	if (type & XFS_IOLOCK_EXCL) {
+		if (!mutex_trylock(&VFS_I(ip)->i_mutex))
+			return false;
+		if (!xfs_ilock_nowait(ip, type)) {
+			mutex_unlock(&VFS_I(ip)->i_mutex);
+			return false;
+		}
+	} else {
+		if (!xfs_ilock_nowait(ip, type))
+			return false;
+	}
+	return true;
+}
+
 static inline void
 xfs_rw_iunlock(
 	struct xfs_inode	*ip,
@@ -280,10 +299,6 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
-	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
-	if (iocb->ki_rwflags & RWF_NONBLOCK)
-		return -EAGAIN;
-
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
@@ -321,7 +336,14 @@ xfs_file_read_iter(
 	 * This allows the normal direct IO case of no page cache pages to
 	 * proceeed concurrently without serialisation.
 	 */
-	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	if (iocb->ki_rwflags & RWF_NONBLOCK) {
+		if (ioflags & XFS_IO_ISDIRECT)
+			return -EAGAIN;
+		if (!xfs_rw_ilock_nowait(ip, XFS_IOLOCK_SHARED))
+			return -EAGAIN;
+	} else {
+		xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	}
 	if ((ioflags & XFS_IO_ISDIRECT) && inode->i_mapping->nrpages) {
 		xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 		xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH v7 5/5] xfs: add RWF_NONBLOCK support
@ 2015-03-16 18:27   ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner, Andrew Morton

From: Christoph Hellwig <hch@lst.de>

Add support for non-blocking reads.  The guts are handled by the generic
code, the only addition is a non-blocking variant of xfs_rw_ilock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a38ddc1..69333a7 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -59,6 +59,25 @@ xfs_rw_ilock(
 	xfs_ilock(ip, type);
 }
 
+static inline bool
+xfs_rw_ilock_nowait(
+	struct xfs_inode	*ip,
+	int			type)
+{
+	if (type & XFS_IOLOCK_EXCL) {
+		if (!mutex_trylock(&VFS_I(ip)->i_mutex))
+			return false;
+		if (!xfs_ilock_nowait(ip, type)) {
+			mutex_unlock(&VFS_I(ip)->i_mutex);
+			return false;
+		}
+	} else {
+		if (!xfs_ilock_nowait(ip, type))
+			return false;
+	}
+	return true;
+}
+
 static inline void
 xfs_rw_iunlock(
 	struct xfs_inode	*ip,
@@ -280,10 +299,6 @@ xfs_file_read_iter(
 
 	XFS_STATS_INC(xs_read_calls);
 
-	/* XXX: need a non-blocking iolock helper, shouldn't be too hard */
-	if (iocb->ki_rwflags & RWF_NONBLOCK)
-		return -EAGAIN;
-
 	if (unlikely(file->f_flags & O_DIRECT))
 		ioflags |= XFS_IO_ISDIRECT;
 	if (file->f_mode & FMODE_NOCMTIME)
@@ -321,7 +336,14 @@ xfs_file_read_iter(
 	 * This allows the normal direct IO case of no page cache pages to
 	 * proceeed concurrently without serialisation.
 	 */
-	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	if (iocb->ki_rwflags & RWF_NONBLOCK) {
+		if (ioflags & XFS_IO_ISDIRECT)
+			return -EAGAIN;
+		if (!xfs_rw_ilock_nowait(ip, XFS_IOLOCK_SHARED))
+			return -EAGAIN;
+	} else {
+		xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	}
 	if ((ioflags & XFS_IO_ISDIRECT) && inode->i_mapping->nrpages) {
 		xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 		xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH] Add preadv2/pwritev2 documentation.
  2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
                   ` (4 preceding siblings ...)
  2015-03-16 18:27   ` Milosz Tanski
@ 2015-03-16 18:32 ` Milosz Tanski
  2015-03-27 16:49   ` Andrew Morton
  2015-03-16 18:34   ` Milosz Tanski
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

New syscalls that are a variation on the preadv/pwritev but support an extra
flag argument.

Signed-off-by: Milosz Tanski <milosz@adfin.com>
Suggested-by: Jeff Moyer <jmoyer@redhat.com>
Fixes: Jeff Moyer <jmoyer@redhat.com>
---
 man2/readv.2 | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 61 insertions(+), 10 deletions(-)

diff --git a/man2/readv.2 b/man2/readv.2
index 756a23f..83265c6 100644
--- a/man2/readv.2
+++ b/man2/readv.2
@@ -45,6 +45,12 @@ readv, writev, preadv, pwritev \- read or write data into multiple buffers
 .sp
 .BI "ssize_t pwritev(int " fd ", const struct iovec *" iov ", int " iovcnt ,
 .BI "                off_t " offset );
+.sp
+.BI "ssize_t preadv2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
+.BI "                off_t " offset ", int " flags );
+.sp
+.BI "ssize_t pwritev2(int " fd ", const struct iovec *" iov ", int " iovcnt ,
+.BI "                 off_t " offset ", int " flags );
 .fi
 .sp
 .in -4n
@@ -162,9 +168,9 @@ The
 system call combines the functionality of
 .BR writev ()
 and
-.BR pwrite (2).
+.BR pwrite (2) "."
 It performs the same task as
-.BR writev (),
+.BR writev () ","
 but adds a fourth argument,
 .IR offset ,
 which specifies the file offset at which the output operation
@@ -174,15 +180,41 @@ The file offset is not changed by these system calls.
 The file referred to by
 .I fd
 must be capable of seeking.
+.SS preadv2() and pwritev2()
+
+This pair of system calls has similar functionality to the
+.BR preadv ()
+and
+.BR pwritev ()
+calls, but adds a fifth argument, \fIflags\fP, which modifies the behavior on a per call basis.
+
+Like the
+.BR preadv ()
+and
+.BR pwritev ()
+calls, they accept an \fIoffset\fP argument. Unlike those calls, if the \fIoffset\fP argument is set to -1 then the current file offset is used and updated.
+
+The \fIflags\fP arguments to
+.BR preadv2 ()
+and
+.BR pwritev2 ()
+contains a bitwise OR of one or more of the following flags:
+.TP
+.BR RWF_NONBLOCK " (only " preadv2() " since Linux 3.19)"
+Performs a non-blocking operation for regular files (not sockets) opened in buffered mode (not
+.BR O_DIRECT ")."
+
 .SH RETURN VALUE
 On success,
-.BR readv ()
-and
+.BR readv () ","
 .BR preadv ()
-return the number of bytes read;
-.BR writev ()
 and
+.BR preadv2 ()
+return the number of bytes read;
+.BR writev () ","
 .BR pwritev ()
+and
+.BR pwritev2 ()
 return the number of bytes written.
 On error, \-1 is returned, and \fIerrno\fP is set appropriately.
 .SH ERRORS
@@ -191,12 +223,22 @@ The errors are as given for
 and
 .BR write (2).
 Furthermore,
-.BR preadv ()
-and
+.BR preadv () ","
+.BR preadv2 () ","
 .BR pwritev ()
+and
+.BR pwritev2 ()
 can also fail for the same reasons as
 .BR lseek (2).
-Additionally, the following error is defined:
+Additionally, the following errors are defined:
+.TP
+.B EAGAIN
+The operation would block. This is possible if the file descriptor \fIfd\fP refers to a socket and has been marked nonblocking
+.RB ( O_NONBLOCK ),
+or the operation is a
+.BR preadv2
+and the \fIflags\fP argument is set to
+.BR RWF_NONBLOCK.
 .TP
 .B EINVAL
 The sum of the
@@ -207,12 +249,17 @@ value.
 .TP
 .B EINVAL
 The vector count \fIiovcnt\fP is less than zero or greater than the
-permitted maximum.
+permitted maximum. Or, an unknown flag is specified in \fIflags\fP.
 .SH VERSIONS
 .BR preadv ()
 and
 .BR pwritev ()
 first appeared in Linux 2.6.30; library support was added in glibc 2.10.
+.sp
+.BR preadv2 ()
+and
+.BR pwritev2 ()
+first appeared in Linux 4.1
 .SH CONFORMING TO
 .BR readv (),
 .BR writev ():
@@ -225,6 +272,10 @@ first appeared in Linux 2.6.30; library support was added in glibc 2.10.
 .BR preadv (),
 .BR pwritev ():
 nonstandard, but present also on the modern BSDs.
+.sp
+.BR preadv2 (),
+.BR pwritev2 ():
+nonstandard, Linux extension.
 .SH NOTES
 POSIX.1-2001 allows an implementation to place a limit on
 the number of items that can be passed in
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH] fstests: generic test for preadv2 behavior on linux
@ 2015-03-16 18:34   ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

preadv2 is a new syscall introduced that is like preadv2 but with flag
argument. The first use case of this is to let us add a flag to perform a
non-blocking file using the page cache.
---
 src/Makefile           |   2 +-
 src/preadv2-pwritev2.h |  52 +++++++++++++++++
 src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/067      |  85 ++++++++++++++++++++++++++++
 tests/generic/067.out  |   9 +++
 tests/generic/group    |   1 +
 6 files changed, 298 insertions(+), 1 deletion(-)
 create mode 100644 src/preadv2-pwritev2.h
 create mode 100644 src/preadv2.c
 create mode 100755 tests/generic/067
 create mode 100644 tests/generic/067.out

diff --git a/src/Makefile b/src/Makefile
index 4781736..f7d3681 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -19,7 +19,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
 	bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
 	stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
 	seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
-	renameat2 t_getcwd e4compact
+	renameat2 t_getcwd e4compact preadv2
 
 SUBDIRS =
 
diff --git a/src/preadv2-pwritev2.h b/src/preadv2-pwritev2.h
new file mode 100644
index 0000000..786e524
--- /dev/null
+++ b/src/preadv2-pwritev2.h
@@ -0,0 +1,52 @@
+#ifndef PREADV2_PWRITEV2_H
+#define PREADV2_PWRITEV2_H
+
+#include "global.h"
+
+#ifndef HAVE_PREADV2
+#include <sys/syscall.h>
+
+#if !defined(SYS_preadv2) && defined(__x86_64__)
+#define SYS_preadv2 323
+#define SYS_pwritev2 324
+#endif
+
+#if !defined (SYS_preadv2) && defined(__i386__)
+#define SYS_preadv2 359
+#define SYS_pwritev2 360
+#endif
+
+/* LO_HI_LONG taken from glibc */
+#define LO_HI_LONG(val)							\
+  (off_t) val,                                                          \
+  (off_t) ((((uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof (long) * 4))
+
+static inline ssize_t
+preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
+{
+#ifdef SYS_preadv2
+        return syscall(SYS_preadv2, fd, iov, iovcnt, LO_HI_LONG(offset),
+		       flags);
+#else
+	errno = ENOSYS;
+	return -1;
+#endif
+}
+
+static inline ssize_t
+pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
+{
+#ifdef SYS_pwritev2
+        return syscall(SYS_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
+		       flags);
+#else
+	errno = ENOSYS;
+	return -1;
+#endif
+}
+
+#define RWF_NONBLOCK	0x00000001
+#define RWF_DSYNC	0x00000002
+
+#endif /* HAVE_PREADV2 */
+#endif /* PREADV2_PWRITEV2_H */
diff --git a/src/preadv2.c b/src/preadv2.c
new file mode 100644
index 0000000..a4f89b5
--- /dev/null
+++ b/src/preadv2.c
@@ -0,0 +1,150 @@
+/*
+ * Copyright 2014 Red Hat, Inc.  All rights reserved.
+ * Copyright 2015 Milosz Tanski
+ *
+ * License: GPLv2
+ *
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <linux/fs.h> /* for RWF_NONBLOCK */
+
+/*
+ * Once preadv2 is part of the upstream kernel and there is glibc support for
+ * it. We'll add support for preadv2 to xfs_io and this will be unnecessary.
+ */
+#include "preadv2-pwritev2.h"
+
+/*
+ * Test to see if the system call is implemented.  If -EINVAL or -ENOSYS
+ * are returned, consider the call unimplemented.  All other errors are
+ * considered success.
+ *
+ * Returns: 0 if the system call is implemented, 1 if the system call
+ * is not implemented.
+ */
+int
+preadv2_check(int fd)
+{
+	int ret;
+	struct iovec iov[] = {};
+
+	/* 0 length read; just check iof the syscall is there.
+         *
+         * - 0 length iovec
+         * - Position is -1 (eg. use current position)
+         */
+	ret = preadv2(fd, iov, 0, -1, 0);
+
+	if (ret < 0) {
+		if (errno == ENOSYS || errno == EINVAL)
+			return 1;
+	}
+
+	return 0;
+}
+
+void
+usage(char *prog)
+{
+	fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
+	fprintf(stderr, "General arguments:\n");
+	fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
+	fprintf(stderr, "\n");
+	fprintf(stderr, "Open arguments:\n");
+	fprintf(stderr, "  -c Open file with O_CREAT flag\n");
+	fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
+	fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
+	fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
+	fprintf(stderr, "\n");
+	fprintf(stderr, "preadv2 arguments:\n");
+	fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
+	fprintf(stderr, "  -p POS offset file to read at\n");
+	fprintf(stderr, "  -l LEN length of file data to read\n");
+	fprintf(stderr, "\n");
+	fflush(stderr);
+}
+
+int
+main(int argc, char **argv)
+{
+	int fd;
+	int ret;
+	int opt;
+	off_t pos = -1;
+	struct iovec iov = { NULL, 0 };
+	int o_flags = 0;
+	int r_flags = 0;
+	char *filename;
+
+	while ((opt = getopt(argc, argv, "vctdwnp:l:")) != -1) {
+		switch (opt) {
+		case 'v':
+			/*
+			 * See if we were called to check for availability of
+			 * sys_preadv2. STDIN is okay, since we do a zero
+			 * length read (see man 2 read).
+			 */
+			ret = preadv2_check(STDIN_FILENO);
+			exit(ret);
+		case 'c':
+			o_flags |= O_CREAT;
+			break;
+		case 't':
+			o_flags |= O_TRUNC;
+			break;
+		case 'd':
+			o_flags |= O_DIRECT;
+			break;
+		case 'w':
+			o_flags |= O_RDWR;
+			break;
+		case 'n':
+			r_flags |= RWF_NONBLOCK;
+			break;
+		case 'p':
+			pos = atoll(optarg);
+			break;
+		case 'l':
+			iov.iov_len = atoll(optarg);
+			break;
+		default:
+			fprintf(stderr, "invalid option: %c\n", opt);
+			usage(argv[0]);
+			exit(1);
+		}
+	}
+
+	if (optind >= argc) {
+		usage(argv[0]);
+		exit(1);
+	}
+
+	if ((o_flags & O_RDWR) != O_RDWR)
+		o_flags |= O_RDONLY;
+
+	if ((iov.iov_base = malloc(iov.iov_len)) == NULL) {
+		perror("malloc");
+		exit(1);
+	}
+
+	filename = argv[optind];
+	fd = open(filename, o_flags);
+
+	if (fd < 0) {
+		perror("open");
+		exit(1);
+	}
+
+	if ((ret = preadv2(fd, &iov, 1, pos, r_flags)) == -1) {
+		perror("preadv2");
+		exit(ret);
+	}
+
+	free(iov.iov_base);
+	exit(0);
+}
diff --git a/tests/generic/067 b/tests/generic/067
new file mode 100755
index 0000000..4cc58f8
--- /dev/null
+++ b/tests/generic/067
@@ -0,0 +1,85 @@
+#! /bin/bash
+# FS QA Test No. 067
+#
+# Test for the preadv2 syscall
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2015 Milosz Tanski <mtanski@gmail.com>.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+    cd /
+    rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_test
+
+# test file we'll be using
+file=$SCRATCH_MNT/067.preadv2.$$
+
+# Create a file:
+# two regions of data and a hole in the middle
+# use O_DIRECT so it's not in the page cache
+echo "create file"
+$XFS_IO_PROG -t -f -d \
+	-c "pwrite 0 1024" \
+	-c "pwrite 2048 1024" \
+	$file > /dev/null
+
+# Make sure it returns EAGAIN on uncached data
+echo "uncached"
+$here/src/preadv2 -n -p 0 -l 1024 $file
+
+# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
+echo "cached"
+$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
+$here/src/preadv2 -n -p 0 -l 1024 $file
+
+# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
+echo "O_DIRECT"
+$here/src/preadv2 -d -n -p 0 -l 1024 $file
+
+# Holes do not block
+echo "holes"
+$here/src/preadv2 -n -p 2048 -l 1024 $file
+
+# EOF behavior (no EAGAIN)
+echo "EOF"
+$here/src/preadv2 -n -p 3072 -l 1 $file
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/067.out b/tests/generic/067.out
new file mode 100644
index 0000000..6e3740f
--- /dev/null
+++ b/tests/generic/067.out
@@ -0,0 +1,9 @@
+QA output created by 067
+create file
+uncached
+preadv2: Resource temporarily unavailable
+cached
+O_DIRECT
+preadv2: Resource temporarily unavailable
+holes
+EOF
diff --git a/tests/generic/group b/tests/generic/group
index e5db772..91c5870 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -69,6 +69,7 @@
 064 auto quick prealloc
 065 metadata auto quick
 066 metadata auto quick
+067 auto quick rw
 068 other auto freeze dangerous stress
 069 rw udf auto quick
 070 attr udf auto quick stress
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 94+ messages in thread

* [PATCH] fstests: generic test for preadv2 behavior on linux
@ 2015-03-16 18:34   ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 18:34 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Andrew Morton

preadv2 is a new syscall introduced that is like preadv2 but with flag
argument. The first use case of this is to let us add a flag to perform a
non-blocking file using the page cache.
---
 src/Makefile           |   2 +-
 src/preadv2-pwritev2.h |  52 +++++++++++++++++
 src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/067      |  85 ++++++++++++++++++++++++++++
 tests/generic/067.out  |   9 +++
 tests/generic/group    |   1 +
 6 files changed, 298 insertions(+), 1 deletion(-)
 create mode 100644 src/preadv2-pwritev2.h
 create mode 100644 src/preadv2.c
 create mode 100755 tests/generic/067
 create mode 100644 tests/generic/067.out

diff --git a/src/Makefile b/src/Makefile
index 4781736..f7d3681 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -19,7 +19,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
 	bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
 	stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
 	seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
-	renameat2 t_getcwd e4compact
+	renameat2 t_getcwd e4compact preadv2
 
 SUBDIRS =
 
diff --git a/src/preadv2-pwritev2.h b/src/preadv2-pwritev2.h
new file mode 100644
index 0000000..786e524
--- /dev/null
+++ b/src/preadv2-pwritev2.h
@@ -0,0 +1,52 @@
+#ifndef PREADV2_PWRITEV2_H
+#define PREADV2_PWRITEV2_H
+
+#include "global.h"
+
+#ifndef HAVE_PREADV2
+#include <sys/syscall.h>
+
+#if !defined(SYS_preadv2) && defined(__x86_64__)
+#define SYS_preadv2 323
+#define SYS_pwritev2 324
+#endif
+
+#if !defined (SYS_preadv2) && defined(__i386__)
+#define SYS_preadv2 359
+#define SYS_pwritev2 360
+#endif
+
+/* LO_HI_LONG taken from glibc */
+#define LO_HI_LONG(val)							\
+  (off_t) val,                                                          \
+  (off_t) ((((uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof (long) * 4))
+
+static inline ssize_t
+preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
+{
+#ifdef SYS_preadv2
+        return syscall(SYS_preadv2, fd, iov, iovcnt, LO_HI_LONG(offset),
+		       flags);
+#else
+	errno = ENOSYS;
+	return -1;
+#endif
+}
+
+static inline ssize_t
+pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
+{
+#ifdef SYS_pwritev2
+        return syscall(SYS_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
+		       flags);
+#else
+	errno = ENOSYS;
+	return -1;
+#endif
+}
+
+#define RWF_NONBLOCK	0x00000001
+#define RWF_DSYNC	0x00000002
+
+#endif /* HAVE_PREADV2 */
+#endif /* PREADV2_PWRITEV2_H */
diff --git a/src/preadv2.c b/src/preadv2.c
new file mode 100644
index 0000000..a4f89b5
--- /dev/null
+++ b/src/preadv2.c
@@ -0,0 +1,150 @@
+/*
+ * Copyright 2014 Red Hat, Inc.  All rights reserved.
+ * Copyright 2015 Milosz Tanski
+ *
+ * License: GPLv2
+ *
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <getopt.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <linux/fs.h> /* for RWF_NONBLOCK */
+
+/*
+ * Once preadv2 is part of the upstream kernel and there is glibc support for
+ * it. We'll add support for preadv2 to xfs_io and this will be unnecessary.
+ */
+#include "preadv2-pwritev2.h"
+
+/*
+ * Test to see if the system call is implemented.  If -EINVAL or -ENOSYS
+ * are returned, consider the call unimplemented.  All other errors are
+ * considered success.
+ *
+ * Returns: 0 if the system call is implemented, 1 if the system call
+ * is not implemented.
+ */
+int
+preadv2_check(int fd)
+{
+	int ret;
+	struct iovec iov[] = {};
+
+	/* 0 length read; just check iof the syscall is there.
+         *
+         * - 0 length iovec
+         * - Position is -1 (eg. use current position)
+         */
+	ret = preadv2(fd, iov, 0, -1, 0);
+
+	if (ret < 0) {
+		if (errno == ENOSYS || errno == EINVAL)
+			return 1;
+	}
+
+	return 0;
+}
+
+void
+usage(char *prog)
+{
+	fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
+	fprintf(stderr, "General arguments:\n");
+	fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
+	fprintf(stderr, "\n");
+	fprintf(stderr, "Open arguments:\n");
+	fprintf(stderr, "  -c Open file with O_CREAT flag\n");
+	fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
+	fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
+	fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
+	fprintf(stderr, "\n");
+	fprintf(stderr, "preadv2 arguments:\n");
+	fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
+	fprintf(stderr, "  -p POS offset file to read at\n");
+	fprintf(stderr, "  -l LEN length of file data to read\n");
+	fprintf(stderr, "\n");
+	fflush(stderr);
+}
+
+int
+main(int argc, char **argv)
+{
+	int fd;
+	int ret;
+	int opt;
+	off_t pos = -1;
+	struct iovec iov = { NULL, 0 };
+	int o_flags = 0;
+	int r_flags = 0;
+	char *filename;
+
+	while ((opt = getopt(argc, argv, "vctdwnp:l:")) != -1) {
+		switch (opt) {
+		case 'v':
+			/*
+			 * See if we were called to check for availability of
+			 * sys_preadv2. STDIN is okay, since we do a zero
+			 * length read (see man 2 read).
+			 */
+			ret = preadv2_check(STDIN_FILENO);
+			exit(ret);
+		case 'c':
+			o_flags |= O_CREAT;
+			break;
+		case 't':
+			o_flags |= O_TRUNC;
+			break;
+		case 'd':
+			o_flags |= O_DIRECT;
+			break;
+		case 'w':
+			o_flags |= O_RDWR;
+			break;
+		case 'n':
+			r_flags |= RWF_NONBLOCK;
+			break;
+		case 'p':
+			pos = atoll(optarg);
+			break;
+		case 'l':
+			iov.iov_len = atoll(optarg);
+			break;
+		default:
+			fprintf(stderr, "invalid option: %c\n", opt);
+			usage(argv[0]);
+			exit(1);
+		}
+	}
+
+	if (optind >= argc) {
+		usage(argv[0]);
+		exit(1);
+	}
+
+	if ((o_flags & O_RDWR) != O_RDWR)
+		o_flags |= O_RDONLY;
+
+	if ((iov.iov_base = malloc(iov.iov_len)) == NULL) {
+		perror("malloc");
+		exit(1);
+	}
+
+	filename = argv[optind];
+	fd = open(filename, o_flags);
+
+	if (fd < 0) {
+		perror("open");
+		exit(1);
+	}
+
+	if ((ret = preadv2(fd, &iov, 1, pos, r_flags)) == -1) {
+		perror("preadv2");
+		exit(ret);
+	}
+
+	free(iov.iov_base);
+	exit(0);
+}
diff --git a/tests/generic/067 b/tests/generic/067
new file mode 100755
index 0000000..4cc58f8
--- /dev/null
+++ b/tests/generic/067
@@ -0,0 +1,85 @@
+#! /bin/bash
+# FS QA Test No. 067
+#
+# Test for the preadv2 syscall
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2015 Milosz Tanski <mtanski-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+    cd /
+    rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_test
+
+# test file we'll be using
+file=$SCRATCH_MNT/067.preadv2.$$
+
+# Create a file:
+# two regions of data and a hole in the middle
+# use O_DIRECT so it's not in the page cache
+echo "create file"
+$XFS_IO_PROG -t -f -d \
+	-c "pwrite 0 1024" \
+	-c "pwrite 2048 1024" \
+	$file > /dev/null
+
+# Make sure it returns EAGAIN on uncached data
+echo "uncached"
+$here/src/preadv2 -n -p 0 -l 1024 $file
+
+# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
+echo "cached"
+$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
+$here/src/preadv2 -n -p 0 -l 1024 $file
+
+# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
+echo "O_DIRECT"
+$here/src/preadv2 -d -n -p 0 -l 1024 $file
+
+# Holes do not block
+echo "holes"
+$here/src/preadv2 -n -p 2048 -l 1024 $file
+
+# EOF behavior (no EAGAIN)
+echo "EOF"
+$here/src/preadv2 -n -p 3072 -l 1 $file
+
+# success, all done
+status=0
+exit
diff --git a/tests/generic/067.out b/tests/generic/067.out
new file mode 100644
index 0000000..6e3740f
--- /dev/null
+++ b/tests/generic/067.out
@@ -0,0 +1,9 @@
+QA output created by 067
+create file
+uncached
+preadv2: Resource temporarily unavailable
+cached
+O_DIRECT
+preadv2: Resource temporarily unavailable
+holes
+EOF
diff --git a/tests/generic/group b/tests/generic/group
index e5db772..91c5870 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -69,6 +69,7 @@
 064 auto quick prealloc
 065 metadata auto quick
 066 metadata auto quick
+067 auto quick rw
 068 other auto freeze dangerous stress
 069 rw udf auto quick
 070 attr udf auto quick stress
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 1/5] vfs: Prepare for adding a new preadv/pwritev with user flags.
  2015-03-16 18:27   ` Milosz Tanski
@ 2015-03-16 21:05     ` Andreas Dilger
  -1 siblings, 0 replies; 94+ messages in thread
From: Andreas Dilger @ 2015-03-16 21:05 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner, Andrew Morton

On Mar 16, 2015, at 12:27 PM, Milosz Tanski <milosz@adfin.com> wrote:
> 
> Plumbing the flags argument through the vfs code so they can be passed
> down to __generic_file_(read/write)_iter function that do the acctual work.
> 
> Signed-off-by: Milosz Tanski <milosz@adfin.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 8e1b687..b53bb59 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -711,7 +711,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
> EXPORT_SYMBOL(iov_shorten);
> 
> static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
> -		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
> +		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
> +		int flags)

Using "int flags" as an argument is too generic IMHO.  We have sooo many
different "int flags" arguments, but there is no easy way to figure out
which flags are being used.  A better solution is to declare a named enum:

enum iov_iter {
	RWF_NONBLOCK = 0x00000001,	/* only access pages in cache */
};

and use "enum iov_iter flags" as the function argument (or even "iter_flags"
if you wanted to make it that much easier to understand).  That makes
it immediately clear to the reader and the compiler what the valid flag
values are here, and it works with tags, etc.

Thoughts?

Cheers, Andreas


> {
> 	struct kiocb kiocb;
> 	struct iov_iter iter;
> @@ -720,6 +721,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
> 	init_sync_kiocb(&kiocb, filp);
> 	kiocb.ki_pos = *ppos;
> 	kiocb.ki_nbytes = len;
> +	kiocb.ki_rwflags = flags;
> 
> 	iov_iter_init(&iter, rw, iov, nr_segs, len);
> 	ret = fn(&kiocb, &iter);
> @@ -858,7 +860,8 @@ out:
> 
> static ssize_t do_readv_writev(int type, struct file *file,
> 			       const struct iovec __user * uvector,
> -			       unsigned long nr_segs, loff_t *pos)
> +			       unsigned long nr_segs, loff_t *pos,
> +			       int flags)
> {
> 	size_t tot_len;
> 	struct iovec iovstack[UIO_FASTIOV];
> @@ -892,7 +895,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
> 
> 	if (iter_fn)
> 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
> -						pos, iter_fn);
> +						pos, iter_fn, flags);
> 	else if (fnv)
> 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> 						pos, fnv);
> @@ -915,27 +918,27 @@ out:
> }
> 
> ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
> -		  unsigned long vlen, loff_t *pos)
> +		  unsigned long vlen, loff_t *pos, int flags)
> {
> 	if (!(file->f_mode & FMODE_READ))
> 		return -EBADF;
> 	if (!(file->f_mode & FMODE_CAN_READ))
> 		return -EINVAL;
> 
> -	return do_readv_writev(READ, file, vec, vlen, pos);
> +	return do_readv_writev(READ, file, vec, vlen, pos, flags);
> }
> 
> EXPORT_SYMBOL(vfs_readv);
> 
> ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
> -		   unsigned long vlen, loff_t *pos)
> +		   unsigned long vlen, loff_t *pos, int flags)
> {
> 	if (!(file->f_mode & FMODE_WRITE))
> 		return -EBADF;
> 	if (!(file->f_mode & FMODE_CAN_WRITE))
> 		return -EINVAL;
> 
> -	return do_readv_writev(WRITE, file, vec, vlen, pos);
> +	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
> }
> 
> EXPORT_SYMBOL(vfs_writev);
> @@ -948,7 +951,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> 
> 	if (f.file) {
> 		loff_t pos = file_pos_read(f.file);
> -		ret = vfs_readv(f.file, vec, vlen, &pos);
> +		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> 		if (ret >= 0)
> 			file_pos_write(f.file, pos);
> 		fdput_pos(f);
> @@ -968,7 +971,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> 
> 	if (f.file) {
> 		loff_t pos = file_pos_read(f.file);
> -		ret = vfs_writev(f.file, vec, vlen, &pos);
> +		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> 		if (ret >= 0)
> 			file_pos_write(f.file, pos);
> 		fdput_pos(f);
> @@ -1000,7 +1003,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> 	if (f.file) {
> 		ret = -ESPIPE;
> 		if (f.file->f_mode & FMODE_PREAD)
> -			ret = vfs_readv(f.file, vec, vlen, &pos);
> +			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> 		fdput(f);
> 	}
> 
> @@ -1024,7 +1027,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> 	if (f.file) {
> 		ret = -ESPIPE;
> 		if (f.file->f_mode & FMODE_PWRITE)
> -			ret = vfs_writev(f.file, vec, vlen, &pos);
> +			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> 		fdput(f);
> 	}
> 
> @@ -1072,7 +1075,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
> 
> 	if (iter_fn)
> 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
> -						pos, iter_fn);
> +						pos, iter_fn, 0);
> 	else if (fnv)
> 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> 						pos, fnv);
> diff --git a/fs/splice.c b/fs/splice.c
> index 7968da9..ee3fd4c 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
> 	old_fs = get_fs();
> 	set_fs(get_ds());
> 	/* The cast to a user pointer is valid due to the set_fs() */
> -	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
> +	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
> 	set_fs(old_fs);
> 
> 	return res;
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index d9c92da..9c1d499 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -52,6 +52,8 @@ struct kiocb {
> 	 * this is the underlying eventfd context to deliver events to.
> 	 */
> 	struct eventfd_ctx	*ki_eventfd;
> +
> +	int			ki_rwflags;
> };
> 
> static inline bool is_sync_kiocb(struct kiocb *kiocb)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b4d71b5..c018335 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1619,9 +1619,9 @@ extern ssize_t __vfs_read(struct file *, char __user *, size_t, loff_t *);
> extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
> extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
> extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> -		unsigned long, loff_t *);
> +		unsigned long, loff_t *, int);
> extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
> -		unsigned long, loff_t *);
> +		unsigned long, loff_t *, int);
> 
> struct super_operations {
>    	struct inode *(*alloc_inode)(struct super_block *sb);
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 1/5] vfs: Prepare for adding a new preadv/pwritev with user flags.
@ 2015-03-16 21:05     ` Andreas Dilger
  0 siblings, 0 replies; 94+ messages in thread
From: Andreas Dilger @ 2015-03-16 21:05 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner, Andrew Morton

On Mar 16, 2015, at 12:27 PM, Milosz Tanski <milosz@adfin.com> wrote:
> 
> Plumbing the flags argument through the vfs code so they can be passed
> down to __generic_file_(read/write)_iter function that do the acctual work.
> 
> Signed-off-by: Milosz Tanski <milosz@adfin.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 8e1b687..b53bb59 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -711,7 +711,8 @@ unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
> EXPORT_SYMBOL(iov_shorten);
> 
> static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iovec *iov,
> -		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn)
> +		unsigned long nr_segs, size_t len, loff_t *ppos, iter_fn_t fn,
> +		int flags)

Using "int flags" as an argument is too generic IMHO.  We have sooo many
different "int flags" arguments, but there is no easy way to figure out
which flags are being used.  A better solution is to declare a named enum:

enum iov_iter {
	RWF_NONBLOCK = 0x00000001,	/* only access pages in cache */
};

and use "enum iov_iter flags" as the function argument (or even "iter_flags"
if you wanted to make it that much easier to understand).  That makes
it immediately clear to the reader and the compiler what the valid flag
values are here, and it works with tags, etc.

Thoughts?

Cheers, Andreas


> {
> 	struct kiocb kiocb;
> 	struct iov_iter iter;
> @@ -720,6 +721,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, int rw, const struct iove
> 	init_sync_kiocb(&kiocb, filp);
> 	kiocb.ki_pos = *ppos;
> 	kiocb.ki_nbytes = len;
> +	kiocb.ki_rwflags = flags;
> 
> 	iov_iter_init(&iter, rw, iov, nr_segs, len);
> 	ret = fn(&kiocb, &iter);
> @@ -858,7 +860,8 @@ out:
> 
> static ssize_t do_readv_writev(int type, struct file *file,
> 			       const struct iovec __user * uvector,
> -			       unsigned long nr_segs, loff_t *pos)
> +			       unsigned long nr_segs, loff_t *pos,
> +			       int flags)
> {
> 	size_t tot_len;
> 	struct iovec iovstack[UIO_FASTIOV];
> @@ -892,7 +895,7 @@ static ssize_t do_readv_writev(int type, struct file *file,
> 
> 	if (iter_fn)
> 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
> -						pos, iter_fn);
> +						pos, iter_fn, flags);
> 	else if (fnv)
> 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> 						pos, fnv);
> @@ -915,27 +918,27 @@ out:
> }
> 
> ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
> -		  unsigned long vlen, loff_t *pos)
> +		  unsigned long vlen, loff_t *pos, int flags)
> {
> 	if (!(file->f_mode & FMODE_READ))
> 		return -EBADF;
> 	if (!(file->f_mode & FMODE_CAN_READ))
> 		return -EINVAL;
> 
> -	return do_readv_writev(READ, file, vec, vlen, pos);
> +	return do_readv_writev(READ, file, vec, vlen, pos, flags);
> }
> 
> EXPORT_SYMBOL(vfs_readv);
> 
> ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
> -		   unsigned long vlen, loff_t *pos)
> +		   unsigned long vlen, loff_t *pos, int flags)
> {
> 	if (!(file->f_mode & FMODE_WRITE))
> 		return -EBADF;
> 	if (!(file->f_mode & FMODE_CAN_WRITE))
> 		return -EINVAL;
> 
> -	return do_readv_writev(WRITE, file, vec, vlen, pos);
> +	return do_readv_writev(WRITE, file, vec, vlen, pos, flags);
> }
> 
> EXPORT_SYMBOL(vfs_writev);
> @@ -948,7 +951,7 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
> 
> 	if (f.file) {
> 		loff_t pos = file_pos_read(f.file);
> -		ret = vfs_readv(f.file, vec, vlen, &pos);
> +		ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> 		if (ret >= 0)
> 			file_pos_write(f.file, pos);
> 		fdput_pos(f);
> @@ -968,7 +971,7 @@ SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
> 
> 	if (f.file) {
> 		loff_t pos = file_pos_read(f.file);
> -		ret = vfs_writev(f.file, vec, vlen, &pos);
> +		ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> 		if (ret >= 0)
> 			file_pos_write(f.file, pos);
> 		fdput_pos(f);
> @@ -1000,7 +1003,7 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
> 	if (f.file) {
> 		ret = -ESPIPE;
> 		if (f.file->f_mode & FMODE_PREAD)
> -			ret = vfs_readv(f.file, vec, vlen, &pos);
> +			ret = vfs_readv(f.file, vec, vlen, &pos, 0);
> 		fdput(f);
> 	}
> 
> @@ -1024,7 +1027,7 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
> 	if (f.file) {
> 		ret = -ESPIPE;
> 		if (f.file->f_mode & FMODE_PWRITE)
> -			ret = vfs_writev(f.file, vec, vlen, &pos);
> +			ret = vfs_writev(f.file, vec, vlen, &pos, 0);
> 		fdput(f);
> 	}
> 
> @@ -1072,7 +1075,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file,
> 
> 	if (iter_fn)
> 		ret = do_iter_readv_writev(file, type, iov, nr_segs, tot_len,
> -						pos, iter_fn);
> +						pos, iter_fn, 0);
> 	else if (fnv)
> 		ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
> 						pos, fnv);
> diff --git a/fs/splice.c b/fs/splice.c
> index 7968da9..ee3fd4c 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -576,7 +576,7 @@ static ssize_t kernel_readv(struct file *file, const struct iovec *vec,
> 	old_fs = get_fs();
> 	set_fs(get_ds());
> 	/* The cast to a user pointer is valid due to the set_fs() */
> -	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos);
> +	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
> 	set_fs(old_fs);
> 
> 	return res;
> diff --git a/include/linux/aio.h b/include/linux/aio.h
> index d9c92da..9c1d499 100644
> --- a/include/linux/aio.h
> +++ b/include/linux/aio.h
> @@ -52,6 +52,8 @@ struct kiocb {
> 	 * this is the underlying eventfd context to deliver events to.
> 	 */
> 	struct eventfd_ctx	*ki_eventfd;
> +
> +	int			ki_rwflags;
> };
> 
> static inline bool is_sync_kiocb(struct kiocb *kiocb)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b4d71b5..c018335 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1619,9 +1619,9 @@ extern ssize_t __vfs_read(struct file *, char __user *, size_t, loff_t *);
> extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
> extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
> extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> -		unsigned long, loff_t *);
> +		unsigned long, loff_t *, int);
> extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
> -		unsigned long, loff_t *);
> +		unsigned long, loff_t *, int);
> 
> struct super_operations {
>    	struct inode *(*alloc_inode)(struct super_block *sb);
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] fstests: generic test for preadv2 behavior on linux
  2015-03-16 18:34   ` Milosz Tanski
@ 2015-03-16 21:07     ` Andreas Dilger
  -1 siblings, 0 replies; 94+ messages in thread
From: Andreas Dilger @ 2015-03-16 21:07 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner, Andrew Morton


> On Mar 16, 2015, at 12:34 PM, Milosz Tanski <milosz@adfin.com> wrote:
> 
> preadv2 is a new syscall introduced that is like preadv2 but with flag

Sorry, "preadv2 ... is like preadv2"?

> argument. The first use case of this is to let us add a flag to perform a
> non-blocking file using the page cache.

This is also missing a Signed-off-by: line.

Cheers, Andreas
> ---
> src/Makefile           |   2 +-
> src/preadv2-pwritev2.h |  52 +++++++++++++++++
> src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++
> tests/generic/067      |  85 ++++++++++++++++++++++++++++
> tests/generic/067.out  |   9 +++
> tests/generic/group    |   1 +
> 6 files changed, 298 insertions(+), 1 deletion(-)
> create mode 100644 src/preadv2-pwritev2.h
> create mode 100644 src/preadv2.c
> create mode 100755 tests/generic/067
> create mode 100644 tests/generic/067.out
> 
> diff --git a/src/Makefile b/src/Makefile
> index 4781736..f7d3681 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -19,7 +19,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
> 	bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
> 	stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
> 	seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
> -	renameat2 t_getcwd e4compact
> +	renameat2 t_getcwd e4compact preadv2
> 
> SUBDIRS =
> 
> diff --git a/src/preadv2-pwritev2.h b/src/preadv2-pwritev2.h
> new file mode 100644
> index 0000000..786e524
> --- /dev/null
> +++ b/src/preadv2-pwritev2.h
> @@ -0,0 +1,52 @@
> +#ifndef PREADV2_PWRITEV2_H
> +#define PREADV2_PWRITEV2_H
> +
> +#include "global.h"
> +
> +#ifndef HAVE_PREADV2
> +#include <sys/syscall.h>
> +
> +#if !defined(SYS_preadv2) && defined(__x86_64__)
> +#define SYS_preadv2 323
> +#define SYS_pwritev2 324
> +#endif
> +
> +#if !defined (SYS_preadv2) && defined(__i386__)
> +#define SYS_preadv2 359
> +#define SYS_pwritev2 360
> +#endif
> +
> +/* LO_HI_LONG taken from glibc */
> +#define LO_HI_LONG(val)							\
> +  (off_t) val,                                                          \
> +  (off_t) ((((uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof (long) * 4))
> +
> +static inline ssize_t
> +preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
> +{
> +#ifdef SYS_preadv2
> +        return syscall(SYS_preadv2, fd, iov, iovcnt, LO_HI_LONG(offset),
> +		       flags);
> +#else
> +	errno = ENOSYS;
> +	return -1;
> +#endif
> +}
> +
> +static inline ssize_t
> +pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
> +{
> +#ifdef SYS_pwritev2
> +        return syscall(SYS_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
> +		       flags);
> +#else
> +	errno = ENOSYS;
> +	return -1;
> +#endif
> +}
> +
> +#define RWF_NONBLOCK	0x00000001
> +#define RWF_DSYNC	0x00000002
> +
> +#endif /* HAVE_PREADV2 */
> +#endif /* PREADV2_PWRITEV2_H */
> diff --git a/src/preadv2.c b/src/preadv2.c
> new file mode 100644
> index 0000000..a4f89b5
> --- /dev/null
> +++ b/src/preadv2.c
> @@ -0,0 +1,150 @@
> +/*
> + * Copyright 2014 Red Hat, Inc.  All rights reserved.
> + * Copyright 2015 Milosz Tanski
> + *
> + * License: GPLv2
> + *
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <getopt.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <linux/fs.h> /* for RWF_NONBLOCK */
> +
> +/*
> + * Once preadv2 is part of the upstream kernel and there is glibc support for
> + * it. We'll add support for preadv2 to xfs_io and this will be unnecessary.
> + */
> +#include "preadv2-pwritev2.h"
> +
> +/*
> + * Test to see if the system call is implemented.  If -EINVAL or -ENOSYS
> + * are returned, consider the call unimplemented.  All other errors are
> + * considered success.
> + *
> + * Returns: 0 if the system call is implemented, 1 if the system call
> + * is not implemented.
> + */
> +int
> +preadv2_check(int fd)
> +{
> +	int ret;
> +	struct iovec iov[] = {};
> +
> +	/* 0 length read; just check iof the syscall is there.
> +         *
> +         * - 0 length iovec
> +         * - Position is -1 (eg. use current position)
> +         */
> +	ret = preadv2(fd, iov, 0, -1, 0);
> +
> +	if (ret < 0) {
> +		if (errno == ENOSYS || errno == EINVAL)
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +usage(char *prog)
> +{
> +	fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
> +	fprintf(stderr, "General arguments:\n");
> +	fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
> +	fprintf(stderr, "\n");
> +	fprintf(stderr, "Open arguments:\n");
> +	fprintf(stderr, "  -c Open file with O_CREAT flag\n");
> +	fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
> +	fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
> +	fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
> +	fprintf(stderr, "\n");
> +	fprintf(stderr, "preadv2 arguments:\n");
> +	fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
> +	fprintf(stderr, "  -p POS offset file to read at\n");
> +	fprintf(stderr, "  -l LEN length of file data to read\n");
> +	fprintf(stderr, "\n");
> +	fflush(stderr);
> +}
> +
> +int
> +main(int argc, char **argv)
> +{
> +	int fd;
> +	int ret;
> +	int opt;
> +	off_t pos = -1;
> +	struct iovec iov = { NULL, 0 };
> +	int o_flags = 0;
> +	int r_flags = 0;
> +	char *filename;
> +
> +	while ((opt = getopt(argc, argv, "vctdwnp:l:")) != -1) {
> +		switch (opt) {
> +		case 'v':
> +			/*
> +			 * See if we were called to check for availability of
> +			 * sys_preadv2. STDIN is okay, since we do a zero
> +			 * length read (see man 2 read).
> +			 */
> +			ret = preadv2_check(STDIN_FILENO);
> +			exit(ret);
> +		case 'c':
> +			o_flags |= O_CREAT;
> +			break;
> +		case 't':
> +			o_flags |= O_TRUNC;
> +			break;
> +		case 'd':
> +			o_flags |= O_DIRECT;
> +			break;
> +		case 'w':
> +			o_flags |= O_RDWR;
> +			break;
> +		case 'n':
> +			r_flags |= RWF_NONBLOCK;
> +			break;
> +		case 'p':
> +			pos = atoll(optarg);
> +			break;
> +		case 'l':
> +			iov.iov_len = atoll(optarg);
> +			break;
> +		default:
> +			fprintf(stderr, "invalid option: %c\n", opt);
> +			usage(argv[0]);
> +			exit(1);
> +		}
> +	}
> +
> +	if (optind >= argc) {
> +		usage(argv[0]);
> +		exit(1);
> +	}
> +
> +	if ((o_flags & O_RDWR) != O_RDWR)
> +		o_flags |= O_RDONLY;
> +
> +	if ((iov.iov_base = malloc(iov.iov_len)) == NULL) {
> +		perror("malloc");
> +		exit(1);
> +	}
> +
> +	filename = argv[optind];
> +	fd = open(filename, o_flags);
> +
> +	if (fd < 0) {
> +		perror("open");
> +		exit(1);
> +	}
> +
> +	if ((ret = preadv2(fd, &iov, 1, pos, r_flags)) == -1) {
> +		perror("preadv2");
> +		exit(ret);
> +	}
> +
> +	free(iov.iov_base);
> +	exit(0);
> +}
> diff --git a/tests/generic/067 b/tests/generic/067
> new file mode 100755
> index 0000000..4cc58f8
> --- /dev/null
> +++ b/tests/generic/067
> @@ -0,0 +1,85 @@
> +#! /bin/bash
> +# FS QA Test No. 067
> +#
> +# Test for the preadv2 syscall
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (c) 2015 Milosz Tanski <mtanski@gmail.com>.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-----------------------------------------------------------------------
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +    cd /
> +    rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs generic
> +_supported_os Linux
> +_require_test
> +
> +# test file we'll be using
> +file=$SCRATCH_MNT/067.preadv2.$$
> +
> +# Create a file:
> +# two regions of data and a hole in the middle
> +# use O_DIRECT so it's not in the page cache
> +echo "create file"
> +$XFS_IO_PROG -t -f -d \
> +	-c "pwrite 0 1024" \
> +	-c "pwrite 2048 1024" \
> +	$file > /dev/null
> +
> +# Make sure it returns EAGAIN on uncached data
> +echo "uncached"
> +$here/src/preadv2 -n -p 0 -l 1024 $file
> +
> +# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
> +echo "cached"
> +$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
> +$here/src/preadv2 -n -p 0 -l 1024 $file
> +
> +# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
> +echo "O_DIRECT"
> +$here/src/preadv2 -d -n -p 0 -l 1024 $file
> +
> +# Holes do not block
> +echo "holes"
> +$here/src/preadv2 -n -p 2048 -l 1024 $file
> +
> +# EOF behavior (no EAGAIN)
> +echo "EOF"
> +$here/src/preadv2 -n -p 3072 -l 1 $file
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/067.out b/tests/generic/067.out
> new file mode 100644
> index 0000000..6e3740f
> --- /dev/null
> +++ b/tests/generic/067.out
> @@ -0,0 +1,9 @@
> +QA output created by 067
> +create file
> +uncached
> +preadv2: Resource temporarily unavailable
> +cached
> +O_DIRECT
> +preadv2: Resource temporarily unavailable
> +holes
> +EOF
> diff --git a/tests/generic/group b/tests/generic/group
> index e5db772..91c5870 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -69,6 +69,7 @@
> 064 auto quick prealloc
> 065 metadata auto quick
> 066 metadata auto quick
> +067 auto quick rw
> 068 other auto freeze dangerous stress
> 069 rw udf auto quick
> 070 attr udf auto quick stress
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] fstests: generic test for preadv2 behavior on linux
@ 2015-03-16 21:07     ` Andreas Dilger
  0 siblings, 0 replies; 94+ messages in thread
From: Andreas Dilger @ 2015-03-16 21:07 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner, Andrew Morton


> On Mar 16, 2015, at 12:34 PM, Milosz Tanski <milosz@adfin.com> wrote:
> 
> preadv2 is a new syscall introduced that is like preadv2 but with flag

Sorry, "preadv2 ... is like preadv2"?

> argument. The first use case of this is to let us add a flag to perform a
> non-blocking file using the page cache.

This is also missing a Signed-off-by: line.

Cheers, Andreas
> ---
> src/Makefile           |   2 +-
> src/preadv2-pwritev2.h |  52 +++++++++++++++++
> src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++
> tests/generic/067      |  85 ++++++++++++++++++++++++++++
> tests/generic/067.out  |   9 +++
> tests/generic/group    |   1 +
> 6 files changed, 298 insertions(+), 1 deletion(-)
> create mode 100644 src/preadv2-pwritev2.h
> create mode 100644 src/preadv2.c
> create mode 100755 tests/generic/067
> create mode 100644 tests/generic/067.out
> 
> diff --git a/src/Makefile b/src/Makefile
> index 4781736..f7d3681 100644
> --- a/src/Makefile
> +++ b/src/Makefile
> @@ -19,7 +19,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
> 	bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
> 	stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
> 	seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
> -	renameat2 t_getcwd e4compact
> +	renameat2 t_getcwd e4compact preadv2
> 
> SUBDIRS =
> 
> diff --git a/src/preadv2-pwritev2.h b/src/preadv2-pwritev2.h
> new file mode 100644
> index 0000000..786e524
> --- /dev/null
> +++ b/src/preadv2-pwritev2.h
> @@ -0,0 +1,52 @@
> +#ifndef PREADV2_PWRITEV2_H
> +#define PREADV2_PWRITEV2_H
> +
> +#include "global.h"
> +
> +#ifndef HAVE_PREADV2
> +#include <sys/syscall.h>
> +
> +#if !defined(SYS_preadv2) && defined(__x86_64__)
> +#define SYS_preadv2 323
> +#define SYS_pwritev2 324
> +#endif
> +
> +#if !defined (SYS_preadv2) && defined(__i386__)
> +#define SYS_preadv2 359
> +#define SYS_pwritev2 360
> +#endif
> +
> +/* LO_HI_LONG taken from glibc */
> +#define LO_HI_LONG(val)							\
> +  (off_t) val,                                                          \
> +  (off_t) ((((uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof (long) * 4))
> +
> +static inline ssize_t
> +preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
> +{
> +#ifdef SYS_preadv2
> +        return syscall(SYS_preadv2, fd, iov, iovcnt, LO_HI_LONG(offset),
> +		       flags);
> +#else
> +	errno = ENOSYS;
> +	return -1;
> +#endif
> +}
> +
> +static inline ssize_t
> +pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
> +{
> +#ifdef SYS_pwritev2
> +        return syscall(SYS_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
> +		       flags);
> +#else
> +	errno = ENOSYS;
> +	return -1;
> +#endif
> +}
> +
> +#define RWF_NONBLOCK	0x00000001
> +#define RWF_DSYNC	0x00000002
> +
> +#endif /* HAVE_PREADV2 */
> +#endif /* PREADV2_PWRITEV2_H */
> diff --git a/src/preadv2.c b/src/preadv2.c
> new file mode 100644
> index 0000000..a4f89b5
> --- /dev/null
> +++ b/src/preadv2.c
> @@ -0,0 +1,150 @@
> +/*
> + * Copyright 2014 Red Hat, Inc.  All rights reserved.
> + * Copyright 2015 Milosz Tanski
> + *
> + * License: GPLv2
> + *
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <getopt.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <errno.h>
> +#include <linux/fs.h> /* for RWF_NONBLOCK */
> +
> +/*
> + * Once preadv2 is part of the upstream kernel and there is glibc support for
> + * it. We'll add support for preadv2 to xfs_io and this will be unnecessary.
> + */
> +#include "preadv2-pwritev2.h"
> +
> +/*
> + * Test to see if the system call is implemented.  If -EINVAL or -ENOSYS
> + * are returned, consider the call unimplemented.  All other errors are
> + * considered success.
> + *
> + * Returns: 0 if the system call is implemented, 1 if the system call
> + * is not implemented.
> + */
> +int
> +preadv2_check(int fd)
> +{
> +	int ret;
> +	struct iovec iov[] = {};
> +
> +	/* 0 length read; just check iof the syscall is there.
> +         *
> +         * - 0 length iovec
> +         * - Position is -1 (eg. use current position)
> +         */
> +	ret = preadv2(fd, iov, 0, -1, 0);
> +
> +	if (ret < 0) {
> +		if (errno == ENOSYS || errno == EINVAL)
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +usage(char *prog)
> +{
> +	fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
> +	fprintf(stderr, "General arguments:\n");
> +	fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
> +	fprintf(stderr, "\n");
> +	fprintf(stderr, "Open arguments:\n");
> +	fprintf(stderr, "  -c Open file with O_CREAT flag\n");
> +	fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
> +	fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
> +	fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
> +	fprintf(stderr, "\n");
> +	fprintf(stderr, "preadv2 arguments:\n");
> +	fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
> +	fprintf(stderr, "  -p POS offset file to read at\n");
> +	fprintf(stderr, "  -l LEN length of file data to read\n");
> +	fprintf(stderr, "\n");
> +	fflush(stderr);
> +}
> +
> +int
> +main(int argc, char **argv)
> +{
> +	int fd;
> +	int ret;
> +	int opt;
> +	off_t pos = -1;
> +	struct iovec iov = { NULL, 0 };
> +	int o_flags = 0;
> +	int r_flags = 0;
> +	char *filename;
> +
> +	while ((opt = getopt(argc, argv, "vctdwnp:l:")) != -1) {
> +		switch (opt) {
> +		case 'v':
> +			/*
> +			 * See if we were called to check for availability of
> +			 * sys_preadv2. STDIN is okay, since we do a zero
> +			 * length read (see man 2 read).
> +			 */
> +			ret = preadv2_check(STDIN_FILENO);
> +			exit(ret);
> +		case 'c':
> +			o_flags |= O_CREAT;
> +			break;
> +		case 't':
> +			o_flags |= O_TRUNC;
> +			break;
> +		case 'd':
> +			o_flags |= O_DIRECT;
> +			break;
> +		case 'w':
> +			o_flags |= O_RDWR;
> +			break;
> +		case 'n':
> +			r_flags |= RWF_NONBLOCK;
> +			break;
> +		case 'p':
> +			pos = atoll(optarg);
> +			break;
> +		case 'l':
> +			iov.iov_len = atoll(optarg);
> +			break;
> +		default:
> +			fprintf(stderr, "invalid option: %c\n", opt);
> +			usage(argv[0]);
> +			exit(1);
> +		}
> +	}
> +
> +	if (optind >= argc) {
> +		usage(argv[0]);
> +		exit(1);
> +	}
> +
> +	if ((o_flags & O_RDWR) != O_RDWR)
> +		o_flags |= O_RDONLY;
> +
> +	if ((iov.iov_base = malloc(iov.iov_len)) == NULL) {
> +		perror("malloc");
> +		exit(1);
> +	}
> +
> +	filename = argv[optind];
> +	fd = open(filename, o_flags);
> +
> +	if (fd < 0) {
> +		perror("open");
> +		exit(1);
> +	}
> +
> +	if ((ret = preadv2(fd, &iov, 1, pos, r_flags)) == -1) {
> +		perror("preadv2");
> +		exit(ret);
> +	}
> +
> +	free(iov.iov_base);
> +	exit(0);
> +}
> diff --git a/tests/generic/067 b/tests/generic/067
> new file mode 100755
> index 0000000..4cc58f8
> --- /dev/null
> +++ b/tests/generic/067
> @@ -0,0 +1,85 @@
> +#! /bin/bash
> +# FS QA Test No. 067
> +#
> +# Test for the preadv2 syscall
> +#
> +#-----------------------------------------------------------------------
> +# Copyright (c) 2015 Milosz Tanski <mtanski@gmail.com>.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-----------------------------------------------------------------------
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +    cd /
> +    rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs generic
> +_supported_os Linux
> +_require_test
> +
> +# test file we'll be using
> +file=$SCRATCH_MNT/067.preadv2.$$
> +
> +# Create a file:
> +# two regions of data and a hole in the middle
> +# use O_DIRECT so it's not in the page cache
> +echo "create file"
> +$XFS_IO_PROG -t -f -d \
> +	-c "pwrite 0 1024" \
> +	-c "pwrite 2048 1024" \
> +	$file > /dev/null
> +
> +# Make sure it returns EAGAIN on uncached data
> +echo "uncached"
> +$here/src/preadv2 -n -p 0 -l 1024 $file
> +
> +# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
> +echo "cached"
> +$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
> +$here/src/preadv2 -n -p 0 -l 1024 $file
> +
> +# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
> +echo "O_DIRECT"
> +$here/src/preadv2 -d -n -p 0 -l 1024 $file
> +
> +# Holes do not block
> +echo "holes"
> +$here/src/preadv2 -n -p 2048 -l 1024 $file
> +
> +# EOF behavior (no EAGAIN)
> +echo "EOF"
> +$here/src/preadv2 -n -p 3072 -l 1 $file
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/067.out b/tests/generic/067.out
> new file mode 100644
> index 0000000..6e3740f
> --- /dev/null
> +++ b/tests/generic/067.out
> @@ -0,0 +1,9 @@
> +QA output created by 067
> +create file
> +uncached
> +preadv2: Resource temporarily unavailable
> +cached
> +O_DIRECT
> +preadv2: Resource temporarily unavailable
> +holes
> +EOF
> diff --git a/tests/generic/group b/tests/generic/group
> index e5db772..91c5870 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -69,6 +69,7 @@
> 064 auto quick prealloc
> 065 metadata auto quick
> 066 metadata auto quick
> +067 auto quick rw
> 068 other auto freeze dangerous stress
> 069 rw udf auto quick
> 070 attr udf auto quick stress
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] fstests: generic test for preadv2 behavior on linux
  2015-03-16 18:34   ` Milosz Tanski
@ 2015-03-16 22:02     ` Dave Chinner
  -1 siblings, 0 replies; 94+ messages in thread
From: Dave Chinner @ 2015-03-16 22:02 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Andrew Morton

On Mon, Mar 16, 2015 at 02:34:22PM -0400, Milosz Tanski wrote:
> preadv2 is a new syscall introduced that is like preadv2 but with flag
> argument. The first use case of this is to let us add a flag to perform a
> non-blocking file using the page cache.
> ---
>  src/Makefile           |   2 +-
>  src/preadv2-pwritev2.h |  52 +++++++++++++++++
>  src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++

You should add this syscall to support to xfs_io (in the xfsprogs
package) rather than write a new helper for it. Mainly because:

> +void
> +usage(char *prog)
> +{
> +	fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
> +	fprintf(stderr, "General arguments:\n");
> +	fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
> +	fprintf(stderr, "\n");
> +	fprintf(stderr, "Open arguments:\n");
> +	fprintf(stderr, "  -c Open file with O_CREAT flag\n");
> +	fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
> +	fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
> +	fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
> +	fprintf(stderr, "\n");
> +	fprintf(stderr, "preadv2 arguments:\n");
> +	fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
> +	fprintf(stderr, "  -p POS offset file to read at\n");
> +	fprintf(stderr, "  -l LEN length of file data to read\n");

The xfs_io pread command already supports all of these functions
except for the RWF_NONBLOCK flag, and anyone testing bleeding edge
functionality is also using a bleeding edge xfs_io binary.

Then you test for whether the functionality is available via
_require_xfs_io_command "preadv -n"

.....
> +# test file we'll be using
> +file=$SCRATCH_MNT/067.preadv2.$$
> +
> +# Create a file:
> +# two regions of data and a hole in the middle
> +# use O_DIRECT so it's not in the page cache
> +echo "create file"
> +$XFS_IO_PROG -t -f -d \
> +	-c "pwrite 0 1024" \
> +	-c "pwrite 2048 1024" \
> +	$file > /dev/null

This does not create holes on most filesystems. You'll need to leave
holes of up 64k so that 64k block size filesystem end up with single
block holes in them.

> +# Make sure it returns EAGAIN on uncached data
> +echo "uncached"
> +$here/src/preadv2 -n -p 0 -l 1024 $file

$XFS_IO_PROG -c "pread -n 0 1024" $file | _filter_xfs_io

> +
> +# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
> +echo "cached"
> +$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
> +$here/src/preadv2 -n -p 0 -l 1024 $file

$XFS_IO_PROG -c "pread 0 4096" -c "pread -n 0 1024" $file | _filter_xfs_io

> +
> +# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
> +echo "O_DIRECT"
> +$here/src/preadv2 -d -n -p 0 -l 1024 $file

$XFS_IO_PROG -d -c "pread -n 0 1024" $file | _filter_xfs_io

And so on....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] fstests: generic test for preadv2 behavior on linux
@ 2015-03-16 22:02     ` Dave Chinner
  0 siblings, 0 replies; 94+ messages in thread
From: Dave Chinner @ 2015-03-16 22:02 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Andrew Morton

On Mon, Mar 16, 2015 at 02:34:22PM -0400, Milosz Tanski wrote:
> preadv2 is a new syscall introduced that is like preadv2 but with flag
> argument. The first use case of this is to let us add a flag to perform a
> non-blocking file using the page cache.
> ---
>  src/Makefile           |   2 +-
>  src/preadv2-pwritev2.h |  52 +++++++++++++++++
>  src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++

You should add this syscall to support to xfs_io (in the xfsprogs
package) rather than write a new helper for it. Mainly because:

> +void
> +usage(char *prog)
> +{
> +	fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
> +	fprintf(stderr, "General arguments:\n");
> +	fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
> +	fprintf(stderr, "\n");
> +	fprintf(stderr, "Open arguments:\n");
> +	fprintf(stderr, "  -c Open file with O_CREAT flag\n");
> +	fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
> +	fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
> +	fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
> +	fprintf(stderr, "\n");
> +	fprintf(stderr, "preadv2 arguments:\n");
> +	fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
> +	fprintf(stderr, "  -p POS offset file to read at\n");
> +	fprintf(stderr, "  -l LEN length of file data to read\n");

The xfs_io pread command already supports all of these functions
except for the RWF_NONBLOCK flag, and anyone testing bleeding edge
functionality is also using a bleeding edge xfs_io binary.

Then you test for whether the functionality is available via
_require_xfs_io_command "preadv -n"

.....
> +# test file we'll be using
> +file=$SCRATCH_MNT/067.preadv2.$$
> +
> +# Create a file:
> +# two regions of data and a hole in the middle
> +# use O_DIRECT so it's not in the page cache
> +echo "create file"
> +$XFS_IO_PROG -t -f -d \
> +	-c "pwrite 0 1024" \
> +	-c "pwrite 2048 1024" \
> +	$file > /dev/null

This does not create holes on most filesystems. You'll need to leave
holes of up 64k so that 64k block size filesystem end up with single
block holes in them.

> +# Make sure it returns EAGAIN on uncached data
> +echo "uncached"
> +$here/src/preadv2 -n -p 0 -l 1024 $file

$XFS_IO_PROG -c "pread -n 0 1024" $file | _filter_xfs_io

> +
> +# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
> +echo "cached"
> +$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
> +$here/src/preadv2 -n -p 0 -l 1024 $file

$XFS_IO_PROG -c "pread 0 4096" -c "pread -n 0 1024" $file | _filter_xfs_io

> +
> +# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
> +echo "O_DIRECT"
> +$here/src/preadv2 -d -n -p 0 -l 1024 $file

$XFS_IO_PROG -d -c "pread -n 0 1024" $file | _filter_xfs_io

And so on....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] fstests: generic test for preadv2 behavior on linux
  2015-03-16 21:07     ` Andreas Dilger
  (?)
@ 2015-03-16 22:03     ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 22:03 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, Linux API, Michael Kerrisk, linux-arch, Dave Chinner,
	Andrew Morton

On Mon, Mar 16, 2015 at 5:07 PM, Andreas Dilger <adilger@dilger.ca> wrote:
>
>> On Mar 16, 2015, at 12:34 PM, Milosz Tanski <milosz@adfin.com> wrote:
>>
>> preadv2 is a new syscall introduced that is like preadv2 but with flag
>
> Sorry, "preadv2 ... is like preadv2"?

I already have a fix for in my branch. Robert Elliott was the first
one to notice that (via private email).

>
>> argument. The first use case of this is to let us add a flag to perform a
>> non-blocking file using the page cache.
>
> This is also missing a Signed-off-by: line.

Good catch. I'm going to fix the above to issues, add a pre-test check
for preadv2 (I just noticed it's missing) and I'm going to resend this
patch.

>
> Cheers, Andreas
>> ---
>> src/Makefile           |   2 +-
>> src/preadv2-pwritev2.h |  52 +++++++++++++++++
>> src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++
>> tests/generic/067      |  85 ++++++++++++++++++++++++++++
>> tests/generic/067.out  |   9 +++
>> tests/generic/group    |   1 +
>> 6 files changed, 298 insertions(+), 1 deletion(-)
>> create mode 100644 src/preadv2-pwritev2.h
>> create mode 100644 src/preadv2.c
>> create mode 100755 tests/generic/067
>> create mode 100644 tests/generic/067.out
>>
>> diff --git a/src/Makefile b/src/Makefile
>> index 4781736..f7d3681 100644
>> --- a/src/Makefile
>> +++ b/src/Makefile
>> @@ -19,7 +19,7 @@ LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
>>       bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
>>       stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
>>       seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
>> -     renameat2 t_getcwd e4compact
>> +     renameat2 t_getcwd e4compact preadv2
>>
>> SUBDIRS =
>>
>> diff --git a/src/preadv2-pwritev2.h b/src/preadv2-pwritev2.h
>> new file mode 100644
>> index 0000000..786e524
>> --- /dev/null
>> +++ b/src/preadv2-pwritev2.h
>> @@ -0,0 +1,52 @@
>> +#ifndef PREADV2_PWRITEV2_H
>> +#define PREADV2_PWRITEV2_H
>> +
>> +#include "global.h"
>> +
>> +#ifndef HAVE_PREADV2
>> +#include <sys/syscall.h>
>> +
>> +#if !defined(SYS_preadv2) && defined(__x86_64__)
>> +#define SYS_preadv2 323
>> +#define SYS_pwritev2 324
>> +#endif
>> +
>> +#if !defined (SYS_preadv2) && defined(__i386__)
>> +#define SYS_preadv2 359
>> +#define SYS_pwritev2 360
>> +#endif
>> +
>> +/* LO_HI_LONG taken from glibc */
>> +#define LO_HI_LONG(val)                                                      \
>> +  (off_t) val,                                                          \
>> +  (off_t) ((((uint64_t) (val)) >> (sizeof (long) * 4)) >> (sizeof (long) * 4))
>> +
>> +static inline ssize_t
>> +preadv2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
>> +{
>> +#ifdef SYS_preadv2
>> +        return syscall(SYS_preadv2, fd, iov, iovcnt, LO_HI_LONG(offset),
>> +                    flags);
>> +#else
>> +     errno = ENOSYS;
>> +     return -1;
>> +#endif
>> +}
>> +
>> +static inline ssize_t
>> +pwritev2(int fd, const struct iovec *iov, int iovcnt, off_t offset, int flags)
>> +{
>> +#ifdef SYS_pwritev2
>> +        return syscall(SYS_pwritev2, fd, iov, iovcnt, LO_HI_LONG(offset),
>> +                    flags);
>> +#else
>> +     errno = ENOSYS;
>> +     return -1;
>> +#endif
>> +}
>> +
>> +#define RWF_NONBLOCK 0x00000001
>> +#define RWF_DSYNC    0x00000002
>> +
>> +#endif /* HAVE_PREADV2 */
>> +#endif /* PREADV2_PWRITEV2_H */
>> diff --git a/src/preadv2.c b/src/preadv2.c
>> new file mode 100644
>> index 0000000..a4f89b5
>> --- /dev/null
>> +++ b/src/preadv2.c
>> @@ -0,0 +1,150 @@
>> +/*
>> + * Copyright 2014 Red Hat, Inc.  All rights reserved.
>> + * Copyright 2015 Milosz Tanski
>> + *
>> + * License: GPLv2
>> + *
>> + */
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <getopt.h>
>> +#include <string.h>
>> +#include <unistd.h>
>> +#include <errno.h>
>> +#include <linux/fs.h> /* for RWF_NONBLOCK */
>> +
>> +/*
>> + * Once preadv2 is part of the upstream kernel and there is glibc support for
>> + * it. We'll add support for preadv2 to xfs_io and this will be unnecessary.
>> + */
>> +#include "preadv2-pwritev2.h"
>> +
>> +/*
>> + * Test to see if the system call is implemented.  If -EINVAL or -ENOSYS
>> + * are returned, consider the call unimplemented.  All other errors are
>> + * considered success.
>> + *
>> + * Returns: 0 if the system call is implemented, 1 if the system call
>> + * is not implemented.
>> + */
>> +int
>> +preadv2_check(int fd)
>> +{
>> +     int ret;
>> +     struct iovec iov[] = {};
>> +
>> +     /* 0 length read; just check iof the syscall is there.
>> +         *
>> +         * - 0 length iovec
>> +         * - Position is -1 (eg. use current position)
>> +         */
>> +     ret = preadv2(fd, iov, 0, -1, 0);
>> +
>> +     if (ret < 0) {
>> +             if (errno == ENOSYS || errno == EINVAL)
>> +                     return 1;
>> +     }
>> +
>> +     return 0;
>> +}
>> +
>> +void
>> +usage(char *prog)
>> +{
>> +     fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
>> +     fprintf(stderr, "General arguments:\n");
>> +     fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
>> +     fprintf(stderr, "\n");
>> +     fprintf(stderr, "Open arguments:\n");
>> +     fprintf(stderr, "  -c Open file with O_CREAT flag\n");
>> +     fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
>> +     fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
>> +     fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
>> +     fprintf(stderr, "\n");
>> +     fprintf(stderr, "preadv2 arguments:\n");
>> +     fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
>> +     fprintf(stderr, "  -p POS offset file to read at\n");
>> +     fprintf(stderr, "  -l LEN length of file data to read\n");
>> +     fprintf(stderr, "\n");
>> +     fflush(stderr);
>> +}
>> +
>> +int
>> +main(int argc, char **argv)
>> +{
>> +     int fd;
>> +     int ret;
>> +     int opt;
>> +     off_t pos = -1;
>> +     struct iovec iov = { NULL, 0 };
>> +     int o_flags = 0;
>> +     int r_flags = 0;
>> +     char *filename;
>> +
>> +     while ((opt = getopt(argc, argv, "vctdwnp:l:")) != -1) {
>> +             switch (opt) {
>> +             case 'v':
>> +                     /*
>> +                      * See if we were called to check for availability of
>> +                      * sys_preadv2. STDIN is okay, since we do a zero
>> +                      * length read (see man 2 read).
>> +                      */
>> +                     ret = preadv2_check(STDIN_FILENO);
>> +                     exit(ret);
>> +             case 'c':
>> +                     o_flags |= O_CREAT;
>> +                     break;
>> +             case 't':
>> +                     o_flags |= O_TRUNC;
>> +                     break;
>> +             case 'd':
>> +                     o_flags |= O_DIRECT;
>> +                     break;
>> +             case 'w':
>> +                     o_flags |= O_RDWR;
>> +                     break;
>> +             case 'n':
>> +                     r_flags |= RWF_NONBLOCK;
>> +                     break;
>> +             case 'p':
>> +                     pos = atoll(optarg);
>> +                     break;
>> +             case 'l':
>> +                     iov.iov_len = atoll(optarg);
>> +                     break;
>> +             default:
>> +                     fprintf(stderr, "invalid option: %c\n", opt);
>> +                     usage(argv[0]);
>> +                     exit(1);
>> +             }
>> +     }
>> +
>> +     if (optind >= argc) {
>> +             usage(argv[0]);
>> +             exit(1);
>> +     }
>> +
>> +     if ((o_flags & O_RDWR) != O_RDWR)
>> +             o_flags |= O_RDONLY;
>> +
>> +     if ((iov.iov_base = malloc(iov.iov_len)) == NULL) {
>> +             perror("malloc");
>> +             exit(1);
>> +     }
>> +
>> +     filename = argv[optind];
>> +     fd = open(filename, o_flags);
>> +
>> +     if (fd < 0) {
>> +             perror("open");
>> +             exit(1);
>> +     }
>> +
>> +     if ((ret = preadv2(fd, &iov, 1, pos, r_flags)) == -1) {
>> +             perror("preadv2");
>> +             exit(ret);
>> +     }
>> +
>> +     free(iov.iov_base);
>> +     exit(0);
>> +}
>> diff --git a/tests/generic/067 b/tests/generic/067
>> new file mode 100755
>> index 0000000..4cc58f8
>> --- /dev/null
>> +++ b/tests/generic/067
>> @@ -0,0 +1,85 @@
>> +#! /bin/bash
>> +# FS QA Test No. 067
>> +#
>> +# Test for the preadv2 syscall
>> +#
>> +#-----------------------------------------------------------------------
>> +# Copyright (c) 2015 Milosz Tanski <mtanski@gmail.com>.  All Rights Reserved.
>> +#
>> +# This program is free software; you can redistribute it and/or
>> +# modify it under the terms of the GNU General Public License as
>> +# published by the Free Software Foundation.
>> +#
>> +# This program is distributed in the hope that it would be useful,
>> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> +# GNU General Public License for more details.
>> +#
>> +# You should have received a copy of the GNU General Public License
>> +# along with this program; if not, write the Free Software Foundation,
>> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
>> +#-----------------------------------------------------------------------
>> +#
>> +
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +here=`pwd`
>> +tmp=/tmp/$$
>> +status=1     # failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> +    cd /
>> +    rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +
>> +# real QA test starts here
>> +
>> +# Modify as appropriate.
>> +_supported_fs generic
>> +_supported_os Linux
>> +_require_test
>> +
>> +# test file we'll be using
>> +file=$SCRATCH_MNT/067.preadv2.$$
>> +
>> +# Create a file:
>> +# two regions of data and a hole in the middle
>> +# use O_DIRECT so it's not in the page cache
>> +echo "create file"
>> +$XFS_IO_PROG -t -f -d \
>> +     -c "pwrite 0 1024" \
>> +     -c "pwrite 2048 1024" \
>> +     $file > /dev/null
>> +
>> +# Make sure it returns EAGAIN on uncached data
>> +echo "uncached"
>> +$here/src/preadv2 -n -p 0 -l 1024 $file
>> +
>> +# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
>> +echo "cached"
>> +$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
>> +$here/src/preadv2 -n -p 0 -l 1024 $file
>> +
>> +# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
>> +echo "O_DIRECT"
>> +$here/src/preadv2 -d -n -p 0 -l 1024 $file
>> +
>> +# Holes do not block
>> +echo "holes"
>> +$here/src/preadv2 -n -p 2048 -l 1024 $file
>> +
>> +# EOF behavior (no EAGAIN)
>> +echo "EOF"
>> +$here/src/preadv2 -n -p 3072 -l 1 $file
>> +
>> +# success, all done
>> +status=0
>> +exit
>> diff --git a/tests/generic/067.out b/tests/generic/067.out
>> new file mode 100644
>> index 0000000..6e3740f
>> --- /dev/null
>> +++ b/tests/generic/067.out
>> @@ -0,0 +1,9 @@
>> +QA output created by 067
>> +create file
>> +uncached
>> +preadv2: Resource temporarily unavailable
>> +cached
>> +O_DIRECT
>> +preadv2: Resource temporarily unavailable
>> +holes
>> +EOF
>> diff --git a/tests/generic/group b/tests/generic/group
>> index e5db772..91c5870 100644
>> --- a/tests/generic/group
>> +++ b/tests/generic/group
>> @@ -69,6 +69,7 @@
>> 064 auto quick prealloc
>> 065 metadata auto quick
>> 066 metadata auto quick
>> +067 auto quick rw
>> 068 other auto freeze dangerous stress
>> 069 rw udf auto quick
>> 070 attr udf auto quick stress
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> Cheers, Andreas
>
>
>
>
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 5/5] xfs: add RWF_NONBLOCK support
  2015-03-16 18:27   ` Milosz Tanski
  (?)
@ 2015-03-16 22:04   ` Dave Chinner
  -1 siblings, 0 replies; 94+ messages in thread
From: Dave Chinner @ 2015-03-16 22:04 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro, linux-api,
	Michael Kerrisk, linux-arch, Andrew Morton

On Mon, Mar 16, 2015 at 02:27:15PM -0400, Milosz Tanski wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Add support for non-blocking reads.  The guts are handled by the generic
> code, the only addition is a non-blocking variant of xfs_rw_ilock.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Change looks good, but I haven't tested it, so:

Acked-by: Dave Chinner <david@fromorbit.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] fstests: generic test for preadv2 behavior on linux
  2015-03-16 22:02     ` Dave Chinner
  (?)
@ 2015-03-16 22:11     ` Milosz Tanski
  2015-03-16 22:56         ` Dave Chinner
  -1 siblings, 1 reply; 94+ messages in thread
From: Milosz Tanski @ 2015-03-16 22:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, Linux API, Michael Kerrisk, linux-arch, Andrew Morton

On Mon, Mar 16, 2015 at 6:02 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Mar 16, 2015 at 02:34:22PM -0400, Milosz Tanski wrote:
>> preadv2 is a new syscall introduced that is like preadv2 but with flag
>> argument. The first use case of this is to let us add a flag to perform a
>> non-blocking file using the page cache.
>> ---
>>  src/Makefile           |   2 +-
>>  src/preadv2-pwritev2.h |  52 +++++++++++++++++
>>  src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++
>
> You should add this syscall to support to xfs_io (in the xfsprogs
> package) rather than write a new helper for it. Mainly because:
>
>> +void
>> +usage(char *prog)
>> +{
>> +     fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
>> +     fprintf(stderr, "General arguments:\n");
>> +     fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
>> +     fprintf(stderr, "\n");
>> +     fprintf(stderr, "Open arguments:\n");
>> +     fprintf(stderr, "  -c Open file with O_CREAT flag\n");
>> +     fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
>> +     fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
>> +     fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
>> +     fprintf(stderr, "\n");
>> +     fprintf(stderr, "preadv2 arguments:\n");
>> +     fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
>> +     fprintf(stderr, "  -p POS offset file to read at\n");
>> +     fprintf(stderr, "  -l LEN length of file data to read\n");
>
> The xfs_io pread command already supports all of these functions
> except for the RWF_NONBLOCK flag, and anyone testing bleeding edge
> functionality is also using a bleeding edge xfs_io binary.
>
> Then you test for whether the functionality is available via
> _require_xfs_io_command "preadv -n"
>
> .....
>> +# test file we'll be using
>> +file=$SCRATCH_MNT/067.preadv2.$$
>> +
>> +# Create a file:
>> +# two regions of data and a hole in the middle
>> +# use O_DIRECT so it's not in the page cache
>> +echo "create file"
>> +$XFS_IO_PROG -t -f -d \
>> +     -c "pwrite 0 1024" \
>> +     -c "pwrite 2048 1024" \
>> +     $file > /dev/null
>
> This does not create holes on most filesystems. You'll need to leave
> holes of up 64k so that 64k block size filesystem end up with single
> block holes in them.

Noted and I shall fix this in the next round.

>
>> +# Make sure it returns EAGAIN on uncached data
>> +echo "uncached"
>> +$here/src/preadv2 -n -p 0 -l 1024 $file
>
> $XFS_IO_PROG -c "pread -n 0 1024" $file | _filter_xfs_io
>
>> +
>> +# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
>> +echo "cached"
>> +$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
>> +$here/src/preadv2 -n -p 0 -l 1024 $file
>
> $XFS_IO_PROG -c "pread 0 4096" -c "pread -n 0 1024" $file | _filter_xfs_io
>
>> +
>> +# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
>> +echo "O_DIRECT"
>> +$here/src/preadv2 -d -n -p 0 -l 1024 $file
>
> $XFS_IO_PROG -d -c "pread -n 0 1024" $file | _filter_xfs_io
>
> And so on....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

Dave,

My plan is/was to wait till the main patch makes it into the upstream
linux kernel with the syscall numbers are set in stone. Possibly after
till glibc adds support for them. After that I was going remove my
preadv2 application from xfs_tests and add that functionality to
xfs_io.

With xfs_io living in separate repository I wanted to the case when/if
syscall numbers change (there's a bunch of new syscalls queued around
epoll) of having somebody test against xfs_io that has preadv2 but bad
ids.

As a side note, I did add mlock / munlock support to xfs_io that I'll
send in another patch.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] fstests: generic test for preadv2 behavior on linux
@ 2015-03-16 22:56         ` Dave Chinner
  0 siblings, 0 replies; 94+ messages in thread
From: Dave Chinner @ 2015-03-16 22:56 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, Linux API, Michael Kerrisk, linux-arch, Andrew Morton

On Mon, Mar 16, 2015 at 06:11:19PM -0400, Milosz Tanski wrote:
> On Mon, Mar 16, 2015 at 6:02 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Mar 16, 2015 at 02:34:22PM -0400, Milosz Tanski wrote:
> >> preadv2 is a new syscall introduced that is like preadv2 but with flag
> >> argument. The first use case of this is to let us add a flag to perform a
> >> non-blocking file using the page cache.
> >> ---
> >>  src/Makefile           |   2 +-
> >>  src/preadv2-pwritev2.h |  52 +++++++++++++++++
> >>  src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++
> >
> > You should add this syscall to support to xfs_io (in the xfsprogs
> > package) rather than write a new helper for it. Mainly because:
> >
> >> +void
> >> +usage(char *prog)
> >> +{
> >> +     fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
> >> +     fprintf(stderr, "General arguments:\n");
> >> +     fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
> >> +     fprintf(stderr, "\n");
> >> +     fprintf(stderr, "Open arguments:\n");
> >> +     fprintf(stderr, "  -c Open file with O_CREAT flag\n");
> >> +     fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
> >> +     fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
> >> +     fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
> >> +     fprintf(stderr, "\n");
> >> +     fprintf(stderr, "preadv2 arguments:\n");
> >> +     fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
> >> +     fprintf(stderr, "  -p POS offset file to read at\n");
> >> +     fprintf(stderr, "  -l LEN length of file data to read\n");
> >
> > The xfs_io pread command already supports all of these functions
> > except for the RWF_NONBLOCK flag, and anyone testing bleeding edge
> > functionality is also using a bleeding edge xfs_io binary.
> >
> > Then you test for whether the functionality is available via
> > _require_xfs_io_command "preadv -n"
> >
> > .....
> >> +# test file we'll be using
> >> +file=$SCRATCH_MNT/067.preadv2.$$
> >> +
> >> +# Create a file:
> >> +# two regions of data and a hole in the middle
> >> +# use O_DIRECT so it's not in the page cache
> >> +echo "create file"
> >> +$XFS_IO_PROG -t -f -d \
> >> +     -c "pwrite 0 1024" \
> >> +     -c "pwrite 2048 1024" \
> >> +     $file > /dev/null
> >
> > This does not create holes on most filesystems. You'll need to leave
> > holes of up 64k so that 64k block size filesystem end up with single
> > block holes in them.
> 
> Noted and I shall fix this in the next round.
> 
> >
> >> +# Make sure it returns EAGAIN on uncached data
> >> +echo "uncached"
> >> +$here/src/preadv2 -n -p 0 -l 1024 $file
> >
> > $XFS_IO_PROG -c "pread -n 0 1024" $file | _filter_xfs_io
> >
> >> +
> >> +# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
> >> +echo "cached"
> >> +$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
> >> +$here/src/preadv2 -n -p 0 -l 1024 $file
> >
> > $XFS_IO_PROG -c "pread 0 4096" -c "pread -n 0 1024" $file | _filter_xfs_io
> >
> >> +
> >> +# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
> >> +echo "O_DIRECT"
> >> +$here/src/preadv2 -d -n -p 0 -l 1024 $file
> >
> > $XFS_IO_PROG -d -c "pread -n 0 1024" $file | _filter_xfs_io
> >
> > And so on....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> 
> Dave,
> 
> My plan is/was to wait till the main patch makes it into the upstream
> linux kernel with the syscall numbers are set in stone. Possibly after
> till glibc adds support for them. After that I was going remove my
> preadv2 application from xfs_tests and add that functionality to
> xfs_io.

I wouldn't worry too much about glibc - once the syscall numbers are
defined we can test for them directly, I think, like we do for
supporting various ioctls.

> With xfs_io living in separate repository I wanted to the case when/if
> syscall numbers change (there's a bunch of new syscalls queued around
> epoll) of having somebody test against xfs_io that has preadv2 but bad
> ids.

If the syscall numbers change, it's simple to patch, and as long as
we have set the syscall numbers in stone before a xfsprogs release
is done then there won't be any problems...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] fstests: generic test for preadv2 behavior on linux
@ 2015-03-16 22:56         ` Dave Chinner
  0 siblings, 0 replies; 94+ messages in thread
From: Dave Chinner @ 2015-03-16 22:56 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton

On Mon, Mar 16, 2015 at 06:11:19PM -0400, Milosz Tanski wrote:
> On Mon, Mar 16, 2015 at 6:02 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> > On Mon, Mar 16, 2015 at 02:34:22PM -0400, Milosz Tanski wrote:
> >> preadv2 is a new syscall introduced that is like preadv2 but with flag
> >> argument. The first use case of this is to let us add a flag to perform a
> >> non-blocking file using the page cache.
> >> ---
> >>  src/Makefile           |   2 +-
> >>  src/preadv2-pwritev2.h |  52 +++++++++++++++++
> >>  src/preadv2.c          | 150 +++++++++++++++++++++++++++++++++++++++++++++++++
> >
> > You should add this syscall to support to xfs_io (in the xfsprogs
> > package) rather than write a new helper for it. Mainly because:
> >
> >> +void
> >> +usage(char *prog)
> >> +{
> >> +     fprintf(stderr, "Usage: %s [-v] [-ctdw] [-n] -p POS -l LEN <filename>\n\n", prog);
> >> +     fprintf(stderr, "General arguments:\n");
> >> +     fprintf(stderr, "  -v Verify that the syscall is supported and quit:\n");
> >> +     fprintf(stderr, "\n");
> >> +     fprintf(stderr, "Open arguments:\n");
> >> +     fprintf(stderr, "  -c Open file with O_CREAT flag\n");
> >> +     fprintf(stderr, "  -t Open file with O_TRUNC flag\n");
> >> +     fprintf(stderr, "  -d Open file with O_DIRECT flag\n");
> >> +     fprintf(stderr, "  -w Open file with O_RDWR flag vs O_RDONLY (default)\n");
> >> +     fprintf(stderr, "\n");
> >> +     fprintf(stderr, "preadv2 arguments:\n");
> >> +     fprintf(stderr, "  -n use RWF_NONBLOCK when performing read\n");
> >> +     fprintf(stderr, "  -p POS offset file to read at\n");
> >> +     fprintf(stderr, "  -l LEN length of file data to read\n");
> >
> > The xfs_io pread command already supports all of these functions
> > except for the RWF_NONBLOCK flag, and anyone testing bleeding edge
> > functionality is also using a bleeding edge xfs_io binary.
> >
> > Then you test for whether the functionality is available via
> > _require_xfs_io_command "preadv -n"
> >
> > .....
> >> +# test file we'll be using
> >> +file=$SCRATCH_MNT/067.preadv2.$$
> >> +
> >> +# Create a file:
> >> +# two regions of data and a hole in the middle
> >> +# use O_DIRECT so it's not in the page cache
> >> +echo "create file"
> >> +$XFS_IO_PROG -t -f -d \
> >> +     -c "pwrite 0 1024" \
> >> +     -c "pwrite 2048 1024" \
> >> +     $file > /dev/null
> >
> > This does not create holes on most filesystems. You'll need to leave
> > holes of up 64k so that 64k block size filesystem end up with single
> > block holes in them.
> 
> Noted and I shall fix this in the next round.
> 
> >
> >> +# Make sure it returns EAGAIN on uncached data
> >> +echo "uncached"
> >> +$here/src/preadv2 -n -p 0 -l 1024 $file
> >
> > $XFS_IO_PROG -c "pread -n 0 1024" $file | _filter_xfs_io
> >
> >> +
> >> +# Make sure we read in the whole file, after that RWF_NONBLOCK should return us all the data
> >> +echo "cached"
> >> +$XFS_IO_PROG -f $file -c "pread 0 4096" $file > /dev/null
> >> +$here/src/preadv2 -n -p 0 -l 1024 $file
> >
> > $XFS_IO_PROG -c "pread 0 4096" -c "pread -n 0 1024" $file | _filter_xfs_io
> >
> >> +
> >> +# O_DIRECT and RWF_NONBLOCK should return EAGAIN always
> >> +echo "O_DIRECT"
> >> +$here/src/preadv2 -d -n -p 0 -l 1024 $file
> >
> > $XFS_IO_PROG -d -c "pread -n 0 1024" $file | _filter_xfs_io
> >
> > And so on....
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org
> 
> Dave,
> 
> My plan is/was to wait till the main patch makes it into the upstream
> linux kernel with the syscall numbers are set in stone. Possibly after
> till glibc adds support for them. After that I was going remove my
> preadv2 application from xfs_tests and add that functionality to
> xfs_io.

I wouldn't worry too much about glibc - once the syscall numbers are
defined we can test for them directly, I think, like we do for
supporting various ioctls.

> With xfs_io living in separate repository I wanted to the case when/if
> syscall numbers change (there's a bunch of new syscalls queued around
> epoll) of having somebody test against xfs_io that has preadv2 but bad
> ids.

If the syscall numbers change, it's simple to patch, and as long as
we have set the syscall numbers in stone before a xfsprogs release
is done then there won't be any problems...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
@ 2015-03-26 11:55   ` Christoph Hellwig
  2015-03-16 18:27   ` Milosz Tanski
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-26 11:55 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner, Andrew Morton

On Mon, Mar 16, 2015 at 02:27:10PM -0400, Milosz Tanski wrote:
> This patchset introduces two new syscalls preadv2 and pwritev2. They are the
> same syscalls as preadv and pwrite but with a flag argument. Additionally,
> preadv2 implements an extra RWF_NONBLOCK flag. 

There was some arugment that we just don't wait and don't have the
classic unix "blocking" semantics.  Maybe it's time to bite the bullet
and rename it to RWF_DONTWAIT? (I personally dont really care).

Second this probably needs to be on top of Al's for-next tree:

	https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next

Note that this has a flags field in struct kiocb, so we could just use
that for the flags.

Otherwise this version look fine to me, let's get it merged!

Al, are yo ready to pick this up?  I'd hate to miss another merge
window.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-26 11:55   ` Christoph Hellwig
  0 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-26 11:55 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner, Andrew Morton

On Mon, Mar 16, 2015 at 02:27:10PM -0400, Milosz Tanski wrote:
> This patchset introduces two new syscalls preadv2 and pwritev2. They are the
> same syscalls as preadv and pwrite but with a flag argument. Additionally,
> preadv2 implements an extra RWF_NONBLOCK flag. 

There was some arugment that we just don't wait and don't have the
classic unix "blocking" semantics.  Maybe it's time to bite the bullet
and rename it to RWF_DONTWAIT? (I personally dont really care).

Second this probably needs to be on top of Al's for-next tree:

	https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next

Note that this has a flags field in struct kiocb, so we could just use
that for the flags.

Otherwise this version look fine to me, let's get it merged!

Al, are yo ready to pick this up?  I'd hate to miss another merge
window.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-26 11:55   ` Christoph Hellwig
@ 2015-03-26 19:12     ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-26 19:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch, Dave Chinner, Andrew Morton

On Thu, Mar 26, 2015 at 7:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Mar 16, 2015 at 02:27:10PM -0400, Milosz Tanski wrote:
> > This patchset introduces two new syscalls preadv2 and pwritev2. They are the
> > same syscalls as preadv and pwrite but with a flag argument. Additionally,
> > preadv2 implements an extra RWF_NONBLOCK flag.
>
> There was some arugment that we just don't wait and don't have the
> classic unix "blocking" semantics.  Maybe it's time to bite the bullet
> and rename it to RWF_DONTWAIT? (I personally dont really care).


Sure. It is in line with the MSG_DONTWAIT flag for sendmsg() which
this whole idea is based on anyways.

>
>
> Second this probably needs to be on top of Al's for-next tree:
>
>         https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next
>
> Note that this has a flags field in struct kiocb, so we could just use
> that for the flags.


Okay I started rebasing the patches on-top of that. I did see that
there is a new flags field "ki_flags". I see that there's already a
IOCB_EVENTFD flag.

Did you see you Andres' question about making the flags a enum? I
usually wouldn't do that, because C++ is stronger when it comes enums,
you can't combine them then assignment them to an enum without a cast.
But in the kernel this doesn't matter.

>
>
> Otherwise this version look fine to me, let's get it merged!
>
> Al, are yo ready to pick this up?  I'd hate to miss another merge
> window.


Just got back from vacation today, I'll try to get this out before the
end of the week. I also have to make some adjustments to xfs-tests (to
push preadv2 into xfs_io).


-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-26 19:12     ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-26 19:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch, Dave Chinner, Andrew Morton

On Thu, Mar 26, 2015 at 7:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Mar 16, 2015 at 02:27:10PM -0400, Milosz Tanski wrote:
> > This patchset introduces two new syscalls preadv2 and pwritev2. They are the
> > same syscalls as preadv and pwrite but with a flag argument. Additionally,
> > preadv2 implements an extra RWF_NONBLOCK flag.
>
> There was some arugment that we just don't wait and don't have the
> classic unix "blocking" semantics.  Maybe it's time to bite the bullet
> and rename it to RWF_DONTWAIT? (I personally dont really care).


Sure. It is in line with the MSG_DONTWAIT flag for sendmsg() which
this whole idea is based on anyways.

>
>
> Second this probably needs to be on top of Al's for-next tree:
>
>         https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next
>
> Note that this has a flags field in struct kiocb, so we could just use
> that for the flags.


Okay I started rebasing the patches on-top of that. I did see that
there is a new flags field "ki_flags". I see that there's already a
IOCB_EVENTFD flag.

Did you see you Andres' question about making the flags a enum? I
usually wouldn't do that, because C++ is stronger when it comes enums,
you can't combine them then assignment them to an enum without a cast.
But in the kernel this doesn't matter.

>
>
> Otherwise this version look fine to me, let's get it merged!
>
> Al, are yo ready to pick this up?  I'd hate to miss another merge
> window.


Just got back from vacation today, I'll try to get this out before the
end of the week. I also have to make some adjustments to xfs-tests (to
push preadv2 into xfs_io).


-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-26 19:12     ` Milosz Tanski
  (?)
@ 2015-03-27  2:26     ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-27  2:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch, Dave Chinner, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 2355 bytes --]

On Thu, Mar 26, 2015 at 3:12 PM, Milosz Tanski <milosz@adfin.com> wrote:

> On Thu, Mar 26, 2015 at 7:55 AM, Christoph Hellwig <hch@infradead.org>
> wrote:
> >
> > On Mon, Mar 16, 2015 at 02:27:10PM -0400, Milosz Tanski wrote:
> > > This patchset introduces two new syscalls preadv2 and pwritev2. They
> are the
> > > same syscalls as preadv and pwrite but with a flag argument.
> Additionally,
> > > preadv2 implements an extra RWF_NONBLOCK flag.
> >
> > There was some arugment that we just don't wait and don't have the
> > classic unix "blocking" semantics.  Maybe it's time to bite the bullet
> > and rename it to RWF_DONTWAIT? (I personally dont really care).
>
>
> Sure. It is in line with the MSG_DONTWAIT flag for sendmsg() which
> this whole idea is based on anyways.
>

Okay, It's RWF_DONTWAIT now. I also split it into two RWF_DONTWAIT for the
userspace API and IOCB_DONTWAIT that goes into kiocb->ki_flags.


>
> >
> >
> > Second this probably needs to be on top of Al's for-next tree:
> >
> >
> https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next
> >
> > Note that this has a flags field in struct kiocb, so we could just use
> > that for the flags.
>
>
> Okay I started rebasing the patches on-top of that. I did see that
> there is a new flags field "ki_flags". I see that there's already a
> IOCB_EVENTFD flag.
>
> Did you see you Andres' question about making the flags a enum? I
> usually wouldn't do that, because C++ is stronger when it comes enums,
> you can't combine them then assignment them to an enum without a cast.
> But in the kernel this doesn't matter.
>
> >
> >
> > Otherwise this version look fine to me, let's get it merged!
> >
> > Al, are yo ready to pick this up?  I'd hate to miss another merge
> > window.
>
>
> Just got back from vacation today, I'll try to get this out before the
> end of the week. I also have to make some adjustments to xfs-tests (to
> push preadv2 into xfs_io).
>

You'll see the patches and the pull request against  viro/for-next
branch tomorrow. The rebase was a bit of work since you guys changed a fair
amount of stuff.


>
>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: milosz@adfin.com
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

[-- Attachment #2: Type: text/html, Size: 3992 bytes --]

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-26 19:12     ` Milosz Tanski
@ 2015-03-27  2:29       ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-27  2:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch, Dave Chinner, Andrew Morton

On Thu, Mar 26, 2015 at 3:12 PM, Milosz Tanski <milosz@adfin.com> wrote:
> On Thu, Mar 26, 2015 at 7:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Mon, Mar 16, 2015 at 02:27:10PM -0400, Milosz Tanski wrote:
>> > This patchset introduces two new syscalls preadv2 and pwritev2. They are the
>> > same syscalls as preadv and pwrite but with a flag argument. Additionally,
>> > preadv2 implements an extra RWF_NONBLOCK flag.
>>
>> There was some arugment that we just don't wait and don't have the
>> classic unix "blocking" semantics.  Maybe it's time to bite the bullet
>> and rename it to RWF_DONTWAIT? (I personally dont really care).
>
>
> Sure. It is in line with the MSG_DONTWAIT flag for sendmsg() which
> this whole idea is based on anyways.

Okay, It's RWF_DONTWAIT now. I also split it into two RWF_DONTWAIT for
the userspace API and IOCB_DONTWAIT that goes into kiocb->ki_flags.

>
>>
>>
>> Second this probably needs to be on top of Al's for-next tree:
>>
>>         https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next
>>
>> Note that this has a flags field in struct kiocb, so we could just use
>> that for the flags.
>
>
> Okay I started rebasing the patches on-top of that. I did see that
> there is a new flags field "ki_flags". I see that there's already a
> IOCB_EVENTFD flag.
>
> Did you see you Andres' question about making the flags a enum? I
> usually wouldn't do that, because C++ is stronger when it comes enums,
> you can't combine them then assignment them to an enum without a cast.
> But in the kernel this doesn't matter.
>
>>
>>
>> Otherwise this version look fine to me, let's get it merged!
>>
>> Al, are yo ready to pick this up?  I'd hate to miss another merge
>> window.
>
>
> Just got back from vacation today, I'll try to get this out before the
> end of the week. I also have to make some adjustments to xfs-tests (to
> push preadv2 into xfs_io).

You'll see the patches and the pull request against  viro/for-next
branch tomorrow. The rebase was a bit of work since you guys changed a
fair amount of stuff.

>
>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: milosz@adfin.com

P.S: Sorry for the double email / HTML spam. I'm not a my primary
computer and apparently gmail randomly forgets that a thread is plain
text only. Thanks gmail.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  2:29       ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-27  2:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: LKML, linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch, Dave Chinner, Andrew Morton

On Thu, Mar 26, 2015 at 3:12 PM, Milosz Tanski <milosz@adfin.com> wrote:
> On Thu, Mar 26, 2015 at 7:55 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>
>> On Mon, Mar 16, 2015 at 02:27:10PM -0400, Milosz Tanski wrote:
>> > This patchset introduces two new syscalls preadv2 and pwritev2. They are the
>> > same syscalls as preadv and pwrite but with a flag argument. Additionally,
>> > preadv2 implements an extra RWF_NONBLOCK flag.
>>
>> There was some arugment that we just don't wait and don't have the
>> classic unix "blocking" semantics.  Maybe it's time to bite the bullet
>> and rename it to RWF_DONTWAIT? (I personally dont really care).
>
>
> Sure. It is in line with the MSG_DONTWAIT flag for sendmsg() which
> this whole idea is based on anyways.

Okay, It's RWF_DONTWAIT now. I also split it into two RWF_DONTWAIT for
the userspace API and IOCB_DONTWAIT that goes into kiocb->ki_flags.

>
>>
>>
>> Second this probably needs to be on top of Al's for-next tree:
>>
>>         https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next
>>
>> Note that this has a flags field in struct kiocb, so we could just use
>> that for the flags.
>
>
> Okay I started rebasing the patches on-top of that. I did see that
> there is a new flags field "ki_flags". I see that there's already a
> IOCB_EVENTFD flag.
>
> Did you see you Andres' question about making the flags a enum? I
> usually wouldn't do that, because C++ is stronger when it comes enums,
> you can't combine them then assignment them to an enum without a cast.
> But in the kernel this doesn't matter.
>
>>
>>
>> Otherwise this version look fine to me, let's get it merged!
>>
>> Al, are yo ready to pick this up?  I'd hate to miss another merge
>> window.
>
>
> Just got back from vacation today, I'll try to get this out before the
> end of the week. I also have to make some adjustments to xfs-tests (to
> push preadv2 into xfs_io).

You'll see the patches and the pull request against  viro/for-next
branch tomorrow. The rebase was a bit of work since you guys changed a
fair amount of stuff.

>
>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: milosz@adfin.com

P.S: Sorry for the double email / HTML spam. I'm not a my primary
computer and apparently gmail randomly forgets that a thread is plain
text only. Thanks gmail.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  3:28   ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27  3:28 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, 16 Mar 2015 14:27:10 -0400 Milosz Tanski <milosz@adfin.com> wrote:

> This patchset introduces two new syscalls preadv2 and pwritev2. They are the
> same syscalls as preadv and pwrite but with a flag argument. Additionally,
> preadv2 implements an extra RWF_NONBLOCK flag. 

I still don't understand why pwritev() exists.  We discussed this last
time but it seems nothing has changed.  I'm not seeing here an adequate
description of why it exists nor a justification for its addition.

Also, why are we adding new syscalls instead of using O_NONBLOCK?  I
think this might have been discussed before, but the changelogs haven't
been updated to reflect it - please do so.

> The RWF_NONBLOCK flag in preadv2 introduces an ability to perform a
> non-blocking read from regular files in buffered IO mode. This works by only
> for those filesystems that have data in the page cache.
> 
> We discussed these changes at this year's LSF/MM summit in Boston. More details
> on the Samba use case, the numbers, and presentation is available at this link:
> https://lists.samba.org/archive/samba-technical/2015-March/106290.html

https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing
talks about "sync" but I can't find a description of what this actually
is.  It appears to perform better than anything else?


> Background:
> 
>  Using a threadpool to emulate non-blocking operations on regular buffered
>  files is a common pattern today (samba, libuv, etc...) Applications split the
>  work between network bound threads (epoll) and IO threadpool. Not every
>  application can use sendfile syscall (TLS / post-processing).
> 
>  This common pattern leads to increased request latency. Latency can be due to
>  additional synchronization between the threads or fast (cached data) request
>  stuck behind slow request (large / uncached data).
> 
>  The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass
>  enqueuing operation in the threadpool if it's already available in the
>  pagecache.

A thing which bugs me about pread2() is that it is specifically
tailored to applications which are able to use a partial read result. 
ie, by sending it over the network.

But it is not very useful for the class of applications which require
that the entire read be completed before they can proceed with using
the data.  Such applications will have to run pread2(), see the short
result, save away the partial data, perform some IO then fetch the
remaining data then proceed.  By this time, the original partially read
data may have fallen out of CPU cache (or we're on a different CPU) and
the data will need to be fetched into cache a second time.

Such applications would be better served if they were able to query for
pagecache presence _before_ doing the big copy_to_user(), so they can
ensure that all the data is in pagecache before copying it in.  ie:
fincore(), perhaps supported by a synchronous POSIX_FADV_WILLNEED.

And of course fincore could be used by Samba etc to avoid blocking on
reads.  It wouldn't perform quite as well as pread2(), but I bet it's
good enough.

Bottom line: with pread2() there's still a need for fincore(), but with
fincore() there probably isn't a need for pread2().

And (again) we've discussed this before, but the patchset gets resent
as if nothing had happened.


And I'm doubtful about claims that it absolutely has to be non-blocking
100% of the time.  I bet that 99.99% is good enough.  A fincore()
option to run mark_page_accessed() against present pages would help
with the race-with-reclaim situation.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  3:28   ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27  3:28 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Mon, 16 Mar 2015 14:27:10 -0400 Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> wrote:

> This patchset introduces two new syscalls preadv2 and pwritev2. They are the
> same syscalls as preadv and pwrite but with a flag argument. Additionally,
> preadv2 implements an extra RWF_NONBLOCK flag. 

I still don't understand why pwritev() exists.  We discussed this last
time but it seems nothing has changed.  I'm not seeing here an adequate
description of why it exists nor a justification for its addition.

Also, why are we adding new syscalls instead of using O_NONBLOCK?  I
think this might have been discussed before, but the changelogs haven't
been updated to reflect it - please do so.

> The RWF_NONBLOCK flag in preadv2 introduces an ability to perform a
> non-blocking read from regular files in buffered IO mode. This works by only
> for those filesystems that have data in the page cache.
> 
> We discussed these changes at this year's LSF/MM summit in Boston. More details
> on the Samba use case, the numbers, and presentation is available at this link:
> https://lists.samba.org/archive/samba-technical/2015-March/106290.html

https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing
talks about "sync" but I can't find a description of what this actually
is.  It appears to perform better than anything else?


> Background:
> 
>  Using a threadpool to emulate non-blocking operations on regular buffered
>  files is a common pattern today (samba, libuv, etc...) Applications split the
>  work between network bound threads (epoll) and IO threadpool. Not every
>  application can use sendfile syscall (TLS / post-processing).
> 
>  This common pattern leads to increased request latency. Latency can be due to
>  additional synchronization between the threads or fast (cached data) request
>  stuck behind slow request (large / uncached data).
> 
>  The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass
>  enqueuing operation in the threadpool if it's already available in the
>  pagecache.

A thing which bugs me about pread2() is that it is specifically
tailored to applications which are able to use a partial read result. 
ie, by sending it over the network.

But it is not very useful for the class of applications which require
that the entire read be completed before they can proceed with using
the data.  Such applications will have to run pread2(), see the short
result, save away the partial data, perform some IO then fetch the
remaining data then proceed.  By this time, the original partially read
data may have fallen out of CPU cache (or we're on a different CPU) and
the data will need to be fetched into cache a second time.

Such applications would be better served if they were able to query for
pagecache presence _before_ doing the big copy_to_user(), so they can
ensure that all the data is in pagecache before copying it in.  ie:
fincore(), perhaps supported by a synchronous POSIX_FADV_WILLNEED.

And of course fincore could be used by Samba etc to avoid blocking on
reads.  It wouldn't perform quite as well as pread2(), but I bet it's
good enough.

Bottom line: with pread2() there's still a need for fincore(), but with
fincore() there probably isn't a need for pread2().

And (again) we've discussed this before, but the patchset gets resent
as if nothing had happened.


And I'm doubtful about claims that it absolutely has to be non-blocking
100% of the time.  I bet that 99.99% is good enough.  A fincore()
option to run mark_page_accessed() against present pages would help
with the race-with-reclaim situation.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27  3:28   ` Andrew Morton
@ 2015-03-27  5:41     ` Volker Lendecke
  -1 siblings, 0 replies; 94+ messages in thread
From: Volker Lendecke @ 2015-03-27  5:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner

On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> A thing which bugs me about pread2() is that it is specifically
> tailored to applications which are able to use a partial read result. 
> ie, by sending it over the network.

Can you explain what you mean by this? Samba gets a pread
request from a client for some bytes. The client will be
confused when we send less than requested although the file
is long enough to satisfy all.

> And of course fincore could be used by Samba etc to avoid blocking on
> reads.  It wouldn't perform quite as well as pread2(), but I bet it's
> good enough.
> 
> Bottom line: with pread2() there's still a need for fincore(), but with
> fincore() there probably isn't a need for pread2().

fincore would be a second syscall per pread, and it is not
atomic. I've had discussions with MIPS based vendors who
are worried about every single syscall. This is the #1
hottest code path in Samba.

> And I'm doubtful about claims that it absolutely has to be non-blocking
> 100% of the time.  I bet that 99.99% is good enough.  A fincore()
> option to run mark_page_accessed() against present pages would help
> with the race-with-reclaim situation.

If you can make sure that after an fincore the pages remain
in memory for x milliseconds the atomicity concern might go
away.

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt@sernet.de

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  5:41     ` Volker Lendecke
  0 siblings, 0 replies; 94+ messages in thread
From: Volker Lendecke @ 2015-03-27  5:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner

On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> A thing which bugs me about pread2() is that it is specifically
> tailored to applications which are able to use a partial read result. 
> ie, by sending it over the network.

Can you explain what you mean by this? Samba gets a pread
request from a client for some bytes. The client will be
confused when we send less than requested although the file
is long enough to satisfy all.

> And of course fincore could be used by Samba etc to avoid blocking on
> reads.  It wouldn't perform quite as well as pread2(), but I bet it's
> good enough.
> 
> Bottom line: with pread2() there's still a need for fincore(), but with
> fincore() there probably isn't a need for pread2().

fincore would be a second syscall per pread, and it is not
atomic. I've had discussions with MIPS based vendors who
are worried about every single syscall. This is the #1
hottest code path in Samba.

> And I'm doubtful about claims that it absolutely has to be non-blocking
> 100% of the time.  I bet that 99.99% is good enough.  A fincore()
> option to run mark_page_accessed() against present pages would help
> with the race-with-reclaim situation.

If you can make sure that after an fincore the pages remain
in memory for x milliseconds the atomicity concern might go
away.

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt@sernet.de

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  6:08       ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27  6:08 UTC (permalink / raw)
  To: Volker.Lendecke
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner

On Fri, 27 Mar 2015 06:41:25 +0100 Volker Lendecke <Volker.Lendecke@sernet.de> wrote:

> On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> > A thing which bugs me about pread2() is that it is specifically
> > tailored to applications which are able to use a partial read result. 
> > ie, by sending it over the network.
> 
> Can you explain what you mean by this? Samba gets a pread
> request from a client for some bytes. The client will be
> confused when we send less than requested although the file
> is long enough to satisfy all.

Well it was my assumption that samba would be able to do something
useful with a partial read - pread() is allowed to return less than requested.

If it isn't the case that samba can use the partial read result then
what does it do?  It has to save the partial data, then do the
additional IO?  That's pretty clunky compared to

	if (it's all in cache)
		read it all now
	else
		ask a worker thread to read it all

> > And of course fincore could be used by Samba etc to avoid blocking on
> > reads.  It wouldn't perform quite as well as pread2(), but I bet it's
> > good enough.
> > 
> > Bottom line: with pread2() there's still a need for fincore(), but with
> > fincore() there probably isn't a need for pread2().
> 
> fincore would be a second syscall per pread, and it is not
> atomic. I've had discussions with MIPS based vendors who
> are worried about every single syscall. This is the #1
> hottest code path in Samba.

Bear in mind that these operations involve physical IO and large
memcpy's.  Yes, a fincore() approach will consume more CPU but the
additional overhead will be relatively small.

Tradeoffs are involved, and it may turn out that choosing a more
flexible and powerful interface which is somewhat more CPU intensive is
a better decision.  It's hard to say until this is quantified (ie:
measured).

> > And I'm doubtful about claims that it absolutely has to be non-blocking
> > 100% of the time.  I bet that 99.99% is good enough.  A fincore()
> > option to run mark_page_accessed() against present pages would help
> > with the race-with-reclaim situation.
> 
> If you can make sure that after an fincore the pages remain
> in memory for x milliseconds the atomicity concern might go
> away.

It won't be guaranteed that the fincore()+pread() will be
non-blocking.  But blocking will be very rare.  I don't know whether
the additional expense of activating the pages within fincore() is
justified - needs runtime testing.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  6:08       ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27  6:08 UTC (permalink / raw)
  To: Volker.Lendecke-3ekOc4rQMZmzQB+pC5nmwQ
  Cc: Milosz Tanski, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Fri, 27 Mar 2015 06:41:25 +0100 Volker Lendecke <Volker.Lendecke-3ekOc4rQMZmzQB+pC5nmwQ@public.gmane.org> wrote:

> On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> > A thing which bugs me about pread2() is that it is specifically
> > tailored to applications which are able to use a partial read result. 
> > ie, by sending it over the network.
> 
> Can you explain what you mean by this? Samba gets a pread
> request from a client for some bytes. The client will be
> confused when we send less than requested although the file
> is long enough to satisfy all.

Well it was my assumption that samba would be able to do something
useful with a partial read - pread() is allowed to return less than requested.

If it isn't the case that samba can use the partial read result then
what does it do?  It has to save the partial data, then do the
additional IO?  That's pretty clunky compared to

	if (it's all in cache)
		read it all now
	else
		ask a worker thread to read it all

> > And of course fincore could be used by Samba etc to avoid blocking on
> > reads.  It wouldn't perform quite as well as pread2(), but I bet it's
> > good enough.
> > 
> > Bottom line: with pread2() there's still a need for fincore(), but with
> > fincore() there probably isn't a need for pread2().
> 
> fincore would be a second syscall per pread, and it is not
> atomic. I've had discussions with MIPS based vendors who
> are worried about every single syscall. This is the #1
> hottest code path in Samba.

Bear in mind that these operations involve physical IO and large
memcpy's.  Yes, a fincore() approach will consume more CPU but the
additional overhead will be relatively small.

Tradeoffs are involved, and it may turn out that choosing a more
flexible and powerful interface which is somewhat more CPU intensive is
a better decision.  It's hard to say until this is quantified (ie:
measured).

> > And I'm doubtful about claims that it absolutely has to be non-blocking
> > 100% of the time.  I bet that 99.99% is good enough.  A fincore()
> > option to run mark_page_accessed() against present pages would help
> > with the race-with-reclaim situation.
> 
> If you can make sure that after an fincore the pages remain
> in memory for x milliseconds the atomicity concern might go
> away.

It won't be guaranteed that the fincore()+pread() will be
non-blocking.  But blocking will be very rare.  I don't know whether
the additional expense of activating the pages within fincore() is
justified - needs runtime testing.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  8:02         ` Volker Lendecke
  0 siblings, 0 replies; 94+ messages in thread
From: Volker Lendecke @ 2015-03-27  8:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner

On Thu, Mar 26, 2015 at 11:08:33PM -0700, Andrew Morton wrote:
> On Fri, 27 Mar 2015 06:41:25 +0100 Volker Lendecke <Volker.Lendecke@sernet.de> wrote:
> 
> > On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> > > A thing which bugs me about pread2() is that it is specifically
> > > tailored to applications which are able to use a partial read result. 
> > > ie, by sending it over the network.
> > 
> > Can you explain what you mean by this? Samba gets a pread
> > request from a client for some bytes. The client will be
> > confused when we send less than requested although the file
> > is long enough to satisfy all.
> 
> Well it was my assumption that samba would be able to do something
> useful with a partial read - pread() is allowed to return less than requested.

No, this is not the case. Maybe my whole understanding of
pread is wrong: I always thought that it won't return short
if the file spans the pread range. EINTR nonwithstanding.

> 	if (it's all in cache)

I know I'm repeating myself: We have a race condition here.
A small one, but it is racy. I've seen loaded systems where
we spend seconds between becoming re-scheduled. In these
systems, it will be the norm to block in later reads. And we
don't have a good way to detect this situation afterwards
and turn to threads as a precaution next time.

> 		read it all now
> 	else
> 		ask a worker thread to read it all
> 

> Bear in mind that these operations involve physical IO and large
> memcpy's.  Yes, a fincore() approach will consume more CPU but the
> additional overhead will be relatively small.

We have to pay this price for every single chunk. Without
oplocks we get 10-byte read requests. This is hard to
swallow for many vendors with small CPUs.

With best regards,

Volker Lendecke

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt@sernet.de

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  8:02         ` Volker Lendecke
  0 siblings, 0 replies; 94+ messages in thread
From: Volker Lendecke @ 2015-03-27  8:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Thu, Mar 26, 2015 at 11:08:33PM -0700, Andrew Morton wrote:
> On Fri, 27 Mar 2015 06:41:25 +0100 Volker Lendecke <Volker.Lendecke@sernet.de> wrote:
> 
> > On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> > > A thing which bugs me about pread2() is that it is specifically
> > > tailored to applications which are able to use a partial read result. 
> > > ie, by sending it over the network.
> > 
> > Can you explain what you mean by this? Samba gets a pread
> > request from a client for some bytes. The client will be
> > confused when we send less than requested although the file
> > is long enough to satisfy all.
> 
> Well it was my assumption that samba would be able to do something
> useful with a partial read - pread() is allowed to return less than requested.

No, this is not the case. Maybe my whole understanding of
pread is wrong: I always thought that it won't return short
if the file spans the pread range. EINTR nonwithstanding.

> 	if (it's all in cache)

I know I'm repeating myself: We have a race condition here.
A small one, but it is racy. I've seen loaded systems where
we spend seconds between becoming re-scheduled. In these
systems, it will be the norm to block in later reads. And we
don't have a good way to detect this situation afterwards
and turn to threads as a precaution next time.

> 		read it all now
> 	else
> 		ask a worker thread to read it all
> 

> Bear in mind that these operations involve physical IO and large
> memcpy's.  Yes, a fincore() approach will consume more CPU but the
> additional overhead will be relatively small.

We have to pay this price for every single chunk. Without
oplocks we get 10-byte read requests. This is hard to
swallow for many vendors with small CPUs.

With best regards,

Volker Lendecke

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt-3ekOc4rQMZmzQB+pC5nmwQ@public.gmane.org

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27  8:02         ` Volker Lendecke
  (?)
@ 2015-03-27  8:12         ` Christoph Hellwig
  -1 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-27  8:12 UTC (permalink / raw)
  To: Volker Lendecke
  Cc: Andrew Morton, Milosz Tanski, linux-kernel, Christoph Hellwig,
	linux-fsdevel, linux-aio, Mel Gorman, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 09:02:51AM +0100, Volker Lendecke wrote:
> No, this is not the case. Maybe my whole understanding of
> pread is wrong: I always thought that it won't return short
> if the file spans the pread range. EINTR nonwithstanding.

Per Posix it could, however if we do it for regular file reads
hell will break lose because no one expects it.  So in practice
it can't.

> 
> > 	if (it's all in cache)
> 
> I know I'm repeating myself: We have a race condition here.
> A small one, but it is racy. I've seen loaded systems where
> we spend seconds between becoming re-scheduled. In these
> systems, it will be the norm to block in later reads. And we
> don't have a good way to detect this situation afterwards
> and turn to threads as a precaution next time.

Exactly.  One that matters in real life, too.  Nevermind that
preadv2 is a way cleaner interface than fincore.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27  3:28   ` Andrew Morton
@ 2015-03-27  8:18     ` Christoph Hellwig
  -1 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-27  8:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> I still don't understand why pwritev() exists.  We discussed this last
> time but it seems nothing has changed.  I'm not seeing here an adequate
> description of why it exists nor a justification for its addition.

pwritev2?  I have patches to support per-I/O O_DSYNC with it, lots of
folks including Samba and SCSI targets want this because their protocols
support it.  The patches were posted with earlier versions of Miklos
series.

It's cleaner to add the two system calls in go when we plan using them
anyway and have symmetric infrastructure, and I did not hear any
disagreement with that on LSF.  Did you skip this session?

> And (again) we've discussed this before, but the patchset gets resent
> as if nothing had happened.

We had long discussiosn about it both here and at LSF.  We had everyone
agree and nod there, and only your repeated argument here, so maybe it's
not Miklos who is disonnected but you?

Also that whole fincore argument is rather hypothetic - it's only been
pushed in to ugly to live multiplexers that also expose things like
pfns,  while with preadv2 we have a trivial and easy to use API read to
merge, and various consumerms just waiting for it.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  8:18     ` Christoph Hellwig
  0 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-27  8:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> I still don't understand why pwritev() exists.  We discussed this last
> time but it seems nothing has changed.  I'm not seeing here an adequate
> description of why it exists nor a justification for its addition.

pwritev2?  I have patches to support per-I/O O_DSYNC with it, lots of
folks including Samba and SCSI targets want this because their protocols
support it.  The patches were posted with earlier versions of Miklos
series.

It's cleaner to add the two system calls in go when we plan using them
anyway and have symmetric infrastructure, and I did not hear any
disagreement with that on LSF.  Did you skip this session?

> And (again) we've discussed this before, but the patchset gets resent
> as if nothing had happened.

We had long discussiosn about it both here and at LSF.  We had everyone
agree and nod there, and only your repeated argument here, so maybe it's
not Miklos who is disonnected but you?

Also that whole fincore argument is rather hypothetic - it's only been
pushed in to ugly to live multiplexers that also expose things like
pfns,  while with preadv2 we have a trivial and easy to use API read to
merge, and various consumerms just waiting for it.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  8:35       ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27  8:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, 27 Mar 2015 01:18:22 -0700 Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> > I still don't understand why pwritev() exists.  We discussed this last
> > time but it seems nothing has changed.  I'm not seeing here an adequate
> > description of why it exists nor a justification for its addition.
> 
> pwritev2?  I have patches to support per-I/O O_DSYNC with it, lots of
> folks including Samba and SCSI targets want this because their protocols
> support it.  The patches were posted with earlier versions of Miklos
> series.
> 
> It's cleaner to add the two system calls in go when we plan using them
> anyway and have symmetric infrastructure, and I did not hear any
> disagreement with that on LSF.  Did you skip this session?

Put it in the changelogs.  All of it.  A conference discussion
is no use to people who weren't there.

> > And (again) we've discussed this before, but the patchset gets resent
> > as if nothing had happened.
> 
> We had long discussiosn about it both here and at LSF.  We had everyone
> agree and nod there, and only your repeated argument here, so maybe it's
> not Miklos who is disonnected but you?

I don't find conferences to be a good place to conduct code and design
review.

> Also that whole fincore argument is rather hypothetic - it's only been
> pushed in to ugly to live multiplexers that also expose things like
> pfns,  while with preadv2 we have a trivial and easy to use API read to
> merge, and various consumerms just waiting for it.

fincore() doesn't have to be ugly.  Please address the design issues I
raised.  How is pread2() useful to the class of applications which
cannot proceed until all data is available?



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  8:35       ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27  8:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Fri, 27 Mar 2015 01:18:22 -0700 Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> On Thu, Mar 26, 2015 at 08:28:24PM -0700, Andrew Morton wrote:
> > I still don't understand why pwritev() exists.  We discussed this last
> > time but it seems nothing has changed.  I'm not seeing here an adequate
> > description of why it exists nor a justification for its addition.
> 
> pwritev2?  I have patches to support per-I/O O_DSYNC with it, lots of
> folks including Samba and SCSI targets want this because their protocols
> support it.  The patches were posted with earlier versions of Miklos
> series.
> 
> It's cleaner to add the two system calls in go when we plan using them
> anyway and have symmetric infrastructure, and I did not hear any
> disagreement with that on LSF.  Did you skip this session?

Put it in the changelogs.  All of it.  A conference discussion
is no use to people who weren't there.

> > And (again) we've discussed this before, but the patchset gets resent
> > as if nothing had happened.
> 
> We had long discussiosn about it both here and at LSF.  We had everyone
> agree and nod there, and only your repeated argument here, so maybe it's
> not Miklos who is disonnected but you?

I don't find conferences to be a good place to conduct code and design
review.

> Also that whole fincore argument is rather hypothetic - it's only been
> pushed in to ugly to live multiplexers that also expose things like
> pfns,  while with preadv2 we have a trivial and easy to use API read to
> merge, and various consumerms just waiting for it.

fincore() doesn't have to be ugly.  Please address the design issues I
raised.  How is pread2() useful to the class of applications which
cannot proceed until all data is available?

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27  8:35       ` Andrew Morton
  (?)
@ 2015-03-27  8:48       ` Christoph Hellwig
  2015-03-27  9:01           ` Andrew Morton
  -1 siblings, 1 reply; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-27  8:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> fincore() doesn't have to be ugly.  Please address the design issues I
> raised.  How is pread2() useful to the class of applications which
> cannot proceed until all data is available?

It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
return -EGAIN, which causes them to bounce to the threadpool where
they call preadv(...).

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  9:01           ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27  9:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> > fincore() doesn't have to be ugly.  Please address the design issues I
> > raised.  How is pread2() useful to the class of applications which
> > cannot proceed until all data is available?
> 
> It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
> return -EGAIN, which causes them to bounce to the threadpool where
> they call preadv(...).

(I assume you mean RWF_NONBLOCK)

That isn't how pread2() works.  If the leading one or more pages are
uptodate, pread2() will return a partial read.  Now what?  Either the
application reads the same data a second time via the worker thread
(dumb, but it will usually be a rare case) or it reads the remainder of
the data in the worker thread and splices the data back together. 
Which, as I said, will often result in a second load of the initial
read result into CPU cache.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27  9:01           ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27  9:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:

> On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> > fincore() doesn't have to be ugly.  Please address the design issues I
> > raised.  How is pread2() useful to the class of applications which
> > cannot proceed until all data is available?
> 
> It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
> return -EGAIN, which causes them to bounce to the threadpool where
> they call preadv(...).

(I assume you mean RWF_NONBLOCK)

That isn't how pread2() works.  If the leading one or more pages are
uptodate, pread2() will return a partial read.  Now what?  Either the
application reads the same data a second time via the worker thread
(dumb, but it will usually be a rare case) or it reads the remainder of
the data in the worker thread and splices the data back together. 
Which, as I said, will often result in a second load of the initial
read result into CPU cache.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27  9:01           ` Andrew Morton
  (?)
@ 2015-03-27  9:44           ` Volker Lendecke
  -1 siblings, 0 replies; 94+ messages in thread
From: Volker Lendecke @ 2015-03-27  9:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, linux-api, Michael Kerrisk, linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
> On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> > > fincore() doesn't have to be ugly.  Please address the design issues I
> > > raised.  How is pread2() useful to the class of applications which
> > > cannot proceed until all data is available?
> > 
> > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
> > return -EGAIN, which causes them to bounce to the threadpool where
> > they call preadv(...).
> 
> (I assume you mean RWF_NONBLOCK)
> 
> That isn't how pread2() works.  If the leading one or more pages are
> uptodate, pread2() will return a partial read.  Now what?  Either the
> application reads the same data a second time via the worker thread
> (dumb, but it will usually be a rare case) or it reads the remainder of
> the data in the worker thread and splices the data back together. 
> Which, as I said, will often result in a second load of the initial
> read result into CPU cache.

Sorry, but I don't have a good picture how we are supposed
to use that. I'm fine with two syscalls, but I need a way to
tell the kernel to either block or not. Or do you want Samba
to do repeated pread calls for ever shorter blocks? Right
now I don't see a way to tell pread to either give me a
short result or really block. To me that's the core of
preadv2. I'm perfectly find for a syscall to give me a short
read instead of a global EWOULDBLOCK. I need a way to tell
the kernel which behaviour I want.

Volker

-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
http://www.sernet.de, mailto:kontakt@sernet.de

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27  3:28   ` Andrew Morton
@ 2015-03-27 15:21     ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-27 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, Linux API, Michael Kerrisk, linux-arch, Dave Chinner

On Thu, Mar 26, 2015 at 11:28 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 16 Mar 2015 14:27:10 -0400 Milosz Tanski <milosz@adfin.com> wrote:
>
>> This patchset introduces two new syscalls preadv2 and pwritev2. They are the
>> same syscalls as preadv and pwrite but with a flag argument. Additionally,
>> preadv2 implements an extra RWF_NONBLOCK flag.
>
> I still don't understand why pwritev() exists.  We discussed this last
> time but it seems nothing has changed.  I'm not seeing here an adequate
> description of why it exists nor a justification for its addition.

In the "Forward Looking" section there's a description of why we want
pwritev2 and what we're doing to do with it in the future. The goal is
to have two additional flags for those calls RWF_DSYNC and
RWF_NONBLOCK. As Christop mentioned modern network filesystem
protocols have per operation sync flags. And there's use cases for
guaranteeing of write dirtying pages without triggering a writeout.

The consensus from our discussion at LSF fs tack was 1) that both
preadv and pwritev should have flags to begin with, inline with the
API syscall design guidelines 2) if we're adding preadv2 we should add
a matching pwritev2 3) especially that we plan on introducing further
flags to preadv in the near future.

>
> Also, why are we adding new syscalls instead of using O_NONBLOCK?  I
> think this might have been discussed before, but the changelogs haven't
> been updated to reflect it - please do so.

In a much earlier patch series we already had the discussion on why we
can't use O_NONBLOCK for regular files. It comes down to that it
breaks some userspace applications. Link for further reference to the
thread:

https://lkml.org/lkml/2014/9/22/294
http://thread.gmane.org/gmane.linux.kernel.aio.general/4242

I will include the background in the next patchset.

>
>> The RWF_NONBLOCK flag in preadv2 introduces an ability to perform a
>> non-blocking read from regular files in buffered IO mode. This works by only
>> for those filesystems that have data in the page cache.
>>
>> We discussed these changes at this year's LSF/MM summit in Boston. More details
>> on the Samba use case, the numbers, and presentation is available at this link:
>> https://lists.samba.org/archive/samba-technical/2015-March/106290.html
>
> https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing
> talks about "sync" but I can't find a description of what this actually
> is.  It appears to perform better than anything else?

Sync is the samba mode where we do not use threadpool just service the
IO request in the network thread. In a single client case if
everything is in the page cache we are aiming to be as close in
latency as sync. The reason it isn't is because the threadpool path in
samba has some additional over head. I did bring it up to the Samba
folks on their technical mailing list, they can investigate it further
if they want it.

It's impractical to use Sync anywhere we have modern SMB3 clients that
can multiplex > 100 operations over a single connection. Head-of-line
blocking would kill performance, why we need the threadpool. With the
threadpool we increase the mean (and tail) latency even if the data is
handy and we can answer it right away.

The cifs FIO engine that I wrote
https://github.com/mtanski/fio/commits/samba does not let us multiplex
multiple SMB3 request. That's not exposed in the samba client
libraries.

>
>
>> Background:
>>
>>  Using a threadpool to emulate non-blocking operations on regular buffered
>>  files is a common pattern today (samba, libuv, etc...) Applications split the
>>  work between network bound threads (epoll) and IO threadpool. Not every
>>  application can use sendfile syscall (TLS / post-processing).
>>
>>  This common pattern leads to increased request latency. Latency can be due to
>>  additional synchronization between the threads or fast (cached data) request
>>  stuck behind slow request (large / uncached data).
>>
>>  The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass
>>  enqueuing operation in the threadpool if it's already available in the
>>  pagecache.
>
> A thing which bugs me about pread2() is that it is specifically
> tailored to applications which are able to use a partial read result.
> ie, by sending it over the network.
>
> But it is not very useful for the class of applications which require
> that the entire read be completed before they can proceed with using
> the data.  Such applications will have to run pread2(), see the short
> result, save away the partial data, perform some IO then fetch the
> remaining data then proceed.  By this time, the original partially read
> data may have fallen out of CPU cache (or we're on a different CPU) and
> the data will need to be fetched into cache a second time.
>
> Such applications would be better served if they were able to query for
> pagecache presence _before_ doing the big copy_to_user(), so they can
> ensure that all the data is in pagecache before copying it in.  ie:
> fincore(), perhaps supported by a synchronous POSIX_FADV_WILLNEED.
>
> And of course fincore could be used by Samba etc to avoid blocking on
> reads.  It wouldn't perform quite as well as pread2(), but I bet it's
> good enough.

The RWF_NONBLOCK is aimed primarily at network applications. Some of
them can send a partial result down the network, and then they can
enqueue the rest in the threadpool. For applications that need the
whole value, they clearly have to wait to read in the rest, but it's
behavior that are opting into.

>
> Bottom line: with pread2() there's still a need for fincore(), but with
> fincore() there probably isn't a need for pread2().

I see fincore() and preadv2() with RWF_NONBLOCK as tangential
syscalls. You can implement a poor man's RWF_NONBLOCK in userspace
with fincore() but not all of us are fine with it's racy nature or
requiring 2 syscalls in the best case.

>
> And (again) we've discussed this before, but the patchset gets resent
> as if nothing had happened.
>
>
> And I'm doubtful about claims that it absolutely has to be non-blocking
> 100% of the time.  I bet that 99.99% is good enough.  A fincore()
> option to run mark_page_accessed() against present pages would help
> with the race-with-reclaim situation.



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27 15:21     ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-27 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, Linux API, Michael Kerrisk, linux-arch, Dave Chinner

On Thu, Mar 26, 2015 at 11:28 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 16 Mar 2015 14:27:10 -0400 Milosz Tanski <milosz@adfin.com> wrote:
>
>> This patchset introduces two new syscalls preadv2 and pwritev2. They are the
>> same syscalls as preadv and pwrite but with a flag argument. Additionally,
>> preadv2 implements an extra RWF_NONBLOCK flag.
>
> I still don't understand why pwritev() exists.  We discussed this last
> time but it seems nothing has changed.  I'm not seeing here an adequate
> description of why it exists nor a justification for its addition.

In the "Forward Looking" section there's a description of why we want
pwritev2 and what we're doing to do with it in the future. The goal is
to have two additional flags for those calls RWF_DSYNC and
RWF_NONBLOCK. As Christop mentioned modern network filesystem
protocols have per operation sync flags. And there's use cases for
guaranteeing of write dirtying pages without triggering a writeout.

The consensus from our discussion at LSF fs tack was 1) that both
preadv and pwritev should have flags to begin with, inline with the
API syscall design guidelines 2) if we're adding preadv2 we should add
a matching pwritev2 3) especially that we plan on introducing further
flags to preadv in the near future.

>
> Also, why are we adding new syscalls instead of using O_NONBLOCK?  I
> think this might have been discussed before, but the changelogs haven't
> been updated to reflect it - please do so.

In a much earlier patch series we already had the discussion on why we
can't use O_NONBLOCK for regular files. It comes down to that it
breaks some userspace applications. Link for further reference to the
thread:

https://lkml.org/lkml/2014/9/22/294
http://thread.gmane.org/gmane.linux.kernel.aio.general/4242

I will include the background in the next patchset.

>
>> The RWF_NONBLOCK flag in preadv2 introduces an ability to perform a
>> non-blocking read from regular files in buffered IO mode. This works by only
>> for those filesystems that have data in the page cache.
>>
>> We discussed these changes at this year's LSF/MM summit in Boston. More details
>> on the Samba use case, the numbers, and presentation is available at this link:
>> https://lists.samba.org/archive/samba-technical/2015-March/106290.html
>
> https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing
> talks about "sync" but I can't find a description of what this actually
> is.  It appears to perform better than anything else?

Sync is the samba mode where we do not use threadpool just service the
IO request in the network thread. In a single client case if
everything is in the page cache we are aiming to be as close in
latency as sync. The reason it isn't is because the threadpool path in
samba has some additional over head. I did bring it up to the Samba
folks on their technical mailing list, they can investigate it further
if they want it.

It's impractical to use Sync anywhere we have modern SMB3 clients that
can multiplex > 100 operations over a single connection. Head-of-line
blocking would kill performance, why we need the threadpool. With the
threadpool we increase the mean (and tail) latency even if the data is
handy and we can answer it right away.

The cifs FIO engine that I wrote
https://github.com/mtanski/fio/commits/samba does not let us multiplex
multiple SMB3 request. That's not exposed in the samba client
libraries.

>
>
>> Background:
>>
>>  Using a threadpool to emulate non-blocking operations on regular buffered
>>  files is a common pattern today (samba, libuv, etc...) Applications split the
>>  work between network bound threads (epoll) and IO threadpool. Not every
>>  application can use sendfile syscall (TLS / post-processing).
>>
>>  This common pattern leads to increased request latency. Latency can be due to
>>  additional synchronization between the threads or fast (cached data) request
>>  stuck behind slow request (large / uncached data).
>>
>>  The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass
>>  enqueuing operation in the threadpool if it's already available in the
>>  pagecache.
>
> A thing which bugs me about pread2() is that it is specifically
> tailored to applications which are able to use a partial read result.
> ie, by sending it over the network.
>
> But it is not very useful for the class of applications which require
> that the entire read be completed before they can proceed with using
> the data.  Such applications will have to run pread2(), see the short
> result, save away the partial data, perform some IO then fetch the
> remaining data then proceed.  By this time, the original partially read
> data may have fallen out of CPU cache (or we're on a different CPU) and
> the data will need to be fetched into cache a second time.
>
> Such applications would be better served if they were able to query for
> pagecache presence _before_ doing the big copy_to_user(), so they can
> ensure that all the data is in pagecache before copying it in.  ie:
> fincore(), perhaps supported by a synchronous POSIX_FADV_WILLNEED.
>
> And of course fincore could be used by Samba etc to avoid blocking on
> reads.  It wouldn't perform quite as well as pread2(), but I bet it's
> good enough.

The RWF_NONBLOCK is aimed primarily at network applications. Some of
them can send a partial result down the network, and then they can
enqueue the rest in the threadpool. For applications that need the
whole value, they clearly have to wait to read in the rest, but it's
behavior that are opting into.

>
> Bottom line: with pread2() there's still a need for fincore(), but with
> fincore() there probably isn't a need for pread2().

I see fincore() and preadv2() with RWF_NONBLOCK as tangential
syscalls. You can implement a poor man's RWF_NONBLOCK in userspace
with fincore() but not all of us are fine with it's racy nature or
requiring 2 syscalls in the best case.

>
> And (again) we've discussed this before, but the patchset gets resent
> as if nothing had happened.
>
>
> And I'm doubtful about claims that it absolutely has to be non-blocking
> 100% of the time.  I bet that 99.99% is good enough.  A fincore()
> option to run mark_page_accessed() against present pages would help
> with the race-with-reclaim situation.



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27  9:01           ` Andrew Morton
@ 2015-03-27 15:58             ` Jeremy Allison
  -1 siblings, 0 replies; 94+ messages in thread
From: Jeremy Allison @ 2015-03-27 15:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
> On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> > > fincore() doesn't have to be ugly.  Please address the design issues I
> > > raised.  How is pread2() useful to the class of applications which
> > > cannot proceed until all data is available?
> > 
> > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
> > return -EGAIN, which causes them to bounce to the threadpool where
> > they call preadv(...).
> 
> (I assume you mean RWF_NONBLOCK)
> 
> That isn't how pread2() works.  If the leading one or more pages are
> uptodate, pread2() will return a partial read.  Now what?  Either the
> application reads the same data a second time via the worker thread
> (dumb, but it will usually be a rare case)

The problem with the above is that we can't tell the difference
between pread2() returning a short read because the pages are not
in cache, or because someone truncated the file. So we need some
way to differentiate this.

My preference from userspace would be for pread2() to return
EAGAIN if *all* the data requested is not available (where
'all' can be less than the size requested if the file has
been truncated in the meantime).

So:

ret = pread2(fd, buf, size_wanted, RWF_NONBLOCK)

if (ret == -1) {
	if (errno == EAGAIN) {
		goto threadpool...
	}
	.. real error..
}

if (ret == size_wanted) {
	.. normal read, file not truncated...
}

if (ret < size_wanted) {
	.. file was truncated..
}

The thing I want to avoid is the case where
ret < size_wanted means only part of the file
is in cache.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27 15:58             ` Jeremy Allison
  0 siblings, 0 replies; 94+ messages in thread
From: Jeremy Allison @ 2015-03-27 15:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
> On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> > > fincore() doesn't have to be ugly.  Please address the design issues I
> > > raised.  How is pread2() useful to the class of applications which
> > > cannot proceed until all data is available?
> > 
> > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
> > return -EGAIN, which causes them to bounce to the threadpool where
> > they call preadv(...).
> 
> (I assume you mean RWF_NONBLOCK)
> 
> That isn't how pread2() works.  If the leading one or more pages are
> uptodate, pread2() will return a partial read.  Now what?  Either the
> application reads the same data a second time via the worker thread
> (dumb, but it will usually be a rare case)

The problem with the above is that we can't tell the difference
between pread2() returning a short read because the pages are not
in cache, or because someone truncated the file. So we need some
way to differentiate this.

My preference from userspace would be for pread2() to return
EAGAIN if *all* the data requested is not available (where
'all' can be less than the size requested if the file has
been truncated in the meantime).

So:

ret = pread2(fd, buf, size_wanted, RWF_NONBLOCK)

if (ret == -1) {
	if (errno == EAGAIN) {
		goto threadpool...
	}
	.. real error..
}

if (ret == size_wanted) {
	.. normal read, file not truncated...
}

if (ret < size_wanted) {
	.. file was truncated..
}

The thing I want to avoid is the case where
ret < size_wanted means only part of the file
is in cache.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27 15:58             ` Jeremy Allison
  (?)
  (?)
@ 2015-03-27 16:30               ` Andrew Morton
  -1 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27 16:30 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, 27 Mar 2015 08:58:54 -0700 Jeremy Allison <jra@samba.org> wrote:

> On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
> > On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> > > > fincore() doesn't have to be ugly.  Please address the design issues I
> > > > raised.  How is pread2() useful to the class of applications which
> > > > cannot proceed until all data is available?
> > > 
> > > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
> > > return -EGAIN, which causes them to bounce to the threadpool where
> > > they call preadv(...).
> > 
> > (I assume you mean RWF_NONBLOCK)
> > 
> > That isn't how pread2() works.  If the leading one or more pages are
> > uptodate, pread2() will return a partial read.  Now what?  Either the
> > application reads the same data a second time via the worker thread
> > (dumb, but it will usually be a rare case)
> 
> The problem with the above is that we can't tell the difference
> between pread2() returning a short read because the pages are not
> in cache, or because someone truncated the file. So we need some
> way to differentiate this.
> 
> My preference from userspace would be for pread2() to return
> EAGAIN if *all* the data requested is not available (where
> 'all' can be less than the size requested if the file has
> been truncated in the meantime).
> 
> ...
> 
> The thing I want to avoid is the case where
> ret < size_wanted means only part of the file
> is in cache.

>From my reading of the code, pread2() will return -EAGAIN only when it
copied zero bytes to userspace.  ie, the very first page wasn't in
cache.  If pread2() does copy some data to userspace then it will
return the amount of data copied.  This is traditional read()
behaviour.

Maybe there's some other code somewhere in the patch which converts
that short read into -EAGAIN, dunno - the changelogs don't appear to
mention it and the manpage update is ambiguous about this.

But from an interface perspective the behaviour you're asking for is
insane, frankly - if the kernel copied out 8k of data then pread2()
should return 8k.  Otherwise there's no way for userspace to know that
the 8k copy actually happened and we have just wasted a great pile of
CPU doing a pointless memcpy.

I expect that this situation (first part in cache, latter part not in
cache) is rare - for reasonably small requests the common cases will be
"all cached" and "nothing cached".  So perhaps the best approach here
is for samba to add special handling for the short read, to work out
the reason for its occurrence.

Alternatively we could add another flag to pread2() to select this
"throw away my data and return -EAGAIN" behaviour.  Presumably
implemented with an i_size check, but it's gonna be racy.



I take it from your comments that nobody has actually wired up pread2()
into samba yet?  That's a bit disturbing, because if we later want to
go and change something like this short-read behaviour, we're screwed -
it's a non back-compat userspace-visible change.


And a note on cosmetics: why are we using EAGAIN here rather than
EWOULDBLOCK?  They have the same numerical value, but EWOULDBLOCK is a
better name - EAGAIN says "run it again", but that won't work.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27 16:30               ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27 16:30 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Milosz Tanski,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Fri, 27 Mar 2015 08:58:54 -0700 Jeremy Allison <jra-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org> wrote:

> On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
> > On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> > 
> > > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> > > > fincore() doesn't have to be ugly.  Please address the design issues I
> > > > raised.  How is pread2() useful to the class of applications which
> > > > cannot proceed until all data is available?
> > > 
> > > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
> > > return -EGAIN, which causes them to bounce to the threadpool where
> > > they call preadv(...).
> > 
> > (I assume you mean RWF_NONBLOCK)
> > 
> > That isn't how pread2() works.  If the leading one or more pages are
> > uptodate, pread2() will return a partial read.  Now what?  Either the
> > application reads the same data a second time via the worker thread
> > (dumb, but it will usually be a rare case)
> 
> The problem with the above is that we can't tell the difference
> between pread2() returning a short read because the pages are not
> in cache, or because someone truncated the file. So we need some
> way to differentiate this.
> 
> My preference from userspace would be for pread2() to return
> EAGAIN if *all* the data requested is not available (where
> 'all' can be less than the size requested if the file has
> been truncated in the meantime).
> 
> ...
> 
> The thing I want to avoid is the case where
> ret < size_wanted means only part of the file
> is in cache.

>From my reading of the code, pread2() will return -EAGAIN only when it
copied zero bytes to userspace.  ie, the very first page wasn't in
cache.  If pread2() does copy some data to userspace then it will
return the amount of data copied.  This is traditional read()
behaviour.

Maybe there's some other code somewhere in the patch which converts
that short read into -EAGAIN, dunno - the changelogs don't appear to
mention it and the manpage update is ambiguous about this.

But from an interface perspective the behaviour you're asking for is
insane, frankly - if the kernel copied out 8k of data then pread2()
should return 8k.  Otherwise there's no way for userspace to know that
the 8k copy actually happened and we have just wasted a great pile of
CPU doing a pointless memcpy.

I expect that this situation (first part in cache, latter part not in
cache) is rare - for reasonably small requests the common cases will be
"all cached" and "nothing cached".  So perhaps the best approach here
is for samba to add special handling for the short read, to work out
the reason for its occurrence.

Alternatively we could add another flag to pread2() to select this
"throw away my data and return -EAGAIN" behaviour.  Presumably
implemented with an i_size check, but it's gonna be racy.



I take it from your comments that nobody has actually wired up pread2()
into samba yet?  That's a bit disturbing, because if we later want to
go and change something like this short-read behaviour, we're screwed -
it's a non back-compat userspace-visible change.


And a note on cosmetics: why are we using EAGAIN here rather than
EWOULDBLOCK?  They have the same numerical value, but EWOULDBLOCK is a
better name - EAGAIN says "run it again", but that won't work.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27 16:30               ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27 16:30 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Milosz Tanski,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Fri, 27 Mar 2015 08:58:54 -0700 Jeremy Allison <jra-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org> wrote:

> On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
> > On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> > 
> > > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> > > > fincore() doesn't have to be ugly.  Please address the design issues I
> > > > raised.  How is pread2() useful to the class of applications which
> > > > cannot proceed until all data is available?
> > > 
> > > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
> > > return -EGAIN, which causes them to bounce to the threadpool where
> > > they call preadv(...).
> > 
> > (I assume you mean RWF_NONBLOCK)
> > 
> > That isn't how pread2() works.  If the leading one or more pages are
> > uptodate, pread2() will return a partial read.  Now what?  Either the
> > application reads the same data a second time via the worker thread
> > (dumb, but it will usually be a rare case)
> 
> The problem with the above is that we can't tell the difference
> between pread2() returning a short read because the pages are not
> in cache, or because someone truncated the file. So we need some
> way to differentiate this.
> 
> My preference from userspace would be for pread2() to return
> EAGAIN if *all* the data requested is not available (where
> 'all' can be less than the size requested if the file has
> been truncated in the meantime).
> 
> ...
> 
> The thing I want to avoid is the case where
> ret < size_wanted means only part of the file
> is in cache.

From my reading of the code, pread2() will return -EAGAIN only when it
copied zero bytes to userspace.  ie, the very first page wasn't in
cache.  If pread2() does copy some data to userspace then it will
return the amount of data copied.  This is traditional read()
behaviour.

Maybe there's some other code somewhere in the patch which converts
that short read into -EAGAIN, dunno - the changelogs don't appear to
mention it and the manpage update is ambiguous about this.

But from an interface perspective the behaviour you're asking for is
insane, frankly - if the kernel copied out 8k of data then pread2()
should return 8k.  Otherwise there's no way for userspace to know that
the 8k copy actually happened and we have just wasted a great pile of
CPU doing a pointless memcpy.

I expect that this situation (first part in cache, latter part not in
cache) is rare - for reasonably small requests the common cases will be
"all cached" and "nothing cached".  So perhaps the best approach here
is for samba to add special handling for the short read, to work out
the reason for its occurrence.

Alternatively we could add another flag to pread2() to select this
"throw away my data and return -EAGAIN" behaviour.  Presumably
implemented with an i_size check, but it's gonna be racy.



I take it from your comments that nobody has actually wired up pread2()
into samba yet?  That's a bit disturbing, because if we later want to
go and change something like this short-read behaviour, we're screwed -
it's a non back-compat userspace-visible change.


And a note on cosmetics: why are we using EAGAIN here rather than
EWOULDBLOCK?  They have the same numerical value, but EWOULDBLOCK is a
better name - EAGAIN says "run it again", but that won't work.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27 16:30               ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27 16:30 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, 27 Mar 2015 08:58:54 -0700 Jeremy Allison <jra@samba.org> wrote:

> On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
> > On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> > 
> > > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
> > > > fincore() doesn't have to be ugly.  Please address the design issues I
> > > > raised.  How is pread2() useful to the class of applications which
> > > > cannot proceed until all data is available?
> > > 
> > > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
> > > return -EGAIN, which causes them to bounce to the threadpool where
> > > they call preadv(...).
> > 
> > (I assume you mean RWF_NONBLOCK)
> > 
> > That isn't how pread2() works.  If the leading one or more pages are
> > uptodate, pread2() will return a partial read.  Now what?  Either the
> > application reads the same data a second time via the worker thread
> > (dumb, but it will usually be a rare case)
> 
> The problem with the above is that we can't tell the difference
> between pread2() returning a short read because the pages are not
> in cache, or because someone truncated the file. So we need some
> way to differentiate this.
> 
> My preference from userspace would be for pread2() to return
> EAGAIN if *all* the data requested is not available (where
> 'all' can be less than the size requested if the file has
> been truncated in the meantime).
> 
> ...
> 
> The thing I want to avoid is the case where
> ret < size_wanted means only part of the file
> is in cache.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27 15:58             ` Jeremy Allison
@ 2015-03-27 16:38               ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-27 16:38 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Andrew Morton, Christoph Hellwig, LKML, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 11:58 AM, Jeremy Allison <jra@samba.org> wrote:
> On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
>> On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:
>>
>> > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
>> > > fincore() doesn't have to be ugly.  Please address the design issues I
>> > > raised.  How is pread2() useful to the class of applications which
>> > > cannot proceed until all data is available?
>> >
>> > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
>> > return -EGAIN, which causes them to bounce to the threadpool where
>> > they call preadv(...).
>>
>> (I assume you mean RWF_NONBLOCK)
>>
>> That isn't how pread2() works.  If the leading one or more pages are
>> uptodate, pread2() will return a partial read.  Now what?  Either the
>> application reads the same data a second time via the worker thread
>> (dumb, but it will usually be a rare case)
>
> The problem with the above is that we can't tell the difference
> between pread2() returning a short read because the pages are not
> in cache, or because someone truncated the file. So we need some
> way to differentiate this.
>
> My preference from userspace would be for pread2() to return
> EAGAIN if *all* the data requested is not available (where
> 'all' can be less than the size requested if the file has
> been truncated in the meantime).
>
> So:
>
> ret = pread2(fd, buf, size_wanted, RWF_NONBLOCK)
>
> if (ret == -1) {
>         if (errno == EAGAIN) {
>                 goto threadpool...
>         }
>         .. real error..
> }
>
> if (ret == size_wanted) {
>         .. normal read, file not truncated...
> }
>
> if (ret < size_wanted) {
>         .. file was truncated..
> }
>
> The thing I want to avoid is the case where
> ret < size_wanted means only part of the file
> is in cache.

I very much like the short read behavior. It lets you overlap some CPU
work partial data (like TLS and then sticking it network output
buffer) with waiting for the test of the data (enequed in the thread
pool).

Short reads are the current behavior, if you call preadv2 a second
time around at EOF it'll return 0 instead of EWOULDBLOCK today. I
actually test for this in the preadv2 test in xfstest here:
https://github.com/mtanski/xfstests/commit/688db24c292999c81ee17caf2b61fe8cf7bb3cd6#diff-114416ea98ce29dde3b5b3d145afbd2bR81.

There's one caveat, that it's possible to get EWOULDBLOCK when reading
at end of file if the file metadata is not paged in.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27 16:38               ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-27 16:38 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Andrew Morton, Christoph Hellwig, LKML, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 11:58 AM, Jeremy Allison <jra@samba.org> wrote:
> On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
>> On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:
>>
>> > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
>> > > fincore() doesn't have to be ugly.  Please address the design issues I
>> > > raised.  How is pread2() useful to the class of applications which
>> > > cannot proceed until all data is available?
>> >
>> > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
>> > return -EGAIN, which causes them to bounce to the threadpool where
>> > they call preadv(...).
>>
>> (I assume you mean RWF_NONBLOCK)
>>
>> That isn't how pread2() works.  If the leading one or more pages are
>> uptodate, pread2() will return a partial read.  Now what?  Either the
>> application reads the same data a second time via the worker thread
>> (dumb, but it will usually be a rare case)
>
> The problem with the above is that we can't tell the difference
> between pread2() returning a short read because the pages are not
> in cache, or because someone truncated the file. So we need some
> way to differentiate this.
>
> My preference from userspace would be for pread2() to return
> EAGAIN if *all* the data requested is not available (where
> 'all' can be less than the size requested if the file has
> been truncated in the meantime).
>
> So:
>
> ret = pread2(fd, buf, size_wanted, RWF_NONBLOCK)
>
> if (ret == -1) {
>         if (errno == EAGAIN) {
>                 goto threadpool...
>         }
>         .. real error..
> }
>
> if (ret == size_wanted) {
>         .. normal read, file not truncated...
> }
>
> if (ret < size_wanted) {
>         .. file was truncated..
> }
>
> The thing I want to avoid is the case where
> ret < size_wanted means only part of the file
> is in cache.

I very much like the short read behavior. It lets you overlap some CPU
work partial data (like TLS and then sticking it network output
buffer) with waiting for the test of the data (enequed in the thread
pool).

Short reads are the current behavior, if you call preadv2 a second
time around at EOF it'll return 0 instead of EWOULDBLOCK today. I
actually test for this in the preadv2 test in xfstest here:
https://github.com/mtanski/xfstests/commit/688db24c292999c81ee17caf2b61fe8cf7bb3cd6#diff-114416ea98ce29dde3b5b3d145afbd2bR81.

There's one caveat, that it's possible to get EWOULDBLOCK when reading
at end of file if the file metadata is not paged in.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27 16:30               ` Andrew Morton
@ 2015-03-27 16:39                 ` Jeremy Allison
  -1 siblings, 0 replies; 94+ messages in thread
From: Jeremy Allison @ 2015-03-27 16:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Allison, Christoph Hellwig, Milosz Tanski, linux-kernel,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro, linux-api,
	Michael Kerrisk, linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 09:30:46AM -0700, Andrew Morton wrote:
> 
> But from an interface perspective the behaviour you're asking for is
> insane, frankly - if the kernel copied out 8k of data then pread2()
> should return 8k.  Otherwise there's no way for userspace to know that
> the 8k copy actually happened and we have just wasted a great pile of
> CPU doing a pointless memcpy.

Why would it do the copy in the first place if we asked (for example)
for 16k, but only 8k was available ? Just return EAGAIN and have
done with it.

> I expect that this situation (first part in cache, latter part not in
> cache) is rare - for reasonably small requests the common cases will be
> "all cached" and "nothing cached".  So perhaps the best approach here
> is for samba to add special handling for the short read, to work out
> the reason for its occurrence.

We can do that, but as Volker says this is a very hot code path.

> I take it from your comments that nobody has actually wired up pread2()
> into samba yet?  That's a bit disturbing, because if we later want to
> go and change something like this short-read behaviour, we're screwed -
> it's a non back-compat userspace-visible change.

It's been done as a test, so the code exists and has run (and improved
perforamance as I recall). Not much point commiting it without kernel
support :-).

> And a note on cosmetics: why are we using EAGAIN here rather than
> EWOULDBLOCK?  They have the same numerical value, but EWOULDBLOCK is a
> better name - EAGAIN says "run it again", but that won't work.

Sounds good to me !

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-27 16:39                 ` Jeremy Allison
  0 siblings, 0 replies; 94+ messages in thread
From: Jeremy Allison @ 2015-03-27 16:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Allison, Christoph Hellwig, Milosz Tanski, linux-kernel,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro, linux-api,
	Michael Kerrisk, linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 09:30:46AM -0700, Andrew Morton wrote:
> 
> But from an interface perspective the behaviour you're asking for is
> insane, frankly - if the kernel copied out 8k of data then pread2()
> should return 8k.  Otherwise there's no way for userspace to know that
> the 8k copy actually happened and we have just wasted a great pile of
> CPU doing a pointless memcpy.

Why would it do the copy in the first place if we asked (for example)
for 16k, but only 8k was available ? Just return EAGAIN and have
done with it.

> I expect that this situation (first part in cache, latter part not in
> cache) is rare - for reasonably small requests the common cases will be
> "all cached" and "nothing cached".  So perhaps the best approach here
> is for samba to add special handling for the short read, to work out
> the reason for its occurrence.

We can do that, but as Volker says this is a very hot code path.

> I take it from your comments that nobody has actually wired up pread2()
> into samba yet?  That's a bit disturbing, because if we later want to
> go and change something like this short-read behaviour, we're screwed -
> it's a non back-compat userspace-visible change.

It's been done as a test, so the code exists and has run (and improved
perforamance as I recall). Not much point commiting it without kernel
support :-).

> And a note on cosmetics: why are we using EAGAIN here rather than
> EWOULDBLOCK?  They have the same numerical value, but EWOULDBLOCK is a
> better name - EAGAIN says "run it again", but that won't work.

Sounds good to me !

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27 16:30               ` Andrew Morton
                                 ` (3 preceding siblings ...)
  (?)
@ 2015-03-27 16:39               ` Andrew Morton
  -1 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-27 16:39 UTC (permalink / raw)
  To: Jeremy Allison, Christoph Hellwig, Milosz Tanski, linux-kernel,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro, linux-api,
	Michael Kerrisk, linux-arch, Dave Chinner

On Fri, 27 Mar 2015 09:30:46 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> I expect that this situation (first part in cache, latter part not in
> cache) is rare - for reasonably small requests the common cases will be
> "all cached" and "nothing cached".  So perhaps the best approach here
> is for samba to add special handling for the short read, to work out
> the reason for its occurrence.
> 
> Alternatively we could add another flag to pread2() to select this
> "throw away my data and return -EAGAIN" behaviour.  Presumably
> implemented with an i_size check, but it's gonna be racy.

Here's a better way:

	nr_read = pread2(buf, len);
	if (nr_read < len)
		nr_read += pread(buf + nr_read, len - nr_read);
	if (nr_read < len)
		we_hit_eof();

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27 16:30               ` Andrew Morton
                                 ` (4 preceding siblings ...)
  (?)
@ 2015-03-27 16:45               ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-27 16:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Allison, Christoph Hellwig, LKML, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 12:30 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri, 27 Mar 2015 08:58:54 -0700 Jeremy Allison <jra@samba.org> wrote:
>
>> On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
>> > On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:
>> >
>> > > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
>> > > > fincore() doesn't have to be ugly.  Please address the design issues I
>> > > > raised.  How is pread2() useful to the class of applications which
>> > > > cannot proceed until all data is available?
>> > >
>> > > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
>> > > return -EGAIN, which causes them to bounce to the threadpool where
>> > > they call preadv(...).
>> >
>> > (I assume you mean RWF_NONBLOCK)
>> >
>> > That isn't how pread2() works.  If the leading one or more pages are
>> > uptodate, pread2() will return a partial read.  Now what?  Either the
>> > application reads the same data a second time via the worker thread
>> > (dumb, but it will usually be a rare case)
>>
>> The problem with the above is that we can't tell the difference
>> between pread2() returning a short read because the pages are not
>> in cache, or because someone truncated the file. So we need some
>> way to differentiate this.
>>
>> My preference from userspace would be for pread2() to return
>> EAGAIN if *all* the data requested is not available (where
>> 'all' can be less than the size requested if the file has
>> been truncated in the meantime).
>>
>> ...
>>
>> The thing I want to avoid is the case where
>> ret < size_wanted means only part of the file
>> is in cache.
>
> From my reading of the code, pread2() will return -EAGAIN only when it
> copied zero bytes to userspace.  ie, the very first page wasn't in
> cache.  If pread2() does copy some data to userspace then it will
> return the amount of data copied.  This is traditional read()
> behaviour.
>
> Maybe there's some other code somewhere in the patch which converts
> that short read into -EAGAIN, dunno - the changelogs don't appear to
> mention it and the manpage update is ambiguous about this.
>
> But from an interface perspective the behaviour you're asking for is
> insane, frankly - if the kernel copied out 8k of data then pread2()
> should return 8k.  Otherwise there's no way for userspace to know that
> the 8k copy actually happened and we have just wasted a great pile of
> CPU doing a pointless memcpy.
>
> I expect that this situation (first part in cache, latter part not in
> cache) is rare - for reasonably small requests the common cases will be
> "all cached" and "nothing cached".  So perhaps the best approach here
> is for samba to add special handling for the short read, to work out
> the reason for its occurrence.
>
> Alternatively we could add another flag to pread2() to select this
> "throw away my data and return -EAGAIN" behaviour.  Presumably
> implemented with an i_size check, but it's gonna be racy.
>
>
>
> I take it from your comments that nobody has actually wired up pread2()
> into samba yet?  That's a bit disturbing, because if we later want to
> go and change something like this short-read behaviour, we're screwed -
> it's a non back-compat userspace-visible change.

Volker and did wired so we can use Samba as a test / use case. The
change we made was quick and dirty 9 lines of code, if you exclude the
syscall boiler plate. In fact, right now it does the stupid thing of
throwing away the partial result and enqueing in the threadpool if it
doesn't get the whole block. Volker agreed that was as much as we need
to do to get the numbers and we'll make a proper patch once it's in
upstream.

Patch to samba at end of email for reference.

>
>
> And a note on cosmetics: why are we using EAGAIN here rather than
> EWOULDBLOCK?  They have the same numerical value, but EWOULDBLOCK is a
> better name - EAGAIN says "run it again", but that won't work.

You're right. I will fix this.

diff --git a/source3/modules/vfs_default.c b/source3/modules/vfs_default.c
index 5634cc0..90348d8 100644
--- a/source3/modules/vfs_default.c
+++ b/source3/modules/vfs_default.c
@@ -34,6 +34,29 @@
 #include "lib/util/tevent_ntstatus.h"
 #include "lib/sys_rw.h"

+#include <pthread.h>
+
+#define __NR_preadv2 322
+#define __NR_pwritev2 323
+#define RWF_NONBLOCK 1
+
+#define LO_HI_LONG(val) \
+       (off_t) val, \
+       (off_t) ((((uint64_t) (val)) >> (sizeof (long) * 4)) >>
(sizeof (long) * 4))
+
+static inline
+int preadv2(int fd, const struct iovec *iov, int iovcnt, off_t
offset, int flags)
+{
+       return syscall(__NR_preadv2, fd, iov, iovcnt,
LO_HI_LONG(offset), flags);
+}
+
+static inline
+int pread2(int fd, void *data, size_t len, off_t offset, int flags)
+{
+       struct iovec iov = { data, len };
+       return preadv2(fd, &iov, 1, offset, flags);
+}
+
 #undef DBGC_CLASS
 #define DBGC_CLASS DBGC_VFS

@@ -718,6 +741,7 @@ static struct tevent_req
*vfswrap_pread_send(struct vfs_handle_struct *handle,
        struct tevent_req *req;
        struct vfswrap_asys_state *state;
        int ret;
+       ssize_t nread;

        req = tevent_req_create(mem_ctx, &state, struct vfswrap_asys_state);
        if (req == NULL) {
@@ -730,6 +754,14 @@ static struct tevent_req
*vfswrap_pread_send(struct vfs_handle_struct *handle,
        state->asys_ctx = handle->conn->sconn->asys_ctx;
        state->req = req;

+       nread = pread2(fsp->fh->fd, data, n, offset, RWF_NONBLOCK);
+       // TODO: partial reads
+       if (nread == n) {
+               state->ret = nread;
+               tevent_req_done(req);
+               return tevent_req_post(req, ev);
+       }
+
        SMBPROFILE_BYTES_ASYNC_START(syscall_asys_pread, profile_p,
                                     state->profile_bytes, n);
        ret = asys_pread(state->asys_ctx, fsp->fh->fd, data, n, offset, req);

^ permalink raw reply related	[flat|nested] 94+ messages in thread

* Re: [PATCH] Add preadv2/pwritev2 documentation.
  2015-03-16 18:32 ` [PATCH] Add preadv2/pwritev2 documentation Milosz Tanski
@ 2015-03-27 16:49   ` Andrew Morton
  2015-03-30  7:33       ` Christoph Hellwig
  0 siblings, 1 reply; 94+ messages in thread
From: Andrew Morton @ 2015-03-27 16:49 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: linux-kernel, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, 16 Mar 2015 14:32:26 -0400 Milosz Tanski <milosz@adfin.com> wrote:

> +.BR pwritev2 ()
>  can also fail for the same reasons as
>  .BR lseek (2).
> -Additionally, the following error is defined:
> +Additionally, the following errors are defined:
> +.TP
> +.B EAGAIN
> +The operation would block. This is possible if the file descriptor \fIfd\fP refers to a socket and has been marked nonblocking
> +.RB ( O_NONBLOCK ),
> +or the operation is a
> +.BR preadv2
> +and the \fIflags\fP argument is set to
> +.BR RWF_NONBLOCK.

Can you please expand on this to describe the circumstances in which
pread2() returns -EAGAIN?  Let's fix up the short-read confusion.

And let's have a think about EAGAIN vs EWOULDBLOCK.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27 15:21     ` Milosz Tanski
  (?)
@ 2015-03-27 17:04     ` Andrew Morton
  2015-03-30  7:40         ` Christoph Hellwig
  -1 siblings, 1 reply; 94+ messages in thread
From: Andrew Morton @ 2015-03-27 17:04 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: LKML, Christoph Hellwig, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, Linux API, Michael Kerrisk, linux-arch, Dave Chinner

On Fri, 27 Mar 2015 11:21:26 -0400 Milosz Tanski <milosz@adfin.com> wrote:

> On Thu, Mar 26, 2015 at 11:28 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Mon, 16 Mar 2015 14:27:10 -0400 Milosz Tanski <milosz@adfin.com> wrote:
> >
> >> This patchset introduces two new syscalls preadv2 and pwritev2. They are the
> >> same syscalls as preadv and pwrite but with a flag argument. Additionally,
> >> preadv2 implements an extra RWF_NONBLOCK flag.
> >
> > I still don't understand why pwritev() exists.  We discussed this last
> > time but it seems nothing has changed.  I'm not seeing here an adequate
> > description of why it exists nor a justification for its addition.
> 
> In the "Forward Looking" section there's a description of why we want
> pwritev2 and what we're doing to do with it in the future. The goal is
> to have two additional flags for those calls RWF_DSYNC and
> RWF_NONBLOCK. As Christop mentioned modern network filesystem
> protocols have per operation sync flags. And there's use cases for
> guaranteeing of write dirtying pages without triggering a writeout.
> 
> The consensus from our discussion at LSF fs tack was 1) that both
> preadv and pwritev should have flags to begin with, inline with the
> API syscall design guidelines 2) if we're adding preadv2 we should add
> a matching pwritev2 3) especially that we plan on introducing further
> flags to preadv in the near future.

mm...  I don't think we should be adding placeholders to the kernel API
to support code which hasn't been written, tested, reviewed, merged,
etc.  It's possible none of this will ever happen and we end up with a
syscall nobody needs or uses.  Plus it's always possible that during
this development we decide the pwrite2() interface needs alteration but
it's too late.

What would be the downside of deferring pwrite2() until it's all
implemented?

> >
> > Also, why are we adding new syscalls instead of using O_NONBLOCK?  I
> > think this might have been discussed before, but the changelogs haven't
> > been updated to reflect it - please do so.
> 
> In a much earlier patch series we already had the discussion on why we
> can't use O_NONBLOCK for regular files. It comes down to that it
> breaks some userspace applications. Link for further reference to the
> thread:
> 
> https://lkml.org/lkml/2014/9/22/294
> http://thread.gmane.org/gmane.linux.kernel.aio.general/4242
> 
> I will include the background in the next patchset.

Cool, thanks.

> >
> >> The RWF_NONBLOCK flag in preadv2 introduces an ability to perform a
> >> non-blocking read from regular files in buffered IO mode. This works by only
> >> for those filesystems that have data in the page cache.
> >>
> >> We discussed these changes at this year's LSF/MM summit in Boston. More details
> >> on the Samba use case, the numbers, and presentation is available at this link:
> >> https://lists.samba.org/archive/samba-technical/2015-March/106290.html
> >
> > https://drive.google.com/file/d/0B3maCn0jCvYncndGbXJKbGlhejQ/view?usp=sharing
> > talks about "sync" but I can't find a description of what this actually
> > is.  It appears to perform better than anything else?
> 
> Sync is the samba mode where we do not use threadpool just service the
> IO request in the network thread. In a single client case if
> everything is in the page cache we are aiming to be as close in
> latency as sync. The reason it isn't is because the threadpool path in
> samba has some additional over head. I did bring it up to the Samba
> folks on their technical mailing list, they can investigate it further
> if they want it.
> 
> It's impractical to use Sync anywhere we have modern SMB3 clients that
> can multiplex > 100 operations over a single connection. Head-of-line
> blocking would kill performance, why we need the threadpool. With the
> threadpool we increase the mean (and tail) latency even if the data is
> handy and we can answer it right away.

OK, all makes sense.

> >
> >> Background:
> >>
> >>  Using a threadpool to emulate non-blocking operations on regular buffered
> >>  files is a common pattern today (samba, libuv, etc...) Applications split the
> >>  work between network bound threads (epoll) and IO threadpool. Not every
> >>  application can use sendfile syscall (TLS / post-processing).
> >>
> >>  This common pattern leads to increased request latency. Latency can be due to
> >>  additional synchronization between the threads or fast (cached data) request
> >>  stuck behind slow request (large / uncached data).
> >>
> >>  The preadv2 syscall with RWF_NONBLOCK lets userspace applications bypass
> >>  enqueuing operation in the threadpool if it's already available in the
> >>  pagecache.
> >
> > A thing which bugs me about pread2() is that it is specifically
> > tailored to applications which are able to use a partial read result.
> > ie, by sending it over the network.
> >
> > But it is not very useful for the class of applications which require
> > that the entire read be completed before they can proceed with using
> > the data.  Such applications will have to run pread2(), see the short
> > result, save away the partial data, perform some IO then fetch the
> > remaining data then proceed.  By this time, the original partially read
> > data may have fallen out of CPU cache (or we're on a different CPU) and
> > the data will need to be fetched into cache a second time.
> >
> > Such applications would be better served if they were able to query for
> > pagecache presence _before_ doing the big copy_to_user(), so they can
> > ensure that all the data is in pagecache before copying it in.  ie:
> > fincore(), perhaps supported by a synchronous POSIX_FADV_WILLNEED.
> >
> > And of course fincore could be used by Samba etc to avoid blocking on
> > reads.  It wouldn't perform quite as well as pread2(), but I bet it's
> > good enough.
> 
> The RWF_NONBLOCK is aimed primarily at network applications. Some of
> them can send a partial result down the network, and then they can
> enqueue the rest in the threadpool. For applications that need the
> whole value, they clearly have to wait to read in the rest, but it's
> behavior that are opting into.

Right.  But these applications would prefer "all or nothing" behaviour.
They can get that with fincore()+pread() but they can't get it with
pread2().  Because pread2() takes the two separate concepts of "are
these pages in cache" and "copy these pages to me" and joins them
together in a single operation.

> >
> > Bottom line: with pread2() there's still a need for fincore(), but with
> > fincore() there probably isn't a need for pread2().
> 
> I see fincore() and preadv2() with RWF_NONBLOCK as tangential
> syscalls. You can implement a poor man's RWF_NONBLOCK in userspace
> with fincore() but not all of us are fine with it's racy nature or
> requiring 2 syscalls in the best case.

I find this to be too handwavy to be able to process it, sorry.

- WHY is the raciness unacceptable?  If 0.0001% of pread2() calls
  inadvertently block, what problem does this cause?  0.01%?  Details
  please, lots of them.

- Yes, 2 syscalls is more expensive.  But 

  a) we don't know *how* expensive.  It might be negligible.

  b) bear in mind that this patchset is adding minor additional overhead
     to every linux application that does read().  To benefit the teeny
     teeny minority of apps which use pread2().

btw, that overhead might be slightly reduced by doing

-	if (iocb->ki_rwflags & RWF_NONBLOCK)
+	if (unlikely(iocb->ki_rwflags & RWF_NONBLOCK))
		return -EAGAIN;

in the appropriate places.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] Add preadv2/pwritev2 documentation.
@ 2015-03-30  7:33       ` Christoph Hellwig
  0 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-30  7:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, linux-kernel, Christoph Hellwig, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 09:49:32AM -0700, Andrew Morton wrote:
> And let's have a think about EAGAIN vs EWOULDBLOCK.

Do you want a sentence saying they are always the same in Linux?

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH] Add preadv2/pwritev2 documentation.
@ 2015-03-30  7:33       ` Christoph Hellwig
  0 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-30  7:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Christoph Hellwig, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Fri, Mar 27, 2015 at 09:49:32AM -0700, Andrew Morton wrote:
> And let's have a think about EAGAIN vs EWOULDBLOCK.

Do you want a sentence saying they are always the same in Linux?

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27 15:58             ` Jeremy Allison
                               ` (2 preceding siblings ...)
  (?)
@ 2015-03-30  7:36             ` Christoph Hellwig
  2015-03-30 17:19                 ` Jeremy Allison
                                 ` (2 more replies)
  -1 siblings, 3 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-30  7:36 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Andrew Morton, Christoph Hellwig, Milosz Tanski, linux-kernel,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro, linux-api,
	Michael Kerrisk, linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
> The problem with the above is that we can't tell the difference
> between pread2() returning a short read because the pages are not
> in cache, or because someone truncated the file. So we need some
> way to differentiate this.

Is a race vs truncate really that time critical that you can't
wait for the thread pool to do the second read to notice it?

> My preference from userspace would be for pread2() to return
> EAGAIN if *all* the data requested is not available (where
> 'all' can be less than the size requested if the file has
> been truncated in the meantime).

That is easily implementable, but I can see that for example web apps
would be happy to get as much as possible.  So if Samba can be ok
with short reads and only detecting the truncated case in the slow
path that would make life simpler.  Otherwise we might indeed need two
flags.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27 17:04     ` Andrew Morton
@ 2015-03-30  7:40         ` Christoph Hellwig
  0 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-30  7:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, LKML, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 10:04:11AM -0700, Andrew Morton wrote:
> mm...  I don't think we should be adding placeholders to the kernel API
> to support code which hasn't been written, tested, reviewed, merged,
> etc.  It's possible none of this will ever happen and we end up with a
> syscall nobody needs or uses.  Plus it's always possible that during
> this development we decide the pwrite2() interface needs alteration but
> it's too late.
> 
> What would be the downside of deferring pwrite2() until it's all
> implemented?

It _is_ implemented.  I just decided to submit it separately as Miklos
already has to deal with enough bikeshedding for his feature that I
don't want to put the burden of dealing with the BS for the one I wrote
on him.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-30  7:40         ` Christoph Hellwig
  0 siblings, 0 replies; 94+ messages in thread
From: Christoph Hellwig @ 2015-03-30  7:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Milosz Tanski, LKML, Christoph Hellwig, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 10:04:11AM -0700, Andrew Morton wrote:
> mm...  I don't think we should be adding placeholders to the kernel API
> to support code which hasn't been written, tested, reviewed, merged,
> etc.  It's possible none of this will ever happen and we end up with a
> syscall nobody needs or uses.  Plus it's always possible that during
> this development we decide the pwrite2() interface needs alteration but
> it's too late.
> 
> What would be the downside of deferring pwrite2() until it's all
> implemented?

It _is_ implemented.  I just decided to submit it separately as Miklos
already has to deal with enough bikeshedding for his feature that I
don't want to put the burden of dealing with the BS for the one I wrote
on him.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-30 17:19                 ` Jeremy Allison
  0 siblings, 0 replies; 94+ messages in thread
From: Jeremy Allison @ 2015-03-30 17:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, Andrew Morton, Milosz Tanski, linux-kernel,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro, linux-api,
	Michael Kerrisk, linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 12:36:04AM -0700, Christoph Hellwig wrote:
> On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
> > The problem with the above is that we can't tell the difference
> > between pread2() returning a short read because the pages are not
> > in cache, or because someone truncated the file. So we need some
> > way to differentiate this.
> 
> Is a race vs truncate really that time critical that you can't
> wait for the thread pool to do the second read to notice it?

Probably not, as this is the fallback path anyway.

> > My preference from userspace would be for pread2() to return
> > EAGAIN if *all* the data requested is not available (where
> > 'all' can be less than the size requested if the file has
> > been truncated in the meantime).
> 
> That is easily implementable, but I can see that for example web apps
> would be happy to get as much as possible.  So if Samba can be ok
> with short reads and only detecting the truncated case in the slow
> path that would make life simpler.  Otherwise we might indeed need two
> flags.

Simpler is better. I can live with the partial read+fallback.

Jeremy.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-30 17:19                 ` Jeremy Allison
  0 siblings, 0 replies; 94+ messages in thread
From: Jeremy Allison @ 2015-03-30 17:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, Andrew Morton, Milosz Tanski,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Michael Kerrisk,
	linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Mon, Mar 30, 2015 at 12:36:04AM -0700, Christoph Hellwig wrote:
> On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
> > The problem with the above is that we can't tell the difference
> > between pread2() returning a short read because the pages are not
> > in cache, or because someone truncated the file. So we need some
> > way to differentiate this.
> 
> Is a race vs truncate really that time critical that you can't
> wait for the thread pool to do the second read to notice it?

Probably not, as this is the fallback path anyway.

> > My preference from userspace would be for pread2() to return
> > EAGAIN if *all* the data requested is not available (where
> > 'all' can be less than the size requested if the file has
> > been truncated in the meantime).
> 
> That is easily implementable, but I can see that for example web apps
> would be happy to get as much as possible.  So if Samba can be ok
> with short reads and only detecting the truncated case in the slow
> path that would make life simpler.  Otherwise we might indeed need two
> flags.

Simpler is better. I can live with the partial read+fallback.

Jeremy.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30  7:40         ` Christoph Hellwig
  (?)
@ 2015-03-30 18:54         ` Andrew Morton
  2015-03-30 22:40           ` Milosz Tanski
  -1 siblings, 1 reply; 94+ messages in thread
From: Andrew Morton @ 2015-03-30 18:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Milosz Tanski, LKML, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, Linux API, Michael Kerrisk, linux-arch, Dave Chinner

On Mon, 30 Mar 2015 00:40:20 -0700 Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Mar 27, 2015 at 10:04:11AM -0700, Andrew Morton wrote:
> > mm...  I don't think we should be adding placeholders to the kernel API
> > to support code which hasn't been written, tested, reviewed, merged,
> > etc.  It's possible none of this will ever happen and we end up with a
> > syscall nobody needs or uses.  Plus it's always possible that during
> > this development we decide the pwrite2() interface needs alteration but
> > it's too late.
> > 
> > What would be the downside of deferring pwrite2() until it's all
> > implemented?
> 
> It _is_ implemented.  I just decided to submit it separately as Miklos
> already has to deal with enough bikeshedding for his feature that I
> don't want to put the burden of dealing with the BS for the one I wrote
> on him.

afacit the only difference between this pwritev2() and the existing
pwritev() is that pwritev2() interprets pos==-1 as "current position",
which duplicates writev()?

Unless I've missed something, there's no point in merging this
pwritev2() and it would be better to separate this syscall out into a
pwritev2() patchset which can be considered and merged separately.  For
the reasons described above.


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30  7:36             ` Christoph Hellwig
@ 2015-03-30 20:26                 ` Andrew Morton
  2015-03-30 20:26                 ` Andrew Morton
  2015-03-30 23:09               ` Milosz Tanski
  2 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-30 20:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, 30 Mar 2015 00:36:04 -0700 Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
> > The problem with the above is that we can't tell the difference
> > between pread2() returning a short read because the pages are not
> > in cache, or because someone truncated the file. So we need some
> > way to differentiate this.
> 
> Is a race vs truncate really that time critical that you can't
> wait for the thread pool to do the second read to notice it?
> 
> > My preference from userspace would be for pread2() to return
> > EAGAIN if *all* the data requested is not available (where
> > 'all' can be less than the size requested if the file has
> > been truncated in the meantime).
> 
> That is easily implementable, but I can see that for example web apps
> would be happy to get as much as possible.  So if Samba can be ok
> with short reads and only detecting the truncated case in the slow
> path that would make life simpler.  Otherwise we might indeed need two
> flags.

The problem is that many applications (including samba!) want
all-or-nothing behaviour, and preadv2() cannot provide it.  By the time
preadv2() discovers a not-present page, it has already copied bulk data
out to userspace.

To fix this, preadv2() would need to take two passes across the pages,
pinning them in between and somehow blocking out truncate.  That's a
big change.

With the current preadv2(), applications would have to do

	nr_read = preadv2(..., offset, len, ...);
	if (nr_read == len)
		process data;
	else
		punt(offset + nr_read, len - nr_read);

and the worker thread will later have to splice together the initial
data and the later-arriving data, probably on another CPU, probably
after the initial data has gone cache-cold.

A cleaner solution is

	if (fincore(fd, NULL, offset, len) == len) {
		preadv(..., offset, len);
		process data;
	} else {
		punt(offset, len);
	}

This way all the data gets copied in a single hit and is cache-hot when
userspace processes it.

Comparing fincore()+pread() to preadv2():

pros:

a) fincore() may be used to provide both all-or-nothing and
   part-read-ok behaviour cleanly and with optimum cache behaviour.

b) fincore() doesn't add overhead, complexity and stack depth to
   core pagecache read() code.  Nor does it expand VFS data structures.

c) with a non-NULL second argument, fincore provides the
   mincore()-style page map.

cons:

d) fincore() is more expensive

e) fincore() will very occasionally block


Tradeoffs are involved.  To decide on the best path we should examine
d).  I expect that the overhead will be significant for small reads but
not significant for medium and large reads.  Needs quantifying.

And I don't believe that e) will be a problem in the real world.  It's
a significant increase in worst-case latency and a negligible increase
in average latency.  I've asked at least three times for someone to
explain why this is unacceptable and no explanation has been provided.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-30 20:26                 ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-30 20:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, 30 Mar 2015 00:36:04 -0700 Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
> > The problem with the above is that we can't tell the difference
> > between pread2() returning a short read because the pages are not
> > in cache, or because someone truncated the file. So we need some
> > way to differentiate this.
> 
> Is a race vs truncate really that time critical that you can't
> wait for the thread pool to do the second read to notice it?
> 
> > My preference from userspace would be for pread2() to return
> > EAGAIN if *all* the data requested is not available (where
> > 'all' can be less than the size requested if the file has
> > been truncated in the meantime).
> 
> That is easily implementable, but I can see that for example web apps
> would be happy to get as much as possible.  So if Samba can be ok
> with short reads and only detecting the truncated case in the slow
> path that would make life simpler.  Otherwise we might indeed need two
> flags.

The problem is that many applications (including samba!) want
all-or-nothing behaviour, and preadv2() cannot provide it.  By the time
preadv2() discovers a not-present page, it has already copied bulk data
out to userspace.

To fix this, preadv2() would need to take two passes across the pages,
pinning them in between and somehow blocking out truncate.  That's a
big change.

With the current preadv2(), applications would have to do

	nr_read = preadv2(..., offset, len, ...);
	if (nr_read == len)
		process data;
	else
		punt(offset + nr_read, len - nr_read);

and the worker thread will later have to splice together the initial
data and the later-arriving data, probably on another CPU, probably
after the initial data has gone cache-cold.

A cleaner solution is

	if (fincore(fd, NULL, offset, len) == len) {
		preadv(..., offset, len);
		process data;
	} else {
		punt(offset, len);
	}

This way all the data gets copied in a single hit and is cache-hot when
userspace processes it.

Comparing fincore()+pread() to preadv2():

pros:

a) fincore() may be used to provide both all-or-nothing and
   part-read-ok behaviour cleanly and with optimum cache behaviour.

b) fincore() doesn't add overhead, complexity and stack depth to
   core pagecache read() code.  Nor does it expand VFS data structures.

c) with a non-NULL second argument, fincore provides the
   mincore()-style page map.

cons:

d) fincore() is more expensive

e) fincore() will very occasionally block


Tradeoffs are involved.  To decide on the best path we should examine
d).  I expect that the overhead will be significant for small reads but
not significant for medium and large reads.  Needs quantifying.

And I don't believe that e) will be a problem in the real world.  It's
a significant increase in worst-case latency and a negligible increase
in average latency.  I've asked at least three times for someone to
explain why this is unacceptable and no explanation has been provided.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 20:26                 ` Andrew Morton
  (?)
@ 2015-03-30 20:32                 ` Jeremy Allison
  2015-03-30 20:37                   ` Andrew Morton
  2015-03-30 22:49                   ` Milosz Tanski
  -1 siblings, 2 replies; 94+ messages in thread
From: Jeremy Allison @ 2015-03-30 20:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Jeremy Allison, Milosz Tanski, linux-kernel,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro, linux-api,
	Michael Kerrisk, linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 01:26:25PM -0700, Andrew Morton wrote:
> 
> cons:
> 
> d) fincore() is more expensive
> 
> e) fincore() will very occasionally block

The above is the killer for Samba. If fincore
returns true but when we schedule the pread
we block, we're hosed.

Once we block, we're done serving clients on the main
thread until this returns. That can cause unpredictable
response times which can cause client timeouts.

A fincore+pread solution that blocks is simply unsafe
to use for us. We'll have to stay with the threadpool :-(.

> And I don't believe that e) will be a problem in the real world.  It's
> a significant increase in worst-case latency and a negligible increase
> in average latency.  I've asked at least three times for someone to
> explain why this is unacceptable and no explanation has been provided.

See above.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 20:32                 ` Jeremy Allison
@ 2015-03-30 20:37                   ` Andrew Morton
  2015-03-30 20:49                     ` Jeremy Allison
  2015-03-30 22:35                     ` Milosz Tanski
  2015-03-30 22:49                   ` Milosz Tanski
  1 sibling, 2 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-30 20:37 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, 30 Mar 2015 13:32:27 -0700 Jeremy Allison <jra@samba.org> wrote:

> On Mon, Mar 30, 2015 at 01:26:25PM -0700, Andrew Morton wrote:
> > 
> > cons:
> > 
> > d) fincore() is more expensive
> > 
> > e) fincore() will very occasionally block
> 
> The above is the killer for Samba. If fincore
> returns true but when we schedule the pread
> we block, we're hosed.
> 
> Once we block, we're done serving clients on the main
> thread until this returns. That can cause unpredictable
> response times which can cause client timeouts.
> 
> A fincore+pread solution that blocks is simply unsafe
> to use for us. We'll have to stay with the threadpool :-(.

Finally.  Thanks ;)

This implies that the samba main thread also has to avoid any memory
allocations both direct and within syscall and pagefault - those will
occasionally exhibit similar worse-case latency. Is this done now?



^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 20:37                   ` Andrew Morton
@ 2015-03-30 20:49                     ` Jeremy Allison
  2015-03-30 21:33                       ` Andrew Morton
  2015-03-30 22:35                     ` Milosz Tanski
  1 sibling, 1 reply; 94+ messages in thread
From: Jeremy Allison @ 2015-03-30 20:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Allison, Christoph Hellwig, Milosz Tanski, linux-kernel,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro, linux-api,
	Michael Kerrisk, linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 01:37:58PM -0700, Andrew Morton wrote:
> On Mon, 30 Mar 2015 13:32:27 -0700 Jeremy Allison <jra@samba.org> wrote:
> 
> > On Mon, Mar 30, 2015 at 01:26:25PM -0700, Andrew Morton wrote:
> > > 
> > > cons:
> > > 
> > > d) fincore() is more expensive
> > > 
> > > e) fincore() will very occasionally block
> > 
> > The above is the killer for Samba. If fincore
> > returns true but when we schedule the pread
> > we block, we're hosed.
> > 
> > Once we block, we're done serving clients on the main
> > thread until this returns. That can cause unpredictable
> > response times which can cause client timeouts.
> > 
> > A fincore+pread solution that blocks is simply unsafe
> > to use for us. We'll have to stay with the threadpool :-(.
> 
> Finally.  Thanks ;)
> 
> This implies that the samba main thread also has to avoid any memory
> allocations both direct and within syscall and pagefault - those will
> occasionally exhibit similar worse-case latency. Is this done now?

We don't do anything special around allocations in syscall.
For aio read we do talloc (internal memory allocator) the
return chunk before going into the pthread pread, so I
suppose this could block. Haven't seen this as a reported
problem though. I suppose you can say "well exactly the
same thing is true of fincore()" :-).

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 20:49                     ` Jeremy Allison
@ 2015-03-30 21:33                       ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-30 21:33 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Milosz Tanski, linux-kernel, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, linux-api, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, 30 Mar 2015 13:49:37 -0700 Jeremy Allison <jra@samba.org> wrote:

> > This implies that the samba main thread also has to avoid any memory
> > allocations both direct and within syscall and pagefault - those will
> > occasionally exhibit similar worse-case latency. Is this done now?
> 
> We don't do anything special around allocations in syscall.
> For aio read we do talloc (internal memory allocator) the
> return chunk before going into the pthread pread, so I
> suppose this could block. Haven't seen this as a reported
> problem though. I suppose you can say "well exactly the
> same thing is true of fincore()" :-).

yup.  If we tickle the page's referenced bit in fincore() then the race
will only happen under the most withering memory loads, and it sounds
like the main thread will be suffering allocation stalls before that
point anyway.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 20:37                   ` Andrew Morton
  2015-03-30 20:49                     ` Jeremy Allison
@ 2015-03-30 22:35                     ` Milosz Tanski
  1 sibling, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-30 22:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Allison, Christoph Hellwig, LKML, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 4:37 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 30 Mar 2015 13:32:27 -0700 Jeremy Allison <jra@samba.org> wrote:
>
>> On Mon, Mar 30, 2015 at 01:26:25PM -0700, Andrew Morton wrote:
>> >
>> > cons:
>> >
>> > d) fincore() is more expensive
>> >
>> > e) fincore() will very occasionally block
>>
>> The above is the killer for Samba. If fincore
>> returns true but when we schedule the pread
>> we block, we're hosed.
>>
>> Once we block, we're done serving clients on the main
>> thread until this returns. That can cause unpredictable
>> response times which can cause client timeouts.
>>
>> A fincore+pread solution that blocks is simply unsafe
>> to use for us. We'll have to stay with the threadpool :-(.
>
> Finally.  Thanks ;)
>
> This implies that the samba main thread also has to avoid any memory
> allocations both direct and within syscall and pagefault - those will
> occasionally exhibit similar worse-case latency. Is this done now?
>
>

It's entirely possible to have an application with a low / semi static
working set, and leave lots of free memory for the kernel especially
for the page cache. For example the Google want to minimize malloc().
So in tcmalloc() they grab large chunks and rarely release it to back
to the OS, in fact old version never shrank it. So you can entirely
avoid stalls in malloc() for many workloads.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 18:54         ` Andrew Morton
@ 2015-03-30 22:40           ` Milosz Tanski
  2015-03-30 22:50               ` Andrew Morton
  0 siblings, 1 reply; 94+ messages in thread
From: Milosz Tanski @ 2015-03-30 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, LKML, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, Linux API, Michael Kerrisk, linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 2:54 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 30 Mar 2015 00:40:20 -0700 Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Fri, Mar 27, 2015 at 10:04:11AM -0700, Andrew Morton wrote:
>> > mm...  I don't think we should be adding placeholders to the kernel API
>> > to support code which hasn't been written, tested, reviewed, merged,
>> > etc.  It's possible none of this will ever happen and we end up with a
>> > syscall nobody needs or uses.  Plus it's always possible that during
>> > this development we decide the pwrite2() interface needs alteration but
>> > it's too late.
>> >
>> > What would be the downside of deferring pwrite2() until it's all
>> > implemented?
>>
>> It _is_ implemented.  I just decided to submit it separately as Miklos
>> already has to deal with enough bikeshedding for his feature that I
>> don't want to put the burden of dealing with the BS for the one I wrote
>> on him.
>
> afacit the only difference between this pwritev2() and the existing
> pwritev() is that pwritev2() interprets pos==-1 as "current position",
> which duplicates writev()?
>
> Unless I've missed something, there's no point in merging this
> pwritev2() and it would be better to separate this syscall out into a
> pwritev2() patchset which can be considered and merged separately.  For
> the reasons described above.
>

At the LSF/MM session, the agreement form the active participants
(James Bottomley, Ted Tso, Christoph, and I forget the last guy's
name) that we should ship both syscalls in the first patch. Personally
I don't care, but you're the only voice against it.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 20:32                 ` Jeremy Allison
  2015-03-30 20:37                   ` Andrew Morton
@ 2015-03-30 22:49                   ` Milosz Tanski
  2015-03-30 22:57                     ` Andrew Morton
  1 sibling, 1 reply; 94+ messages in thread
From: Milosz Tanski @ 2015-03-30 22:49 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Andrew Morton, Christoph Hellwig, LKML, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 4:32 PM, Jeremy Allison <jra@samba.org> wrote:
> On Mon, Mar 30, 2015 at 01:26:25PM -0700, Andrew Morton wrote:
>>
>> cons:
>>
>> d) fincore() is more expensive
>>
>> e) fincore() will very occasionally block
>
> The above is the killer for Samba. If fincore
> returns true but when we schedule the pread
> we block, we're hosed.
>
> Once we block, we're done serving clients on the main
> thread until this returns. That can cause unpredictable
> response times which can cause client timeouts.
>
> A fincore+pread solution that blocks is simply unsafe
> to use for us. We'll have to stay with the threadpool :-(.

We're getting data from a network filesystem Ceph in our case, but it
could be pNFS. In many cases those filesystems have some kind
hierarchy and it's not uncommon for us to se requests that take 20 to
25 milliseconds to complete. In this case the miss becomes very
expensive. And it's not just that one requests experiences the slow
down all the request being serviced by that (single) epoll thread
experience head-of-line blocking because of one stalled request.

10K request a second is a common load for many web services / video
servers servings chunks of data. If we experience one miss a second,
that 25 million stall will impact 250 other requests (all of them will
have a 25ms latency tacked on).

>
>> And I don't believe that e) will be a problem in the real world.  It's
>> a significant increase in worst-case latency and a negligible increase
>> in average latency.  I've asked at least three times for someone to
>> explain why this is unacceptable and no explanation has been provided.
>
> See above.



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-30 22:50               ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-30 22:50 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Christoph Hellwig, LKML, linux-fsdevel, linux-aio, Mel Gorman,
	Volker Lendecke, Tejun Heo, Jeff Moyer, Theodore Ts'o,
	Al Viro, Linux API, Michael Kerrisk, linux-arch, Dave Chinner

On Mon, 30 Mar 2015 18:40:16 -0400 Milosz Tanski <milosz@adfin.com> wrote:

> On Mon, Mar 30, 2015 at 2:54 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> > On Mon, 30 Mar 2015 00:40:20 -0700 Christoph Hellwig <hch@infradead.org> wrote:
> >
> >> On Fri, Mar 27, 2015 at 10:04:11AM -0700, Andrew Morton wrote:
> >> > mm...  I don't think we should be adding placeholders to the kernel API
> >> > to support code which hasn't been written, tested, reviewed, merged,
> >> > etc.  It's possible none of this will ever happen and we end up with a
> >> > syscall nobody needs or uses.  Plus it's always possible that during
> >> > this development we decide the pwrite2() interface needs alteration but
> >> > it's too late.
> >> >
> >> > What would be the downside of deferring pwrite2() until it's all
> >> > implemented?
> >>
> >> It _is_ implemented.  I just decided to submit it separately as Miklos
> >> already has to deal with enough bikeshedding for his feature that I
> >> don't want to put the burden of dealing with the BS for the one I wrote
> >> on him.
> >
> > afacit the only difference between this pwritev2() and the existing
> > pwritev() is that pwritev2() interprets pos==-1 as "current position",
> > which duplicates writev()?
> >
> > Unless I've missed something, there's no point in merging this
> > pwritev2() and it would be better to separate this syscall out into a
> > pwritev2() patchset which can be considered and merged separately.  For
> > the reasons described above.
> >
> 
> At the LSF/MM session, the agreement form the active participants
> (James Bottomley, Ted Tso, Christoph, and I forget the last guy's
> name) that we should ship both syscalls in the first patch.

I was over in the mm session and probably wouldn't have objected either
because because you can't sit down, think, carefully inspect code and
evaluate arguments in such a context.

I've explained my reasoning.  If there's something wrong with that
reasoning or if there are contradictory reasons which I'm not aware of
then let's hear them!

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-30 22:50               ` Andrew Morton
  0 siblings, 0 replies; 94+ messages in thread
From: Andrew Morton @ 2015-03-30 22:50 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Christoph Hellwig, LKML, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Mon, 30 Mar 2015 18:40:16 -0400 Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> wrote:

> On Mon, Mar 30, 2015 at 2:54 PM, Andrew Morton
> <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> > On Mon, 30 Mar 2015 00:40:20 -0700 Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org> wrote:
> >
> >> On Fri, Mar 27, 2015 at 10:04:11AM -0700, Andrew Morton wrote:
> >> > mm...  I don't think we should be adding placeholders to the kernel API
> >> > to support code which hasn't been written, tested, reviewed, merged,
> >> > etc.  It's possible none of this will ever happen and we end up with a
> >> > syscall nobody needs or uses.  Plus it's always possible that during
> >> > this development we decide the pwrite2() interface needs alteration but
> >> > it's too late.
> >> >
> >> > What would be the downside of deferring pwrite2() until it's all
> >> > implemented?
> >>
> >> It _is_ implemented.  I just decided to submit it separately as Miklos
> >> already has to deal with enough bikeshedding for his feature that I
> >> don't want to put the burden of dealing with the BS for the one I wrote
> >> on him.
> >
> > afacit the only difference between this pwritev2() and the existing
> > pwritev() is that pwritev2() interprets pos==-1 as "current position",
> > which duplicates writev()?
> >
> > Unless I've missed something, there's no point in merging this
> > pwritev2() and it would be better to separate this syscall out into a
> > pwritev2() patchset which can be considered and merged separately.  For
> > the reasons described above.
> >
> 
> At the LSF/MM session, the agreement form the active participants
> (James Bottomley, Ted Tso, Christoph, and I forget the last guy's
> name) that we should ship both syscalls in the first patch.

I was over in the mm session and probably wouldn't have objected either
because because you can't sit down, think, carefully inspect code and
evaluate arguments in such a context.

I've explained my reasoning.  If there's something wrong with that
reasoning or if there are contradictory reasons which I'm not aware of
then let's hear them!

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 17:19                 ` Jeremy Allison
  (?)
@ 2015-03-30 22:51                 ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-30 22:51 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Christoph Hellwig, Andrew Morton, LKML, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 1:19 PM, Jeremy Allison <jra@samba.org> wrote:
> On Mon, Mar 30, 2015 at 12:36:04AM -0700, Christoph Hellwig wrote:
>> On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
>> > The problem with the above is that we can't tell the difference
>> > between pread2() returning a short read because the pages are not
>> > in cache, or because someone truncated the file. So we need some
>> > way to differentiate this.
>>
>> Is a race vs truncate really that time critical that you can't
>> wait for the thread pool to do the second read to notice it?
>
> Probably not, as this is the fallback path anyway.
>
>> > My preference from userspace would be for pread2() to return
>> > EAGAIN if *all* the data requested is not available (where
>> > 'all' can be less than the size requested if the file has
>> > been truncated in the meantime).
>>
>> That is easily implementable, but I can see that for example web apps
>> would be happy to get as much as possible.  So if Samba can be ok
>> with short reads and only detecting the truncated case in the slow
>> path that would make life simpler.  Otherwise we might indeed need two
>> flags.
>
> Simpler is better. I can live with the partial read+fallback.
>
> Jeremy.

The partial behavior is very useful for protocols like HTTP1, since
the client can start processing the response if we send it down the
wire while we process other connections. It becomes even more useful
over HTTP2 which provides it's own framing where we can send a partial
response frame and move onto other requests in this connection or
other connections.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 22:49                   ` Milosz Tanski
@ 2015-03-30 22:57                     ` Andrew Morton
  2015-03-30 23:06                         ` Milosz Tanski
  0 siblings, 1 reply; 94+ messages in thread
From: Andrew Morton @ 2015-03-30 22:57 UTC (permalink / raw)
  To: Milosz Tanski
  Cc: Jeremy Allison, Christoph Hellwig, LKML, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, 30 Mar 2015 18:49:06 -0400 Milosz Tanski <milosz@adfin.com> wrote:

> > A fincore+pread solution that blocks is simply unsafe
> > to use for us. We'll have to stay with the threadpool :-(.
> 
> We're getting data from a network filesystem Ceph in our case, but it
> could be pNFS. In many cases those filesystems have some kind
> hierarchy and it's not uncommon for us to se requests that take 20 to
> 25 milliseconds to complete. In this case the miss becomes very
> expensive. And it's not just that one requests experiences the slow
> down all the request being serviced by that (single) epoll thread
> experience head-of-line blocking because of one stalled request.
> 
> 10K request a second is a common load for many web services / video
> servers servings chunks of data. If we experience one miss a second,
> that 25 million stall will impact 250 other requests (all of them will
> have a 25ms latency tacked on).

I'd expect a fincore() which doesn't do SetPageReferenced() to be
orders of magnitude better than this.  A fincore() which does use
SetPageReferenced() will be in the "basically never happens" region -
it would take massive and artificial memory stress to trigger.

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-30 23:06                         ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-30 23:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Allison, Christoph Hellwig, LKML, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 6:57 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 30 Mar 2015 18:49:06 -0400 Milosz Tanski <milosz@adfin.com> wrote:
>
>> > A fincore+pread solution that blocks is simply unsafe
>> > to use for us. We'll have to stay with the threadpool :-(.
>>
>> We're getting data from a network filesystem Ceph in our case, but it
>> could be pNFS. In many cases those filesystems have some kind
>> hierarchy and it's not uncommon for us to se requests that take 20 to
>> 25 milliseconds to complete. In this case the miss becomes very
>> expensive. And it's not just that one requests experiences the slow
>> down all the request being serviced by that (single) epoll thread
>> experience head-of-line blocking because of one stalled request.
>>
>> 10K request a second is a common load for many web services / video
>> servers servings chunks of data. If we experience one miss a second,
>> that 25 million stall will impact 250 other requests (all of them will
>> have a 25ms latency tacked on).
>
> I'd expect a fincore() which doesn't do SetPageReferenced() to be
> orders of magnitude better than this.  A fincore() which does use
> SetPageReferenced() will be in the "basically never happens" region -
> it would take massive and artificial memory stress to trigger.

I'm just responding to the upper bound you put out in an email a few
back of 0.0001% miss. And, people run web caches (like Apache Traffic
Server) at much higher rates than that.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-03-30 23:06                         ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-30 23:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Allison, Christoph Hellwig, LKML,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Mon, Mar 30, 2015 at 6:57 PM, Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Mon, 30 Mar 2015 18:49:06 -0400 Milosz Tanski <milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org> wrote:
>
>> > A fincore+pread solution that blocks is simply unsafe
>> > to use for us. We'll have to stay with the threadpool :-(.
>>
>> We're getting data from a network filesystem Ceph in our case, but it
>> could be pNFS. In many cases those filesystems have some kind
>> hierarchy and it's not uncommon for us to se requests that take 20 to
>> 25 milliseconds to complete. In this case the miss becomes very
>> expensive. And it's not just that one requests experiences the slow
>> down all the request being serviced by that (single) epoll thread
>> experience head-of-line blocking because of one stalled request.
>>
>> 10K request a second is a common load for many web services / video
>> servers servings chunks of data. If we experience one miss a second,
>> that 25 million stall will impact 250 other requests (all of them will
>> have a 25ms latency tacked on).
>
> I'd expect a fincore() which doesn't do SetPageReferenced() to be
> orders of magnitude better than this.  A fincore() which does use
> SetPageReferenced() will be in the "basically never happens" region -
> it would take massive and artificial memory stress to trigger.

I'm just responding to the upper bound you put out in an email a few
back of 0.0001% miss. And, people run web caches (like Apache Traffic
Server) at much higher rates than that.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30  7:36             ` Christoph Hellwig
  2015-03-30 17:19                 ` Jeremy Allison
  2015-03-30 20:26                 ` Andrew Morton
@ 2015-03-30 23:09               ` Milosz Tanski
  2 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-30 23:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, Andrew Morton, LKML, linux-fsdevel, linux-aio,
	Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 3:36 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
>> The problem with the above is that we can't tell the difference
>> between pread2() returning a short read because the pages are not
>> in cache, or because someone truncated the file. So we need some
>> way to differentiate this.
>
> Is a race vs truncate really that time critical that you can't
> wait for the thread pool to do the second read to notice it?
>
>> My preference from userspace would be for pread2() to return
>> EAGAIN if *all* the data requested is not available (where
>> 'all' can be less than the size requested if the file has
>> been truncated in the meantime).
>
> That is easily implementable, but I can see that for example web apps
> would be happy to get as much as possible.  So if Samba can be ok
> with short reads and only detecting the truncated case in the slow
> path that would make life simpler.  Otherwise we might indeed need two
> flags.

I'm okay with an old or nothing flag. Although I think that would much
more useful with RWF_NONWAIT with pwritev, in applications that don't
want to block while logging (but it's okay to drop low level log
messages). That's a whole different use case in my mind.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 20:26                 ` Andrew Morton
  (?)
  (?)
@ 2015-03-30 23:25                 ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-30 23:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Jeremy Allison, LKML, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Mon, Mar 30, 2015 at 4:26 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 30 Mar 2015 00:36:04 -0700 Christoph Hellwig <hch@infradead.org> wrote:
>
>> On Fri, Mar 27, 2015 at 08:58:54AM -0700, Jeremy Allison wrote:
>> > The problem with the above is that we can't tell the difference
>> > between pread2() returning a short read because the pages are not
>> > in cache, or because someone truncated the file. So we need some
>> > way to differentiate this.
>>
>> Is a race vs truncate really that time critical that you can't
>> wait for the thread pool to do the second read to notice it?
>>
>> > My preference from userspace would be for pread2() to return
>> > EAGAIN if *all* the data requested is not available (where
>> > 'all' can be less than the size requested if the file has
>> > been truncated in the meantime).
>>
>> That is easily implementable, but I can see that for example web apps
>> would be happy to get as much as possible.  So if Samba can be ok
>> with short reads and only detecting the truncated case in the slow
>> path that would make life simpler.  Otherwise we might indeed need two
>> flags.
>
> The problem is that many applications (including samba!) want
> all-or-nothing behaviour, and preadv2() cannot provide it.  By the time
> preadv2() discovers a not-present page, it has already copied bulk data
> out to userspace.
>
> To fix this, preadv2() would need to take two passes across the pages,
> pinning them in between and somehow blocking out truncate.  That's a
> big change.
>
> With the current preadv2(), applications would have to do
>
>         nr_read = preadv2(..., offset, len, ...);
>         if (nr_read == len)
>                 process data;
>         else
>                 punt(offset + nr_read, len - nr_read);
>
> and the worker thread will later have to splice together the initial
> data and the later-arriving data, probably on another CPU, probably
> after the initial data has gone cache-cold.
>
> A cleaner solution is
>
>         if (fincore(fd, NULL, offset, len) == len) {
>                 preadv(..., offset, len);
>                 process data;
>         } else {
>                 punt(offset, len);
>         }
>
> This way all the data gets copied in a single hit and is cache-hot when
> userspace processes it.
>
> Comparing fincore()+pread() to preadv2():
>
> pros:
>
> a) fincore() may be used to provide both all-or-nothing and
>    part-read-ok behaviour cleanly and with optimum cache behaviour.
>
> b) fincore() doesn't add overhead, complexity and stack depth to
>    core pagecache read() code.  Nor does it expand VFS data structures.

Actually, we're not expanding any VFS structures with the next
patchset. I've rebased the forthcoming patchset ontop of Al's
vfs/linux-next tree to keep track of the refactoring already done with
some of the code paths I touched. The refactoring work done there
already ads a flag argument to kiocb struct for other reasons.

>
> c) with a non-NULL second argument, fincore provides the
>    mincore()-style page map.
>
> cons:
>
> d) fincore() is more expensive
>
> e) fincore() will very occasionally block
>
>
> Tradeoffs are involved.  To decide on the best path we should examine
> d).  I expect that the overhead will be significant for small reads but
> not significant for medium and large reads.  Needs quantifying.
>
> And I don't believe that e) will be a problem in the real world.  It's
> a significant increase in worst-case latency and a negligible increase
> in average latency.  I've asked at least three times for someone to
> explain why this is unacceptable and no explanation has been provided.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-27 16:30               ` Andrew Morton
                                 ` (5 preceding siblings ...)
  (?)
@ 2015-03-31  1:27               ` Milosz Tanski
  -1 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-03-31  1:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jeremy Allison, Christoph Hellwig, LKML, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Mar 27, 2015 at 12:30 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Fri, 27 Mar 2015 08:58:54 -0700 Jeremy Allison <jra@samba.org> wrote:
>
>> On Fri, Mar 27, 2015 at 02:01:59AM -0700, Andrew Morton wrote:
>> > On Fri, 27 Mar 2015 01:48:33 -0700 Christoph Hellwig <hch@infradead.org> wrote:
>> >
>> > > On Fri, Mar 27, 2015 at 01:35:16AM -0700, Andrew Morton wrote:
>> > > > fincore() doesn't have to be ugly.  Please address the design issues I
>> > > > raised.  How is pread2() useful to the class of applications which
>> > > > cannot proceed until all data is available?
>> > >
>> > > It actually makes them work correctly?  preadv2( ..., DONTWAIT) will
>> > > return -EGAIN, which causes them to bounce to the threadpool where
>> > > they call preadv(...).
>> >
>> > (I assume you mean RWF_NONBLOCK)
>> >
>> > That isn't how pread2() works.  If the leading one or more pages are
>> > uptodate, pread2() will return a partial read.  Now what?  Either the
>> > application reads the same data a second time via the worker thread
>> > (dumb, but it will usually be a rare case)
>>
>> The problem with the above is that we can't tell the difference
>> between pread2() returning a short read because the pages are not
>> in cache, or because someone truncated the file. So we need some
>> way to differentiate this.
>>
>> My preference from userspace would be for pread2() to return
>> EAGAIN if *all* the data requested is not available (where
>> 'all' can be less than the size requested if the file has
>> been truncated in the meantime).
>>
>> ...
>>
>> The thing I want to avoid is the case where
>> ret < size_wanted means only part of the file
>> is in cache.
>
> From my reading of the code, pread2() will return -EAGAIN only when it
> copied zero bytes to userspace.  ie, the very first page wasn't in
> cache.  If pread2() does copy some data to userspace then it will
> return the amount of data copied.  This is traditional read()
> behaviour.
>
> Maybe there's some other code somewhere in the patch which converts
> that short read into -EAGAIN, dunno - the changelogs don't appear to
> mention it and the manpage update is ambiguous about this.
>
> But from an interface perspective the behaviour you're asking for is
> insane, frankly - if the kernel copied out 8k of data then pread2()
> should return 8k.  Otherwise there's no way for userspace to know that
> the 8k copy actually happened and we have just wasted a great pile of
> CPU doing a pointless memcpy.
>
> I expect that this situation (first part in cache, latter part not in
> cache) is rare - for reasonably small requests the common cases will be
> "all cached" and "nothing cached".  So perhaps the best approach here
> is for samba to add special handling for the short read, to work out
> the reason for its occurrence.
>
> Alternatively we could add another flag to pread2() to select this
> "throw away my data and return -EAGAIN" behaviour.  Presumably
> implemented with an i_size check, but it's gonna be racy.
>
>
>
> I take it from your comments that nobody has actually wired up pread2()
> into samba yet?  That's a bit disturbing, because if we later want to
> go and change something like this short-read behaviour, we're screwed -
> it's a non back-compat userspace-visible change.
>
>
> And a note on cosmetics: why are we using EAGAIN here rather than
> EWOULDBLOCK?  They have the same numerical value, but EWOULDBLOCK is a
> better name - EAGAIN says "run it again", but that won't work.
>

Per definition EWOULDBLOCK seams like a better fit. Like you said
above it won't stop blocking unless you do something. I also did a
search in the kernel source (excluding drivers / sound directories)
use of EAGAIN (even in network code) is like 2 magnitudes bigger then
EWOULDBLOCK. In fact some places that grep found check for both
(although I'm sure it's optimized out).

Does anybody feel strongly about it being EWOULDBLOCK instead of
EAGAIN? Esp. since they are same on Linux? The convention (by numbers)
seams to favor EAGAIN.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
  2015-03-30 20:26                 ` Andrew Morton
                                   ` (2 preceding siblings ...)
  (?)
@ 2015-04-04  3:42                 ` Andrew Morton
  2015-04-06  3:53                     ` Milosz Tanski
  -1 siblings, 1 reply; 94+ messages in thread
From: Andrew Morton @ 2015-04-04  3:42 UTC (permalink / raw)
  To: Christoph Hellwig, Jeremy Allison, Milosz Tanski, linux-kernel,
	linux-fsdevel, linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo,
	Jeff Moyer, Theodore Ts'o, Al Viro, linux-api,
	Michael Kerrisk, linux-arch, Dave Chinner

On Mon, 30 Mar 2015 13:26:25 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> d) fincore() is more expensive

Actually, I kinda take that back.  fincore() will be faster than
preadv2() in the case of a pagecache miss, and slower in the case of a
pagecache hit.

The breakpoint appears to be a hit rate of 30% - if fewer than 30% of
queries find the page in pagecache, fincore() will be faster than
preadv2().

This is because for a pagecache miss, fincore() will be about twice as
fast as preadv2().  For a pagecache hit, fincore()+pread() is 55%
slower than preadv2().  If there are lots of misses, fincore() is
faster overall.




Minimal fincore() implementation is below.  It doesn't implement the
page_map!=NULL mode at all and will be slow for large areas - it needs
to be taught about radix_tree_for_each_*().  But it's good enough for
testing.  

On a slow machine, in nanoseconds:

null syscall:		528
fincore (miss):		674
fincore (hit):		729
single byte pread:	1026
single byte preadv:	1134

pread() is a bit faster than preadv() and samba uses pread(), so the
implementations are:

	if (fincore(fd, NULL, offset, len) == len)
		pread();
	else
		punt();

	if (preadv2(fd, ..., offset, len) == len)
		...
	else
		punt();

fincore+pread, pagecache-hit:	1755ns
fincore+pread, pagecache-miss:	674ns
preadv():			1134ns (preadv2() will be a little faster for misses)



Now, a pagecache hit rate of 30% sounds high so one would think that
fincore+pread is clearly ahead.  But the pagecache hit rate in this
code will actually be quite high, because of readahead.

For a large linear read of a file which is perfectly laid out on disk
and is fully *uncached*, the hit rates will be as good as 99.8%,
because readahead is bringing in data in 2MB blobs.

In practice I expect that fincore()+pread() will be slower for linear
reads of medium to large files and faster for small files and seeky
accesses.

How much does all this matter?  Not much.  On a fast machine a
single-byte pread() takes 240ns.  So if your server thread is handling
25000 requests/sec, we're only talking 0.6% overhead.

Note that we can trivially monitor the hit rate with either preadv2()
or fincore()+pread(): just count how many times all the data is there
versus how many times it isn't.



Also, note that we can use *both* fincore() and preadv2() to detect the
problematic page-just-disappeared race:

	if (fincore(fd, NULL, offset, len) == len) {
		if (preadv2(fd, offset, len) != len)
			printf("race just happened");

It would be great if someone could apply the below, modify the
preadv2() callsite as above and determine under what conditions (if
any) the page-stealing race occurs.



 arch/x86/syscalls/syscall_64.tbl |    1 
 include/linux/syscalls.h         |    2 
 mm/Makefile                      |    2 
 mm/fincore.c                     |   65 +++++++++++++++++++++++++++++
 4 files changed, 69 insertions(+), 1 deletion(-)

diff -puN arch/x86/syscalls/syscall_64.tbl~fincore arch/x86/syscalls/syscall_64.tbl
--- a/arch/x86/syscalls/syscall_64.tbl~fincore
+++ a/arch/x86/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
 322	64	execveat		stub_execveat
 323	64	preadv2			sys_preadv2
 324	64	pwritev2		sys_pwritev2
+325	common	fincore			sys_fincore
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff -puN include/linux/syscalls.h~fincore include/linux/syscalls.h
--- a/include/linux/syscalls.h~fincore
+++ a/include/linux/syscalls.h
@@ -880,6 +880,8 @@ asmlinkage long sys_process_vm_writev(pi
 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
 			 unsigned long idx1, unsigned long idx2);
 asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_fincore(int fd, unsigned char __user *page_map,
+			    loff_t offset, size_t len);
 asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 			    const char __user *uargs);
 asmlinkage long sys_getrandom(char __user *buf, size_t count,
diff -puN mm/Makefile~fincore mm/Makefile
--- a/mm/Makefile~fincore
+++ a/mm/Makefile
@@ -19,7 +19,7 @@ obj-y			:= filemap.o mempool.o oom_kill.
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
-			   compaction.o vmacache.o \
+			   compaction.o vmacache.o fincore.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o $(mmu-y)
 
diff -puN /dev/null mm/fincore.c
--- /dev/null
+++ a/mm/fincore.c
@@ -0,0 +1,65 @@
+#include <linux/syscalls.h>
+#include <linux/pagemap.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/hugetlb.h>
+
+SYSCALL_DEFINE4(fincore, int, fd, unsigned char __user *, page_map,
+		loff_t, offset, size_t, len)
+{
+	struct fd f;
+	struct address_space *mapping;
+	loff_t cur_off;
+	loff_t end;
+	pgoff_t pgoff;
+	long ret = 0;
+
+	if (offset < 0 || (ssize_t)len <= 0)
+		return -EINVAL;
+
+	f = fdget(fd);
+
+	if (!f.file)
+		return -EBADF;
+
+	if (is_file_hugepages(f.file)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (!S_ISREG(file_inode(f.file)->i_mode)) {
+		ret = -EBADF;
+		goto out;
+	}
+
+	end = min_t(loff_t, offset + len, i_size_read(file_inode(f.file)));
+	pgoff = offset >> PAGE_CACHE_SHIFT;
+	mapping = f.file->f_mapping;
+
+	/*
+	 * We probably need to do somethnig here to reduce the chance of the
+	 * pages being reclaimed between fincore() and read().  eg,
+	 * SetPageReferenced(page) or mark_page_accessed(page) or
+	 * activate_page(page).
+	 */
+	for (cur_off = offset; cur_off < end ; ) {
+		struct page *page;
+		loff_t end_of_coverage;
+
+		page = find_get_page(mapping, pgoff);
+		if (!page || !PageUptodate(page))
+			break;
+		page_cache_release(page);
+
+		pgoff++;
+		end_of_coverage = min_t(loff_t, pgoff << PAGE_CACHE_SHIFT, end);
+		ret += end_of_coverage - cur_off;
+		cur_off = (cur_off + PAGE_CACHE_SIZE) & PAGE_CACHE_MASK;
+	}
+
+out:
+	fdput(f);
+	return ret;
+}
_


^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-04-06  3:53                     ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-04-06  3:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Jeremy Allison, LKML, linux-fsdevel,
	linux-aio, Mel Gorman, Volker Lendecke, Tejun Heo, Jeff Moyer,
	Theodore Ts'o, Al Viro, Linux API, Michael Kerrisk,
	linux-arch, Dave Chinner

On Fri, Apr 3, 2015 at 11:42 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 30 Mar 2015 13:26:25 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
>
>> d) fincore() is more expensive
>
> Actually, I kinda take that back.  fincore() will be faster than
> preadv2() in the case of a pagecache miss, and slower in the case of a
> pagecache hit.
>
> The breakpoint appears to be a hit rate of 30% - if fewer than 30% of
> queries find the page in pagecache, fincore() will be faster than
> preadv2().

In my application (motivation for this patch), web-serving
applications (familiar to me), and Samba I'm going to that the
majority of requests are going to be cached. Only some small
percentage will be uncached (say 20%). I'll add to that: a small
percentage but of a large number.

A lot of IO falls into zipfan / sequential pattern. And that makes
send to me a small number of frequently access files and large
streaming data (with read ahead).

>
> This is because for a pagecache miss, fincore() will be about twice as
> fast as preadv2().  For a pagecache hit, fincore()+pread() is 55%
> slower than preadv2().  If there are lots of misses, fincore() is
> faster overall.
>


>
>
>
> Minimal fincore() implementation is below.  It doesn't implement the
> page_map!=NULL mode at all and will be slow for large areas - it needs
> to be taught about radix_tree_for_each_*().  But it's good enough for
> testing.

I'm glad you took the time to do this. It's simple, but your
implementation is much cleaner then the last round of fincore() from 3
years back.

>
> On a slow machine, in nanoseconds:
>
> null syscall:           528
> fincore (miss):         674
> fincore (hit):          729
> single byte pread:      1026
> single byte preadv:     1134

I'm not surprised, fincore() doesn't have to go through all the vfs /
fs machinery that pread or preadv do. By chance if you compare pread /
preadv with a larger read (say 4k) is the difference negligible.

>
> pread() is a bit faster than preadv() and samba uses pread(), so the
> implementations are:
>
>         if (fincore(fd, NULL, offset, len) == len)
>                 pread();
>         else
>                 punt();
>
>         if (preadv2(fd, ..., offset, len) == len)
>                 ...
>         else
>                 punt();
>
> fincore+pread, pagecache-hit:   1755ns
> fincore+pread, pagecache-miss:  674ns
> preadv():                       1134ns (preadv2() will be a little faster for misses)
>
>
>
> Now, a pagecache hit rate of 30% sounds high so one would think that
> fincore+pread is clearly ahead.  But the pagecache hit rate in this
> code will actually be quite high, because of readahead.
>
> For a large linear read of a file which is perfectly laid out on disk
> and is fully *uncached*, the hit rates will be as good as 99.8%,
> because readahead is bringing in data in 2MB blobs.
>
> In practice I expect that fincore()+pread() will be slower for linear
> reads of medium to large files and faster for small files and seeky
> accesses.
>
> How much does all this matter?  Not much.  On a fast machine a
> single-byte pread() takes 240ns.  So if your server thread is handling
> 25000 requests/sec, we're only talking 0.6% overhead.
>
> Note that we can trivially monitor the hit rate with either preadv2()
> or fincore()+pread(): just count how many times all the data is there
> versus how many times it isn't.
>
>
>
> Also, note that we can use *both* fincore() and preadv2() to detect the
> problematic page-just-disappeared race:
>
>         if (fincore(fd, NULL, offset, len) == len) {
>                 if (preadv2(fd, offset, len) != len)
>                         printf("race just happened");
>
> It would be great if someone could apply the below, modify the
> preadv2() callsite as above and determine under what conditions (if
> any) the page-stealing race occurs.
>
>

Let me see what I can do.

>
>  arch/x86/syscalls/syscall_64.tbl |    1
>  include/linux/syscalls.h         |    2
>  mm/Makefile                      |    2
>  mm/fincore.c                     |   65 +++++++++++++++++++++++++++++
>  4 files changed, 69 insertions(+), 1 deletion(-)
>
> diff -puN arch/x86/syscalls/syscall_64.tbl~fincore arch/x86/syscalls/syscall_64.tbl
> --- a/arch/x86/syscalls/syscall_64.tbl~fincore
> +++ a/arch/x86/syscalls/syscall_64.tbl
> @@ -331,6 +331,7 @@
>  322    64      execveat                stub_execveat
>  323    64      preadv2                 sys_preadv2
>  324    64      pwritev2                sys_pwritev2
> +325    common  fincore                 sys_fincore
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff -puN include/linux/syscalls.h~fincore include/linux/syscalls.h
> --- a/include/linux/syscalls.h~fincore
> +++ a/include/linux/syscalls.h
> @@ -880,6 +880,8 @@ asmlinkage long sys_process_vm_writev(pi
>  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
>                          unsigned long idx1, unsigned long idx2);
>  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
> +asmlinkage long sys_fincore(int fd, unsigned char __user *page_map,
> +                           loff_t offset, size_t len);
>  asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>                             const char __user *uargs);
>  asmlinkage long sys_getrandom(char __user *buf, size_t count,
> diff -puN mm/Makefile~fincore mm/Makefile
> --- a/mm/Makefile~fincore
> +++ a/mm/Makefile
> @@ -19,7 +19,7 @@ obj-y                 := filemap.o mempool.o oom_kill.
>                            readahead.o swap.o truncate.o vmscan.o shmem.o \
>                            util.o mmzone.o vmstat.o backing-dev.o \
>                            mm_init.o mmu_context.o percpu.o slab_common.o \
> -                          compaction.o vmacache.o \
> +                          compaction.o vmacache.o fincore.o \
>                            interval_tree.o list_lru.o workingset.o \
>                            debug.o $(mmu-y)
>
> diff -puN /dev/null mm/fincore.c
> --- /dev/null
> +++ a/mm/fincore.c
> @@ -0,0 +1,65 @@
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +SYSCALL_DEFINE4(fincore, int, fd, unsigned char __user *, page_map,
> +               loff_t, offset, size_t, len)
> +{
> +       struct fd f;
> +       struct address_space *mapping;
> +       loff_t cur_off;
> +       loff_t end;
> +       pgoff_t pgoff;
> +       long ret = 0;
> +
> +       if (offset < 0 || (ssize_t)len <= 0)
> +               return -EINVAL;
> +
> +       f = fdget(fd);
> +
> +       if (!f.file)
> +               return -EBADF;
> +
> +       if (is_file_hugepages(f.file)) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       if (!S_ISREG(file_inode(f.file)->i_mode)) {
> +               ret = -EBADF;
> +               goto out;
> +       }
> +
> +       end = min_t(loff_t, offset + len, i_size_read(file_inode(f.file)));
> +       pgoff = offset >> PAGE_CACHE_SHIFT;
> +       mapping = f.file->f_mapping;
> +
> +       /*
> +        * We probably need to do somethnig here to reduce the chance of the
> +        * pages being reclaimed between fincore() and read().  eg,
> +        * SetPageReferenced(page) or mark_page_accessed(page) or
> +        * activate_page(page).
> +        */
> +       for (cur_off = offset; cur_off < end ; ) {
> +               struct page *page;
> +               loff_t end_of_coverage;
> +
> +               page = find_get_page(mapping, pgoff);
> +               if (!page || !PageUptodate(page))
> +                       break;
> +               page_cache_release(page);
> +
> +               pgoff++;
> +               end_of_coverage = min_t(loff_t, pgoff << PAGE_CACHE_SHIFT, end);
> +               ret += end_of_coverage - cur_off;
> +               cur_off = (cur_off + PAGE_CACHE_SIZE) & PAGE_CACHE_MASK;
> +       }
> +
> +out:
> +       fdput(f);
> +       return ret;
> +}
> _
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@adfin.com

^ permalink raw reply	[flat|nested] 94+ messages in thread

* Re: [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only)
@ 2015-04-06  3:53                     ` Milosz Tanski
  0 siblings, 0 replies; 94+ messages in thread
From: Milosz Tanski @ 2015-04-06  3:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Jeremy Allison, LKML,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,
	linux-aio-Bw31MaZKKs3YtjvyW6yDsg, Mel Gorman, Volker Lendecke,
	Tejun Heo, Jeff Moyer, Theodore Ts'o, Al Viro, Linux API,
	Michael Kerrisk, linux-arch-u79uwXL29TY76Z2rM5mHXA, Dave Chinner

On Fri, Apr 3, 2015 at 11:42 PM, Andrew Morton
<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> On Mon, 30 Mar 2015 13:26:25 -0700 Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
>
>> d) fincore() is more expensive
>
> Actually, I kinda take that back.  fincore() will be faster than
> preadv2() in the case of a pagecache miss, and slower in the case of a
> pagecache hit.
>
> The breakpoint appears to be a hit rate of 30% - if fewer than 30% of
> queries find the page in pagecache, fincore() will be faster than
> preadv2().

In my application (motivation for this patch), web-serving
applications (familiar to me), and Samba I'm going to that the
majority of requests are going to be cached. Only some small
percentage will be uncached (say 20%). I'll add to that: a small
percentage but of a large number.

A lot of IO falls into zipfan / sequential pattern. And that makes
send to me a small number of frequently access files and large
streaming data (with read ahead).

>
> This is because for a pagecache miss, fincore() will be about twice as
> fast as preadv2().  For a pagecache hit, fincore()+pread() is 55%
> slower than preadv2().  If there are lots of misses, fincore() is
> faster overall.
>


>
>
>
> Minimal fincore() implementation is below.  It doesn't implement the
> page_map!=NULL mode at all and will be slow for large areas - it needs
> to be taught about radix_tree_for_each_*().  But it's good enough for
> testing.

I'm glad you took the time to do this. It's simple, but your
implementation is much cleaner then the last round of fincore() from 3
years back.

>
> On a slow machine, in nanoseconds:
>
> null syscall:           528
> fincore (miss):         674
> fincore (hit):          729
> single byte pread:      1026
> single byte preadv:     1134

I'm not surprised, fincore() doesn't have to go through all the vfs /
fs machinery that pread or preadv do. By chance if you compare pread /
preadv with a larger read (say 4k) is the difference negligible.

>
> pread() is a bit faster than preadv() and samba uses pread(), so the
> implementations are:
>
>         if (fincore(fd, NULL, offset, len) == len)
>                 pread();
>         else
>                 punt();
>
>         if (preadv2(fd, ..., offset, len) == len)
>                 ...
>         else
>                 punt();
>
> fincore+pread, pagecache-hit:   1755ns
> fincore+pread, pagecache-miss:  674ns
> preadv():                       1134ns (preadv2() will be a little faster for misses)
>
>
>
> Now, a pagecache hit rate of 30% sounds high so one would think that
> fincore+pread is clearly ahead.  But the pagecache hit rate in this
> code will actually be quite high, because of readahead.
>
> For a large linear read of a file which is perfectly laid out on disk
> and is fully *uncached*, the hit rates will be as good as 99.8%,
> because readahead is bringing in data in 2MB blobs.
>
> In practice I expect that fincore()+pread() will be slower for linear
> reads of medium to large files and faster for small files and seeky
> accesses.
>
> How much does all this matter?  Not much.  On a fast machine a
> single-byte pread() takes 240ns.  So if your server thread is handling
> 25000 requests/sec, we're only talking 0.6% overhead.
>
> Note that we can trivially monitor the hit rate with either preadv2()
> or fincore()+pread(): just count how many times all the data is there
> versus how many times it isn't.
>
>
>
> Also, note that we can use *both* fincore() and preadv2() to detect the
> problematic page-just-disappeared race:
>
>         if (fincore(fd, NULL, offset, len) == len) {
>                 if (preadv2(fd, offset, len) != len)
>                         printf("race just happened");
>
> It would be great if someone could apply the below, modify the
> preadv2() callsite as above and determine under what conditions (if
> any) the page-stealing race occurs.
>
>

Let me see what I can do.

>
>  arch/x86/syscalls/syscall_64.tbl |    1
>  include/linux/syscalls.h         |    2
>  mm/Makefile                      |    2
>  mm/fincore.c                     |   65 +++++++++++++++++++++++++++++
>  4 files changed, 69 insertions(+), 1 deletion(-)
>
> diff -puN arch/x86/syscalls/syscall_64.tbl~fincore arch/x86/syscalls/syscall_64.tbl
> --- a/arch/x86/syscalls/syscall_64.tbl~fincore
> +++ a/arch/x86/syscalls/syscall_64.tbl
> @@ -331,6 +331,7 @@
>  322    64      execveat                stub_execveat
>  323    64      preadv2                 sys_preadv2
>  324    64      pwritev2                sys_pwritev2
> +325    common  fincore                 sys_fincore
>
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff -puN include/linux/syscalls.h~fincore include/linux/syscalls.h
> --- a/include/linux/syscalls.h~fincore
> +++ a/include/linux/syscalls.h
> @@ -880,6 +880,8 @@ asmlinkage long sys_process_vm_writev(pi
>  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
>                          unsigned long idx1, unsigned long idx2);
>  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
> +asmlinkage long sys_fincore(int fd, unsigned char __user *page_map,
> +                           loff_t offset, size_t len);
>  asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>                             const char __user *uargs);
>  asmlinkage long sys_getrandom(char __user *buf, size_t count,
> diff -puN mm/Makefile~fincore mm/Makefile
> --- a/mm/Makefile~fincore
> +++ a/mm/Makefile
> @@ -19,7 +19,7 @@ obj-y                 := filemap.o mempool.o oom_kill.
>                            readahead.o swap.o truncate.o vmscan.o shmem.o \
>                            util.o mmzone.o vmstat.o backing-dev.o \
>                            mm_init.o mmu_context.o percpu.o slab_common.o \
> -                          compaction.o vmacache.o \
> +                          compaction.o vmacache.o fincore.o \
>                            interval_tree.o list_lru.o workingset.o \
>                            debug.o $(mmu-y)
>
> diff -puN /dev/null mm/fincore.c
> --- /dev/null
> +++ a/mm/fincore.c
> @@ -0,0 +1,65 @@
> +#include <linux/syscalls.h>
> +#include <linux/pagemap.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +SYSCALL_DEFINE4(fincore, int, fd, unsigned char __user *, page_map,
> +               loff_t, offset, size_t, len)
> +{
> +       struct fd f;
> +       struct address_space *mapping;
> +       loff_t cur_off;
> +       loff_t end;
> +       pgoff_t pgoff;
> +       long ret = 0;
> +
> +       if (offset < 0 || (ssize_t)len <= 0)
> +               return -EINVAL;
> +
> +       f = fdget(fd);
> +
> +       if (!f.file)
> +               return -EBADF;
> +
> +       if (is_file_hugepages(f.file)) {
> +               ret = -EINVAL;
> +               goto out;
> +       }
> +
> +       if (!S_ISREG(file_inode(f.file)->i_mode)) {
> +               ret = -EBADF;
> +               goto out;
> +       }
> +
> +       end = min_t(loff_t, offset + len, i_size_read(file_inode(f.file)));
> +       pgoff = offset >> PAGE_CACHE_SHIFT;
> +       mapping = f.file->f_mapping;
> +
> +       /*
> +        * We probably need to do somethnig here to reduce the chance of the
> +        * pages being reclaimed between fincore() and read().  eg,
> +        * SetPageReferenced(page) or mark_page_accessed(page) or
> +        * activate_page(page).
> +        */
> +       for (cur_off = offset; cur_off < end ; ) {
> +               struct page *page;
> +               loff_t end_of_coverage;
> +
> +               page = find_get_page(mapping, pgoff);
> +               if (!page || !PageUptodate(page))
> +                       break;
> +               page_cache_release(page);
> +
> +               pgoff++;
> +               end_of_coverage = min_t(loff_t, pgoff << PAGE_CACHE_SHIFT, end);
> +               ret += end_of_coverage - cur_off;
> +               cur_off = (cur_off + PAGE_CACHE_SIZE) & PAGE_CACHE_MASK;
> +       }
> +
> +out:
> +       fdput(f);
> +       return ret;
> +}
> _
>



-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz-B5zB6C1i6pkAvxtiuMwx3w@public.gmane.org

^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2015-04-06  3:53 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-16 18:27 [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 1/5] vfs: Prepare for adding a new preadv/pwritev with user flags Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 21:05   ` Andreas Dilger
2015-03-16 21:05     ` Andreas Dilger
2015-03-16 18:27 ` [PATCH v7 2/5] vfs: Define new syscalls preadv2,pwritev2 Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 3/5] x86: wire up preadv2 and pwritev2 Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 4/5] vfs: RWF_NONBLOCK flag for preadv2 Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 18:27 ` [PATCH v7 5/5] xfs: add RWF_NONBLOCK support Milosz Tanski
2015-03-16 18:27   ` Milosz Tanski
2015-03-16 22:04   ` Dave Chinner
2015-03-16 18:32 ` [PATCH] Add preadv2/pwritev2 documentation Milosz Tanski
2015-03-27 16:49   ` Andrew Morton
2015-03-30  7:33     ` Christoph Hellwig
2015-03-30  7:33       ` Christoph Hellwig
2015-03-16 18:34 ` [PATCH] fstests: generic test for preadv2 behavior on linux Milosz Tanski
2015-03-16 18:34   ` Milosz Tanski
2015-03-16 21:07   ` Andreas Dilger
2015-03-16 21:07     ` Andreas Dilger
2015-03-16 22:03     ` Milosz Tanski
2015-03-16 22:02   ` Dave Chinner
2015-03-16 22:02     ` Dave Chinner
2015-03-16 22:11     ` Milosz Tanski
2015-03-16 22:56       ` Dave Chinner
2015-03-16 22:56         ` Dave Chinner
2015-03-26 11:55 ` [PATCH v7 0/5] vfs: Non-blockling buffered fs read (page cache only) Christoph Hellwig
2015-03-26 11:55   ` Christoph Hellwig
2015-03-26 19:12   ` Milosz Tanski
2015-03-26 19:12     ` Milosz Tanski
2015-03-27  2:26     ` Milosz Tanski
2015-03-27  2:29     ` Milosz Tanski
2015-03-27  2:29       ` Milosz Tanski
2015-03-27  3:28 ` Andrew Morton
2015-03-27  3:28   ` Andrew Morton
2015-03-27  5:41   ` Volker Lendecke
2015-03-27  5:41     ` Volker Lendecke
2015-03-27  6:08     ` Andrew Morton
2015-03-27  6:08       ` Andrew Morton
2015-03-27  8:02       ` Volker Lendecke
2015-03-27  8:02         ` Volker Lendecke
2015-03-27  8:12         ` Christoph Hellwig
2015-03-27  8:18   ` Christoph Hellwig
2015-03-27  8:18     ` Christoph Hellwig
2015-03-27  8:35     ` Andrew Morton
2015-03-27  8:35       ` Andrew Morton
2015-03-27  8:48       ` Christoph Hellwig
2015-03-27  9:01         ` Andrew Morton
2015-03-27  9:01           ` Andrew Morton
2015-03-27  9:44           ` Volker Lendecke
2015-03-27 15:58           ` Jeremy Allison
2015-03-27 15:58             ` Jeremy Allison
2015-03-27 16:30             ` Andrew Morton
2015-03-27 16:30               ` Andrew Morton
2015-03-27 16:30               ` Andrew Morton
2015-03-27 16:30               ` Andrew Morton
2015-03-27 16:39               ` Jeremy Allison
2015-03-27 16:39                 ` Jeremy Allison
2015-03-27 16:39               ` Andrew Morton
2015-03-27 16:45               ` Milosz Tanski
2015-03-31  1:27               ` Milosz Tanski
2015-03-27 16:38             ` Milosz Tanski
2015-03-27 16:38               ` Milosz Tanski
2015-03-30  7:36             ` Christoph Hellwig
2015-03-30 17:19               ` Jeremy Allison
2015-03-30 17:19                 ` Jeremy Allison
2015-03-30 22:51                 ` Milosz Tanski
2015-03-30 20:26               ` Andrew Morton
2015-03-30 20:26                 ` Andrew Morton
2015-03-30 20:32                 ` Jeremy Allison
2015-03-30 20:37                   ` Andrew Morton
2015-03-30 20:49                     ` Jeremy Allison
2015-03-30 21:33                       ` Andrew Morton
2015-03-30 22:35                     ` Milosz Tanski
2015-03-30 22:49                   ` Milosz Tanski
2015-03-30 22:57                     ` Andrew Morton
2015-03-30 23:06                       ` Milosz Tanski
2015-03-30 23:06                         ` Milosz Tanski
2015-03-30 23:25                 ` Milosz Tanski
2015-04-04  3:42                 ` Andrew Morton
2015-04-06  3:53                   ` Milosz Tanski
2015-04-06  3:53                     ` Milosz Tanski
2015-03-30 23:09               ` Milosz Tanski
2015-03-27 15:21   ` Milosz Tanski
2015-03-27 15:21     ` Milosz Tanski
2015-03-27 17:04     ` Andrew Morton
2015-03-30  7:40       ` Christoph Hellwig
2015-03-30  7:40         ` Christoph Hellwig
2015-03-30 18:54         ` Andrew Morton
2015-03-30 22:40           ` Milosz Tanski
2015-03-30 22:50             ` Andrew Morton
2015-03-30 22:50               ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.