linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v4] Support for polled aio
@ 2018-11-30 16:56 Jens Axboe
  2018-11-30 16:56 ` [PATCH 01/27] aio: fix failure to put the file pointer Jens Axboe
                   ` (26 more replies)
  0 siblings, 27 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch

For the grand introduction to this feature, see my original posting
here:

https://lore.kernel.org/linux-block/20181117235317.7366-1-axboe@kernel.dk/

and refer to the previous postings of this patchset for whatever
features were added there.

Outside of "just" supporting polled IO, it also adds support for user
mapped IOCBs, so we don't have to copy those for every IO. A new
addition in this version is support for pre-mapped buffers as well.
If an application uses fixed IO buffers, it can set IOCTX_FLAG_FIXEDBUFS
along with IOCTX_FLAG_USERIOCB. The iocbs that are mapped in should have
the maximum length already set, and the buffer field pointing to the
right location. That eliminates the need to do get_user_pages() for
every IO.

Everything is solid for me in all the testing I have done, no problems
observed, crashes, corruptions, etc.

For the testing below, 'Mainline' refers to current -git from Linus,
'aio-poll' is the aio-poll branch but with none of the new features
enabled, and finally 'aio-poll-all' is the aio-poll branch with
useriocbs turned on, polling turned on, and user mapped buffers
turned on. In other words, mainline and aio-poll are running the
exact same workload, and aio-poll-all is running that workload,
but with the new features turned on.

All testing done with fio. Latencies quoted are 50th percentile.

All testing is done with a single thread, using a maximum of one
core in the system. Testing is run on two devices - one that supports
high peak IOPS, and one that is low latency.


Peak IOPS testing on an NVMe device that supports high IOPS:

Depth      Mainline           aio-poll        aio-poll-all
============================================================
1            77K                80K              132K
2           145K               163K              262K
4           287K               342K              514K
8           560K               582K              824K
16          616K               727K             1013K
32          636K               773K             1155K
64          635K               776K             1230K


Low latency testing on low latency device:

Depth      Mainline	      aio-poll        aio-poll-all
============================================================
1        84K / 8.5 usec    87K / 8.3 usec    168K / 5.0 usec
2       201K / 7.4 usec   208K / 7.1 usec    330K / 5.0 usec
4       389K / 7.7 usec   398K / 7.2 usec    547K / 6.1 usec

It's worth nothing that the average IO submission time for
'aio-poll-all' is 660 nsec, with aio-poll 1.8 - 2.0 usec, and finally
mainline at 1.8 - 2.1 usec.

As before, patches are against my 'mq-perf' branch, and can also
be found in my aio-poll branch.

 Documentation/filesystems/vfs.txt      |    3 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 block/bio.c                            |   36 +-
 fs/aio.c                               | 1055 +++++++++++++++++++++---
 fs/block_dev.c                         |   36 +-
 fs/file.c                              |   15 +-
 fs/file_table.c                        |   10 +-
 fs/gfs2/file.c                         |    2 +
 fs/iomap.c                             |   56 +-
 fs/xfs/xfs_file.c                      |    1 +
 include/linux/bio.h                    |    1 +
 include/linux/blk_types.h              |    1 +
 include/linux/file.h                   |    2 +
 include/linux/fs.h                     |    5 +-
 include/linux/iomap.h                  |    1 +
 include/linux/syscalls.h               |    2 +
 include/linux/uio.h                    |    3 +
 include/uapi/asm-generic/unistd.h      |    4 +-
 include/uapi/linux/aio_abi.h           |    6 +
 kernel/sys_ni.c                        |    1 +
 lib/iov_iter.c                         |   35 +-
 21 files changed, 1109 insertions(+), 167 deletions(-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 01/27] aio: fix failure to put the file pointer
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 17:07   ` Bart Van Assche
  2018-11-30 16:56 ` [PATCH 02/27] aio: clear IOCB_HIPRI Jens Axboe
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

If the ioprio capability check fails, we return without putting
the file pointer.

Fixes: d9a08a9e616b ("fs: Add aio iopriority support")
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/aio.c b/fs/aio.c
index b984918be4b7..205390c0c1bb 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1436,6 +1436,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
 		ret = ioprio_check_cap(iocb->aio_reqprio);
 		if (ret) {
 			pr_debug("aio ioprio check cap error: %d\n", ret);
+			fput(req->ki_filp);
 			return ret;
 		}
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/27] aio: clear IOCB_HIPRI
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
  2018-11-30 16:56 ` [PATCH 01/27] aio: fix failure to put the file pointer Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 17:13   ` Christoph Hellwig
  2018-11-30 16:56 ` [PATCH 03/27] fs: add an iopoll method to struct file_operations Jens Axboe
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

No one is going to poll for aio (yet), so we must clear the HIPRI
flag, as we would otherwise send it down the poll queues, where no
one will be polling for completions.

Signed-off-by: Christoph Hellwig <hch@lst.de>

IOCB_HIPRI, not RWF_HIPRI.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 205390c0c1bb..05647d352bf3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1436,8 +1436,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
 		ret = ioprio_check_cap(iocb->aio_reqprio);
 		if (ret) {
 			pr_debug("aio ioprio check cap error: %d\n", ret);
-			fput(req->ki_filp);
-			return ret;
+			goto out_fput;
 		}
 
 		req->ki_ioprio = iocb->aio_reqprio;
@@ -1446,7 +1445,13 @@ static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
 
 	ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
 	if (unlikely(ret))
-		fput(req->ki_filp);
+		goto out_fput;
+
+	req->ki_flags &= ~IOCB_HIPRI; /* no one is going to poll for this I/O */
+	return 0;
+
+out_fput:
+	fput(req->ki_filp);
 	return ret;
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/27] fs: add an iopoll method to struct file_operations
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
  2018-11-30 16:56 ` [PATCH 01/27] aio: fix failure to put the file pointer Jens Axboe
  2018-11-30 16:56 ` [PATCH 02/27] aio: clear IOCB_HIPRI Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 04/27] block: wire up block device iopoll method Jens Axboe
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

This new methods is used to explicitly poll for I/O completion for an
iocb.  It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.

TODO: we can probably union ki_cookie with the existing hint and I/O
priority fields to avoid struct kiocb growth.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/filesystems/vfs.txt | 3 +++
 include/linux/fs.h                | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 5f71a252e2e0..d9dc5e4d82b9 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
@@ -902,6 +903,8 @@ otherwise noted.
 
   write_iter: possibly asynchronous write with iov_iter as source
 
+  iopoll: called when aio wants to poll for completions on HIPRI iocbs
+
   iterate: called when the VFS needs to read the directory contents
 
   iterate_shared: called when the VFS needs to read the directory contents
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a1ab233e6469..6a5f71f8ae06 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -310,6 +310,7 @@ struct kiocb {
 	int			ki_flags;
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	unsigned int		ki_cookie; /* for ->iopoll */
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1781,6 +1782,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/27] block: wire up block device iopoll method
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (2 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 03/27] fs: add an iopoll method to struct file_operations Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT Jens Axboe
                   ` (22 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e1886cc7048f..6de8d35f6e41 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -281,6 +281,14 @@ struct blkdev_dio {
 
 static struct bio_set blkdev_dio_pool;
 
+static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
+{
+	struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host);
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
+}
+
 static void blkdev_bio_end_io(struct bio *bio)
 {
 	struct blkdev_dio *dio = bio->bi_private;
@@ -398,6 +406,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				bio->bi_opf |= REQ_HIPRI;
 
 			qc = submit_bio(bio);
+			WRITE_ONCE(iocb->ki_cookie, qc);
 			break;
 		}
 
@@ -2070,6 +2079,7 @@ const struct file_operations def_blk_fops = {
 	.llseek		= block_llseek,
 	.read_iter	= blkdev_read_iter,
 	.write_iter	= blkdev_write_iter,
+	.iopoll		= blkdev_iopoll,
 	.mmap		= generic_file_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (3 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 04/27] block: wire up block device iopoll method Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 17:12   ` Bart Van Assche
  2018-11-30 16:56 ` [PATCH 06/27] iomap: wire up the iopoll method Jens Axboe
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

We can't wait for polled events to complete, as they may require active
polling from whoever submitted it. If that is the same task that is
submitting new IO, we could deadlock waiting for IO to complete that
this task is supposed to be completing itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6de8d35f6e41..ebc3d5a0f424 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -402,8 +402,16 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 
 		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
 		if (!nr_pages) {
-			if (iocb->ki_flags & IOCB_HIPRI)
+			if (iocb->ki_flags & IOCB_HIPRI) {
 				bio->bi_opf |= REQ_HIPRI;
+				/*
+				 * For async polled IO, we can't wait for
+				 * requests to complete, as they may also be
+				 * polled and require active reaping.
+				 */
+				if (!is_sync)
+					bio->bi_opf |= REQ_NOWAIT;
+			}
 
 			qc = submit_bio(bio);
 			WRITE_ONCE(iocb->ki_cookie, qc);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/27] iomap: wire up the iopoll method
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (4 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 07/27] iomap: ensure that async polled IO is marked REQ_NOWAIT Jens Axboe
                   ` (20 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device.  Also refactor the common direct I/O bio submission code into a
nice little helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/gfs2/file.c        |  2 ++
 fs/iomap.c            | 43 ++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_file.c     |  1 +
 include/linux/iomap.h |  1 +
 4 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 45a17b770d97..358157efc5b7 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
@@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
diff --git a/fs/iomap.c b/fs/iomap.c
index 74c1f37f0fd6..16787b3b09fd 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1436,6 +1436,28 @@ struct iomap_dio {
 	};
 };
 
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
+{
+	struct request_queue *q = READ_ONCE(kiocb->private);
+
+	if (!q)
+		return 0;
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin);
+}
+EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
+
+static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
+		struct bio *bio)
+{
+	atomic_inc(&dio->ref);
+
+	if (dio->iocb->ki_flags & IOCB_HIPRI)
+		bio->bi_opf |= REQ_HIPRI;
+
+	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
+	dio->submit.cookie = submit_bio(bio);
+}
+
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
@@ -1548,7 +1570,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	}
 }
 
-static blk_qc_t
+static void
 iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 		unsigned len)
 {
@@ -1562,15 +1584,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	if (dio->iocb->ki_flags & IOCB_HIPRI)
-		flags |= REQ_HIPRI;
-
 	get_page(page);
 	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
-
-	atomic_inc(&dio->ref);
-	return submit_bio(bio);
+	iomap_dio_submit_bio(dio, iomap, bio);
 }
 
 static loff_t
@@ -1666,9 +1683,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 				bio_set_pages_dirty(bio);
 		}
 
-		if (dio->iocb->ki_flags & IOCB_HIPRI)
-			bio->bi_opf |= REQ_HIPRI;
-
 		iov_iter_advance(dio->submit.iter, n);
 
 		dio->size += n;
@@ -1676,11 +1690,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
-
-		atomic_inc(&dio->ref);
-
-		dio->submit.last_queue = bdev_get_queue(iomap->bdev);
-		dio->submit.cookie = submit_bio(bio);
+		iomap_dio_submit_bio(dio, iomap, bio);
 	} while (nr_pages);
 
 	if (need_zeroout) {
@@ -1883,6 +1893,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
+	WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie);
+	WRITE_ONCE(iocb->private, dio->submit.last_queue);
+
 	if (!atomic_dec_and_test(&dio->ref)) {
 		if (!dio->wait_for_completion)
 			return -EIOCBQUEUED;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 53c9ab8fb777..603e705781a4 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = {
 	.write_iter	= xfs_file_write_iter,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= xfs_file_compat_ioctl,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 9a4258154b25..0fefb5455bda 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
 		unsigned flags);
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
 #ifdef CONFIG_SWAP
 struct file;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/27] iomap: ensure that async polled IO is marked REQ_NOWAIT
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (5 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 06/27] iomap: wire up the iopoll method Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 08/27] aio: use assigned completion handler Jens Axboe
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

We can't wait for polled events to complete, as they may require active
polling from whoever submitted it. If that is the same task that is
submitting new IO, we could deadlock waiting for IO to complete that
this task is supposed to be completing itself.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index 16787b3b09fd..96d60b9b2bea 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1451,8 +1451,16 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
 {
 	atomic_inc(&dio->ref);
 
-	if (dio->iocb->ki_flags & IOCB_HIPRI)
+	if (dio->iocb->ki_flags & IOCB_HIPRI) {
 		bio->bi_opf |= REQ_HIPRI;
+		/*
+		 * For async polled IO, we can't wait for requests to
+		 * complete, as they may also be polled and require active
+		 * reaping.
+		 */
+		if (!dio->wait_for_completion)
+			bio->bi_opf |= REQ_NOWAIT;
+	}
 
 	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
 	dio->submit.cookie = submit_bio(bio);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/27] aio: use assigned completion handler
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (6 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 07/27] iomap: ensure that async polled IO is marked REQ_NOWAIT Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 09/27] aio: separate out ring reservation from req allocation Jens Axboe
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

We know this is a read/write request, but in preparation for
having different kinds of those, ensure that we call the assigned
handler instead of assuming it's aio_complete_rq().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index 05647d352bf3..cf0de61743e8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1490,7 +1490,7 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret)
 		ret = -EINTR;
 		/*FALLTHRU*/
 	default:
-		aio_complete_rw(req, ret, 0);
+		req->ki_complete(req, ret, 0);
 	}
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/27] aio: separate out ring reservation from req allocation
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (7 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 08/27] aio: use assigned completion handler Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req() Jens Axboe
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

This is in preparation for certain types of IO not needing a ring
reserveration.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index cf0de61743e8..eaceb40e6cf5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -901,7 +901,7 @@ static void put_reqs_available(struct kioctx *ctx, unsigned nr)
 	local_irq_restore(flags);
 }
 
-static bool get_reqs_available(struct kioctx *ctx)
+static bool __get_reqs_available(struct kioctx *ctx)
 {
 	struct kioctx_cpu *kcpu;
 	bool ret = false;
@@ -993,6 +993,14 @@ static void user_refill_reqs_available(struct kioctx *ctx)
 	spin_unlock_irq(&ctx->completion_lock);
 }
 
+static bool get_reqs_available(struct kioctx *ctx)
+{
+	if (__get_reqs_available(ctx))
+		return true;
+	user_refill_reqs_available(ctx);
+	return __get_reqs_available(ctx);
+}
+
 /* aio_get_req
  *	Allocate a slot for an aio request.
  * Returns NULL if no requests are free.
@@ -1001,24 +1009,15 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
 {
 	struct aio_kiocb *req;
 
-	if (!get_reqs_available(ctx)) {
-		user_refill_reqs_available(ctx);
-		if (!get_reqs_available(ctx))
-			return NULL;
-	}
-
 	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
 	if (unlikely(!req))
-		goto out_put;
+		return NULL;
 
 	percpu_ref_get(&ctx->reqs);
 	INIT_LIST_HEAD(&req->ki_list);
 	refcount_set(&req->ki_refcnt, 0);
 	req->ki_ctx = ctx;
 	return req;
-out_put:
-	put_reqs_available(ctx, 1);
-	return NULL;
 }
 
 static struct kioctx *lookup_ioctx(unsigned long ctx_id)
@@ -1805,9 +1804,13 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		return -EINVAL;
 	}
 
+	if (!get_reqs_available(ctx))
+		return -EAGAIN;
+
+	ret = -EAGAIN;
 	req = aio_get_req(ctx);
 	if (unlikely(!req))
-		return -EAGAIN;
+		goto out_put_reqs_available;
 
 	if (iocb.aio_flags & IOCB_FLAG_RESFD) {
 		/*
@@ -1870,11 +1873,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		goto out_put_req;
 	return 0;
 out_put_req:
-	put_reqs_available(ctx, 1);
 	percpu_ref_put(&ctx->reqs);
 	if (req->ki_eventfd)
 		eventfd_ctx_put(req->ki_eventfd);
 	kmem_cache_free(kiocb_cachep, req);
+out_put_reqs_available:
+	put_reqs_available(ctx, 1);
 	return ret;
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req()
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (8 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 09/27] aio: separate out ring reservation from req allocation Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-12-04 14:49   ` Christoph Hellwig
  2018-11-30 16:56 ` [PATCH 11/27] aio: only use blk plugs for > 2 depth submissions Jens Axboe
                   ` (16 subsequent siblings)
  26 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

It's 192 bytes, fairly substantial. Most items don't need to be cleared,
especially not upfront. Clear the ones we do need to clear, and leave
the other ones for setup when the iocb is prepared and submitted.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index eaceb40e6cf5..681f2072f81b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1009,14 +1009,15 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
 {
 	struct aio_kiocb *req;
 
-	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
-	if (unlikely(!req))
-		return NULL;
+	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
+	if (req) {
+		percpu_ref_get(&ctx->reqs);
+		req->ki_ctx = ctx;
+		INIT_LIST_HEAD(&req->ki_list);
+		refcount_set(&req->ki_refcnt, 0);
+		req->ki_eventfd = NULL;
+	}
 
-	percpu_ref_get(&ctx->reqs);
-	INIT_LIST_HEAD(&req->ki_list);
-	refcount_set(&req->ki_refcnt, 0);
-	req->ki_ctx = ctx;
 	return req;
 }
 
@@ -1730,6 +1731,10 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
 	if (unlikely(!req->file))
 		return -EBADF;
 
+	req->head = NULL;
+	req->woken = false;
+	req->cancelled = false;
+
 	apt.pt._qproc = aio_poll_queue_proc;
 	apt.pt._key = req->events;
 	apt.iocb = aiocb;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 11/27] aio: only use blk plugs for > 2 depth submissions
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (9 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req() Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-12-04 14:50   ` Christoph Hellwig
  2018-11-30 16:56 ` [PATCH 12/27] aio: use iocb_put() instead of open coding it Jens Axboe
                   ` (15 subsequent siblings)
  26 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

Plugging is meant to optimize submission of a string of IOs, if we don't
have more than 2 being submitted, don't bother setting up a plug.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 681f2072f81b..533cb7b1112f 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1887,6 +1887,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	return ret;
 }
 
+/*
+ * Plugging is meant to work with larger batches of IOs. If we don't
+ * have more than the below, then don't bother setting up a plug.
+ */
+#define AIO_PLUG_THRESHOLD	2
+
 /* sys_io_submit:
  *	Queue the nr iocbs pointed to by iocbpp for processing.  Returns
  *	the number of iocbs queued.  May return -EINVAL if the aio_context
@@ -1919,7 +1925,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
 	if (nr > ctx->nr_events)
 		nr = ctx->nr_events;
 
-	blk_start_plug(&plug);
+	if (nr > AIO_PLUG_THRESHOLD)
+		blk_start_plug(&plug);
 	for (i = 0; i < nr; i++) {
 		struct iocb __user *user_iocb;
 
@@ -1932,7 +1939,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	if (nr > AIO_PLUG_THRESHOLD)
+		blk_finish_plug(&plug);
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
@@ -1959,7 +1967,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id,
 	if (nr > ctx->nr_events)
 		nr = ctx->nr_events;
 
-	blk_start_plug(&plug);
+	if (nr > AIO_PLUG_THRESHOLD)
+		blk_start_plug(&plug);
 	for (i = 0; i < nr; i++) {
 		compat_uptr_t user_iocb;
 
@@ -1972,7 +1981,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id,
 		if (ret)
 			break;
 	}
-	blk_finish_plug(&plug);
+	if (nr > AIO_PLUG_THRESHOLD)
+		blk_finish_plug(&plug);
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 12/27] aio: use iocb_put() instead of open coding it
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (10 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 11/27] aio: only use blk plugs for > 2 depth submissions Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-12-04 14:50   ` Christoph Hellwig
  2018-11-30 16:56 ` [PATCH 13/27] aio: split out iocb copy from io_submit_one() Jens Axboe
                   ` (14 subsequent siblings)
  26 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

Replace the percpu_ref_put() + kmem_cache_free() with a call to
iocb_put() instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 533cb7b1112f..e8457f9486e3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1878,10 +1878,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		goto out_put_req;
 	return 0;
 out_put_req:
-	percpu_ref_put(&ctx->reqs);
 	if (req->ki_eventfd)
 		eventfd_ctx_put(req->ki_eventfd);
-	kmem_cache_free(kiocb_cachep, req);
+	iocb_put(req);
 out_put_reqs_available:
 	put_reqs_available(ctx, 1);
 	return ret;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 13/27] aio: split out iocb copy from io_submit_one()
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (11 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 12/27] aio: use iocb_put() instead of open coding it Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 14/27] aio: abstract out io_event filler helper Jens Axboe
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

In preparation of handing in iocbs in a different fashion as well. Also
make it clear that the iocb being passed in isn't modified, by marking
it const throughout.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 68 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 38 insertions(+), 30 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index e8457f9486e3..ba5758c854e8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1414,7 +1414,7 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
 	aio_complete(iocb, res, res2);
 }
 
-static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
+static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 {
 	int ret;
 
@@ -1455,7 +1455,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
 	return ret;
 }
 
-static int aio_setup_rw(int rw, struct iocb *iocb, struct iovec **iovec,
+static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec,
 		bool vectored, bool compat, struct iov_iter *iter)
 {
 	void __user *buf = (void __user *)(uintptr_t)iocb->aio_buf;
@@ -1494,8 +1494,8 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret)
 	}
 }
 
-static ssize_t aio_read(struct kiocb *req, struct iocb *iocb, bool vectored,
-		bool compat)
+static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
+			bool vectored, bool compat)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct iov_iter iter;
@@ -1527,8 +1527,8 @@ static ssize_t aio_read(struct kiocb *req, struct iocb *iocb, bool vectored,
 	return ret;
 }
 
-static ssize_t aio_write(struct kiocb *req, struct iocb *iocb, bool vectored,
-		bool compat)
+static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
+			 bool vectored, bool compat)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct iov_iter iter;
@@ -1583,7 +1583,8 @@ static void aio_fsync_work(struct work_struct *work)
 	aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0);
 }
 
-static int aio_fsync(struct fsync_iocb *req, struct iocb *iocb, bool datasync)
+static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb,
+		     bool datasync)
 {
 	if (unlikely(iocb->aio_buf || iocb->aio_offset || iocb->aio_nbytes ||
 			iocb->aio_rw_flags))
@@ -1711,7 +1712,7 @@ aio_poll_queue_proc(struct file *file, struct wait_queue_head *head,
 	add_wait_queue(head, &pt->iocb->poll.wait);
 }
 
-static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
+static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 {
 	struct kioctx *ctx = aiocb->ki_ctx;
 	struct poll_iocb *req = &aiocb->poll;
@@ -1783,27 +1784,23 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb)
 	return 0;
 }
 
-static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
-			 bool compat)
+static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
+			   struct iocb __user *user_iocb, bool compat)
 {
 	struct aio_kiocb *req;
-	struct iocb iocb;
 	ssize_t ret;
 
-	if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb))))
-		return -EFAULT;
-
 	/* enforce forwards compatibility on users */
-	if (unlikely(iocb.aio_reserved2)) {
+	if (unlikely(iocb->aio_reserved2)) {
 		pr_debug("EINVAL: reserve field set\n");
 		return -EINVAL;
 	}
 
 	/* prevent overflows */
 	if (unlikely(
-	    (iocb.aio_buf != (unsigned long)iocb.aio_buf) ||
-	    (iocb.aio_nbytes != (size_t)iocb.aio_nbytes) ||
-	    ((ssize_t)iocb.aio_nbytes < 0)
+	    (iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
+	    (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) ||
+	    ((ssize_t)iocb->aio_nbytes < 0)
 	   )) {
 		pr_debug("EINVAL: overflow check\n");
 		return -EINVAL;
@@ -1817,14 +1814,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(!req))
 		goto out_put_reqs_available;
 
-	if (iocb.aio_flags & IOCB_FLAG_RESFD) {
+	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
 		 * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an
 		 * instance of the file* now. The file descriptor must be
 		 * an eventfd() fd, and will be signaled for each completed
 		 * event using the eventfd_signal() function.
 		 */
-		req->ki_eventfd = eventfd_ctx_fdget((int) iocb.aio_resfd);
+		req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
 		if (IS_ERR(req->ki_eventfd)) {
 			ret = PTR_ERR(req->ki_eventfd);
 			req->ki_eventfd = NULL;
@@ -1839,32 +1836,32 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	}
 
 	req->ki_user_iocb = user_iocb;
-	req->ki_user_data = iocb.aio_data;
+	req->ki_user_data = iocb->aio_data;
 
-	switch (iocb.aio_lio_opcode) {
+	switch (iocb->aio_lio_opcode) {
 	case IOCB_CMD_PREAD:
-		ret = aio_read(&req->rw, &iocb, false, compat);
+		ret = aio_read(&req->rw, iocb, false, compat);
 		break;
 	case IOCB_CMD_PWRITE:
-		ret = aio_write(&req->rw, &iocb, false, compat);
+		ret = aio_write(&req->rw, iocb, false, compat);
 		break;
 	case IOCB_CMD_PREADV:
-		ret = aio_read(&req->rw, &iocb, true, compat);
+		ret = aio_read(&req->rw, iocb, true, compat);
 		break;
 	case IOCB_CMD_PWRITEV:
-		ret = aio_write(&req->rw, &iocb, true, compat);
+		ret = aio_write(&req->rw, iocb, true, compat);
 		break;
 	case IOCB_CMD_FSYNC:
-		ret = aio_fsync(&req->fsync, &iocb, false);
+		ret = aio_fsync(&req->fsync, iocb, false);
 		break;
 	case IOCB_CMD_FDSYNC:
-		ret = aio_fsync(&req->fsync, &iocb, true);
+		ret = aio_fsync(&req->fsync, iocb, true);
 		break;
 	case IOCB_CMD_POLL:
-		ret = aio_poll(req, &iocb);
+		ret = aio_poll(req, iocb);
 		break;
 	default:
-		pr_debug("invalid aio operation %d\n", iocb.aio_lio_opcode);
+		pr_debug("invalid aio operation %d\n", iocb->aio_lio_opcode);
 		ret = -EINVAL;
 		break;
 	}
@@ -1886,6 +1883,17 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	return ret;
 }
 
+static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
+			 bool compat)
+{
+	struct iocb iocb;
+
+	if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb))))
+		return -EFAULT;
+
+	return __io_submit_one(ctx, &iocb, user_iocb, compat);
+}
+
 /*
  * Plugging is meant to work with larger batches of IOs. If we don't
  * have more than the below, then don't bother setting up a plug.
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 14/27] aio: abstract out io_event filler helper
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (12 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 13/27] aio: split out iocb copy from io_submit_one() Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 15/27] aio: add io_setup2() system call Jens Axboe
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index ba5758c854e8..12859ea1cb64 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1057,6 +1057,15 @@ static inline void iocb_put(struct aio_kiocb *iocb)
 	}
 }
 
+static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb,
+			   long res, long res2)
+{
+	ev->obj = (u64)(unsigned long)iocb->ki_user_iocb;
+	ev->data = iocb->ki_user_data;
+	ev->res = res;
+	ev->res2 = res2;
+}
+
 /* aio_complete
  *	Called when the io request on the given iocb is complete.
  */
@@ -1084,10 +1093,7 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2)
 	ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
 	event = ev_page + pos % AIO_EVENTS_PER_PAGE;
 
-	event->obj = (u64)(unsigned long)iocb->ki_user_iocb;
-	event->data = iocb->ki_user_data;
-	event->res = res;
-	event->res2 = res2;
+	aio_fill_event(event, iocb, res, res2);
 
 	kunmap_atomic(ev_page);
 	flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 15/27] aio: add io_setup2() system call
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (13 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 14/27] aio: abstract out io_event filler helper Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 16/27] aio: add support for having user mapped iocbs Jens Axboe
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

This is just like io_setup(), except add a flags argument to let the
caller control/define some of the io_context behavior. Outside of that,
we pass in an iocb array for future use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 fs/aio.c                               | 70 ++++++++++++++++----------
 include/linux/syscalls.h               |  2 +
 include/uapi/asm-generic/unistd.h      |  4 +-
 kernel/sys_ni.c                        |  1 +
 5 files changed, 50 insertions(+), 28 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..67c357225fb0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,7 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+335	common	io_setup2		__x64_sys_io_setup2
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/aio.c b/fs/aio.c
index 12859ea1cb64..74831ce2185e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -94,6 +94,8 @@ struct kioctx {
 
 	unsigned long		user_id;
 
+	unsigned int		flags;
+
 	struct __percpu kioctx_cpu *cpu;
 
 	/*
@@ -680,21 +682,24 @@ static void aio_nr_sub(unsigned nr)
 	spin_unlock(&aio_nr_lock);
 }
 
-/* ioctx_alloc
- *	Allocates and initializes an ioctx.  Returns an ERR_PTR if it failed.
- */
-static struct kioctx *ioctx_alloc(unsigned nr_events)
+static struct kioctx *io_setup_flags(unsigned long ctxid,
+				     unsigned int nr_events, unsigned int flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct kioctx *ctx;
 	int err = -ENOMEM;
-
 	/*
 	 * Store the original nr_events -- what userspace passed to io_setup(),
 	 * for counting against the global limit -- before it changes.
 	 */
 	unsigned int max_reqs = nr_events;
 
+	if (unlikely(ctxid || nr_events == 0)) {
+		pr_debug("EINVAL: ctx %lu nr_events %u\n",
+		         ctxid, nr_events);
+		return ERR_PTR(-EINVAL);
+	}
+
 	/*
 	 * We keep track of the number of available ringbuffer slots, to prevent
 	 * overflow (reqs_available), and we also use percpu counters for this.
@@ -720,6 +725,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
+	ctx->flags = flags;
 	ctx->max_reqs = max_reqs;
 
 	spin_lock_init(&ctx->ctx_lock);
@@ -1275,6 +1281,33 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,
 	return ret;
 }
 
+SYSCALL_DEFINE4(io_setup2, u32, nr_events, u32, flags, struct iocb * __user,
+		iocbs, aio_context_t __user *, ctxp)
+{
+	struct kioctx *ioctx;
+	unsigned long ctx;
+	long ret;
+
+	if (flags)
+		return -EINVAL;
+
+	ret = get_user(ctx, ctxp);
+	if (unlikely(ret))
+		goto out;
+
+	ioctx = io_setup_flags(ctx, nr_events, flags);
+	ret = PTR_ERR(ioctx);
+	if (IS_ERR(ioctx))
+		goto out;
+
+	ret = put_user(ioctx->user_id, ctxp);
+	if (ret)
+		kill_ioctx(current->mm, ioctx, NULL);
+	percpu_ref_put(&ioctx->users);
+out:
+	return ret;
+}
+
 /* sys_io_setup:
  *	Create an aio_context capable of receiving at least nr_events.
  *	ctxp must not point to an aio_context that already exists, and
@@ -1290,7 +1323,7 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,
  */
 SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
 {
-	struct kioctx *ioctx = NULL;
+	struct kioctx *ioctx;
 	unsigned long ctx;
 	long ret;
 
@@ -1298,14 +1331,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
 	if (unlikely(ret))
 		goto out;
 
-	ret = -EINVAL;
-	if (unlikely(ctx || nr_events == 0)) {
-		pr_debug("EINVAL: ctx %lu nr_events %u\n",
-		         ctx, nr_events);
-		goto out;
-	}
-
-	ioctx = ioctx_alloc(nr_events);
+	ioctx = io_setup_flags(ctx, nr_events, 0);
 	ret = PTR_ERR(ioctx);
 	if (!IS_ERR(ioctx)) {
 		ret = put_user(ioctx->user_id, ctxp);
@@ -1313,7 +1339,6 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
 			kill_ioctx(current->mm, ioctx, NULL);
 		percpu_ref_put(&ioctx->users);
 	}
-
 out:
 	return ret;
 }
@@ -1321,7 +1346,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp)
 #ifdef CONFIG_COMPAT
 COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, u32 __user *, ctx32p)
 {
-	struct kioctx *ioctx = NULL;
+	struct kioctx *ioctx;
 	unsigned long ctx;
 	long ret;
 
@@ -1329,23 +1354,14 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, u32 __user *, ctx32p)
 	if (unlikely(ret))
 		goto out;
 
-	ret = -EINVAL;
-	if (unlikely(ctx || nr_events == 0)) {
-		pr_debug("EINVAL: ctx %lu nr_events %u\n",
-		         ctx, nr_events);
-		goto out;
-	}
-
-	ioctx = ioctx_alloc(nr_events);
+	ioctx = io_setup_flags(ctx, nr_events, 0);
 	ret = PTR_ERR(ioctx);
 	if (!IS_ERR(ioctx)) {
-		/* truncating is ok because it's a user address */
-		ret = put_user((u32)ioctx->user_id, ctx32p);
+		ret = put_user(ioctx->user_id, ctx32p);
 		if (ret)
 			kill_ioctx(current->mm, ioctx, NULL);
 		percpu_ref_put(&ioctx->users);
 	}
-
 out:
 	return ret;
 }
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2ac3d13a915b..b661e78717e6 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -287,6 +287,8 @@ static inline void addr_limit_user_check(void)
  */
 #ifndef CONFIG_ARCH_HAS_SYSCALL_WRAPPER
 asmlinkage long sys_io_setup(unsigned nr_reqs, aio_context_t __user *ctx);
+asmlinkage long sys_io_setup2(unsigned, unsigned, struct iocb __user *,
+				aio_context_t __user *);
 asmlinkage long sys_io_destroy(aio_context_t ctx);
 asmlinkage long sys_io_submit(aio_context_t, long,
 			struct iocb __user * __user *);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 538546edbfbd..b4527ed373b0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -738,9 +738,11 @@ __SYSCALL(__NR_statx,     sys_statx)
 __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents)
 #define __NR_rseq 293
 __SYSCALL(__NR_rseq, sys_rseq)
+#define __NR_io_setup2 294
+__SYSCALL(__NR_io_setup2, sys_io_setup2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 294
+#define __NR_syscalls 295
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index df556175be50..17c8b4393669 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -37,6 +37,7 @@ asmlinkage long sys_ni_syscall(void)
  */
 
 COND_SYSCALL(io_setup);
+COND_SYSCALL(io_setup2);
 COND_SYSCALL_COMPAT(io_setup);
 COND_SYSCALL(io_destroy);
 COND_SYSCALL(io_submit);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 16/27] aio: add support for having user mapped iocbs
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (14 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 15/27] aio: add io_setup2() system call Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 17/27] aio: support for IO polling Jens Axboe
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

For io_submit(), we have to first copy each pointer to an iocb, then
copy the iocb. The latter is 64 bytes in size, and that's a lot of
copying for a single IO.

Add support for setting IOCTX_FLAG_USERIOCB through the new io_setup2()
system call, which allows the iocbs to reside in userspace. If this flag
is used, then io_submit() doesn't take pointers to iocbs anymore, it
takes an index value into the array of iocbs instead. Similary, for
io_getevents(), the iocb ->obj will be the index, not the pointer to the
iocb.

See the change made to fio to support this feature, it's pretty
trivialy to adapt to. For applications, like fio, that previously
embedded the iocb inside a application private structure, some sort
of lookup table/structure is needed to find the private IO structure
from the index at io_getevents() time.

http://git.kernel.dk/cgit/fio/commit/?id=3c3168e91329c83880c91e5abc28b9d6b940fd95

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c                     | 111 +++++++++++++++++++++++++++++++----
 include/uapi/linux/aio_abi.h |   2 +
 2 files changed, 101 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 74831ce2185e..380e6fe8c429 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -121,6 +121,9 @@ struct kioctx {
 	struct page		**ring_pages;
 	long			nr_pages;
 
+	struct page		**iocb_pages;
+	long			iocb_nr_pages;
+
 	struct rcu_work		free_rwork;	/* see free_ioctx() */
 
 	/*
@@ -216,6 +219,11 @@ static struct vfsmount *aio_mnt;
 static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
+static const unsigned int iocb_page_shift =
+				ilog2(PAGE_SIZE / sizeof(struct iocb));
+
+static void aio_useriocb_free(struct kioctx *);
+
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
 	struct file *file;
@@ -572,6 +580,7 @@ static void free_ioctx(struct work_struct *work)
 					  free_rwork);
 	pr_debug("freeing %p\n", ctx);
 
+	aio_useriocb_free(ctx);
 	aio_free_ring(ctx);
 	free_percpu(ctx->cpu);
 	percpu_ref_exit(&ctx->reqs);
@@ -1281,6 +1290,61 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr,
 	return ret;
 }
 
+static struct iocb *aio_iocb_from_index(struct kioctx *ctx, int index)
+{
+	unsigned int page_index;
+	struct iocb *iocb;
+
+	page_index = index >> iocb_page_shift;
+	index &= ((1 << iocb_page_shift) - 1);
+	iocb = page_address(ctx->iocb_pages[page_index]);
+
+	return iocb + index;
+}
+
+static void aio_useriocb_free(struct kioctx *ctx)
+{
+	int i;
+
+	if (!ctx->iocb_nr_pages)
+		return;
+
+	for (i = 0; i < ctx->iocb_nr_pages; i++)
+		put_page(ctx->iocb_pages[i]);
+
+	kfree(ctx->iocb_pages);
+	ctx->iocb_pages = NULL;
+	ctx->iocb_nr_pages = 0;
+}
+
+static int aio_useriocb_map(struct kioctx *ctx, struct iocb __user *iocbs)
+{
+	int nr_pages, ret;
+
+	if ((unsigned long) iocbs & ~PAGE_MASK)
+		return -EINVAL;
+
+	nr_pages = sizeof(struct iocb) * ctx->max_reqs;
+	nr_pages = (nr_pages + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	ctx->iocb_pages = kzalloc(nr_pages * sizeof(struct page *), GFP_KERNEL);
+	if (!ctx->iocb_pages)
+		return -ENOMEM;
+
+	down_write(&current->mm->mmap_sem);
+	ret = get_user_pages((unsigned long) iocbs, nr_pages, 0,
+				ctx->iocb_pages, NULL);
+	up_write(&current->mm->mmap_sem);
+
+	if (ret < nr_pages) {
+		kfree(ctx->iocb_pages);
+		return -ENOMEM;
+	}
+
+	ctx->iocb_nr_pages = nr_pages;
+	return 0;
+}
+
 SYSCALL_DEFINE4(io_setup2, u32, nr_events, u32, flags, struct iocb * __user,
 		iocbs, aio_context_t __user *, ctxp)
 {
@@ -1288,7 +1352,7 @@ SYSCALL_DEFINE4(io_setup2, u32, nr_events, u32, flags, struct iocb * __user,
 	unsigned long ctx;
 	long ret;
 
-	if (flags)
+	if (flags & ~IOCTX_FLAG_USERIOCB)
 		return -EINVAL;
 
 	ret = get_user(ctx, ctxp);
@@ -1300,9 +1364,17 @@ SYSCALL_DEFINE4(io_setup2, u32, nr_events, u32, flags, struct iocb * __user,
 	if (IS_ERR(ioctx))
 		goto out;
 
+	if (flags & IOCTX_FLAG_USERIOCB) {
+		ret = aio_useriocb_map(ioctx, iocbs);
+		if (ret)
+			goto err;
+	}
+
 	ret = put_user(ioctx->user_id, ctxp);
-	if (ret)
+	if (ret) {
+err:
 		kill_ioctx(current->mm, ioctx, NULL);
+	}
 	percpu_ref_put(&ioctx->users);
 out:
 	return ret;
@@ -1851,10 +1923,13 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 		}
 	}
 
-	ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
-	if (unlikely(ret)) {
-		pr_debug("EFAULT: aio_key\n");
-		goto out_put_req;
+	/* Don't support cancel on user mapped iocbs */
+	if (!(ctx->flags & IOCTX_FLAG_USERIOCB)) {
+		ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
+		if (unlikely(ret)) {
+			pr_debug("EFAULT: aio_key\n");
+			goto out_put_req;
+		}
 	}
 
 	req->ki_user_iocb = user_iocb;
@@ -1908,12 +1983,22 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 bool compat)
 {
-	struct iocb iocb;
+	struct iocb iocb, *iocbp;
 
-	if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb))))
-		return -EFAULT;
+	if (ctx->flags & IOCTX_FLAG_USERIOCB) {
+		unsigned long iocb_index = (unsigned long) user_iocb;
+
+		if (iocb_index >= ctx->max_reqs)
+			return -EINVAL;
+
+		iocbp = aio_iocb_from_index(ctx, iocb_index);
+	} else {
+		if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb))))
+			return -EFAULT;
+		iocbp = &iocb;
+	}
 
-	return __io_submit_one(ctx, &iocb, user_iocb, compat);
+	return __io_submit_one(ctx, iocbp, user_iocb, compat);
 }
 
 /*
@@ -2063,6 +2148,9 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
 	if (unlikely(!ctx))
 		return -EINVAL;
 
+	if (ctx->flags & IOCTX_FLAG_USERIOCB)
+		goto err;
+
 	spin_lock_irq(&ctx->ctx_lock);
 	kiocb = lookup_kiocb(ctx, iocb);
 	if (kiocb) {
@@ -2079,9 +2167,8 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
 		 */
 		ret = -EINPROGRESS;
 	}
-
+err:
 	percpu_ref_put(&ctx->users);
-
 	return ret;
 }
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 8387e0af0f76..814e6606c413 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -106,6 +106,8 @@ struct iocb {
 	__u32	aio_resfd;
 }; /* 64 bytes */
 
+#define IOCTX_FLAG_USERIOCB	(1 << 0)	/* iocbs are user mapped */
+
 #undef IFBIG
 #undef IFLITTLE
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 17/27] aio: support for IO polling
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (15 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 16/27] aio: add support for having user mapped iocbs Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 18/27] aio: add submission side request cache Jens Axboe
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

Add polled variants of PREAD/PREADV and PWRITE/PWRITEV. These act
like their non-polled counterparts, except we expect to poll for
completion of them. The polling happens at io_getevent() time, and
works just like non-polled IO.

To setup an io_context for polled IO, the application must call
io_setup2() with IOCTX_FLAG_IOPOLL as one of the flags. It is illegal
to mix and match polled and non-polled IO on an io_context.

Polled IO doesn't support the user mapped completion ring. Events
must be reaped through the io_getevents() system call. For non-irq
driven poll devices, there's no way to support completion reaping
from userspace by just looking at the ring. The application itself
is the one that pulls completion entries.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c                     | 381 +++++++++++++++++++++++++++++++----
 include/uapi/linux/aio_abi.h |   3 +
 2 files changed, 348 insertions(+), 36 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 380e6fe8c429..f7a49abc7694 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -143,6 +143,18 @@ struct kioctx {
 		atomic_t	reqs_available;
 	} ____cacheline_aligned_in_smp;
 
+	/* iopoll submission state */
+	struct {
+		spinlock_t poll_lock;
+		struct list_head poll_submitted;
+	} ____cacheline_aligned_in_smp;
+
+	/* iopoll completion state */
+	struct {
+		struct list_head poll_completing;
+		struct mutex getevents_lock;
+	} ____cacheline_aligned_in_smp;
+
 	struct {
 		spinlock_t	ctx_lock;
 		struct list_head active_reqs;	/* used for cancellation */
@@ -195,14 +207,27 @@ struct aio_kiocb {
 	__u64			ki_user_data;	/* user's data for completion */
 
 	struct list_head	ki_list;	/* the aio core uses this
-						 * for cancellation */
+						 * for cancellation, or for
+						 * polled IO */
+
+	unsigned long		ki_flags;
+#define IOCB_POLL_COMPLETED	0
+#define IOCB_POLL_BUSY		1
+
 	refcount_t		ki_refcnt;
 
-	/*
-	 * If the aio_resfd field of the userspace iocb is not zero,
-	 * this is the underlying eventfd context to deliver events to.
-	 */
-	struct eventfd_ctx	*ki_eventfd;
+	union {
+		/*
+		 * If the aio_resfd field of the userspace iocb is not zero,
+		 * this is the underlying eventfd context to deliver events to.
+		 */
+		struct eventfd_ctx	*ki_eventfd;
+
+		/*
+		 * For polled IO, stash completion info here
+		 */
+		struct io_event		ki_ev;
+	};
 };
 
 /*------ sysctl variables----*/
@@ -223,6 +248,7 @@ static const unsigned int iocb_page_shift =
 				ilog2(PAGE_SIZE / sizeof(struct iocb));
 
 static void aio_useriocb_free(struct kioctx *);
+static void aio_iopoll_reap_events(struct kioctx *);
 
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
 {
@@ -461,11 +487,15 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 	int i;
 	struct file *file;
 
-	/* Compensate for the ring buffer's head/tail overlap entry */
-	nr_events += 2;	/* 1 is required, 2 for good luck */
-
+	/*
+	 * Compensate for the ring buffer's head/tail overlap entry.
+	 * IO polling doesn't require any io event entries
+	 */
 	size = sizeof(struct aio_ring);
-	size += sizeof(struct io_event) * nr_events;
+	if (!(ctx->flags & IOCTX_FLAG_IOPOLL)) {
+		nr_events += 2;	/* 1 is required, 2 for good luck */
+		size += sizeof(struct io_event) * nr_events;
+	}
 
 	nr_pages = PFN_UP(size);
 	if (nr_pages < 0)
@@ -747,6 +777,11 @@ static struct kioctx *io_setup_flags(unsigned long ctxid,
 
 	INIT_LIST_HEAD(&ctx->active_reqs);
 
+	spin_lock_init(&ctx->poll_lock);
+	INIT_LIST_HEAD(&ctx->poll_submitted);
+	INIT_LIST_HEAD(&ctx->poll_completing);
+	mutex_init(&ctx->getevents_lock);
+
 	if (percpu_ref_init(&ctx->users, free_ioctx_users, 0, GFP_KERNEL))
 		goto err;
 
@@ -818,11 +853,15 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
 {
 	struct kioctx_table *table;
 
+	mutex_lock(&ctx->getevents_lock);
 	spin_lock(&mm->ioctx_lock);
 	if (atomic_xchg(&ctx->dead, 1)) {
 		spin_unlock(&mm->ioctx_lock);
+		mutex_unlock(&ctx->getevents_lock);
 		return -EINVAL;
 	}
+	aio_iopoll_reap_events(ctx);
+	mutex_unlock(&ctx->getevents_lock);
 
 	table = rcu_dereference_raw(mm->ioctx_table);
 	WARN_ON(ctx != rcu_access_pointer(table->table[ctx->id]));
@@ -1029,6 +1068,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
 		percpu_ref_get(&ctx->reqs);
 		req->ki_ctx = ctx;
 		INIT_LIST_HEAD(&req->ki_list);
+		req->ki_flags = 0;
 		refcount_set(&req->ki_refcnt, 0);
 		req->ki_eventfd = NULL;
 	}
@@ -1072,6 +1112,15 @@ static inline void iocb_put(struct aio_kiocb *iocb)
 	}
 }
 
+static void iocb_put_many(struct kioctx *ctx, void **iocbs, int *nr)
+{
+	if (*nr) {
+		percpu_ref_put_many(&ctx->reqs, *nr);
+		kmem_cache_free_bulk(kiocb_cachep, *nr, iocbs);
+		*nr = 0;
+	}
+}
+
 static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb,
 			   long res, long res2)
 {
@@ -1261,6 +1310,170 @@ static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr,
 	return ret < 0 || *i >= min_nr;
 }
 
+#define AIO_IOPOLL_BATCH	8
+
+/*
+ * Process completed iocb iopoll entries, copying the result to userspace.
+ */
+static long aio_iopoll_reap(struct kioctx *ctx, struct io_event __user *evs,
+			    unsigned int *nr_events, long max)
+{
+	void *iocbs[AIO_IOPOLL_BATCH];
+	struct aio_kiocb *iocb, *n;
+	int to_free = 0, ret = 0;
+
+	/* Shouldn't happen... */
+	if (*nr_events >= max)
+		return 0;
+
+	list_for_each_entry_safe(iocb, n, &ctx->poll_completing, ki_list) {
+		if (*nr_events == max)
+			break;
+		if (!test_bit(IOCB_POLL_COMPLETED, &iocb->ki_flags))
+			continue;
+		if (to_free == AIO_IOPOLL_BATCH)
+			iocb_put_many(ctx, iocbs, &to_free);
+
+		list_del(&iocb->ki_list);
+		iocbs[to_free++] = iocb;
+
+		fput(iocb->rw.ki_filp);
+
+		if (evs && copy_to_user(evs + *nr_events, &iocb->ki_ev,
+		    sizeof(iocb->ki_ev))) {
+			ret = -EFAULT;
+			break;
+		}
+		(*nr_events)++;
+	}
+
+	if (to_free)
+		iocb_put_many(ctx, iocbs, &to_free);
+
+	return ret;
+}
+
+static int __aio_iopoll_check(struct kioctx *ctx, struct io_event __user *event,
+			      unsigned int *nr_events, long min, long max)
+{
+	struct aio_kiocb *iocb;
+	int to_poll, polled, ret;
+
+	/*
+	 * Check if we already have done events that satisfy what we need
+	 */
+	if (!list_empty(&ctx->poll_completing)) {
+		ret = aio_iopoll_reap(ctx, event, nr_events, max);
+		if (ret < 0)
+			return ret;
+		if (*nr_events >= min)
+			return 0;
+	}
+
+	/*
+	 * Take in a new working set from the submitted list, if possible.
+	 */
+	if (!list_empty_careful(&ctx->poll_submitted)) {
+		spin_lock(&ctx->poll_lock);
+		list_splice_init(&ctx->poll_submitted, &ctx->poll_completing);
+		spin_unlock(&ctx->poll_lock);
+	}
+
+	if (list_empty(&ctx->poll_completing))
+		return 0;
+
+	/*
+	 * Check again now that we have a new batch.
+	 */
+	ret = aio_iopoll_reap(ctx, event, nr_events, max);
+	if (ret < 0)
+		return ret;
+	if (*nr_events >= min)
+		return 0;
+
+	/*
+	 * Find up to 'max' worth of events to poll for, including the
+	 * events we already successfully polled
+	 */
+	polled = to_poll = 0;
+	list_for_each_entry(iocb, &ctx->poll_completing, ki_list) {
+		/*
+		 * Poll for needed events with spin == true, anything after
+		 * that we just check if we have more, up to max.
+		 */
+		bool spin = polled + *nr_events >= min;
+		struct kiocb *kiocb = &iocb->rw;
+
+		if (test_bit(IOCB_POLL_COMPLETED, &iocb->ki_flags))
+			break;
+		if (++to_poll + *nr_events > max)
+			break;
+
+		ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin);
+		if (ret < 0)
+			return ret;
+
+		polled += ret;
+		if (polled + *nr_events >= max)
+			break;
+	}
+
+	ret = aio_iopoll_reap(ctx, event, nr_events, max);
+	if (ret < 0)
+		return ret;
+	if (*nr_events >= min)
+		return 0;
+	return to_poll;
+}
+
+/*
+ * We can't just wait for polled events to come to us, we have to actively
+ * find and complete them.
+ */
+static void aio_iopoll_reap_events(struct kioctx *ctx)
+{
+	if (!(ctx->flags & IOCTX_FLAG_IOPOLL))
+		return;
+
+	while (!list_empty_careful(&ctx->poll_submitted) ||
+	       !list_empty(&ctx->poll_completing)) {
+		unsigned int nr_events = 0;
+
+		__aio_iopoll_check(ctx, NULL, &nr_events, 1, UINT_MAX);
+	}
+}
+
+static int aio_iopoll_check(struct kioctx *ctx, long min_nr, long nr,
+			    struct io_event __user *event)
+{
+	unsigned int nr_events = 0;
+	int ret = 0;
+
+	/* Only allow one thread polling at a time */
+	if (!mutex_trylock(&ctx->getevents_lock))
+		return -EBUSY;
+	if (unlikely(atomic_read(&ctx->dead))) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	while (!nr_events || !need_resched()) {
+		int tmin = 0;
+
+		if (nr_events < min_nr)
+			tmin = min_nr - nr_events;
+
+		ret = __aio_iopoll_check(ctx, event, &nr_events, tmin, nr);
+		if (ret <= 0)
+			break;
+		ret = 0;
+	}
+
+err:
+	mutex_unlock(&ctx->getevents_lock);
+	return nr_events ? nr_events : ret;
+}
+
 static long read_events(struct kioctx *ctx, long min_nr, long nr,
 			struct io_event __user *event,
 			ktime_t until)
@@ -1352,7 +1565,7 @@ SYSCALL_DEFINE4(io_setup2, u32, nr_events, u32, flags, struct iocb * __user,
 	unsigned long ctx;
 	long ret;
 
-	if (flags & ~IOCTX_FLAG_USERIOCB)
+	if (flags & ~(IOCTX_FLAG_USERIOCB | IOCTX_FLAG_IOPOLL))
 		return -EINVAL;
 
 	ret = get_user(ctx, ctxp);
@@ -1485,13 +1698,8 @@ static void aio_remove_iocb(struct aio_kiocb *iocb)
 	spin_unlock_irqrestore(&ctx->ctx_lock, flags);
 }
 
-static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
+static void kiocb_end_write(struct kiocb *kiocb)
 {
-	struct aio_kiocb *iocb = container_of(kiocb, struct aio_kiocb, rw);
-
-	if (!list_empty_careful(&iocb->ki_list))
-		aio_remove_iocb(iocb);
-
 	if (kiocb->ki_flags & IOCB_WRITE) {
 		struct inode *inode = file_inode(kiocb->ki_filp);
 
@@ -1503,19 +1711,48 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
 			__sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE);
 		file_end_write(kiocb->ki_filp);
 	}
+}
+
+static void aio_complete_rw(struct kiocb *kiocb, long res, long res2)
+{
+	struct aio_kiocb *iocb = container_of(kiocb, struct aio_kiocb, rw);
+
+	if (!list_empty_careful(&iocb->ki_list))
+		aio_remove_iocb(iocb);
+
+	kiocb_end_write(kiocb);
 
 	fput(kiocb->ki_filp);
 	aio_complete(iocb, res, res2);
 }
 
-static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
+static void aio_complete_rw_poll(struct kiocb *kiocb, long res, long res2)
 {
+	struct aio_kiocb *iocb = container_of(kiocb, struct aio_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	/*
+	 * Handle EAGAIN from resource limits with polled IO inline, don't
+	 * pass the event back to userspace.
+	 */
+	if (unlikely(res == -EAGAIN))
+		set_bit(IOCB_POLL_BUSY, &iocb->ki_flags);
+	else {
+		aio_fill_event(&iocb->ki_ev, iocb, res, res2);
+		set_bit(IOCB_POLL_COMPLETED, &iocb->ki_flags);
+	}
+}
+
+static int aio_prep_rw(struct aio_kiocb *kiocb, const struct iocb *iocb)
+{
+	struct kioctx *ctx = kiocb->ki_ctx;
+	struct kiocb *req = &kiocb->rw;
 	int ret;
 
 	req->ki_filp = fget(iocb->aio_fildes);
 	if (unlikely(!req->ki_filp))
 		return -EBADF;
-	req->ki_complete = aio_complete_rw;
 	req->ki_pos = iocb->aio_offset;
 	req->ki_flags = iocb_flags(req->ki_filp);
 	if (iocb->aio_flags & IOCB_FLAG_RESFD)
@@ -1541,9 +1778,35 @@ static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb)
 	if (unlikely(ret))
 		goto out_fput;
 
-	req->ki_flags &= ~IOCB_HIPRI; /* no one is going to poll for this I/O */
-	return 0;
+	if (iocb->aio_flags & IOCB_FLAG_HIPRI) {
+		/* shares space in the union, and is rather pointless.. */
+		ret = -EINVAL;
+		if (iocb->aio_flags & IOCB_FLAG_RESFD)
+			goto out_fput;
+
+		/* can't submit polled IO to a non-polled ctx */
+		if (!(ctx->flags & IOCTX_FLAG_IOPOLL))
+			goto out_fput;
+
+		ret = -EOPNOTSUPP;
+		if (!(req->ki_flags & IOCB_DIRECT) ||
+		    !req->ki_filp->f_op->iopoll)
+			goto out_fput;
+
+		req->ki_flags |= IOCB_HIPRI;
+		req->ki_complete = aio_complete_rw_poll;
+	} else {
+		/* can't submit non-polled IO to a polled ctx */
+		ret = -EINVAL;
+		if (ctx->flags & IOCTX_FLAG_IOPOLL)
+			goto out_fput;
+
+		/* no one is going to poll for this I/O */
+		req->ki_flags &= ~IOCB_HIPRI;
+		req->ki_complete = aio_complete_rw;
+	}
 
+	return 0;
 out_fput:
 	fput(req->ki_filp);
 	return ret;
@@ -1588,15 +1851,40 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret)
 	}
 }
 
-static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
+/*
+ * After the iocb has been issued, it's safe to be found on the poll list.
+ * Adding the kiocb to the list AFTER submission ensures that we don't
+ * find it from a io_getevents() thread before the issuer is done accessing
+ * the kiocb cookie.
+ */
+static void aio_iopoll_iocb_issued(struct aio_kiocb *kiocb)
+{
+	/*
+	 * For fast devices, IO may have already completed. If it has, add
+	 * it to the front so we find it first. We can't add to the poll_done
+	 * list as that's unlocked from the completion side.
+	 */
+	const int front_add = test_bit(IOCB_POLL_COMPLETED, &kiocb->ki_flags);
+	struct kioctx *ctx = kiocb->ki_ctx;
+
+	spin_lock(&ctx->poll_lock);
+	if (front_add)
+		list_add(&kiocb->ki_list, &ctx->poll_submitted);
+	else
+		list_add_tail(&kiocb->ki_list, &ctx->poll_submitted);
+	spin_unlock(&ctx->poll_lock);
+}
+
+static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
 			bool vectored, bool compat)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *req = &kiocb->rw;
 	struct iov_iter iter;
 	struct file *file;
 	ssize_t ret;
 
-	ret = aio_prep_rw(req, iocb);
+	ret = aio_prep_rw(kiocb, iocb);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -1621,15 +1909,16 @@ static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb,
 	return ret;
 }
 
-static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb,
+static ssize_t aio_write(struct aio_kiocb *kiocb, const struct iocb *iocb,
 			 bool vectored, bool compat)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *req = &kiocb->rw;
 	struct iov_iter iter;
 	struct file *file;
 	ssize_t ret;
 
-	ret = aio_prep_rw(req, iocb);
+	ret = aio_prep_rw(kiocb, iocb);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -1900,7 +2189,8 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 		return -EINVAL;
 	}
 
-	if (!get_reqs_available(ctx))
+	/* Poll IO doesn't need ring reservations */
+	if (!(ctx->flags & IOCTX_FLAG_IOPOLL) && !get_reqs_available(ctx))
 		return -EAGAIN;
 
 	ret = -EAGAIN;
@@ -1923,8 +2213,8 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 		}
 	}
 
-	/* Don't support cancel on user mapped iocbs */
-	if (!(ctx->flags & IOCTX_FLAG_USERIOCB)) {
+	/* Don't support cancel on user mapped iocbs or polled context */
+	if (!(ctx->flags & (IOCTX_FLAG_USERIOCB | IOCTX_FLAG_IOPOLL))) {
 		ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
 		if (unlikely(ret)) {
 			pr_debug("EFAULT: aio_key\n");
@@ -1935,26 +2225,33 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
+	ret = -EINVAL;
 	switch (iocb->aio_lio_opcode) {
 	case IOCB_CMD_PREAD:
-		ret = aio_read(&req->rw, iocb, false, compat);
+		ret = aio_read(req, iocb, false, compat);
 		break;
 	case IOCB_CMD_PWRITE:
-		ret = aio_write(&req->rw, iocb, false, compat);
+		ret = aio_write(req, iocb, false, compat);
 		break;
 	case IOCB_CMD_PREADV:
-		ret = aio_read(&req->rw, iocb, true, compat);
+		ret = aio_read(req, iocb, true, compat);
 		break;
 	case IOCB_CMD_PWRITEV:
-		ret = aio_write(&req->rw, iocb, true, compat);
+		ret = aio_write(req, iocb, true, compat);
 		break;
 	case IOCB_CMD_FSYNC:
+		if (ctx->flags & IOCTX_FLAG_IOPOLL)
+			break;
 		ret = aio_fsync(&req->fsync, iocb, false);
 		break;
 	case IOCB_CMD_FDSYNC:
+		if (ctx->flags & IOCTX_FLAG_IOPOLL)
+			break;
 		ret = aio_fsync(&req->fsync, iocb, true);
 		break;
 	case IOCB_CMD_POLL:
+		if (ctx->flags & IOCTX_FLAG_IOPOLL)
+			break;
 		ret = aio_poll(req, iocb);
 		break;
 	default:
@@ -1970,13 +2267,21 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 	 */
 	if (ret)
 		goto out_put_req;
+	if (ctx->flags & IOCTX_FLAG_IOPOLL) {
+		if (test_bit(IOCB_POLL_BUSY, &req->ki_flags)) {
+			ret = -EAGAIN;
+			goto out_put_req;
+		}
+		aio_iopoll_iocb_issued(req);
+	}
 	return 0;
 out_put_req:
 	if (req->ki_eventfd)
 		eventfd_ctx_put(req->ki_eventfd);
 	iocb_put(req);
 out_put_reqs_available:
-	put_reqs_available(ctx, 1);
+	if (!(ctx->flags & IOCTX_FLAG_IOPOLL))
+		put_reqs_available(ctx, 1);
 	return ret;
 }
 
@@ -2148,7 +2453,7 @@ SYSCALL_DEFINE3(io_cancel, aio_context_t, ctx_id, struct iocb __user *, iocb,
 	if (unlikely(!ctx))
 		return -EINVAL;
 
-	if (ctx->flags & IOCTX_FLAG_USERIOCB)
+	if (ctx->flags & (IOCTX_FLAG_USERIOCB | IOCTX_FLAG_IOPOLL))
 		goto err;
 
 	spin_lock_irq(&ctx->ctx_lock);
@@ -2183,8 +2488,12 @@ static long do_io_getevents(aio_context_t ctx_id,
 	long ret = -EINVAL;
 
 	if (likely(ioctx)) {
-		if (likely(min_nr <= nr && min_nr >= 0))
-			ret = read_events(ioctx, min_nr, nr, events, until);
+		if (likely(min_nr <= nr && min_nr >= 0)) {
+			if (ioctx->flags & IOCTX_FLAG_IOPOLL)
+				ret = aio_iopoll_check(ioctx, min_nr, nr, events);
+			else
+				ret = read_events(ioctx, min_nr, nr, events, until);
+		}
 		percpu_ref_put(&ioctx->users);
 	}
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 814e6606c413..ea0b9a19f4df 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -52,9 +52,11 @@ enum {
  *                   is valid.
  * IOCB_FLAG_IOPRIO - Set if the "aio_reqprio" member of the "struct iocb"
  *                    is valid.
+ * IOCB_FLAG_HIPRI - Use IO completion polling
  */
 #define IOCB_FLAG_RESFD		(1 << 0)
 #define IOCB_FLAG_IOPRIO	(1 << 1)
+#define IOCB_FLAG_HIPRI		(1 << 2)
 
 /* read() from /dev/aio returns these structures. */
 struct io_event {
@@ -107,6 +109,7 @@ struct iocb {
 }; /* 64 bytes */
 
 #define IOCTX_FLAG_USERIOCB	(1 << 0)	/* iocbs are user mapped */
+#define IOCTX_FLAG_IOPOLL	(1 << 1)	/* io_context is polled */
 
 #undef IFBIG
 #undef IFLITTLE
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 18/27] aio: add submission side request cache
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (16 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 17/27] aio: support for IO polling Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 19/27] fs: add fget_many() and fput_many() Jens Axboe
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

We have to add each submitted polled request to the io_context
poll_submitted list, which means we have to grab the poll_lock. We
already use the block plug to batch submissions if we're doing a batch
of IO submissions, extend that to cover the poll requests internally as
well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 136 +++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 113 insertions(+), 23 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f7a49abc7694..182e2fc6ec82 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -230,6 +230,21 @@ struct aio_kiocb {
 	};
 };
 
+struct aio_submit_state {
+	struct kioctx *ctx;
+
+	struct blk_plug plug;
+#ifdef CONFIG_BLOCK
+	struct blk_plug_cb plug_cb;
+#endif
+
+	/*
+	 * Polled iocbs that have been submitted, but not added to the ctx yet
+	 */
+	struct list_head req_list;
+	unsigned int req_count;
+};
+
 /*------ sysctl variables----*/
 static DEFINE_SPINLOCK(aio_nr_lock);
 unsigned long aio_nr;		/* current system wide number of aio requests */
@@ -247,6 +262,15 @@ static const struct address_space_operations aio_ctx_aops;
 static const unsigned int iocb_page_shift =
 				ilog2(PAGE_SIZE / sizeof(struct iocb));
 
+/*
+ * We rely on block level unplugs to flush pending requests, if we schedule
+ */
+#ifdef CONFIG_BLOCK
+static const bool aio_use_state_req_list = true;
+#else
+static const bool aio_use_state_req_list = false;
+#endif
+
 static void aio_useriocb_free(struct kioctx *);
 static void aio_iopoll_reap_events(struct kioctx *);
 
@@ -1851,13 +1875,28 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret)
 	}
 }
 
+/*
+ * Called either at the end of IO submission, or through a plug callback
+ * because we're going to schedule. Moves out local batch of requests to
+ * the ctx poll list, so they can be found for polling + reaping.
+ */
+static void aio_flush_state_reqs(struct kioctx *ctx,
+				 struct aio_submit_state *state)
+{
+	spin_lock(&ctx->poll_lock);
+	list_splice_tail_init(&state->req_list, &ctx->poll_submitted);
+	spin_unlock(&ctx->poll_lock);
+	state->req_count = 0;
+}
+
 /*
  * After the iocb has been issued, it's safe to be found on the poll list.
  * Adding the kiocb to the list AFTER submission ensures that we don't
  * find it from a io_getevents() thread before the issuer is done accessing
  * the kiocb cookie.
  */
-static void aio_iopoll_iocb_issued(struct aio_kiocb *kiocb)
+static void aio_iopoll_iocb_issued(struct aio_submit_state *state,
+				   struct aio_kiocb *kiocb)
 {
 	/*
 	 * For fast devices, IO may have already completed. If it has, add
@@ -1867,12 +1906,21 @@ static void aio_iopoll_iocb_issued(struct aio_kiocb *kiocb)
 	const int front_add = test_bit(IOCB_POLL_COMPLETED, &kiocb->ki_flags);
 	struct kioctx *ctx = kiocb->ki_ctx;
 
-	spin_lock(&ctx->poll_lock);
-	if (front_add)
-		list_add(&kiocb->ki_list, &ctx->poll_submitted);
-	else
-		list_add_tail(&kiocb->ki_list, &ctx->poll_submitted);
-	spin_unlock(&ctx->poll_lock);
+	if (!state || !aio_use_state_req_list) {
+		spin_lock(&ctx->poll_lock);
+		if (front_add)
+			list_add(&kiocb->ki_list, &ctx->poll_submitted);
+		else
+			list_add_tail(&kiocb->ki_list, &ctx->poll_submitted);
+		spin_unlock(&ctx->poll_lock);
+	} else {
+		if (front_add)
+			list_add(&kiocb->ki_list, &state->req_list);
+		else
+			list_add_tail(&kiocb->ki_list, &state->req_list);
+		if (++state->req_count >= AIO_IOPOLL_BATCH)
+			aio_flush_state_reqs(ctx, state);
+	}
 }
 
 static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
@@ -2168,7 +2216,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 }
 
 static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
-			   struct iocb __user *user_iocb, bool compat)
+			   struct iocb __user *user_iocb,
+			   struct aio_submit_state *state, bool compat)
 {
 	struct aio_kiocb *req;
 	ssize_t ret;
@@ -2272,7 +2321,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 			ret = -EAGAIN;
 			goto out_put_req;
 		}
-		aio_iopoll_iocb_issued(req);
+		aio_iopoll_iocb_issued(state, req);
 	}
 	return 0;
 out_put_req:
@@ -2286,7 +2335,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 }
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
-			 bool compat)
+			 struct aio_submit_state *state, bool compat)
 {
 	struct iocb iocb, *iocbp;
 
@@ -2303,7 +2352,44 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		iocbp = &iocb;
 	}
 
-	return __io_submit_one(ctx, iocbp, user_iocb, compat);
+	return __io_submit_one(ctx, iocbp, user_iocb, state, compat);
+}
+
+#ifdef CONFIG_BLOCK
+static void aio_state_unplug(struct blk_plug_cb *cb, bool from_schedule)
+{
+	struct aio_submit_state *state;
+
+	state = container_of(cb, struct aio_submit_state, plug_cb);
+	if (!list_empty(&state->req_list))
+		aio_flush_state_reqs(state->ctx, state);
+}
+#endif
+
+/*
+ * Batched submission is done, ensure local IO is flushed out.
+ */
+static void aio_submit_state_end(struct aio_submit_state *state)
+{
+	blk_finish_plug(&state->plug);
+	if (!list_empty(&state->req_list))
+		aio_flush_state_reqs(state->ctx, state);
+}
+
+/*
+ * Start submission side cache.
+ */
+static void aio_submit_state_start(struct aio_submit_state *state,
+				   struct kioctx *ctx)
+{
+	state->ctx = ctx;
+	INIT_LIST_HEAD(&state->req_list);
+	state->req_count = 0;
+#ifdef CONFIG_BLOCK
+	state->plug_cb.callback = aio_state_unplug;
+	blk_start_plug(&state->plug);
+	list_add(&state->plug_cb.list, &state->plug.cb_list);
+#endif
 }
 
 /*
@@ -2327,10 +2413,10 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
 		struct iocb __user * __user *, iocbpp)
 {
+	struct aio_submit_state state, *statep = NULL;
 	struct kioctx *ctx;
 	long ret = 0;
 	int i = 0;
-	struct blk_plug plug;
 
 	if (unlikely(nr < 0))
 		return -EINVAL;
@@ -2344,8 +2430,10 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
 	if (nr > ctx->nr_events)
 		nr = ctx->nr_events;
 
-	if (nr > AIO_PLUG_THRESHOLD)
-		blk_start_plug(&plug);
+	if (nr > AIO_PLUG_THRESHOLD) {
+		aio_submit_state_start(&state, ctx);
+		statep = &state;
+	}
 	for (i = 0; i < nr; i++) {
 		struct iocb __user *user_iocb;
 
@@ -2354,12 +2442,12 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
 			break;
 		}
 
-		ret = io_submit_one(ctx, user_iocb, false);
+		ret = io_submit_one(ctx, user_iocb, statep, false);
 		if (ret)
 			break;
 	}
-	if (nr > AIO_PLUG_THRESHOLD)
-		blk_finish_plug(&plug);
+	if (statep)
+		aio_submit_state_end(statep);
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
@@ -2369,10 +2457,10 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
 COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id,
 		       int, nr, compat_uptr_t __user *, iocbpp)
 {
+	struct aio_submit_state state, *statep = NULL;
 	struct kioctx *ctx;
 	long ret = 0;
 	int i = 0;
-	struct blk_plug plug;
 
 	if (unlikely(nr < 0))
 		return -EINVAL;
@@ -2386,8 +2474,10 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id,
 	if (nr > ctx->nr_events)
 		nr = ctx->nr_events;
 
-	if (nr > AIO_PLUG_THRESHOLD)
-		blk_start_plug(&plug);
+	if (nr > AIO_PLUG_THRESHOLD) {
+		aio_submit_state_start(&state, ctx);
+		statep = &state;
+	}
 	for (i = 0; i < nr; i++) {
 		compat_uptr_t user_iocb;
 
@@ -2396,12 +2486,12 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id,
 			break;
 		}
 
-		ret = io_submit_one(ctx, compat_ptr(user_iocb), true);
+		ret = io_submit_one(ctx, compat_ptr(user_iocb), statep, true);
 		if (ret)
 			break;
 	}
-	if (nr > AIO_PLUG_THRESHOLD)
-		blk_finish_plug(&plug);
+	if (statep)
+		aio_submit_state_end(statep);
 
 	percpu_ref_put(&ctx->users);
 	return i ? i : ret;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 19/27] fs: add fget_many() and fput_many()
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (17 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 18/27] aio: add submission side request cache Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 20/27] aio: use fget/fput_many() for file references Jens Axboe
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/file.c            | 15 ++++++++++-----
 fs/file_table.c      | 10 ++++++++--
 include/linux/file.h |  2 ++
 include/linux/fs.h   |  3 ++-
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 7ffd6e9d103d..ad9870edfd51 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -676,7 +676,7 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__fget(unsigned int fd, fmode_t mask)
+static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
 {
 	struct files_struct *files = current->files;
 	struct file *file;
@@ -691,7 +691,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 		 */
 		if (file->f_mode & mask)
 			file = NULL;
-		else if (!get_file_rcu(file))
+		else if (!get_file_rcu_many(file, refs))
 			goto loop;
 	}
 	rcu_read_unlock();
@@ -699,15 +699,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 	return file;
 }
 
+struct file *fget_many(unsigned int fd, unsigned int refs)
+{
+	return __fget(fd, FMODE_PATH, refs);
+}
+
 struct file *fget(unsigned int fd)
 {
-	return __fget(fd, FMODE_PATH);
+	return fget_many(fd, 1);
 }
 EXPORT_SYMBOL(fget);
 
 struct file *fget_raw(unsigned int fd)
 {
-	return __fget(fd, 0);
+	return __fget(fd, 0, 1);
 }
 EXPORT_SYMBOL(fget_raw);
 
@@ -738,7 +743,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget(fd, mask, 1);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/fs/file_table.c b/fs/file_table.c
index e49af4caf15d..6a3964df33e4 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -326,9 +326,9 @@ void flush_delayed_fput(void)
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
-void fput(struct file *file)
+void fput_many(struct file *file, unsigned int refs)
 {
-	if (atomic_long_dec_and_test(&file->f_count)) {
+	if (atomic_long_sub_and_test(refs, &file->f_count)) {
 		struct task_struct *task = current;
 
 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
@@ -347,6 +347,12 @@ void fput(struct file *file)
 	}
 }
 
+void fput(struct file *file)
+{
+	fput_many(file, 1);
+}
+
+
 /*
  * synchronous analog of fput(); for kernel threads that might be needed
  * in some umount() (and thus can't use flush_delayed_fput() without
diff --git a/include/linux/file.h b/include/linux/file.h
index 6b2fb032416c..3fcddff56bc4 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -13,6 +13,7 @@
 struct file;
 
 extern void fput(struct file *);
+extern void fput_many(struct file *, unsigned int);
 
 struct file_operations;
 struct vfsmount;
@@ -44,6 +45,7 @@ static inline void fdput(struct fd fd)
 }
 
 extern struct file *fget(unsigned int fd);
+extern struct file *fget_many(unsigned int fd, unsigned int refs);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
 extern unsigned long __fdget_raw(unsigned int fd);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6a5f71f8ae06..dc54a65c401a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -952,7 +952,8 @@ static inline struct file *get_file(struct file *f)
 	atomic_long_inc(&f->f_count);
 	return f;
 }
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+#define get_file_rcu_many(x, cnt) atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+#define get_file_rcu(x) get_file_rcu_many((x), 1)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
 #define file_count(x)	atomic_long_read(&(x)->f_count)
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 20/27] aio: use fget/fput_many() for file references
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (18 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 19/27] fs: add fget_many() and fput_many() Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 21/27] aio: split iocb init from allocation Jens Axboe
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

On the submission side, add file reference batching to the
aio_submit_state. We get as many references as the number of iocbs we
are submitting, and drop unused ones if we end up switching files. The
assumption here is that we're usually only dealing with one fd, and if
there are multiple, hopefuly they are at least somewhat ordered. Could
trivially be extended to cover multiple fds, if needed.

On the completion side we do the same thing, except this is trivially
done just locally in aio_iopoll_reap().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 106 +++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 91 insertions(+), 15 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 182e2fc6ec82..291bbc62b2a8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -243,6 +243,15 @@ struct aio_submit_state {
 	 */
 	struct list_head req_list;
 	unsigned int req_count;
+
+	/*
+	 * File reference cache
+	 */
+	struct file *file;
+	unsigned int fd;
+	unsigned int has_refs;
+	unsigned int used_refs;
+	unsigned int ios_left;
 };
 
 /*------ sysctl variables----*/
@@ -1344,7 +1353,8 @@ static long aio_iopoll_reap(struct kioctx *ctx, struct io_event __user *evs,
 {
 	void *iocbs[AIO_IOPOLL_BATCH];
 	struct aio_kiocb *iocb, *n;
-	int to_free = 0, ret = 0;
+	int file_count, to_free = 0, ret = 0;
+	struct file *file = NULL;
 
 	/* Shouldn't happen... */
 	if (*nr_events >= max)
@@ -1361,7 +1371,20 @@ static long aio_iopoll_reap(struct kioctx *ctx, struct io_event __user *evs,
 		list_del(&iocb->ki_list);
 		iocbs[to_free++] = iocb;
 
-		fput(iocb->rw.ki_filp);
+		/*
+		 * Batched puts of the same file, to avoid dirtying the
+		 * file usage count multiple times, if avoidable.
+		 */
+		if (!file) {
+			file = iocb->rw.ki_filp;
+			file_count = 1;
+		} else if (file == iocb->rw.ki_filp) {
+			file_count++;
+		} else {
+			fput_many(file, file_count);
+			file = iocb->rw.ki_filp;
+			file_count = 1;
+		}
 
 		if (evs && copy_to_user(evs + *nr_events, &iocb->ki_ev,
 		    sizeof(iocb->ki_ev))) {
@@ -1371,6 +1394,9 @@ static long aio_iopoll_reap(struct kioctx *ctx, struct io_event __user *evs,
 		(*nr_events)++;
 	}
 
+	if (file)
+		fput_many(file, file_count);
+
 	if (to_free)
 		iocb_put_many(ctx, iocbs, &to_free);
 
@@ -1768,13 +1794,58 @@ static void aio_complete_rw_poll(struct kiocb *kiocb, long res, long res2)
 	}
 }
 
-static int aio_prep_rw(struct aio_kiocb *kiocb, const struct iocb *iocb)
+static void aio_file_put(struct aio_submit_state *state)
+{
+	if (state->file) {
+		int diff = state->has_refs - state->used_refs;
+
+		if (diff)
+			fput_many(state->file, diff);
+		state->file = NULL;
+	}
+}
+
+/*
+ * Get as many references to a file as we have IOs left in this submission,
+ * assuming most submissions are for one file, or at least that each file
+ * has more than one submission.
+ */
+static struct file *aio_file_get(struct aio_submit_state *state, int fd)
+{
+	if (!state)
+		return fget(fd);
+
+	if (!state->file) {
+get_file:
+		state->file = fget_many(fd, state->ios_left);
+		if (!state->file)
+			return NULL;
+
+		state->fd = fd;
+		state->has_refs = state->ios_left;
+		state->used_refs = 1;
+		state->ios_left--;
+		return state->file;
+	}
+
+	if (state->fd == fd) {
+		state->used_refs++;
+		state->ios_left--;
+		return state->file;
+	}
+
+	aio_file_put(state);
+	goto get_file;
+}
+
+static int aio_prep_rw(struct aio_kiocb *kiocb, const struct iocb *iocb,
+		       struct aio_submit_state *state)
 {
 	struct kioctx *ctx = kiocb->ki_ctx;
 	struct kiocb *req = &kiocb->rw;
 	int ret;
 
-	req->ki_filp = fget(iocb->aio_fildes);
+	req->ki_filp = aio_file_get(state, iocb->aio_fildes);
 	if (unlikely(!req->ki_filp))
 		return -EBADF;
 	req->ki_pos = iocb->aio_offset;
@@ -1924,7 +1995,8 @@ static void aio_iopoll_iocb_issued(struct aio_submit_state *state,
 }
 
 static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
-			bool vectored, bool compat)
+			struct aio_submit_state *state, bool vectored,
+			bool compat)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *req = &kiocb->rw;
@@ -1932,7 +2004,7 @@ static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
 	struct file *file;
 	ssize_t ret;
 
-	ret = aio_prep_rw(kiocb, iocb);
+	ret = aio_prep_rw(kiocb, iocb, state);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -1958,7 +2030,8 @@ static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
 }
 
 static ssize_t aio_write(struct aio_kiocb *kiocb, const struct iocb *iocb,
-			 bool vectored, bool compat)
+			 struct aio_submit_state *state, bool vectored,
+			 bool compat)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *req = &kiocb->rw;
@@ -1966,7 +2039,7 @@ static ssize_t aio_write(struct aio_kiocb *kiocb, const struct iocb *iocb,
 	struct file *file;
 	ssize_t ret;
 
-	ret = aio_prep_rw(kiocb, iocb);
+	ret = aio_prep_rw(kiocb, iocb, state);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -2277,16 +2350,16 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 	ret = -EINVAL;
 	switch (iocb->aio_lio_opcode) {
 	case IOCB_CMD_PREAD:
-		ret = aio_read(req, iocb, false, compat);
+		ret = aio_read(req, iocb, state, false, compat);
 		break;
 	case IOCB_CMD_PWRITE:
-		ret = aio_write(req, iocb, false, compat);
+		ret = aio_write(req, iocb, state, false, compat);
 		break;
 	case IOCB_CMD_PREADV:
-		ret = aio_read(req, iocb, true, compat);
+		ret = aio_read(req, iocb, state, true, compat);
 		break;
 	case IOCB_CMD_PWRITEV:
-		ret = aio_write(req, iocb, true, compat);
+		ret = aio_write(req, iocb, state, true, compat);
 		break;
 	case IOCB_CMD_FSYNC:
 		if (ctx->flags & IOCTX_FLAG_IOPOLL)
@@ -2374,17 +2447,20 @@ static void aio_submit_state_end(struct aio_submit_state *state)
 	blk_finish_plug(&state->plug);
 	if (!list_empty(&state->req_list))
 		aio_flush_state_reqs(state->ctx, state);
+	aio_file_put(state);
 }
 
 /*
  * Start submission side cache.
  */
 static void aio_submit_state_start(struct aio_submit_state *state,
-				   struct kioctx *ctx)
+				   struct kioctx *ctx, int max_ios)
 {
 	state->ctx = ctx;
 	INIT_LIST_HEAD(&state->req_list);
 	state->req_count = 0;
+	state->file = NULL;
+	state->ios_left = max_ios;
 #ifdef CONFIG_BLOCK
 	state->plug_cb.callback = aio_state_unplug;
 	blk_start_plug(&state->plug);
@@ -2431,7 +2507,7 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr,
 		nr = ctx->nr_events;
 
 	if (nr > AIO_PLUG_THRESHOLD) {
-		aio_submit_state_start(&state, ctx);
+		aio_submit_state_start(&state, ctx, nr);
 		statep = &state;
 	}
 	for (i = 0; i < nr; i++) {
@@ -2475,7 +2551,7 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id,
 		nr = ctx->nr_events;
 
 	if (nr > AIO_PLUG_THRESHOLD) {
-		aio_submit_state_start(&state, ctx);
+		aio_submit_state_start(&state, ctx, nr);
 		statep = &state;
 	}
 	for (i = 0; i < nr; i++) {
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 21/27] aio: split iocb init from allocation
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (19 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 20/27] aio: use fget/fput_many() for file references Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 22/27] aio: batch aio_kiocb allocation Jens Axboe
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 291bbc62b2a8..341eb1b19319 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1088,6 +1088,16 @@ static bool get_reqs_available(struct kioctx *ctx)
 	return __get_reqs_available(ctx);
 }
 
+static void aio_iocb_init(struct kioctx *ctx, struct aio_kiocb *req)
+{
+	percpu_ref_get(&ctx->reqs);
+	req->ki_ctx = ctx;
+	INIT_LIST_HEAD(&req->ki_list);
+	req->ki_flags = 0;
+	refcount_set(&req->ki_refcnt, 0);
+	req->ki_eventfd = NULL;
+}
+
 /* aio_get_req
  *	Allocate a slot for an aio request.
  * Returns NULL if no requests are free.
@@ -1097,14 +1107,8 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
 	struct aio_kiocb *req;
 
 	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
-	if (req) {
-		percpu_ref_get(&ctx->reqs);
-		req->ki_ctx = ctx;
-		INIT_LIST_HEAD(&req->ki_list);
-		req->ki_flags = 0;
-		refcount_set(&req->ki_refcnt, 0);
-		req->ki_eventfd = NULL;
-	}
+	if (req)
+		aio_iocb_init(ctx, req);
 
 	return req;
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 22/27] aio: batch aio_kiocb allocation
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (20 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 21/27] aio: split iocb init from allocation Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 23/27] block: add BIO_HOLD_PAGES flag Jens Axboe
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

Similarly to how we use the state->ios_left to know how many references
to get to a file, we can use it to allocate the aio_kiocb's we need in
bulk.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 341eb1b19319..426939f1dae9 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -230,6 +230,8 @@ struct aio_kiocb {
 	};
 };
 
+#define AIO_IOPOLL_BATCH	8
+
 struct aio_submit_state {
 	struct kioctx *ctx;
 
@@ -244,6 +246,13 @@ struct aio_submit_state {
 	struct list_head req_list;
 	unsigned int req_count;
 
+	/*
+	 * aio_kiocb alloc cache
+	 */
+	void *iocbs[AIO_IOPOLL_BATCH];
+	unsigned int free_iocbs;
+	unsigned int cur_iocb;
+
 	/*
 	 * File reference cache
 	 */
@@ -1102,11 +1111,32 @@ static void aio_iocb_init(struct kioctx *ctx, struct aio_kiocb *req)
  *	Allocate a slot for an aio request.
  * Returns NULL if no requests are free.
  */
-static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
+static struct aio_kiocb *aio_get_req(struct kioctx *ctx,
+				     struct aio_submit_state *state)
 {
 	struct aio_kiocb *req;
 
-	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
+	if (!state)
+		req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
+	else if (!state->free_iocbs) {
+		size_t size;
+
+		size = min_t(size_t, state->ios_left, ARRAY_SIZE(state->iocbs));
+		size = kmem_cache_alloc_bulk(kiocb_cachep, GFP_KERNEL, size,
+						state->iocbs);
+		if (size < 0)
+			return ERR_PTR(size);
+		else if (!size)
+			return ERR_PTR(-ENOMEM);
+		state->free_iocbs = size - 1;
+		state->cur_iocb = 1;
+		req = state->iocbs[0];
+	} else {
+		req = state->iocbs[state->cur_iocb];
+		state->free_iocbs--;
+		state->cur_iocb++;
+	}
+
 	if (req)
 		aio_iocb_init(ctx, req);
 
@@ -1347,8 +1377,6 @@ static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr,
 	return ret < 0 || *i >= min_nr;
 }
 
-#define AIO_IOPOLL_BATCH	8
-
 /*
  * Process completed iocb iopoll entries, copying the result to userspace.
  */
@@ -2320,7 +2348,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 		return -EAGAIN;
 
 	ret = -EAGAIN;
-	req = aio_get_req(ctx);
+	req = aio_get_req(ctx, state);
 	if (unlikely(!req))
 		goto out_put_reqs_available;
 
@@ -2452,6 +2480,9 @@ static void aio_submit_state_end(struct aio_submit_state *state)
 	if (!list_empty(&state->req_list))
 		aio_flush_state_reqs(state->ctx, state);
 	aio_file_put(state);
+	if (state->free_iocbs)
+		kmem_cache_free_bulk(kiocb_cachep, state->free_iocbs,
+					&state->iocbs[state->cur_iocb]);
 }
 
 /*
@@ -2463,6 +2494,7 @@ static void aio_submit_state_start(struct aio_submit_state *state,
 	state->ctx = ctx;
 	INIT_LIST_HEAD(&state->req_list);
 	state->req_count = 0;
+	state->free_iocbs = 0;
 	state->file = NULL;
 	state->ios_left = max_ios;
 #ifdef CONFIG_BLOCK
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 23/27] block: add BIO_HOLD_PAGES flag
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (21 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 22/27] aio: batch aio_kiocb allocation Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio Jens Axboe
                   ` (3 subsequent siblings)
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

For user mapped IO, we do get_user_pages() upfront, and then do a
put_page() on each page at end_io time to release the page reference. In
preparation for having permanently mapped pages, add a BIO_HOLD_PAGES
flag that tells us not to release the pages, the caller will do that.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c               | 6 ++++--
 include/linux/blk_types.h | 1 +
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 03895cc0d74a..ab174bce5436 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1635,7 +1635,8 @@ static void bio_dirty_fn(struct work_struct *work)
 		next = bio->bi_private;
 
 		bio_set_pages_dirty(bio);
-		bio_release_pages(bio);
+		if (!bio_flagged(bio, BIO_HOLD_PAGES))
+			bio_release_pages(bio);
 		bio_put(bio);
 	}
 }
@@ -1651,7 +1652,8 @@ void bio_check_pages_dirty(struct bio *bio)
 			goto defer;
 	}
 
-	bio_release_pages(bio);
+	if (!bio_flagged(bio, BIO_HOLD_PAGES))
+		bio_release_pages(bio);
 	bio_put(bio);
 	return;
 defer:
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index c0ba1a038ff3..78aaf7442688 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -227,6 +227,7 @@ struct bio {
 #define BIO_TRACE_COMPLETION 10	/* bio_endio() should trace the final completion
 				 * of this bio. */
 #define BIO_QUEUE_ENTERED 11	/* can use blk_queue_enter_live() */
+#define BIO_HOLD_PAGES	12	/* don't put O_DIRECT pages */
 
 /* See BVEC_POOL_OFFSET below before adding new flags */
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (22 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 23/27] block: add BIO_HOLD_PAGES flag Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 19:21   ` Al Viro
  2018-11-30 16:56 ` [PATCH 25/27] fs: add support for mapping an ITER_KVEC for O_DIRECT Jens Axboe
                   ` (2 subsequent siblings)
  26 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

For an ITER_KVEC, we can just iterate the iov and add the pages
to the bio directly.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c         | 30 ++++++++++++++++++++++++++++++
 include/linux/bio.h |  1 +
 2 files changed, 31 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index ab174bce5436..7e59ef547ed4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -903,6 +903,36 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
+/**
+ * bio_iov_kvec_add_pages - add pages from an ITER_KVEC to a bio
+ * @bio: bio to add pages to
+ * @iter: iov iterator describing the region to be added
+ *
+ * Iterate pages in the @iter and add them to the bio. We flag the
+ * @bio with BIO_HOLD_PAGES, telling IO completion not to free them.
+ */
+int bio_iov_kvec_add_pages(struct bio *bio, struct iov_iter *iter)
+{
+	unsigned short orig_vcnt = bio->bi_vcnt;
+	const struct kvec *kv;
+
+	do {
+		struct page *page;
+		size_t size;
+
+		kv = iter->kvec + iter->iov_offset;
+		page = virt_to_page(kv->iov_base);
+		size = bio_add_page(bio, page, kv->iov_len,
+					offset_in_page(kv->iov_base));
+		if (size != kv->iov_len)
+			break;
+		iov_iter_advance(iter, size);
+	} while (iov_iter_count(iter) && !bio_full(bio));
+
+	bio_set_flag(bio, BIO_HOLD_PAGES);
+	return bio->bi_vcnt > orig_vcnt ? 0 : -EINVAL;
+}
+
 static void submit_bio_wait_endio(struct bio *bio)
 {
 	complete(bio->bi_private);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 056fb627edb3..23ae8fb66b1e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -434,6 +434,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 void __bio_add_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
+int bio_iov_kvec_add_pages(struct bio *bio, struct iov_iter *iter);
 struct rq_map_data;
 extern struct bio *bio_map_user_iov(struct request_queue *,
 				    struct iov_iter *, gfp_t);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 25/27] fs: add support for mapping an ITER_KVEC for O_DIRECT
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (23 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 16:56 ` [PATCH 26/27] iov_iter: add import_kvec() Jens Axboe
  2018-11-30 16:56 ` [PATCH 27/27] aio: add support for pre-mapped user IO buffers Jens Axboe
  26 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

This adds support for sync/async O_DIRECT to make a kvec type iter
for bdev access, as well as iomap.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 16 ++++++++++++----
 fs/iomap.c     |  5 ++++-
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index ebc3d5a0f424..b926f03de55e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -219,7 +219,10 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 	bio.bi_end_io = blkdev_bio_end_io_simple;
 	bio.bi_ioprio = iocb->ki_ioprio;
 
-	ret = bio_iov_iter_get_pages(&bio, iter);
+	if (iov_iter_is_kvec(iter))
+		ret = bio_iov_kvec_add_pages(&bio, iter);
+	else
+		ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
 		goto out;
 	ret = bio.bi_iter.bi_size;
@@ -326,8 +329,9 @@ static void blkdev_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_HOLD_PAGES))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
@@ -381,7 +385,11 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
 
-		ret = bio_iov_iter_get_pages(bio, iter);
+		if (iov_iter_is_kvec(iter))
+			ret = bio_iov_kvec_add_pages(bio, iter);
+		else
+			ret = bio_iov_iter_get_pages(bio, iter);
+
 		if (unlikely(ret)) {
 			bio->bi_status = BLK_STS_IOERR;
 			bio_endio(bio);
diff --git a/fs/iomap.c b/fs/iomap.c
index 96d60b9b2bea..72f58d604fab 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1671,7 +1671,10 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
-		ret = bio_iov_iter_get_pages(bio, &iter);
+		if (iov_iter_is_kvec(&iter))
+			ret = bio_iov_kvec_add_pages(bio, &iter);
+		else
+			ret = bio_iov_iter_get_pages(bio, &iter);
 		if (unlikely(ret)) {
 			bio_put(bio);
 			return copied ? copied : ret;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 26/27] iov_iter: add import_kvec()
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (24 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 25/27] fs: add support for mapping an ITER_KVEC for O_DIRECT Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 19:17   ` Al Viro
  2018-11-30 16:56 ` [PATCH 27/27] aio: add support for pre-mapped user IO buffers Jens Axboe
  26 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

This explicitly sets up an ITER_KVEC from an iovec with kernel ranges
mapped.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/uio.h |  3 +++
 lib/iov_iter.c      | 35 ++++++++++++++++++++++++++---------
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 55ce99ddb912..bbefdb421f6d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -284,6 +284,9 @@ int compat_import_iovec(int type, const struct compat_iovec __user * uvector,
 int import_single_range(int type, void __user *buf, size_t len,
 		 struct iovec *iov, struct iov_iter *i);
 
+int import_kvec(int type, const struct kvec *kvecs, unsigned nr_segs,
+		size_t bytes, struct iov_iter *iter);
+
 int iov_iter_for_each_range(struct iov_iter *i, size_t bytes,
 			    int (*f)(struct kvec *vec, void *context),
 			    void *context);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 7ebccb5c1637..a5b6dd691f37 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -431,25 +431,33 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
-void iov_iter_init(struct iov_iter *i, unsigned int direction,
-			const struct iovec *iov, unsigned long nr_segs,
-			size_t count)
+static void iov_iter_init_type(struct iov_iter *i, int type,
+			unsigned int direction, const struct iovec *iov,
+			unsigned long nr_segs, size_t count)
 {
 	WARN_ON(direction & ~(READ | WRITE));
 	direction &= READ | WRITE;
 
-	/* It will get better.  Eventually... */
-	if (uaccess_kernel()) {
-		i->type = ITER_KVEC | direction;
+	i->type = type | direction;
+	if (i->type == ITER_KVEC)
 		i->kvec = (struct kvec *)iov;
-	} else {
-		i->type = ITER_IOVEC | direction;
+	else
 		i->iov = iov;
-	}
+
 	i->nr_segs = nr_segs;
 	i->iov_offset = 0;
 	i->count = count;
 }
+
+void iov_iter_init(struct iov_iter *i, unsigned int direction,
+			const struct iovec *iov, unsigned long nr_segs,
+			size_t count)
+{
+	/* It will get better.  Eventually... */
+	int type = uaccess_kernel() ? ITER_KVEC : ITER_IOVEC;
+
+	iov_iter_init_type(i, type, direction, iov, nr_segs, count);
+}
 EXPORT_SYMBOL(iov_iter_init);
 
 static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t len)
@@ -1582,6 +1590,15 @@ int import_iovec(int type, const struct iovec __user * uvector,
 }
 EXPORT_SYMBOL(import_iovec);
 
+int import_kvec(int type, const struct kvec *kvecs, unsigned nr_segs,
+		size_t bytes, struct iov_iter *iter)
+{
+	const struct iovec *p = (const struct iovec *) kvecs;
+
+	iov_iter_init_type(iter, ITER_KVEC, type, p, nr_segs, bytes);
+	return 0;
+}
+
 #ifdef CONFIG_COMPAT
 #include <linux/compat.h>
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 27/27] aio: add support for pre-mapped user IO buffers
  2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
                   ` (25 preceding siblings ...)
  2018-11-30 16:56 ` [PATCH 26/27] iov_iter: add import_kvec() Jens Axboe
@ 2018-11-30 16:56 ` Jens Axboe
  2018-11-30 21:44   ` Jeff Moyer
  26 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 16:56 UTC (permalink / raw)
  To: linux-block, linux-fsdevel, linux-aio; +Cc: hch, Jens Axboe

If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must set both
IOCTX_FLAG_USERIOCB, to provide iocb's in userspace, and then
IOCTX_FLAG_FIXEDBUFS. The latter tells aio that the iocbs that are
mapped already contain valid destination and sizes. These buffers can
then be mapped into the kernel for the life time of the io_context, as
opposed to just the duration of the each single IO.

Only works with non-vectored read/write commands for now, not with
PREADV/PWRITEV.

A limit of 4M is imposed as the largest buffer we currently support.
There's nothing preventing us from going larger, but we need some cap,
and 4M seemed like it would definitely be big enough.

See the fio change for how to utilize this feature:

http://git.kernel.dk/cgit/fio/commit/?id=2041bd343da1c1e955253f62374588718c64f0f3

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/aio.c                     | 185 +++++++++++++++++++++++++++++++----
 include/uapi/linux/aio_abi.h |   1 +
 2 files changed, 169 insertions(+), 17 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 426939f1dae9..f735967488a5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -42,6 +42,7 @@
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/sizes.h>
 
 #include <asm/kmap_types.h>
 #include <linux/uaccess.h>
@@ -86,6 +87,11 @@ struct ctx_rq_wait {
 	atomic_t count;
 };
 
+struct aio_mapped_ubuf {
+	struct kvec *kvec;
+	unsigned int nr_kvecs;
+};
+
 struct kioctx {
 	struct percpu_ref	users;
 	atomic_t		dead;
@@ -124,6 +130,8 @@ struct kioctx {
 	struct page		**iocb_pages;
 	long			iocb_nr_pages;
 
+	struct aio_mapped_ubuf	*user_bufs;
+
 	struct rcu_work		free_rwork;	/* see free_ioctx() */
 
 	/*
@@ -290,6 +298,7 @@ static const bool aio_use_state_req_list = false;
 #endif
 
 static void aio_useriocb_free(struct kioctx *);
+static void aio_iocb_buffer_unmap(struct kioctx *);
 static void aio_iopoll_reap_events(struct kioctx *);
 
 static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
@@ -652,6 +661,7 @@ static void free_ioctx(struct work_struct *work)
 					  free_rwork);
 	pr_debug("freeing %p\n", ctx);
 
+	aio_iocb_buffer_unmap(ctx);
 	aio_useriocb_free(ctx);
 	aio_free_ring(ctx);
 	free_percpu(ctx->cpu);
@@ -1597,6 +1607,115 @@ static struct iocb *aio_iocb_from_index(struct kioctx *ctx, int index)
 	return iocb + index;
 }
 
+static void aio_iocb_buffer_unmap(struct kioctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return;
+
+	for (i = 0; i < ctx->max_reqs; i++) {
+		struct aio_mapped_ubuf *amu = &ctx->user_bufs[i];
+
+		for (j = 0; j < amu->nr_kvecs; j++) {
+			struct page *page;
+
+			page = virt_to_page(amu->kvec[j].iov_base);
+			put_page(page);
+		}
+		kfree(amu->kvec);
+		amu->nr_kvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+}
+
+static int aio_iocb_buffer_map(struct kioctx *ctx)
+{
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	struct iocb *iocb;
+	int ret = -EINVAL;
+
+	ctx->user_bufs = kzalloc(ctx->max_reqs * sizeof(struct aio_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	for (i = 0; i < ctx->max_reqs; i++) {
+		struct aio_mapped_ubuf *amu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		size_t size;
+
+		iocb = aio_iocb_from_index(ctx, i);
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iocb->aio_buf)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iocb->aio_nbytes > SZ_4M)
+			goto err;
+
+		ubuf = iocb->aio_buf;
+		end = (ubuf + iocb->aio_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		if (!pages || nr_pages > got_pages) {
+			kfree(pages);
+			pages = kmalloc(nr_pages * sizeof(struct page *),
+					GFP_KERNEL);
+			if (!pages) {
+				ret = -ENOMEM;
+				goto err;
+			}
+			got_pages = nr_pages;
+		}
+
+		amu->kvec = kmalloc(nr_pages * sizeof(struct kvec), GFP_KERNEL);
+		if (!amu->kvec)
+			goto err;
+
+		down_write(&current->mm->mmap_sem);
+		pret = get_user_pages((unsigned long) iocb->aio_buf, nr_pages,
+					1, pages, NULL);
+		up_write(&current->mm->mmap_sem);
+
+		if (pret < nr_pages) {
+			if (pret < 0)
+				ret = pret;
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iocb->aio_nbytes;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			amu->kvec[j].iov_base = page_address(pages[j]) + off;
+			amu->kvec[j].iov_len = vec_len;
+			off = 0;
+			size -= vec_len;
+		}
+		amu->nr_kvecs = nr_pages;
+	}
+	kfree(pages);
+	return 0;
+err:
+	kfree(pages);
+	aio_iocb_buffer_unmap(ctx);
+	return ret;
+}
+
 static void aio_useriocb_free(struct kioctx *ctx)
 {
 	int i;
@@ -1647,7 +1766,8 @@ SYSCALL_DEFINE4(io_setup2, u32, nr_events, u32, flags, struct iocb * __user,
 	unsigned long ctx;
 	long ret;
 
-	if (flags & ~(IOCTX_FLAG_USERIOCB | IOCTX_FLAG_IOPOLL))
+	if (flags & ~(IOCTX_FLAG_USERIOCB | IOCTX_FLAG_IOPOLL |
+		      IOCTX_FLAG_FIXEDBUFS))
 		return -EINVAL;
 
 	ret = get_user(ctx, ctxp);
@@ -1663,6 +1783,15 @@ SYSCALL_DEFINE4(io_setup2, u32, nr_events, u32, flags, struct iocb * __user,
 		ret = aio_useriocb_map(ioctx, iocbs);
 		if (ret)
 			goto err;
+		if (flags & IOCTX_FLAG_FIXEDBUFS) {
+			ret = aio_iocb_buffer_map(ioctx);
+			if (ret)
+				goto err;
+		}
+	} else if (flags & IOCTX_FLAG_FIXEDBUFS) {
+		/* can only support fixed bufs with user mapped iocbs */
+		ret = -EINVAL;
+		goto err;
 	}
 
 	ret = put_user(ioctx->user_id, ctxp);
@@ -1939,23 +2068,38 @@ static int aio_prep_rw(struct aio_kiocb *kiocb, const struct iocb *iocb,
 	return ret;
 }
 
-static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec,
-		bool vectored, bool compat, struct iov_iter *iter)
+static int aio_setup_rw(int rw, struct aio_kiocb *kiocb,
+		const struct iocb *iocb, struct iovec **iovec, bool vectored,
+		bool compat, bool kvecs, struct iov_iter *iter)
 {
-	void __user *buf = (void __user *)(uintptr_t)iocb->aio_buf;
+	void __user *ubuf = (void __user *)(uintptr_t)iocb->aio_buf;
 	size_t len = iocb->aio_nbytes;
 
 	if (!vectored) {
-		ssize_t ret = import_single_range(rw, buf, len, *iovec, iter);
+		ssize_t ret;
+
+		if (!kvecs) {
+			ret = import_single_range(rw, ubuf, len, *iovec, iter);
+		} else {
+			long index = (long) kiocb->ki_user_iocb;
+			struct aio_mapped_ubuf *amu;
+
+			/* __io_submit_one() already validated the index */
+			amu = &kiocb->ki_ctx->user_bufs[index];
+			ret = import_kvec(rw, amu->kvec, amu->nr_kvecs,
+						len, iter);
+		}
 		*iovec = NULL;
 		return ret;
 	}
+	if (kvecs)
+		return -EINVAL;
 #ifdef CONFIG_COMPAT
 	if (compat)
-		return compat_import_iovec(rw, buf, len, UIO_FASTIOV, iovec,
+		return compat_import_iovec(rw, ubuf, len, UIO_FASTIOV, iovec,
 				iter);
 #endif
-	return import_iovec(rw, buf, len, UIO_FASTIOV, iovec, iter);
+	return import_iovec(rw, ubuf, len, UIO_FASTIOV, iovec, iter);
 }
 
 static inline void aio_rw_done(struct kiocb *req, ssize_t ret)
@@ -2028,7 +2172,7 @@ static void aio_iopoll_iocb_issued(struct aio_submit_state *state,
 
 static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
 			struct aio_submit_state *state, bool vectored,
-			bool compat)
+			bool compat, bool kvecs)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *req = &kiocb->rw;
@@ -2048,9 +2192,11 @@ static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
 	if (unlikely(!file->f_op->read_iter))
 		goto out_fput;
 
-	ret = aio_setup_rw(READ, iocb, &iovec, vectored, compat, &iter);
+	ret = aio_setup_rw(READ, kiocb, iocb, &iovec, vectored, compat, kvecs,
+				&iter);
 	if (ret)
 		goto out_fput;
+
 	ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter));
 	if (!ret)
 		aio_rw_done(req, call_read_iter(file, req, &iter));
@@ -2063,7 +2209,7 @@ static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
 
 static ssize_t aio_write(struct aio_kiocb *kiocb, const struct iocb *iocb,
 			 struct aio_submit_state *state, bool vectored,
-			 bool compat)
+			 bool compat, bool kvecs)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *req = &kiocb->rw;
@@ -2083,7 +2229,8 @@ static ssize_t aio_write(struct aio_kiocb *kiocb, const struct iocb *iocb,
 	if (unlikely(!file->f_op->write_iter))
 		goto out_fput;
 
-	ret = aio_setup_rw(WRITE, iocb, &iovec, vectored, compat, &iter);
+	ret = aio_setup_rw(WRITE, kiocb, iocb, &iovec, vectored, compat, kvecs,
+				&iter);
 	if (ret)
 		goto out_fput;
 	ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
@@ -2322,7 +2469,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
 
 static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 			   struct iocb __user *user_iocb,
-			   struct aio_submit_state *state, bool compat)
+			   struct aio_submit_state *state, bool compat,
+			   bool kvecs)
 {
 	struct aio_kiocb *req;
 	ssize_t ret;
@@ -2382,16 +2530,16 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
 	ret = -EINVAL;
 	switch (iocb->aio_lio_opcode) {
 	case IOCB_CMD_PREAD:
-		ret = aio_read(req, iocb, state, false, compat);
+		ret = aio_read(req, iocb, state, false, compat, kvecs);
 		break;
 	case IOCB_CMD_PWRITE:
-		ret = aio_write(req, iocb, state, false, compat);
+		ret = aio_write(req, iocb, state, false, compat, kvecs);
 		break;
 	case IOCB_CMD_PREADV:
-		ret = aio_read(req, iocb, state, true, compat);
+		ret = aio_read(req, iocb, state, true, compat, kvecs);
 		break;
 	case IOCB_CMD_PWRITEV:
-		ret = aio_write(req, iocb, state, true, compat);
+		ret = aio_write(req, iocb, state, true, compat, kvecs);
 		break;
 	case IOCB_CMD_FSYNC:
 		if (ctx->flags & IOCTX_FLAG_IOPOLL)
@@ -2443,6 +2591,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 			 struct aio_submit_state *state, bool compat)
 {
 	struct iocb iocb, *iocbp;
+	bool kvecs;
 
 	if (ctx->flags & IOCTX_FLAG_USERIOCB) {
 		unsigned long iocb_index = (unsigned long) user_iocb;
@@ -2450,14 +2599,16 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 		if (iocb_index >= ctx->max_reqs)
 			return -EINVAL;
 
+		kvecs = (ctx->flags & IOCTX_FLAG_FIXEDBUFS) != 0;
 		iocbp = aio_iocb_from_index(ctx, iocb_index);
 	} else {
 		if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb))))
 			return -EFAULT;
+		kvecs = false;
 		iocbp = &iocb;
 	}
 
-	return __io_submit_one(ctx, iocbp, user_iocb, state, compat);
+	return __io_submit_one(ctx, iocbp, user_iocb, state, compat, kvecs);
 }
 
 #ifdef CONFIG_BLOCK
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index ea0b9a19f4df..05d72cf86bd3 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -110,6 +110,7 @@ struct iocb {
 
 #define IOCTX_FLAG_USERIOCB	(1 << 0)	/* iocbs are user mapped */
 #define IOCTX_FLAG_IOPOLL	(1 << 1)	/* io_context is polled */
+#define IOCTX_FLAG_FIXEDBUFS	(1 << 2)	/* IO buffers are fixed */
 
 #undef IFBIG
 #undef IFLITTLE
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/27] aio: fix failure to put the file pointer
  2018-11-30 16:56 ` [PATCH 01/27] aio: fix failure to put the file pointer Jens Axboe
@ 2018-11-30 17:07   ` Bart Van Assche
  2018-11-30 17:08     ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Bart Van Assche @ 2018-11-30 17:07 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel, linux-aio; +Cc: hch

On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote:
> If the ioprio capability check fails, we return without putting
> the file pointer.
> 
> Fixes: d9a08a9e616b ("fs: Add aio iopriority support")
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/aio.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index b984918be4b7..205390c0c1bb 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1436,6 +1436,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
>  		ret = ioprio_check_cap(iocb->aio_reqprio);
>  		if (ret) {
>  			pr_debug("aio ioprio check cap error: %d\n", ret);
> +			fput(req->ki_filp);
>  			return ret;
>  		}

Since this patch fixes a bug that was introduced in kernel v4.18, does this
patch need a "Cc: stable" tag?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/27] aio: fix failure to put the file pointer
  2018-11-30 17:07   ` Bart Van Assche
@ 2018-11-30 17:08     ` Jens Axboe
  2018-11-30 17:24       ` Bart Van Assche
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 17:08 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-fsdevel, linux-aio; +Cc: hch

On 11/30/18 10:07 AM, Bart Van Assche wrote:
> On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote:
>> If the ioprio capability check fails, we return without putting
>> the file pointer.
>>
>> Fixes: d9a08a9e616b ("fs: Add aio iopriority support")
>> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/aio.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/aio.c b/fs/aio.c
>> index b984918be4b7..205390c0c1bb 100644
>> --- a/fs/aio.c
>> +++ b/fs/aio.c
>> @@ -1436,6 +1436,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
>>  		ret = ioprio_check_cap(iocb->aio_reqprio);
>>  		if (ret) {
>>  			pr_debug("aio ioprio check cap error: %d\n", ret);
>> +			fput(req->ki_filp);
>>  			return ret;
>>  		}
> 
> Since this patch fixes a bug that was introduced in kernel v4.18, does this
> patch need a "Cc: stable" tag?

The fixes should take care of that by itself, I hope.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT
  2018-11-30 16:56 ` [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT Jens Axboe
@ 2018-11-30 17:12   ` Bart Van Assche
  2018-11-30 17:17     ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Bart Van Assche @ 2018-11-30 17:12 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel, linux-aio; +Cc: hch

On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote:
> We can't wait for polled events to complete, as they may require active
> polling from whoever submitted it. If that is the same task that is
> submitting new IO, we could deadlock waiting for IO to complete that
> this task is supposed to be completing itself.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/block_dev.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 6de8d35f6e41..ebc3d5a0f424 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -402,8 +402,16 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>  
>  		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>  		if (!nr_pages) {
> -			if (iocb->ki_flags & IOCB_HIPRI)
> +			if (iocb->ki_flags & IOCB_HIPRI) {
>  				bio->bi_opf |= REQ_HIPRI;
> +				/*
> +				 * For async polled IO, we can't wait for
> +				 * requests to complete, as they may also be
> +				 * polled and require active reaping.
> +				 */
> +				if (!is_sync)
> +					bio->bi_opf |= REQ_NOWAIT;
> +			}
>  
>  			qc = submit_bio(bio);
>  			WRITE_ONCE(iocb->ki_cookie, qc);

Setting REQ_NOWAIT from inside the block layer will make the code that
submits requests harder to review. Have you considered to make this code
fail I/O if REQ_NOWAIT has not been set and to require that the context
that submits I/O sets REQ_NOWAIT?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/27] aio: clear IOCB_HIPRI
  2018-11-30 16:56 ` [PATCH 02/27] aio: clear IOCB_HIPRI Jens Axboe
@ 2018-11-30 17:13   ` Christoph Hellwig
  2018-11-30 17:14     ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2018-11-30 17:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

I think we'll need to queue this up for 4.21 ASAP independent of the
rest, given that with separate poll queues userspace could otherwise
submit I/O that will never get polled for anywhere.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/27] aio: clear IOCB_HIPRI
  2018-11-30 17:13   ` Christoph Hellwig
@ 2018-11-30 17:14     ` Jens Axboe
  2018-12-04 14:46       ` Christoph Hellwig
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 17:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, linux-fsdevel, linux-aio

On 11/30/18 10:13 AM, Christoph Hellwig wrote:
> I think we'll need to queue this up for 4.21 ASAP independent of the
> rest, given that with separate poll queues userspace could otherwise
> submit I/O that will never get polled for anywhere.

Probably a good idea, I can just move it to my 4.21 branch, it's not
strictly dependent on the series.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT
  2018-11-30 17:12   ` Bart Van Assche
@ 2018-11-30 17:17     ` Jens Axboe
  2018-12-04 14:48       ` Christoph Hellwig
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 17:17 UTC (permalink / raw)
  To: Bart Van Assche, linux-block, linux-fsdevel, linux-aio; +Cc: hch

On 11/30/18 10:12 AM, Bart Van Assche wrote:
> On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote:
>> We can't wait for polled events to complete, as they may require active
>> polling from whoever submitted it. If that is the same task that is
>> submitting new IO, we could deadlock waiting for IO to complete that
>> this task is supposed to be completing itself.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/block_dev.c | 10 +++++++++-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 6de8d35f6e41..ebc3d5a0f424 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -402,8 +402,16 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
>>  
>>  		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
>>  		if (!nr_pages) {
>> -			if (iocb->ki_flags & IOCB_HIPRI)
>> +			if (iocb->ki_flags & IOCB_HIPRI) {
>>  				bio->bi_opf |= REQ_HIPRI;
>> +				/*
>> +				 * For async polled IO, we can't wait for
>> +				 * requests to complete, as they may also be
>> +				 * polled and require active reaping.
>> +				 */
>> +				if (!is_sync)
>> +					bio->bi_opf |= REQ_NOWAIT;
>> +			}
>>  
>>  			qc = submit_bio(bio);
>>  			WRITE_ONCE(iocb->ki_cookie, qc);
> 
> Setting REQ_NOWAIT from inside the block layer will make the code that
> submits requests harder to review. Have you considered to make this code
> fail I/O if REQ_NOWAIT has not been set and to require that the context
> that submits I/O sets REQ_NOWAIT?

It's technically still feasible to do for sync polled IO, it's only
the async case that makes it a potential deadlock.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 01/27] aio: fix failure to put the file pointer
  2018-11-30 17:08     ` Jens Axboe
@ 2018-11-30 17:24       ` Bart Van Assche
  0 siblings, 0 replies; 59+ messages in thread
From: Bart Van Assche @ 2018-11-30 17:24 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel, linux-aio; +Cc: hch

On Fri, 2018-11-30 at 10:08 -0700, Jens Axboe wrote:
> On 11/30/18 10:07 AM, Bart Van Assche wrote:
> > On Fri, 2018-11-30 at 09:56 -0700, Jens Axboe wrote:
> > > If the ioprio capability check fails, we return without putting
> > > the file pointer.
> > > 
> > > Fixes: d9a08a9e616b ("fs: Add aio iopriority support")
> > > Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > > ---
> > >  fs/aio.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/fs/aio.c b/fs/aio.c
> > > index b984918be4b7..205390c0c1bb 100644
> > > --- a/fs/aio.c
> > > +++ b/fs/aio.c
> > > @@ -1436,6 +1436,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb *iocb)
> > >  		ret = ioprio_check_cap(iocb->aio_reqprio);
> > >  		if (ret) {
> > >  			pr_debug("aio ioprio check cap error: %d\n", ret);
> > > +			fput(req->ki_filp);
> > >  			return ret;
> > >  		}
> > 
> > Since this patch fixes a bug that was introduced in kernel v4.18, does this
> > patch need a "Cc: stable" tag?
> 
> The fixes should take care of that by itself, I hope.

Hi Jens,

My understanding is that patches that have a "Cc: stable" tag are guaranteed to
be integrated in a stable kernel sooner or later. Without that tag it depends on
the stable kernel maintainer whether or not these patches get picked up. I think
the "AUTOSEL" tag Sasha Levin uses indicates that a patch was picked up for one
of his stable kernels and that it did not have a "Cc: stable" tag. I'm not sure
Greg KH picks up patches that only have a "Fixes:" tag but no "Cc: stable" tag.

Bart.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 26/27] iov_iter: add import_kvec()
  2018-11-30 16:56 ` [PATCH 26/27] iov_iter: add import_kvec() Jens Axboe
@ 2018-11-30 19:17   ` Al Viro
  2018-11-30 20:15     ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Al Viro @ 2018-11-30 19:17 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On Fri, Nov 30, 2018 at 09:56:45AM -0700, Jens Axboe wrote:
> This explicitly sets up an ITER_KVEC from an iovec with kernel ranges
> mapped.

> +int import_kvec(int type, const struct kvec *kvecs, unsigned nr_segs,
> +		size_t bytes, struct iov_iter *iter)
> +{
> +	const struct iovec *p = (const struct iovec *) kvecs;
> +
> +	iov_iter_init_type(iter, ITER_KVEC, type, p, nr_segs, bytes);
> +	return 0;
> +}

What the hell is wrong with existing iov_iter_kvec()?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 16:56 ` [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio Jens Axboe
@ 2018-11-30 19:21   ` Al Viro
  2018-11-30 20:15     ` Jens Axboe
  2018-12-04 14:55     ` Christoph Hellwig
  0 siblings, 2 replies; 59+ messages in thread
From: Al Viro @ 2018-11-30 19:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote:
> For an ITER_KVEC, we can just iterate the iov and add the pages
> to the bio directly.

> +		page = virt_to_page(kv->iov_base);
> +		size = bio_add_page(bio, page, kv->iov_len,
> +					offset_in_page(kv->iov_base));

Who said that you *can* do virt_to_page() on those?  E.g. vmalloc()'ed
addresses are fine for ITER_KVEC, etc.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 26/27] iov_iter: add import_kvec()
  2018-11-30 19:17   ` Al Viro
@ 2018-11-30 20:15     ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 20:15 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On 11/30/18 12:17 PM, Al Viro wrote:
> On Fri, Nov 30, 2018 at 09:56:45AM -0700, Jens Axboe wrote:
>> This explicitly sets up an ITER_KVEC from an iovec with kernel ranges
>> mapped.
> 
>> +int import_kvec(int type, const struct kvec *kvecs, unsigned nr_segs,
>> +		size_t bytes, struct iov_iter *iter)
>> +{
>> +	const struct iovec *p = (const struct iovec *) kvecs;
>> +
>> +	iov_iter_init_type(iter, ITER_KVEC, type, p, nr_segs, bytes);
>> +	return 0;
>> +}
> 
> What the hell is wrong with existing iov_iter_kvec()?

Hah, looks like I overlooked that. Not sure how anyone could look at
lib/iov_iter.c and not get lost in the beauty of it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 19:21   ` Al Viro
@ 2018-11-30 20:15     ` Jens Axboe
  2018-11-30 20:32       ` Jens Axboe
  2018-12-04 14:55     ` Christoph Hellwig
  1 sibling, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 20:15 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On 11/30/18 12:21 PM, Al Viro wrote:
> On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote:
>> For an ITER_KVEC, we can just iterate the iov and add the pages
>> to the bio directly.
> 
>> +		page = virt_to_page(kv->iov_base);
>> +		size = bio_add_page(bio, page, kv->iov_len,
>> +					offset_in_page(kv->iov_base));
> 
> Who said that you *can* do virt_to_page() on those?  E.g. vmalloc()'ed
> addresses are fine for ITER_KVEC, etc.

Then how do you set up a kvec based iter with memory you can safely
DMA to/from?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 20:15     ` Jens Axboe
@ 2018-11-30 20:32       ` Jens Axboe
  2018-11-30 21:11         ` Al Viro
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 20:32 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On 11/30/18 1:15 PM, Jens Axboe wrote:
> On 11/30/18 12:21 PM, Al Viro wrote:
>> On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote:
>>> For an ITER_KVEC, we can just iterate the iov and add the pages
>>> to the bio directly.
>>
>>> +		page = virt_to_page(kv->iov_base);
>>> +		size = bio_add_page(bio, page, kv->iov_len,
>>> +					offset_in_page(kv->iov_base));
>>
>> Who said that you *can* do virt_to_page() on those?  E.g. vmalloc()'ed
>> addresses are fine for ITER_KVEC, etc.
> 
> Then how do you set up a kvec based iter with memory you can safely
> DMA to/from?

Would this make you happy:

if (!is_vmalloc_addr(kv->iov_base))
        page = virt_to_page(kv->iov_base);
else
        page = vmalloc_to_page(kv->iov_base);

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 20:32       ` Jens Axboe
@ 2018-11-30 21:11         ` Al Viro
  2018-11-30 21:16           ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Al Viro @ 2018-11-30 21:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On Fri, Nov 30, 2018 at 01:32:21PM -0700, Jens Axboe wrote:
> On 11/30/18 1:15 PM, Jens Axboe wrote:
> > On 11/30/18 12:21 PM, Al Viro wrote:
> >> On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote:
> >>> For an ITER_KVEC, we can just iterate the iov and add the pages
> >>> to the bio directly.
> >>
> >>> +		page = virt_to_page(kv->iov_base);
> >>> +		size = bio_add_page(bio, page, kv->iov_len,
> >>> +					offset_in_page(kv->iov_base));
> >>
> >> Who said that you *can* do virt_to_page() on those?  E.g. vmalloc()'ed
> >> addresses are fine for ITER_KVEC, etc.
> > 
> > Then how do you set up a kvec based iter with memory you can safely
> > DMA to/from?
> 
> Would this make you happy:
> 
> if (!is_vmalloc_addr(kv->iov_base))
>         page = virt_to_page(kv->iov_base);
> else
>         page = vmalloc_to_page(kv->iov_base);

Free advice: don't ever let Linus see anything along those lines.  Results
tend to be colourful...

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 21:11         ` Al Viro
@ 2018-11-30 21:16           ` Jens Axboe
  2018-11-30 21:25             ` Al Viro
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 21:16 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On 11/30/18 2:11 PM, Al Viro wrote:
> On Fri, Nov 30, 2018 at 01:32:21PM -0700, Jens Axboe wrote:
>> On 11/30/18 1:15 PM, Jens Axboe wrote:
>>> On 11/30/18 12:21 PM, Al Viro wrote:
>>>> On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote:
>>>>> For an ITER_KVEC, we can just iterate the iov and add the pages
>>>>> to the bio directly.
>>>>
>>>>> +		page = virt_to_page(kv->iov_base);
>>>>> +		size = bio_add_page(bio, page, kv->iov_len,
>>>>> +					offset_in_page(kv->iov_base));
>>>>
>>>> Who said that you *can* do virt_to_page() on those?  E.g. vmalloc()'ed
>>>> addresses are fine for ITER_KVEC, etc.
>>>
>>> Then how do you set up a kvec based iter with memory you can safely
>>> DMA to/from?
>>
>> Would this make you happy:
>>
>> if (!is_vmalloc_addr(kv->iov_base))
>>         page = virt_to_page(kv->iov_base);
>> else
>>         page = vmalloc_to_page(kv->iov_base);
> 
> Free advice: don't ever let Linus see anything along those lines.  Results
> tend to be colourful...

We already have those lines in the kernel, XFS for instance. Al, could you
please try to be helpful instead of being deliberately obtuse?

Examples of being helpful:

1) Informing the sender of why something is a bad idea, instead of just
   saying it's a bad idea.

2) Making helpful suggestions to improve the current situation.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 21:16           ` Jens Axboe
@ 2018-11-30 21:25             ` Al Viro
  2018-11-30 21:34               ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Al Viro @ 2018-11-30 21:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On Fri, Nov 30, 2018 at 02:16:38PM -0700, Jens Axboe wrote:
> >> Would this make you happy:
> >>
> >> if (!is_vmalloc_addr(kv->iov_base))
> >>         page = virt_to_page(kv->iov_base);
> >> else
> >>         page = vmalloc_to_page(kv->iov_base);
> > 
> > Free advice: don't ever let Linus see anything along those lines.  Results
> > tend to be colourful...
> 
> We already have those lines in the kernel, XFS for instance. Al, could you
> please try to be helpful instead of being deliberately obtuse?

Again, the last time something like that had been suggested, Linus had replied
with a very impressive rant.  I *did* propose pretty much that, and reaction
was basically "hell no, not in general-purpose primitives".  Precisely about
iov_iter stuff.  A part of that was due to touching page refcounts, but quite
a bit wasn't.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 21:25             ` Al Viro
@ 2018-11-30 21:34               ` Jens Axboe
  2018-11-30 22:06                 ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 21:34 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On 11/30/18 2:25 PM, Al Viro wrote:
> On Fri, Nov 30, 2018 at 02:16:38PM -0700, Jens Axboe wrote:
>>>> Would this make you happy:
>>>>
>>>> if (!is_vmalloc_addr(kv->iov_base))
>>>>         page = virt_to_page(kv->iov_base);
>>>> else
>>>>         page = vmalloc_to_page(kv->iov_base);
>>>
>>> Free advice: don't ever let Linus see anything along those lines.  Results
>>> tend to be colourful...
>>
>> We already have those lines in the kernel, XFS for instance. Al, could you
>> please try to be helpful instead of being deliberately obtuse?
> 
> Again, the last time something like that had been suggested, Linus had replied
> with a very impressive rant.  I *did* propose pretty much that, and reaction
> was basically "hell no, not in general-purpose primitives".  Precisely about
> iov_iter stuff.  A part of that was due to touching page refcounts, but quite
> a bit wasn't.

Nobody is touching the page count here, and for the aio user mapped IO,
nobody is touching them at the end either.

As far as I can tell, the above is fine. It's either a vmalloc'ed
address and should be treated specially, or we can do virt_to_page() on
it.

Do you have a link to said rant?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 27/27] aio: add support for pre-mapped user IO buffers
  2018-11-30 16:56 ` [PATCH 27/27] aio: add support for pre-mapped user IO buffers Jens Axboe
@ 2018-11-30 21:44   ` Jeff Moyer
  2018-11-30 21:57     ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Moyer @ 2018-11-30 21:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

Hi, Jens,

Jens Axboe <axboe@kernel.dk> writes:

> If we have fixed user buffers, we can map them into the kernel when we
> setup the io_context. That avoids the need to do get_user_pages() for
> each and every IO.
>
> To utilize this feature, the application must set both
> IOCTX_FLAG_USERIOCB, to provide iocb's in userspace, and then
> IOCTX_FLAG_FIXEDBUFS. The latter tells aio that the iocbs that are
> mapped already contain valid destination and sizes. These buffers can
> then be mapped into the kernel for the life time of the io_context, as
> opposed to just the duration of the each single IO.
>
> Only works with non-vectored read/write commands for now, not with
> PREADV/PWRITEV.
>
> A limit of 4M is imposed as the largest buffer we currently support.
> There's nothing preventing us from going larger, but we need some cap,
> and 4M seemed like it would definitely be big enough.

Doesn't this mean that a user can pin a bunch of memory?  Something like
4MB * aio_max_nr?

$ sysctl fs.aio-max-nr
fs.aio-max-nr = 1048576

If so, it may be a good idea to account the memory under RLIMIT_MEMLOCK.

I'm not sure how close you are to proposing this patch set for realz.
If it's soon (now?), then CC-ing linux-api and writing man pages would
be a good idea.  I can help out with the libaio bits if you'd like.  I
haven't yet had time to take this stuff for a spin, sorry.  I'll try to
get to that soonish.

The speedups are pretty impressive!

Cheers,
Jeff


> See the fio change for how to utilize this feature:
>
> http://git.kernel.dk/cgit/fio/commit/?id=2041bd343da1c1e955253f62374588718c64f0f3
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/aio.c                     | 185 +++++++++++++++++++++++++++++++----
>  include/uapi/linux/aio_abi.h |   1 +
>  2 files changed, 169 insertions(+), 17 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 426939f1dae9..f735967488a5 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -42,6 +42,7 @@
>  #include <linux/ramfs.h>
>  #include <linux/percpu-refcount.h>
>  #include <linux/mount.h>
> +#include <linux/sizes.h>
>  
>  #include <asm/kmap_types.h>
>  #include <linux/uaccess.h>
> @@ -86,6 +87,11 @@ struct ctx_rq_wait {
>  	atomic_t count;
>  };
>  
> +struct aio_mapped_ubuf {
> +	struct kvec *kvec;
> +	unsigned int nr_kvecs;
> +};
> +
>  struct kioctx {
>  	struct percpu_ref	users;
>  	atomic_t		dead;
> @@ -124,6 +130,8 @@ struct kioctx {
>  	struct page		**iocb_pages;
>  	long			iocb_nr_pages;
>  
> +	struct aio_mapped_ubuf	*user_bufs;
> +
>  	struct rcu_work		free_rwork;	/* see free_ioctx() */
>  
>  	/*
> @@ -290,6 +298,7 @@ static const bool aio_use_state_req_list = false;
>  #endif
>  
>  static void aio_useriocb_free(struct kioctx *);
> +static void aio_iocb_buffer_unmap(struct kioctx *);
>  static void aio_iopoll_reap_events(struct kioctx *);
>  
>  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
> @@ -652,6 +661,7 @@ static void free_ioctx(struct work_struct *work)
>  					  free_rwork);
>  	pr_debug("freeing %p\n", ctx);
>  
> +	aio_iocb_buffer_unmap(ctx);
>  	aio_useriocb_free(ctx);
>  	aio_free_ring(ctx);
>  	free_percpu(ctx->cpu);
> @@ -1597,6 +1607,115 @@ static struct iocb *aio_iocb_from_index(struct kioctx *ctx, int index)
>  	return iocb + index;
>  }
>  
> +static void aio_iocb_buffer_unmap(struct kioctx *ctx)
> +{
> +	int i, j;
> +
> +	if (!ctx->user_bufs)
> +		return;
> +
> +	for (i = 0; i < ctx->max_reqs; i++) {
> +		struct aio_mapped_ubuf *amu = &ctx->user_bufs[i];
> +
> +		for (j = 0; j < amu->nr_kvecs; j++) {
> +			struct page *page;
> +
> +			page = virt_to_page(amu->kvec[j].iov_base);
> +			put_page(page);
> +		}
> +		kfree(amu->kvec);
> +		amu->nr_kvecs = 0;
> +	}
> +
> +	kfree(ctx->user_bufs);
> +	ctx->user_bufs = NULL;
> +}
> +
> +static int aio_iocb_buffer_map(struct kioctx *ctx)
> +{
> +	struct page **pages = NULL;
> +	int i, j, got_pages = 0;
> +	struct iocb *iocb;
> +	int ret = -EINVAL;
> +
> +	ctx->user_bufs = kzalloc(ctx->max_reqs * sizeof(struct aio_mapped_ubuf),
> +					GFP_KERNEL);
> +	if (!ctx->user_bufs)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < ctx->max_reqs; i++) {
> +		struct aio_mapped_ubuf *amu = &ctx->user_bufs[i];
> +		unsigned long off, start, end, ubuf;
> +		int pret, nr_pages;
> +		size_t size;
> +
> +		iocb = aio_iocb_from_index(ctx, i);
> +
> +		/*
> +		 * Don't impose further limits on the size and buffer
> +		 * constraints here, we'll -EINVAL later when IO is
> +		 * submitted if they are wrong.
> +		 */
> +		ret = -EFAULT;
> +		if (!iocb->aio_buf)
> +			goto err;
> +
> +		/* arbitrary limit, but we need something */
> +		if (iocb->aio_nbytes > SZ_4M)
> +			goto err;
> +
> +		ubuf = iocb->aio_buf;
> +		end = (ubuf + iocb->aio_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +		start = ubuf >> PAGE_SHIFT;
> +		nr_pages = end - start;
> +
> +		if (!pages || nr_pages > got_pages) {
> +			kfree(pages);
> +			pages = kmalloc(nr_pages * sizeof(struct page *),
> +					GFP_KERNEL);
> +			if (!pages) {
> +				ret = -ENOMEM;
> +				goto err;
> +			}
> +			got_pages = nr_pages;
> +		}
> +
> +		amu->kvec = kmalloc(nr_pages * sizeof(struct kvec), GFP_KERNEL);
> +		if (!amu->kvec)
> +			goto err;
> +
> +		down_write(&current->mm->mmap_sem);
> +		pret = get_user_pages((unsigned long) iocb->aio_buf, nr_pages,
> +					1, pages, NULL);
> +		up_write(&current->mm->mmap_sem);
> +
> +		if (pret < nr_pages) {
> +			if (pret < 0)
> +				ret = pret;
> +			goto err;
> +		}
> +
> +		off = ubuf & ~PAGE_MASK;
> +		size = iocb->aio_nbytes;
> +		for (j = 0; j < nr_pages; j++) {
> +			size_t vec_len;
> +
> +			vec_len = min_t(size_t, size, PAGE_SIZE - off);
> +			amu->kvec[j].iov_base = page_address(pages[j]) + off;
> +			amu->kvec[j].iov_len = vec_len;
> +			off = 0;
> +			size -= vec_len;
> +		}
> +		amu->nr_kvecs = nr_pages;
> +	}
> +	kfree(pages);
> +	return 0;
> +err:
> +	kfree(pages);
> +	aio_iocb_buffer_unmap(ctx);
> +	return ret;
> +}
> +
>  static void aio_useriocb_free(struct kioctx *ctx)
>  {
>  	int i;
> @@ -1647,7 +1766,8 @@ SYSCALL_DEFINE4(io_setup2, u32, nr_events, u32, flags, struct iocb * __user,
>  	unsigned long ctx;
>  	long ret;
>  
> -	if (flags & ~(IOCTX_FLAG_USERIOCB | IOCTX_FLAG_IOPOLL))
> +	if (flags & ~(IOCTX_FLAG_USERIOCB | IOCTX_FLAG_IOPOLL |
> +		      IOCTX_FLAG_FIXEDBUFS))
>  		return -EINVAL;
>  
>  	ret = get_user(ctx, ctxp);
> @@ -1663,6 +1783,15 @@ SYSCALL_DEFINE4(io_setup2, u32, nr_events, u32, flags, struct iocb * __user,
>  		ret = aio_useriocb_map(ioctx, iocbs);
>  		if (ret)
>  			goto err;
> +		if (flags & IOCTX_FLAG_FIXEDBUFS) {
> +			ret = aio_iocb_buffer_map(ioctx);
> +			if (ret)
> +				goto err;
> +		}
> +	} else if (flags & IOCTX_FLAG_FIXEDBUFS) {
> +		/* can only support fixed bufs with user mapped iocbs */
> +		ret = -EINVAL;
> +		goto err;
>  	}
>  
>  	ret = put_user(ioctx->user_id, ctxp);
> @@ -1939,23 +2068,38 @@ static int aio_prep_rw(struct aio_kiocb *kiocb, const struct iocb *iocb,
>  	return ret;
>  }
>  
> -static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec,
> -		bool vectored, bool compat, struct iov_iter *iter)
> +static int aio_setup_rw(int rw, struct aio_kiocb *kiocb,
> +		const struct iocb *iocb, struct iovec **iovec, bool vectored,
> +		bool compat, bool kvecs, struct iov_iter *iter)
>  {
> -	void __user *buf = (void __user *)(uintptr_t)iocb->aio_buf;
> +	void __user *ubuf = (void __user *)(uintptr_t)iocb->aio_buf;
>  	size_t len = iocb->aio_nbytes;
>  
>  	if (!vectored) {
> -		ssize_t ret = import_single_range(rw, buf, len, *iovec, iter);
> +		ssize_t ret;
> +
> +		if (!kvecs) {
> +			ret = import_single_range(rw, ubuf, len, *iovec, iter);
> +		} else {
> +			long index = (long) kiocb->ki_user_iocb;
> +			struct aio_mapped_ubuf *amu;
> +
> +			/* __io_submit_one() already validated the index */
> +			amu = &kiocb->ki_ctx->user_bufs[index];
> +			ret = import_kvec(rw, amu->kvec, amu->nr_kvecs,
> +						len, iter);
> +		}
>  		*iovec = NULL;
>  		return ret;
>  	}
> +	if (kvecs)
> +		return -EINVAL;
>  #ifdef CONFIG_COMPAT
>  	if (compat)
> -		return compat_import_iovec(rw, buf, len, UIO_FASTIOV, iovec,
> +		return compat_import_iovec(rw, ubuf, len, UIO_FASTIOV, iovec,
>  				iter);
>  #endif
> -	return import_iovec(rw, buf, len, UIO_FASTIOV, iovec, iter);
> +	return import_iovec(rw, ubuf, len, UIO_FASTIOV, iovec, iter);
>  }
>  
>  static inline void aio_rw_done(struct kiocb *req, ssize_t ret)
> @@ -2028,7 +2172,7 @@ static void aio_iopoll_iocb_issued(struct aio_submit_state *state,
>  
>  static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
>  			struct aio_submit_state *state, bool vectored,
> -			bool compat)
> +			bool compat, bool kvecs)
>  {
>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>  	struct kiocb *req = &kiocb->rw;
> @@ -2048,9 +2192,11 @@ static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
>  	if (unlikely(!file->f_op->read_iter))
>  		goto out_fput;
>  
> -	ret = aio_setup_rw(READ, iocb, &iovec, vectored, compat, &iter);
> +	ret = aio_setup_rw(READ, kiocb, iocb, &iovec, vectored, compat, kvecs,
> +				&iter);
>  	if (ret)
>  		goto out_fput;
> +
>  	ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter));
>  	if (!ret)
>  		aio_rw_done(req, call_read_iter(file, req, &iter));
> @@ -2063,7 +2209,7 @@ static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb,
>  
>  static ssize_t aio_write(struct aio_kiocb *kiocb, const struct iocb *iocb,
>  			 struct aio_submit_state *state, bool vectored,
> -			 bool compat)
> +			 bool compat, bool kvecs)
>  {
>  	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>  	struct kiocb *req = &kiocb->rw;
> @@ -2083,7 +2229,8 @@ static ssize_t aio_write(struct aio_kiocb *kiocb, const struct iocb *iocb,
>  	if (unlikely(!file->f_op->write_iter))
>  		goto out_fput;
>  
> -	ret = aio_setup_rw(WRITE, iocb, &iovec, vectored, compat, &iter);
> +	ret = aio_setup_rw(WRITE, kiocb, iocb, &iovec, vectored, compat, kvecs,
> +				&iter);
>  	if (ret)
>  		goto out_fput;
>  	ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
> @@ -2322,7 +2469,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb)
>  
>  static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
>  			   struct iocb __user *user_iocb,
> -			   struct aio_submit_state *state, bool compat)
> +			   struct aio_submit_state *state, bool compat,
> +			   bool kvecs)
>  {
>  	struct aio_kiocb *req;
>  	ssize_t ret;
> @@ -2382,16 +2530,16 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb,
>  	ret = -EINVAL;
>  	switch (iocb->aio_lio_opcode) {
>  	case IOCB_CMD_PREAD:
> -		ret = aio_read(req, iocb, state, false, compat);
> +		ret = aio_read(req, iocb, state, false, compat, kvecs);
>  		break;
>  	case IOCB_CMD_PWRITE:
> -		ret = aio_write(req, iocb, state, false, compat);
> +		ret = aio_write(req, iocb, state, false, compat, kvecs);
>  		break;
>  	case IOCB_CMD_PREADV:
> -		ret = aio_read(req, iocb, state, true, compat);
> +		ret = aio_read(req, iocb, state, true, compat, kvecs);
>  		break;
>  	case IOCB_CMD_PWRITEV:
> -		ret = aio_write(req, iocb, state, true, compat);
> +		ret = aio_write(req, iocb, state, true, compat, kvecs);
>  		break;
>  	case IOCB_CMD_FSYNC:
>  		if (ctx->flags & IOCTX_FLAG_IOPOLL)
> @@ -2443,6 +2591,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>  			 struct aio_submit_state *state, bool compat)
>  {
>  	struct iocb iocb, *iocbp;
> +	bool kvecs;
>  
>  	if (ctx->flags & IOCTX_FLAG_USERIOCB) {
>  		unsigned long iocb_index = (unsigned long) user_iocb;
> @@ -2450,14 +2599,16 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
>  		if (iocb_index >= ctx->max_reqs)
>  			return -EINVAL;
>  
> +		kvecs = (ctx->flags & IOCTX_FLAG_FIXEDBUFS) != 0;
>  		iocbp = aio_iocb_from_index(ctx, iocb_index);
>  	} else {
>  		if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb))))
>  			return -EFAULT;
> +		kvecs = false;
>  		iocbp = &iocb;
>  	}
>  
> -	return __io_submit_one(ctx, iocbp, user_iocb, state, compat);
> +	return __io_submit_one(ctx, iocbp, user_iocb, state, compat, kvecs);
>  }
>  
>  #ifdef CONFIG_BLOCK
> diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
> index ea0b9a19f4df..05d72cf86bd3 100644
> --- a/include/uapi/linux/aio_abi.h
> +++ b/include/uapi/linux/aio_abi.h
> @@ -110,6 +110,7 @@ struct iocb {
>  
>  #define IOCTX_FLAG_USERIOCB	(1 << 0)	/* iocbs are user mapped */
>  #define IOCTX_FLAG_IOPOLL	(1 << 1)	/* io_context is polled */
> +#define IOCTX_FLAG_FIXEDBUFS	(1 << 2)	/* IO buffers are fixed */
>  
>  #undef IFBIG
>  #undef IFLITTLE

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 27/27] aio: add support for pre-mapped user IO buffers
  2018-11-30 21:44   ` Jeff Moyer
@ 2018-11-30 21:57     ` Jens Axboe
  2018-11-30 22:04       ` Jeff Moyer
  0 siblings, 1 reply; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 21:57 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On 11/30/18 2:44 PM, Jeff Moyer wrote:
> Hi, Jens,
> 
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> If we have fixed user buffers, we can map them into the kernel when we
>> setup the io_context. That avoids the need to do get_user_pages() for
>> each and every IO.
>>
>> To utilize this feature, the application must set both
>> IOCTX_FLAG_USERIOCB, to provide iocb's in userspace, and then
>> IOCTX_FLAG_FIXEDBUFS. The latter tells aio that the iocbs that are
>> mapped already contain valid destination and sizes. These buffers can
>> then be mapped into the kernel for the life time of the io_context, as
>> opposed to just the duration of the each single IO.
>>
>> Only works with non-vectored read/write commands for now, not with
>> PREADV/PWRITEV.
>>
>> A limit of 4M is imposed as the largest buffer we currently support.
>> There's nothing preventing us from going larger, but we need some cap,
>> and 4M seemed like it would definitely be big enough.
> 
> Doesn't this mean that a user can pin a bunch of memory?  Something like
> 4MB * aio_max_nr?
> 
> $ sysctl fs.aio-max-nr
> fs.aio-max-nr = 1048576
> 
> If so, it may be a good idea to account the memory under RLIMIT_MEMLOCK.

Yes, it'll need some kind of limiting, right now the limit would indeed
be aio-max-nr * 4MB. 4G isn't terrible, but...

RLIMIT_MEMLOCK isn't a bad idea.

> I'm not sure how close you are to proposing this patch set for realz.
> If it's soon (now?), then CC-ing linux-api and writing man pages would
> be a good idea.  I can help out with the libaio bits if you'd like.  I
> haven't yet had time to take this stuff for a spin, sorry.  I'll try to
> get to that soonish.

I am proposing it for real, not sure how long it'll take to get it
reviewed and moved forward. Unless I get lucky. 4.22 seems like a more
viable version than 4.21.

I'll take any help I can get on the API/man page parts. And/or testing!

> The speedups are pretty impressive!

That's why I put them in there, maybe that'd get peoples attention :-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 27/27] aio: add support for pre-mapped user IO buffers
  2018-11-30 21:57     ` Jens Axboe
@ 2018-11-30 22:04       ` Jeff Moyer
  2018-11-30 22:11         ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Jeff Moyer @ 2018-11-30 22:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

Jens Axboe <axboe@kernel.dk> writes:

>>> A limit of 4M is imposed as the largest buffer we currently support.
>>> There's nothing preventing us from going larger, but we need some cap,
>>> and 4M seemed like it would definitely be big enough.
>> 
>> Doesn't this mean that a user can pin a bunch of memory?  Something like
>> 4MB * aio_max_nr?
>> 
>> $ sysctl fs.aio-max-nr
>> fs.aio-max-nr = 1048576
>> 
>> If so, it may be a good idea to account the memory under RLIMIT_MEMLOCK.
>
> Yes, it'll need some kind of limiting, right now the limit would indeed
> be aio-max-nr * 4MB. 4G isn't terrible, but...

Unless my math's wrong, that's 4TiB on my system.  ;-)

> RLIMIT_MEMLOCK isn't a bad idea.
>
>> I'm not sure how close you are to proposing this patch set for realz.
>> If it's soon (now?), then CC-ing linux-api and writing man pages would
>> be a good idea.  I can help out with the libaio bits if you'd like.  I
>> haven't yet had time to take this stuff for a spin, sorry.  I'll try to
>> get to that soonish.
>
> I am proposing it for real, not sure how long it'll take to get it
> reviewed and moved forward. Unless I get lucky. 4.22 seems like a more
> viable version than 4.21.
>
> I'll take any help I can get on the API/man page parts. And/or testing!

OK, I'll add libaio support (including unit tests), write the man page,
and I'll definitely do some testing.  I'll start on all that probably in
the latter half of next week.

>> The speedups are pretty impressive!
>
> That's why I put them in there, maybe that'd get peoples attention :-)

Indeed.  :)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 21:34               ` Jens Axboe
@ 2018-11-30 22:06                 ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 22:06 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On 11/30/18 2:34 PM, Jens Axboe wrote:
> On 11/30/18 2:25 PM, Al Viro wrote:
>> On Fri, Nov 30, 2018 at 02:16:38PM -0700, Jens Axboe wrote:
>>>>> Would this make you happy:
>>>>>
>>>>> if (!is_vmalloc_addr(kv->iov_base))
>>>>>         page = virt_to_page(kv->iov_base);
>>>>> else
>>>>>         page = vmalloc_to_page(kv->iov_base);
>>>>
>>>> Free advice: don't ever let Linus see anything along those lines.  Results
>>>> tend to be colourful...
>>>
>>> We already have those lines in the kernel, XFS for instance. Al, could you
>>> please try to be helpful instead of being deliberately obtuse?
>>
>> Again, the last time something like that had been suggested, Linus had replied
>> with a very impressive rant.  I *did* propose pretty much that, and reaction
>> was basically "hell no, not in general-purpose primitives".  Precisely about
>> iov_iter stuff.  A part of that was due to touching page refcounts, but quite
>> a bit wasn't.
> 
> Nobody is touching the page count here, and for the aio user mapped IO,
> nobody is touching them at the end either.
> 
> As far as I can tell, the above is fine. It's either a vmalloc'ed
> address and should be treated specially, or we can do virt_to_page() on
> it.
> 
> Do you have a link to said rant?

I found the rant. I think the solution here is to switch it to using
ITER_BVEC instead. With ITER_KVEC, we don't necessarily know if we can
map it, for bvec we already have the pages. And from the aio point of
view, we know the pages are sane.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 27/27] aio: add support for pre-mapped user IO buffers
  2018-11-30 22:04       ` Jeff Moyer
@ 2018-11-30 22:11         ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-11-30 22:11 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On 11/30/18 3:04 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>>>> A limit of 4M is imposed as the largest buffer we currently support.
>>>> There's nothing preventing us from going larger, but we need some cap,
>>>> and 4M seemed like it would definitely be big enough.
>>>
>>> Doesn't this mean that a user can pin a bunch of memory?  Something like
>>> 4MB * aio_max_nr?
>>>
>>> $ sysctl fs.aio-max-nr
>>> fs.aio-max-nr = 1048576
>>>
>>> If so, it may be a good idea to account the memory under RLIMIT_MEMLOCK.
>>
>> Yes, it'll need some kind of limiting, right now the limit would indeed
>> be aio-max-nr * 4MB. 4G isn't terrible, but...
> 
> Unless my math's wrong, that's 4TiB on my system.  ;-)

I guess that's a little more terrible ;-)

>> RLIMIT_MEMLOCK isn't a bad idea.
>>
>>> I'm not sure how close you are to proposing this patch set for realz.
>>> If it's soon (now?), then CC-ing linux-api and writing man pages would
>>> be a good idea.  I can help out with the libaio bits if you'd like.  I
>>> haven't yet had time to take this stuff for a spin, sorry.  I'll try to
>>> get to that soonish.
>>
>> I am proposing it for real, not sure how long it'll take to get it
>> reviewed and moved forward. Unless I get lucky. 4.22 seems like a more
>> viable version than 4.21.
>>
>> I'll take any help I can get on the API/man page parts. And/or testing!
> 
> OK, I'll add libaio support (including unit tests), write the man page,
> and I'll definitely do some testing.  I'll start on all that probably in
> the latter half of next week.

Awesome, that's much appreciated!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/27] aio: clear IOCB_HIPRI
  2018-11-30 17:14     ` Jens Axboe
@ 2018-12-04 14:46       ` Christoph Hellwig
  2018-12-04 16:40         ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2018-12-04 14:46 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-block, linux-fsdevel, linux-aio

On Fri, Nov 30, 2018 at 10:14:31AM -0700, Jens Axboe wrote:
> On 11/30/18 10:13 AM, Christoph Hellwig wrote:
> > I think we'll need to queue this up for 4.21 ASAP independent of the
> > rest, given that with separate poll queues userspace could otherwise
> > submit I/O that will never get polled for anywhere.
> 
> Probably a good idea, I can just move it to my 4.21 branch, it's not
> strictly dependent on the series.

So, can you add it to the 4.21 branch?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT
  2018-11-30 17:17     ` Jens Axboe
@ 2018-12-04 14:48       ` Christoph Hellwig
  2018-12-04 18:13         ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2018-12-04 14:48 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Bart Van Assche, linux-block, linux-fsdevel, linux-aio, hch

On Fri, Nov 30, 2018 at 10:17:49AM -0700, Jens Axboe wrote:
> > Setting REQ_NOWAIT from inside the block layer will make the code that
> > submits requests harder to review. Have you considered to make this code
> > fail I/O if REQ_NOWAIT has not been set and to require that the context
> > that submits I/O sets REQ_NOWAIT?
> 
> It's technically still feasible to do for sync polled IO, it's only
> the async case that makes it a potential deadlock.

I wonder if we want a REQ_ASYNC_POLL compound flag #define that sets
REQ_POLL and REQ_NOWAIT to make this blindly obvious.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req()
  2018-11-30 16:56 ` [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req() Jens Axboe
@ 2018-12-04 14:49   ` Christoph Hellwig
  2018-12-04 15:27     ` Jens Axboe
  0 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2018-12-04 14:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

> -	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
> -	if (unlikely(!req))
> -		return NULL;
> +	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
> +	if (req) {
> +		percpu_ref_get(&ctx->reqs);
> +		req->ki_ctx = ctx;
> +		INIT_LIST_HEAD(&req->ki_list);
> +		refcount_set(&req->ki_refcnt, 0);
> +		req->ki_eventfd = NULL;
> +	}
>  
> -	percpu_ref_get(&ctx->reqs);
> -	INIT_LIST_HEAD(&req->ki_list);
> -	refcount_set(&req->ki_refcnt, 0);
> -	req->ki_ctx = ctx;
>  	return req;

Why the reformatting?  Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 11/27] aio: only use blk plugs for > 2 depth submissions
  2018-11-30 16:56 ` [PATCH 11/27] aio: only use blk plugs for > 2 depth submissions Jens Axboe
@ 2018-12-04 14:50   ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2018-12-04 14:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On Fri, Nov 30, 2018 at 09:56:30AM -0700, Jens Axboe wrote:
> Plugging is meant to optimize submission of a string of IOs, if we don't
> have more than 2 being submitted, don't bother setting up a plug.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 12/27] aio: use iocb_put() instead of open coding it
  2018-11-30 16:56 ` [PATCH 12/27] aio: use iocb_put() instead of open coding it Jens Axboe
@ 2018-12-04 14:50   ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2018-12-04 14:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-fsdevel, linux-aio, hch

On Fri, Nov 30, 2018 at 09:56:31AM -0700, Jens Axboe wrote:
> Replace the percpu_ref_put() + kmem_cache_free() with a call to
> iocb_put() instead.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-11-30 19:21   ` Al Viro
  2018-11-30 20:15     ` Jens Axboe
@ 2018-12-04 14:55     ` Christoph Hellwig
  2018-12-04 15:25       ` Jens Axboe
  1 sibling, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2018-12-04 14:55 UTC (permalink / raw)
  To: Al Viro; +Cc: Jens Axboe, linux-block, linux-fsdevel, linux-aio, hch

On Fri, Nov 30, 2018 at 07:21:02PM +0000, Al Viro wrote:
> On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote:
> > For an ITER_KVEC, we can just iterate the iov and add the pages
> > to the bio directly.
> 
> > +		page = virt_to_page(kv->iov_base);
> > +		size = bio_add_page(bio, page, kv->iov_len,
> > +					offset_in_page(kv->iov_base));
> 
> Who said that you *can* do virt_to_page() on those?  E.g. vmalloc()'ed
> addresses are fine for ITER_KVEC, etc.

In this particular case the caller it seems the callers knows what kind
of pages we have, but we need to properly document this at the very least.

Note that in the completely generic case iov_base could also point to
memory without a struct page at all (e.g. ioremap()ed memory).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
  2018-12-04 14:55     ` Christoph Hellwig
@ 2018-12-04 15:25       ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-12-04 15:25 UTC (permalink / raw)
  To: Christoph Hellwig, Al Viro; +Cc: linux-block, linux-fsdevel, linux-aio

On 12/4/18 7:55 AM, Christoph Hellwig wrote:
> On Fri, Nov 30, 2018 at 07:21:02PM +0000, Al Viro wrote:
>> On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote:
>>> For an ITER_KVEC, we can just iterate the iov and add the pages
>>> to the bio directly.
>>
>>> +		page = virt_to_page(kv->iov_base);
>>> +		size = bio_add_page(bio, page, kv->iov_len,
>>> +					offset_in_page(kv->iov_base));
>>
>> Who said that you *can* do virt_to_page() on those?  E.g. vmalloc()'ed
>> addresses are fine for ITER_KVEC, etc.
> 
> In this particular case the caller it seems the callers knows what kind
> of pages we have, but we need to properly document this at the very least.
> 
> Note that in the completely generic case iov_base could also point to
> memory without a struct page at all (e.g. ioremap()ed memory).

That's why I went to ITER_BVEC instead, that's much saner for this
use case.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req()
  2018-12-04 14:49   ` Christoph Hellwig
@ 2018-12-04 15:27     ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-12-04 15:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, linux-fsdevel, linux-aio

On 12/4/18 7:49 AM, Christoph Hellwig wrote:
>> -	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
>> -	if (unlikely(!req))
>> -		return NULL;
>> +	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
>> +	if (req) {
>> +		percpu_ref_get(&ctx->reqs);
>> +		req->ki_ctx = ctx;
>> +		INIT_LIST_HEAD(&req->ki_list);
>> +		refcount_set(&req->ki_refcnt, 0);
>> +		req->ki_eventfd = NULL;
>> +	}
>>  
>> -	percpu_ref_get(&ctx->reqs);
>> -	INIT_LIST_HEAD(&req->ki_list);
>> -	refcount_set(&req->ki_refcnt, 0);
>> -	req->ki_ctx = ctx;
>>  	return req;
> 
> Why the reformatting?  Otherwise this looks fine to me:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Probably just the (over) abuse of likely/unlikely in aio.c. I can get
rid of it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 02/27] aio: clear IOCB_HIPRI
  2018-12-04 14:46       ` Christoph Hellwig
@ 2018-12-04 16:40         ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-12-04 16:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, linux-fsdevel, linux-aio

On 12/4/18 7:46 AM, Christoph Hellwig wrote:
> On Fri, Nov 30, 2018 at 10:14:31AM -0700, Jens Axboe wrote:
>> On 11/30/18 10:13 AM, Christoph Hellwig wrote:
>>> I think we'll need to queue this up for 4.21 ASAP independent of the
>>> rest, given that with separate poll queues userspace could otherwise
>>> submit I/O that will never get polled for anywhere.
>>
>> Probably a good idea, I can just move it to my 4.21 branch, it's not
>> strictly dependent on the series.
> 
> So, can you add it to the 4.21 branch?

Done

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT
  2018-12-04 14:48       ` Christoph Hellwig
@ 2018-12-04 18:13         ` Jens Axboe
  0 siblings, 0 replies; 59+ messages in thread
From: Jens Axboe @ 2018-12-04 18:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Bart Van Assche, linux-block, linux-fsdevel, linux-aio

On 12/4/18 7:48 AM, Christoph Hellwig wrote:
> On Fri, Nov 30, 2018 at 10:17:49AM -0700, Jens Axboe wrote:
>>> Setting REQ_NOWAIT from inside the block layer will make the code that
>>> submits requests harder to review. Have you considered to make this code
>>> fail I/O if REQ_NOWAIT has not been set and to require that the context
>>> that submits I/O sets REQ_NOWAIT?
>>
>> It's technically still feasible to do for sync polled IO, it's only
>> the async case that makes it a potential deadlock.
> 
> I wonder if we want a REQ_ASYNC_POLL compound flag #define that sets
> REQ_POLL and REQ_NOWAIT to make this blindly obvious.

Yeah that might make sense, all the async cases should certainly use it,
and sync can keep using REQ_POLL. I'll add that and fold where I can.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2018-12-04 18:13 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-30 16:56 [PATCHSET v4] Support for polled aio Jens Axboe
2018-11-30 16:56 ` [PATCH 01/27] aio: fix failure to put the file pointer Jens Axboe
2018-11-30 17:07   ` Bart Van Assche
2018-11-30 17:08     ` Jens Axboe
2018-11-30 17:24       ` Bart Van Assche
2018-11-30 16:56 ` [PATCH 02/27] aio: clear IOCB_HIPRI Jens Axboe
2018-11-30 17:13   ` Christoph Hellwig
2018-11-30 17:14     ` Jens Axboe
2018-12-04 14:46       ` Christoph Hellwig
2018-12-04 16:40         ` Jens Axboe
2018-11-30 16:56 ` [PATCH 03/27] fs: add an iopoll method to struct file_operations Jens Axboe
2018-11-30 16:56 ` [PATCH 04/27] block: wire up block device iopoll method Jens Axboe
2018-11-30 16:56 ` [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT Jens Axboe
2018-11-30 17:12   ` Bart Van Assche
2018-11-30 17:17     ` Jens Axboe
2018-12-04 14:48       ` Christoph Hellwig
2018-12-04 18:13         ` Jens Axboe
2018-11-30 16:56 ` [PATCH 06/27] iomap: wire up the iopoll method Jens Axboe
2018-11-30 16:56 ` [PATCH 07/27] iomap: ensure that async polled IO is marked REQ_NOWAIT Jens Axboe
2018-11-30 16:56 ` [PATCH 08/27] aio: use assigned completion handler Jens Axboe
2018-11-30 16:56 ` [PATCH 09/27] aio: separate out ring reservation from req allocation Jens Axboe
2018-11-30 16:56 ` [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req() Jens Axboe
2018-12-04 14:49   ` Christoph Hellwig
2018-12-04 15:27     ` Jens Axboe
2018-11-30 16:56 ` [PATCH 11/27] aio: only use blk plugs for > 2 depth submissions Jens Axboe
2018-12-04 14:50   ` Christoph Hellwig
2018-11-30 16:56 ` [PATCH 12/27] aio: use iocb_put() instead of open coding it Jens Axboe
2018-12-04 14:50   ` Christoph Hellwig
2018-11-30 16:56 ` [PATCH 13/27] aio: split out iocb copy from io_submit_one() Jens Axboe
2018-11-30 16:56 ` [PATCH 14/27] aio: abstract out io_event filler helper Jens Axboe
2018-11-30 16:56 ` [PATCH 15/27] aio: add io_setup2() system call Jens Axboe
2018-11-30 16:56 ` [PATCH 16/27] aio: add support for having user mapped iocbs Jens Axboe
2018-11-30 16:56 ` [PATCH 17/27] aio: support for IO polling Jens Axboe
2018-11-30 16:56 ` [PATCH 18/27] aio: add submission side request cache Jens Axboe
2018-11-30 16:56 ` [PATCH 19/27] fs: add fget_many() and fput_many() Jens Axboe
2018-11-30 16:56 ` [PATCH 20/27] aio: use fget/fput_many() for file references Jens Axboe
2018-11-30 16:56 ` [PATCH 21/27] aio: split iocb init from allocation Jens Axboe
2018-11-30 16:56 ` [PATCH 22/27] aio: batch aio_kiocb allocation Jens Axboe
2018-11-30 16:56 ` [PATCH 23/27] block: add BIO_HOLD_PAGES flag Jens Axboe
2018-11-30 16:56 ` [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio Jens Axboe
2018-11-30 19:21   ` Al Viro
2018-11-30 20:15     ` Jens Axboe
2018-11-30 20:32       ` Jens Axboe
2018-11-30 21:11         ` Al Viro
2018-11-30 21:16           ` Jens Axboe
2018-11-30 21:25             ` Al Viro
2018-11-30 21:34               ` Jens Axboe
2018-11-30 22:06                 ` Jens Axboe
2018-12-04 14:55     ` Christoph Hellwig
2018-12-04 15:25       ` Jens Axboe
2018-11-30 16:56 ` [PATCH 25/27] fs: add support for mapping an ITER_KVEC for O_DIRECT Jens Axboe
2018-11-30 16:56 ` [PATCH 26/27] iov_iter: add import_kvec() Jens Axboe
2018-11-30 19:17   ` Al Viro
2018-11-30 20:15     ` Jens Axboe
2018-11-30 16:56 ` [PATCH 27/27] aio: add support for pre-mapped user IO buffers Jens Axboe
2018-11-30 21:44   ` Jeff Moyer
2018-11-30 21:57     ` Jens Axboe
2018-11-30 22:04       ` Jeff Moyer
2018-11-30 22:11         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).