linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v9] io_uring IO interface
@ 2019-01-29 19:26 Jens Axboe
  2019-01-29 19:26 ` [PATCH 01/18] fs: add an iopoll method to struct file_operations Jens Axboe
                   ` (17 more replies)
  0 siblings, 18 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh

Following up on all the great review from yesterday (and today),
here's a v9 that addresses all known review concerns so far.
A particular big thanks to Jann Horn for looking into the grittier
details, which resulted in a slew of fixes. Also thanks to Christoph
for working through the patches. I feel like we're making good
progress here.

A note on ctx->compat still being there - we could store this in
struct sqe_submit, but this doesn't work for the io_sq_thread()
polled submission. Additionally, makes more sense to keep this in
the ctx instead of once per IO.

No new changes in the liburing user side library, but as a reference,
you can clone that here:

git://git.kernel.dk/liburing

We're still missing a man page for io_uring_enter(2), but the two other
system calls are documented.

Patches are against 5.0-rc4, and can also be found in my io_uring branch
here:

git://git.kernel.dk/linux-block io_uring

Changes since v8:
- Check for p->sq_thread_cpu being possible
- Check for valid flags in io_uring_enter(2)
- Cap 'to_submit' at SQ ring size in io_uring_enter(2)
- Fix files/mm references
- Don't bother with ctx referencing in io_uring_register(2)
- Use READ/WRITE_ONCE for ring updates/reads
- Use percpu_ref_tryget() for io_get_req()
- Protect sqe reads (that matter) with READ_ONCE()
- Store compat syscall info in the ctx. Still derived from
  in_compat_syscall(), but we need access to it from the io_sq_thread()
  as well.
- Don't make IORING_MAX_ENTRIES user visible
- Address various review comments

 Documentation/filesystems/vfs.txt      |    3 +
 arch/x86/entry/syscalls/syscall_32.tbl |    3 +
 arch/x86/entry/syscalls/syscall_64.tbl |    3 +
 block/bio.c                            |   59 +-
 fs/Makefile                            |    1 +
 fs/block_dev.c                         |   19 +-
 fs/file.c                              |   15 +-
 fs/file_table.c                        |    9 +-
 fs/gfs2/file.c                         |    2 +
 fs/io_uring.c                          | 2599 ++++++++++++++++++++++++
 fs/iomap.c                             |   48 +-
 fs/xfs/xfs_file.c                      |    1 +
 include/linux/bio.h                    |   14 +
 include/linux/blk_types.h              |    1 +
 include/linux/file.h                   |    2 +
 include/linux/fs.h                     |    6 +-
 include/linux/iomap.h                  |    1 +
 include/linux/sched/user.h             |    2 +-
 include/linux/syscalls.h               |    8 +
 include/uapi/asm-generic/unistd.h      |    8 +-
 include/uapi/linux/io_uring.h          |  141 ++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    3 +
 23 files changed, 2916 insertions(+), 41 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 01/18] fs: add an iopoll method to struct file_operations
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 02/18] block: wire up block device iopoll method Jens Axboe
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

This new methods is used to explicitly poll for I/O completion for an
iocb.  It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/filesystems/vfs.txt | 3 +++
 include/linux/fs.h                | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 8dc8e9c2913f..761c6fd24a53 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
@@ -902,6 +903,8 @@ otherwise noted.
 
   write_iter: possibly asynchronous write with iov_iter as source
 
+  iopoll: called when aio wants to poll for completions on HIPRI iocbs
+
   iterate: called when the VFS needs to read the directory contents
 
   iterate_shared: called when the VFS needs to read the directory contents
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 811c77743dad..ccb0b7a63aa5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -310,6 +310,7 @@ struct kiocb {
 	int			ki_flags;
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	unsigned int		ki_cookie; /* for ->iopoll */
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1786,6 +1787,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 02/18] block: wire up block device iopoll method
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
  2019-01-29 19:26 ` [PATCH 01/18] fs: add an iopoll method to struct file_operations Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 03/18] block: add bio_set_polled() helper Jens Axboe
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 58a4c1217fa8..f18d076a2596 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -293,6 +293,14 @@ struct blkdev_dio {
 
 static struct bio_set blkdev_dio_pool;
 
+static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
+{
+	struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host);
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
+}
+
 static void blkdev_bio_end_io(struct bio *bio)
 {
 	struct blkdev_dio *dio = bio->bi_private;
@@ -410,6 +418,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				bio->bi_opf |= REQ_HIPRI;
 
 			qc = submit_bio(bio);
+			WRITE_ONCE(iocb->ki_cookie, qc);
 			break;
 		}
 
@@ -2076,6 +2085,7 @@ const struct file_operations def_blk_fops = {
 	.llseek		= block_llseek,
 	.read_iter	= blkdev_read_iter,
 	.write_iter	= blkdev_write_iter,
+	.iopoll		= blkdev_iopoll,
 	.mmap		= generic_file_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 03/18] block: add bio_set_polled() helper
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
  2019-01-29 19:26 ` [PATCH 01/18] fs: add an iopoll method to struct file_operations Jens Axboe
  2019-01-29 19:26 ` [PATCH 02/18] block: wire up block device iopoll method Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 04/18] iomap: wire up the iopoll method Jens Axboe
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.

Utilize the helper in the blockdev DIRECT_IO code.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c      |  4 ++--
 include/linux/bio.h | 14 ++++++++++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f18d076a2596..392e2bfb636f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -247,7 +247,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		task_io_account_write(ret);
 	}
 	if (iocb->ki_flags & IOCB_HIPRI)
-		bio.bi_opf |= REQ_HIPRI;
+		bio_set_polled(&bio, iocb);
 
 	qc = submit_bio(&bio);
 	for (;;) {
@@ -415,7 +415,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
 		if (!nr_pages) {
 			if (iocb->ki_flags & IOCB_HIPRI)
-				bio->bi_opf |= REQ_HIPRI;
+				bio_set_polled(bio, iocb);
 
 			qc = submit_bio(bio);
 			WRITE_ONCE(iocb->ki_cookie, qc);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7380b094dcca..f6f0a2b3cbc8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+/*
+ * Mark a bio as polled. Note that for async polled IO, the caller must
+ * expect -EWOULDBLOCK if we cannot allocate a request (or other resources).
+ * We cannot block waiting for requests on polled IO, as those completions
+ * must be found by the caller. This is different than IRQ driven IO, where
+ * it's safe to wait for IO to complete.
+ */
+static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
+{
+	bio->bi_opf |= REQ_HIPRI;
+	if (!is_sync_kiocb(kiocb))
+		bio->bi_opf |= REQ_NOWAIT;
+}
+
 #endif /* CONFIG_BLOCK */
 #endif /* __LINUX_BIO_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 04/18] iomap: wire up the iopoll method
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (2 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 03/18] block: add bio_set_polled() helper Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 05/18] Add io_uring IO interface Jens Axboe
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device.  Also refactor the common direct I/O bio submission code into a
nice little helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified to use bio_set_polled().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/gfs2/file.c        |  2 ++
 fs/iomap.c            | 43 ++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_file.c     |  1 +
 include/linux/iomap.h |  1 +
 4 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index a2dea5bc0427..58a768e59712 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
@@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
diff --git a/fs/iomap.c b/fs/iomap.c
index a3088fae567b..4ee50b76b4a1 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1454,6 +1454,28 @@ struct iomap_dio {
 	};
 };
 
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
+{
+	struct request_queue *q = READ_ONCE(kiocb->private);
+
+	if (!q)
+		return 0;
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin);
+}
+EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
+
+static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
+		struct bio *bio)
+{
+	atomic_inc(&dio->ref);
+
+	if (dio->iocb->ki_flags & IOCB_HIPRI)
+		bio_set_polled(bio, dio->iocb);
+
+	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
+	dio->submit.cookie = submit_bio(bio);
+}
+
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
@@ -1566,7 +1588,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	}
 }
 
-static blk_qc_t
+static void
 iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 		unsigned len)
 {
@@ -1580,15 +1602,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	if (dio->iocb->ki_flags & IOCB_HIPRI)
-		flags |= REQ_HIPRI;
-
 	get_page(page);
 	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
-
-	atomic_inc(&dio->ref);
-	return submit_bio(bio);
+	iomap_dio_submit_bio(dio, iomap, bio);
 }
 
 static loff_t
@@ -1691,9 +1708,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 				bio_set_pages_dirty(bio);
 		}
 
-		if (dio->iocb->ki_flags & IOCB_HIPRI)
-			bio->bi_opf |= REQ_HIPRI;
-
 		iov_iter_advance(dio->submit.iter, n);
 
 		dio->size += n;
@@ -1701,11 +1715,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
-
-		atomic_inc(&dio->ref);
-
-		dio->submit.last_queue = bdev_get_queue(iomap->bdev);
-		dio->submit.cookie = submit_bio(bio);
+		iomap_dio_submit_bio(dio, iomap, bio);
 	} while (nr_pages);
 
 	/*
@@ -1916,6 +1926,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
+	WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie);
+	WRITE_ONCE(iocb->private, dio->submit.last_queue);
+
 	if (!atomic_dec_and_test(&dio->ref)) {
 		if (!dio->wait_for_completion)
 			return -EIOCBQUEUED;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e47425071e65..60c2da41f0fc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = {
 	.write_iter	= xfs_file_write_iter,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= xfs_file_compat_ioctl,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 9a4258154b25..0fefb5455bda 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
 		unsigned flags);
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
 #ifdef CONFIG_SWAP
 struct file;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 05/18] Add io_uring IO interface
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (3 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 04/18] iomap: wire up the iopoll method Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 06/18] io_uring: add fsync support Jens Axboe
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_sqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.

Two new system calls are added for this:

io_uring_setup(entries, params)
	Sets up a context for doing async IO. On success, returns a file
	descriptor that the application can mmap to gain access to the
	SQ ring, CQ ring, and io_uring_sqes.

io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
	Initiates IO against the rings mapped to this fd, or waits for
	them to complete, or both. The behavior is controlled by the
	parameters passed in. If 'to_submit' is non-zero, then we'll
	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
	kernel will wait for 'min_complete' events, if they aren't
	already available. It's valid to set IORING_ENTER_GETEVENTS
	and 'min_complete' == 0 at the same time, this allows the
	kernel to return already completed events without waiting
	for them. This is useful only for polling, as for IRQ
	driven IO, the application can just check the CQ ring
	without entering the kernel.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 fs/Makefile                            |    1 +
 fs/io_uring.c                          | 1118 ++++++++++++++++++++++++
 include/linux/syscalls.h               |    6 +
 include/uapi/asm-generic/unistd.h      |    6 +-
 include/uapi/linux/io_uring.h          |   94 ++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    2 +
 9 files changed, 1239 insertions(+), 1 deletion(-)
 create mode 100644 fs/io_uring.c
 create mode 100644 include/uapi/linux/io_uring.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..481c126259e9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,5 @@
 384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
+425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
+426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..6a32a430c8e0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,8 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+425	common	io_uring_setup		__x64_sys_io_uring_setup
+426	common	io_uring_enter		__x64_sys_io_uring_enter
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..8e15d6fc4340 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_IO_URING)		+= io_uring.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
diff --git a/fs/io_uring.c b/fs/io_uring.c
new file mode 100644
index 000000000000..b41fed1c056b
--- /dev/null
+++ b/fs/io_uring.c
@@ -0,0 +1,1118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Shared application/kernel submission and completion ring pairs, for
+ * supporting fast/efficient IO.
+ *
+ * Copyright (C) 2018-2019 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/syscalls.h>
+#include <linux/compat.h>
+#include <linux/refcount.h>
+#include <linux/uio.h>
+
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mmu_context.h>
+#include <linux/percpu.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/blkdev.h>
+#include <linux/anon_inodes.h>
+#include <linux/sched/mm.h>
+#include <linux/uaccess.h>
+#include <linux/nospec.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "internal.h"
+
+#define IORING_MAX_ENTRIES	4096
+
+struct io_uring {
+	u32 head ____cacheline_aligned_in_smp;
+	u32 tail ____cacheline_aligned_in_smp;
+};
+
+struct io_sq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			dropped;
+	u32			flags;
+	u32			array[];
+};
+
+struct io_cq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			overflow;
+	struct io_uring_cqe	cqes[];
+};
+
+struct io_ring_ctx {
+	struct {
+		struct percpu_ref	refs;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		unsigned int		flags;
+		bool			compat;
+
+		/* SQ ring */
+		struct io_sq_ring	*sq_ring;
+		unsigned		cached_sq_head;
+		unsigned		sq_entries;
+		unsigned		sq_mask;
+		unsigned		sq_thread_cpu;
+		struct io_uring_sqe	*sq_sqes;
+	} ____cacheline_aligned_in_smp;
+
+	/* IO offload */
+	struct workqueue_struct	*sqo_wq;
+	struct mm_struct	*sqo_mm;
+	struct files_struct	*sqo_files;
+
+	struct {
+		/* CQ ring */
+		struct io_cq_ring	*cq_ring;
+		unsigned		cached_cq_tail;
+		unsigned		cq_entries;
+		unsigned		cq_mask;
+		struct wait_queue_head	cq_wait;
+		struct fasync_struct	*cq_fasync;
+	} ____cacheline_aligned_in_smp;
+
+	struct user_struct	*user;
+
+	struct completion	ctx_done;
+
+	struct {
+		struct mutex		uring_lock;
+		wait_queue_head_t	wait;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		spinlock_t		completion_lock;
+	} ____cacheline_aligned_in_smp;
+};
+
+struct sqe_submit {
+	const struct io_uring_sqe	*sqe;
+	unsigned			index;
+};
+
+struct io_kiocb {
+	union {
+		struct kiocb		rw;
+		struct sqe_submit	submit;
+	};
+
+	struct io_ring_ctx	*ctx;
+	struct list_head	list;
+	unsigned int		flags;
+#define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
+	u64			user_data;
+
+	struct work_struct	work;
+};
+
+#define IO_PLUG_THRESHOLD		2
+
+static struct kmem_cache *req_cachep;
+
+static const struct file_operations io_uring_fops;
+
+static void io_ring_ctx_ref_free(struct percpu_ref *ref)
+{
+	struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
+
+	complete(&ctx->ctx_done);
+}
+
+static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
+{
+	struct io_ring_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) {
+		kfree(ctx);
+		return NULL;
+	}
+
+	ctx->flags = p->flags;
+	init_waitqueue_head(&ctx->cq_wait);
+	init_completion(&ctx->ctx_done);
+	mutex_init(&ctx->uring_lock);
+	init_waitqueue_head(&ctx->wait);
+	spin_lock_init(&ctx->completion_lock);
+	return ctx;
+}
+
+static void io_commit_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+
+	if (ctx->cached_cq_tail != ring->r.tail) {
+		/* order cqe stores with ring update */
+		smp_wmb();
+		WRITE_ONCE(ring->r.tail, ctx->cached_cq_tail);
+		/* write side barrier of tail update, app has read side */
+		smp_wmb();
+
+		if (wq_has_sleeper(&ctx->cq_wait)) {
+			wake_up_interruptible(&ctx->cq_wait);
+			kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
+		}
+	}
+}
+
+static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+	unsigned tail;
+
+	tail = ctx->cached_cq_tail;
+	smp_rmb();
+	if (tail + 1 == READ_ONCE(ring->r.head))
+		return NULL;
+
+	ctx->cached_cq_tail++;
+	return &ring->cqes[tail & ctx->cq_mask];
+}
+
+static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data,
+				 long res, unsigned ev_flags)
+{
+	struct io_uring_cqe *cqe;
+
+	/*
+	 * If we can't get a cq entry, userspace overflowed the
+	 * submission (by quite a lot). Increment the overflow count in
+	 * the ring.
+	 */
+	cqe = io_get_cqring(ctx);
+	if (cqe) {
+		cqe->user_data = ki_user_data;
+		cqe->res = res;
+		cqe->flags = ev_flags;
+	} else
+		ctx->cq_ring->overflow++;
+}
+
+static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
+				long res, unsigned ev_flags)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctx->completion_lock, flags);
+	io_cqring_fill_event(ctx, ki_user_data, res, ev_flags);
+	io_commit_cqring(ctx);
+	spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
+	if (waitqueue_active(&ctx->wait))
+		wake_up(&ctx->wait);
+}
+
+static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
+{
+	percpu_ref_put_many(&ctx->refs, refs);
+
+	if (waitqueue_active(&ctx->wait))
+		wake_up(&ctx->wait);
+}
+
+static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	if (!percpu_ref_tryget(&ctx->refs))
+		return NULL;
+
+	req = kmem_cache_alloc(req_cachep, __GFP_NOWARN);
+	if (req) {
+		req->ctx = ctx;
+		req->flags = 0;
+		return req;
+	}
+
+	io_ring_drop_ctx_refs(ctx, 1);
+	return NULL;
+}
+
+static void io_free_req(struct io_kiocb *req)
+{
+	io_ring_drop_ctx_refs(req->ctx, 1);
+	kmem_cache_free(req_cachep, req);
+}
+
+static void kiocb_end_write(struct kiocb *kiocb)
+{
+	if (kiocb->ki_flags & IOCB_WRITE) {
+		struct inode *inode = file_inode(kiocb->ki_filp);
+
+		/*
+		 * Tell lockdep we inherited freeze protection from submission
+		 * thread.
+		 */
+		if (S_ISREG(inode->i_mode))
+			__sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE);
+		file_end_write(kiocb->ki_filp);
+	}
+}
+
+static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
+{
+	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	fput(kiocb->ki_filp);
+	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+	io_free_req(req);
+}
+
+static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		      bool force_nonblock)
+{
+	struct kiocb *kiocb = &req->rw;
+	unsigned ioprio;
+	int fd, ret;
+
+	fd = READ_ONCE(sqe->fd);
+	kiocb->ki_filp = fget(fd);
+	if (unlikely(!kiocb->ki_filp))
+		return -EBADF;
+	kiocb->ki_pos = READ_ONCE(sqe->off);
+	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
+	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
+
+	ioprio = READ_ONCE(sqe->ioprio);
+	if (ioprio) {
+		ret = ioprio_check_cap(ioprio);
+		if (ret)
+			goto out_fput;
+
+		kiocb->ki_ioprio = ioprio;
+	} else
+		kiocb->ki_ioprio = get_current_ioprio();
+
+	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
+	if (unlikely(ret))
+		goto out_fput;
+	if (force_nonblock) {
+		kiocb->ki_flags |= IOCB_NOWAIT;
+		req->flags |= REQ_F_FORCE_NONBLOCK;
+	}
+	if (kiocb->ki_flags & IOCB_HIPRI) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	kiocb->ki_complete = io_complete_rw;
+	return 0;
+out_fput:
+	fput(kiocb->ki_filp);
+	return ret;
+}
+
+static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
+{
+	switch (ret) {
+	case -EIOCBQUEUED:
+		break;
+	case -ERESTARTSYS:
+	case -ERESTARTNOINTR:
+	case -ERESTARTNOHAND:
+	case -ERESTART_RESTARTBLOCK:
+		/*
+		 * We can't just restart the syscall, since previously
+		 * submitted sqes may already be in progress. Just fail this
+		 * IO with EINTR.
+		 */
+		ret = -EINTR;
+		/* fall through */
+	default:
+		kiocb->ki_complete(kiocb, ret, 0);
+	}
+}
+
+static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe, struct iovec **iovec,
+			   struct iov_iter *iter)
+{
+	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	size_t sqe_len = READ_ONCE(sqe->len);
+
+#ifdef CONFIG_COMPAT
+	if (ctx->compat)
+		return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV,
+						iovec, iter);
+#endif
+
+	return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter);
+}
+
+static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		       bool force_nonblock)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *kiocb = &req->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(req, sqe, force_nonblock);
+	if (ret)
+		return ret;
+	file = kiocb->ki_filp;
+
+	ret = -EBADF;
+	if (unlikely(!(file->f_mode & FMODE_READ)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->read_iter))
+		goto out_fput;
+
+	ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+
+	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
+	if (!ret) {
+		ssize_t ret2;
+
+		/* Catch -EAGAIN return for forced non-blocking submission */
+		ret2 = call_read_iter(file, kiocb, &iter);
+		if (!force_nonblock || ret2 != -EAGAIN)
+			io_rw_done(kiocb, ret2);
+		else
+			ret = -EAGAIN;
+	}
+	kfree(iovec);
+out_fput:
+	if (unlikely(ret))
+		fput(file);
+	return ret;
+}
+
+static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			bool force_nonblock)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *kiocb = &req->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(req, sqe, force_nonblock);
+	if (ret)
+		return ret;
+	file = kiocb->ki_filp;
+
+	ret = -EAGAIN;
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT))
+		goto out_fput;
+
+	ret = -EBADF;
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->write_iter))
+		goto out_fput;
+
+	ret = io_import_iovec(req->ctx, WRITE, sqe, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+
+	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos,
+				iov_iter_count(&iter));
+	if (!ret) {
+		/*
+		 * Open-code file_start_write here to grab freeze protection,
+		 * which will be released by another thread in
+		 * io_complete_rw().  Fool lockdep by telling it the lock got
+		 * released so that it doesn't complain about the held lock when
+		 * we return to userspace.
+		 */
+		if (S_ISREG(file_inode(file)->i_mode)) {
+			__sb_start_write(file_inode(file)->i_sb,
+						SB_FREEZE_WRITE, true);
+			__sb_writers_release(file_inode(file)->i_sb,
+						SB_FREEZE_WRITE);
+		}
+		kiocb->ki_flags |= IOCB_WRITE;
+		io_rw_done(kiocb, call_write_iter(file, kiocb, &iter));
+	}
+	kfree(iovec);
+out_fput:
+	if (unlikely(ret))
+		fput(file);
+	return ret;
+}
+
+/*
+ * IORING_OP_NOP just posts a completion event, nothing else.
+ */
+static int io_nop(struct io_kiocb *req, u64 user_data)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	io_cqring_add_event(ctx, user_data, 0, 0);
+	io_free_req(req);
+	return 0;
+}
+
+static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			   const struct sqe_submit *s, bool force_nonblock)
+{
+	ssize_t ret;
+	int opcode;
+
+	if (unlikely(s->index >= ctx->sq_entries))
+		return -EINVAL;
+	req->user_data = READ_ONCE(s->sqe->user_data);
+
+	ret = -EINVAL;
+	opcode = READ_ONCE(s->sqe->opcode);
+	switch (opcode) {
+	case IORING_OP_NOP:
+		ret = io_nop(req, req->user_data);
+		break;
+	case IORING_OP_READV:
+		ret = io_read(req, s->sqe, force_nonblock);
+		break;
+	case IORING_OP_WRITEV:
+		ret = io_write(req, s->sqe, force_nonblock);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static void io_sq_wq_submit_work(struct work_struct *work)
+{
+	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+	struct sqe_submit *s = &req->submit;
+	u64 user_data = READ_ONCE(s->sqe->user_data);
+	struct io_ring_ctx *ctx = req->ctx;
+	mm_segment_t old_fs = get_fs();
+	struct files_struct *old_files;
+	int ret;
+
+	 /* Ensure we clear previously set forced non-block flag */
+	req->flags &= ~REQ_F_FORCE_NONBLOCK;
+
+	task_lock(current);
+	old_files = current->files;
+	current->files = ctx->sqo_files;
+	task_unlock(current);
+
+	if (!mmget_not_zero(ctx->sqo_mm)) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	use_mm(ctx->sqo_mm);
+	set_fs(USER_DS);
+
+	ret = __io_submit_sqe(ctx, req, s, false);
+
+	set_fs(old_fs);
+	unuse_mm(ctx->sqo_mm);
+	mmput(ctx->sqo_mm);
+err:
+	if (ret) {
+		io_cqring_add_event(ctx, user_data, ret, 0);
+		io_free_req(req);
+	}
+
+	task_lock(current);
+	current->files = old_files;
+	task_unlock(current);
+}
+
+static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s)
+{
+	struct io_kiocb *req;
+	ssize_t ret;
+
+	/* enforce forwards compatibility on users */
+	if (unlikely(s->sqe->flags))
+		return -EINVAL;
+
+	req = io_get_req(ctx);
+	if (unlikely(!req))
+		return -EAGAIN;
+
+	ret = __io_submit_sqe(ctx, req, s, true);
+	if (ret == -EAGAIN) {
+		memcpy(&req->submit, s, sizeof(*s));
+		INIT_WORK(&req->work, io_sq_wq_submit_work);
+		queue_work(ctx->sqo_wq, &req->work);
+		ret = 0;
+	}
+	if (ret)
+		io_free_req(req);
+
+	return ret;
+}
+
+static void io_commit_sqring(struct io_ring_ctx *ctx)
+{
+	struct io_sq_ring *ring = ctx->sq_ring;
+
+	if (ctx->cached_sq_head != ring->r.head) {
+		WRITE_ONCE(ring->r.head, ctx->cached_sq_head);
+		/* write side barrier of head update, app has read side */
+		smp_wmb();
+	}
+}
+
+/*
+ * Undo last io_get_sqring()
+ */
+static void io_drop_sqring(struct io_ring_ctx *ctx)
+{
+	ctx->cached_sq_head--;
+}
+
+/*
+ * Fetch an sqe, if one is available. Note that s->sqe will point to memory
+ * that is mapped by userspace. This means that care needs to be taken to
+ * ensure that reads are stable, as we cannot rely on userspace always
+ * being a good citizen. If members of the sqe are validated and then later
+ * used, it's important that those reads are done through READ_ONCE() to
+ * prevent a re-load down the line.
+ */
+static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
+{
+	struct io_sq_ring *ring = ctx->sq_ring;
+	unsigned head;
+
+	/*
+	 * The cached sq head (or cq tail) serves two purposes:
+	 *
+	 * 1) allows us to batch the cost of updating the user visible
+	 *    head updates.
+	 * 2) allows the kernel side to track the head on its own, even
+	 *    though the application is the one updating it.
+	 */
+	head = ctx->cached_sq_head;
+	smp_rmb();
+	if (head == READ_ONCE(ring->r.tail))
+		return false;
+
+	head = READ_ONCE(ring->array[head & ctx->sq_mask]);
+	if (head < ctx->sq_entries) {
+		s->index = head;
+		s->sqe = &ctx->sq_sqes[head];
+		ctx->cached_sq_head++;
+		return true;
+	}
+
+	/* drop invalid entries */
+	ctx->cached_sq_head++;
+	ring->dropped++;
+	smp_wmb();
+	return false;
+}
+
+static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
+{
+	int i, ret = 0, submit = 0;
+	struct blk_plug plug;
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_start_plug(&plug);
+
+	for (i = 0; i < to_submit; i++) {
+		struct sqe_submit s;
+
+		if (!io_get_sqring(ctx, &s))
+			break;
+
+		ret = io_submit_sqe(ctx, &s);
+		if (ret) {
+			io_drop_sqring(ctx);
+			break;
+		}
+
+		submit++;
+	}
+	io_commit_sqring(ctx);
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_finish_plug(&plug);
+
+	return submit ? submit : ret;
+}
+
+/*
+ * Wait until events become available, if we don't already have some. The
+ * application must reap them itself, as they reside on the shared cq ring.
+ */
+static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
+			  const sigset_t __user *sig, size_t sigsz)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+	sigset_t ksigmask, sigsaved;
+	DEFINE_WAIT(wait);
+	int ret = 0;
+
+	smp_rmb();
+	if (ring->r.head != ring->r.tail)
+		return 0;
+	if (!min_events)
+		return 0;
+
+	if (sig) {
+		ret = set_user_sigmask(sig, &ksigmask, &sigsaved, sigsz);
+		if (ret)
+			return ret;
+	}
+
+	do {
+		prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE);
+
+		ret = 0;
+		smp_rmb();
+		if (ring->r.head != ring->r.tail)
+			break;
+
+		schedule();
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+	} while (1);
+
+	finish_wait(&ctx->wait, &wait);
+
+	if (sig)
+		restore_user_sigmask(sig, &sigsaved);
+
+	return ring->r.head == ring->r.tail ? ret : 0;
+}
+
+static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
+			    unsigned min_complete, unsigned flags,
+			    const sigset_t __user *sig, size_t sigsz)
+{
+	int submitted, ret;
+
+	submitted = ret = 0;
+	if (to_submit) {
+		to_submit = min(to_submit, ctx->sq_entries);
+
+		submitted = io_ring_submit(ctx, to_submit);
+		if (submitted < 0)
+			return submitted;
+	}
+	if (flags & IORING_ENTER_GETEVENTS) {
+		/*
+		 * The application could have included the 'to_submit' count
+		 * in how many events it wanted to wait for. If we failed to
+		 * submit the desired count, we may need to adjust the number
+		 * of events to poll/wait for.
+		 */
+		if (submitted < to_submit)
+			min_complete = min_t(unsigned, submitted, min_complete);
+
+		ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
+	}
+
+	return submitted ? submitted : ret;
+}
+
+static int io_sq_offload_start(struct io_ring_ctx *ctx)
+{
+	int ret;
+
+	mmgrab(current->mm);
+	ctx->sqo_mm = current->mm;
+
+	ret = -EBADF;
+	ctx->sqo_files = get_files_struct(current);
+	if (!ctx->sqo_files)
+		goto err;
+
+	/* Do QD, or 2 * CPUS, whatever is smallest */
+	ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
+			min(ctx->sq_entries - 1, 2 * num_online_cpus()));
+	if (!ctx->sqo_wq) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	return 0;
+err:
+	if (ctx->sqo_files) {
+		put_files_struct(ctx->sqo_files);
+		ctx->sqo_files = NULL;
+	}
+	mmdrop(ctx->sqo_mm);
+	ctx->sqo_mm = NULL;
+	return ret;
+}
+
+static void __io_unaccount_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	atomic_long_sub(nr_pages, &user->locked_vm);
+}
+
+static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
+{
+	if (ctx->user)
+		__io_unaccount_mem(ctx->user, nr_pages);
+}
+
+static int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	unsigned long page_limit, cur_pages, new_pages;
+
+	/* Don't allow more pages than we can safely lock */
+	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	do {
+		cur_pages = atomic_long_read(&user->locked_vm);
+		new_pages = cur_pages + nr_pages;
+		if (new_pages > page_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&user->locked_vm, cur_pages,
+					new_pages) != cur_pages);
+
+	return 0;
+}
+
+static void io_mem_free(void *ptr)
+{
+	struct page *page = virt_to_head_page(ptr);
+
+	if (put_page_testzero(page))
+		free_compound_page(page);
+}
+
+static void *io_mem_alloc(size_t size)
+{
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
+				__GFP_NORETRY;
+
+	return (void *) __get_free_pages(gfp_flags, get_order(size));
+}
+
+static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
+{
+	struct io_sq_ring *sq_ring;
+	struct io_cq_ring *cq_ring;
+	size_t bytes;
+
+	bytes = struct_size(sq_ring, array, sq_entries);
+	bytes += array_size(sizeof(struct io_uring_sqe), sq_entries);
+	bytes += struct_size(cq_ring, cqes, cq_entries);
+
+	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
+}
+
+static void io_ring_ctx_free(struct io_ring_ctx *ctx)
+{
+	destroy_workqueue(ctx->sqo_wq);
+	mmdrop(ctx->sqo_mm);
+	put_files_struct(ctx->sqo_files);
+
+	io_mem_free(ctx->sq_ring);
+	io_mem_free(ctx->sq_sqes);
+	io_mem_free(ctx->cq_ring);
+
+	percpu_ref_exit(&ctx->refs);
+	io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries));
+	kfree(ctx);
+}
+
+static __poll_t io_uring_poll(struct file *file, poll_table *wait)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &ctx->cq_wait, wait);
+	smp_rmb();
+	if (ctx->sq_ring->r.tail + 1 != ctx->cached_sq_head)
+		mask |= EPOLLOUT | EPOLLWRNORM;
+	if (ctx->cq_ring->r.head != ctx->cached_cq_tail)
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static int io_uring_fasync(int fd, struct file *file, int on)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	return fasync_helper(fd, file, on, &ctx->cq_fasync);
+}
+
+static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
+{
+	mutex_lock(&ctx->uring_lock);
+	percpu_ref_kill(&ctx->refs);
+	mutex_unlock(&ctx->uring_lock);
+
+	wait_for_completion(&ctx->ctx_done);
+	io_ring_ctx_free(ctx);
+}
+
+static int io_uring_release(struct inode *inode, struct file *file)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	file->private_data = NULL;
+	io_ring_ctx_wait_and_kill(ctx);
+	return 0;
+}
+
+static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long sz = vma->vm_end - vma->vm_start;
+	struct io_ring_ctx *ctx = file->private_data;
+	unsigned long pfn;
+	struct page *page;
+	void *ptr;
+
+	switch (offset) {
+	case IORING_OFF_SQ_RING:
+		ptr = ctx->sq_ring;
+		break;
+	case IORING_OFF_SQES:
+		ptr = ctx->sq_sqes;
+		break;
+	case IORING_OFF_CQ_RING:
+		ptr = ctx->cq_ring;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	page = virt_to_head_page(ptr);
+	if (sz > (PAGE_SIZE << compound_order(page)))
+		return -EINVAL;
+
+	pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
+}
+
+SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
+		u32, min_complete, u32, flags, const sigset_t __user *, sig,
+		size_t, sigsz)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	if (flags & ~IORING_ENTER_GETEVENTS)
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ret = -ENXIO;
+	ctx = f.file->private_data;
+	if (!percpu_ref_tryget(&ctx->refs))
+		goto out_fput;
+
+	ret = -EBUSY;
+	if (!mutex_trylock(&ctx->uring_lock))
+		goto out_ctx;
+
+	ret = __io_uring_enter(ctx, to_submit, min_complete, flags, sig, sigsz);
+	mutex_unlock(&ctx->uring_lock);
+out_ctx:
+	io_ring_drop_ctx_refs(ctx, 1);
+out_fput:
+	fdput(f);
+	return ret;
+}
+
+static const struct file_operations io_uring_fops = {
+	.release	= io_uring_release,
+	.mmap		= io_uring_mmap,
+	.poll		= io_uring_poll,
+	.fasync		= io_uring_fasync,
+};
+
+static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
+				  struct io_uring_params *p)
+{
+	struct io_sq_ring *sq_ring;
+	struct io_cq_ring *cq_ring;
+	size_t size;
+
+	sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries));
+	if (!sq_ring)
+		return -ENOMEM;
+
+	ctx->sq_ring = sq_ring;
+	sq_ring->ring_mask = p->sq_entries - 1;
+	sq_ring->ring_entries = p->sq_entries;
+	ctx->sq_mask = sq_ring->ring_mask;
+	ctx->sq_entries = sq_ring->ring_entries;
+
+	size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
+	if (size == SIZE_MAX)
+		return -EOVERFLOW;
+
+	ctx->sq_sqes = io_mem_alloc(size);
+	if (!ctx->sq_sqes) {
+		io_mem_free(ctx->sq_ring);
+		return -ENOMEM;
+	}
+
+	cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries));
+	if (!cq_ring) {
+		io_mem_free(ctx->sq_ring);
+		io_mem_free(ctx->sq_sqes);
+		return -ENOMEM;
+	}
+
+	ctx->cq_ring = cq_ring;
+	cq_ring->ring_mask = p->cq_entries - 1;
+	cq_ring->ring_entries = p->cq_entries;
+	ctx->cq_mask = cq_ring->ring_mask;
+	ctx->cq_entries = cq_ring->ring_entries;
+	return 0;
+}
+
+static int io_uring_create(unsigned entries, struct io_uring_params *p)
+{
+	struct user_struct *user = NULL;
+	struct io_ring_ctx *ctx;
+	int ret;
+
+	if (!entries || entries > IORING_MAX_ENTRIES)
+		return -EINVAL;
+
+	/*
+	 * Use twice as many entries for the CQ ring. It's possible for the
+	 * application to drive a higher depth than the size of the SQ ring,
+	 * since the sqes are only used at submission time. This allows for
+	 * some flexibility in overcommitting a bit.
+	 */
+	p->sq_entries = roundup_pow_of_two(entries);
+	p->cq_entries = 2 * p->sq_entries;
+
+	if (!capable(CAP_IPC_LOCK)) {
+		user = get_uid(current_user());
+		ret = __io_account_mem(user, ring_pages(p->sq_entries,
+							p->cq_entries));
+		if (ret) {
+			free_uid(user);
+			return ret;
+		}
+	}
+
+	ctx = io_ring_ctx_alloc(p);
+	if (!ctx) {
+		__io_unaccount_mem(user, ring_pages(p->sq_entries,
+							p->cq_entries));
+		free_uid(user);
+		return -ENOMEM;
+	}
+	ctx->compat = in_compat_syscall();
+	ctx->user = user;
+
+	ret = io_allocate_scq_urings(ctx, p);
+	if (ret)
+		goto err;
+
+	ret = io_sq_offload_start(ctx);
+	if (ret)
+		goto err;
+
+	ret = anon_inode_getfd("[io_uring]", &io_uring_fops, ctx,
+				O_RDWR | O_CLOEXEC);
+	if (ret < 0)
+		goto err;
+
+	memset(&p->sq_off, 0, sizeof(p->sq_off));
+	p->sq_off.head = offsetof(struct io_sq_ring, r.head);
+	p->sq_off.tail = offsetof(struct io_sq_ring, r.tail);
+	p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask);
+	p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries);
+	p->sq_off.flags = offsetof(struct io_sq_ring, flags);
+	p->sq_off.dropped = offsetof(struct io_sq_ring, dropped);
+	p->sq_off.array = offsetof(struct io_sq_ring, array);
+
+	memset(&p->cq_off, 0, sizeof(p->cq_off));
+	p->cq_off.head = offsetof(struct io_cq_ring, r.head);
+	p->cq_off.tail = offsetof(struct io_cq_ring, r.tail);
+	p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask);
+	p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries);
+	p->cq_off.overflow = offsetof(struct io_cq_ring, overflow);
+	p->cq_off.cqes = offsetof(struct io_cq_ring, cqes);
+	return ret;
+err:
+	io_ring_ctx_wait_and_kill(ctx);
+	return ret;
+}
+
+/*
+ * Sets up an aio uring context, and returns the fd. Applications asks for a
+ * ring size, we return the actual sq/cq ring sizes (among other things) in the
+ * params structure passed in.
+ */
+static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
+{
+	struct io_uring_params p;
+	long ret;
+	int i;
+
+	if (copy_from_user(&p, params, sizeof(p)))
+		return -EFAULT;
+	for (i = 0; i < ARRAY_SIZE(p.resv); i++) {
+		if (p.resv[i])
+			return -EINVAL;
+	}
+
+	if (p.flags)
+		return -EINVAL;
+
+	ret = io_uring_create(entries, &p);
+	if (ret < 0)
+		return ret;
+
+	if (copy_to_user(params, &p, sizeof(p)))
+		return -EFAULT;
+
+	return ret;
+}
+
+SYSCALL_DEFINE2(io_uring_setup, u32, entries,
+		struct io_uring_params __user *, params)
+{
+	return io_uring_setup(entries, params);
+}
+
+static int __init io_uring_init(void)
+{
+	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+	return 0;
+};
+__initcall(io_uring_init);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 257cccba3062..3072dbaa7869 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct file_handle;
 struct sigaltstack;
 struct rseq;
 union bpf_attr;
+struct io_uring_params;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -309,6 +310,11 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id,
 				struct io_event __user *events,
 				struct old_timespec32 __user *timeout,
 				const struct __aio_sigset *sig);
+asmlinkage long sys_io_uring_setup(u32 entries,
+				struct io_uring_params __user *p);
+asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
+				u32 min_complete, u32 flags,
+				const sigset_t __user *sig, size_t sigsz);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index d90127298f12..87871e7b7ea7 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -740,9 +740,13 @@ __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents)
 __SYSCALL(__NR_rseq, sys_rseq)
 #define __NR_kexec_file_load 294
 __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
+#define __NR_io_uring_setup 425
+__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
+#define __NR_io_uring_enter 426
+__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 
 #undef __NR_syscalls
-#define __NR_syscalls 295
+#define __NR_syscalls 427
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
new file mode 100644
index 000000000000..bd47e86701ea
--- /dev/null
+++ b/include/uapi/linux/io_uring.h
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Header file for the io_uring interface.
+ *
+ * Copyright (C) 2019 Jens Axboe
+ * Copyright (C) 2019 Christoph Hellwig
+ */
+#ifndef LINUX_IO_URING_H
+#define LINUX_IO_URING_H
+
+#include <linux/fs.h>
+#include <linux/types.h>
+
+/*
+ * IO submission data structure (Submission Queue Entry)
+ */
+struct io_uring_sqe {
+	__u8	opcode;		/* type of operation for this sqe */
+	__u8	flags;		/* as of now unused */
+	__u16	ioprio;		/* ioprio for the request */
+	__s32	fd;		/* file descriptor to do IO on */
+	__u64	off;		/* offset into file */
+	__u64	addr;		/* pointer to buffer or iovecs */
+	__u32	len;		/* buffer size or number of iovecs */
+	union {
+		__kernel_rwf_t	rw_flags;
+		__u32		__resv;
+	};
+	__u64	user_data;	/* data to be passed back at completion time */
+	__u64	__pad2[3];
+};
+
+#define IORING_OP_NOP		0
+#define IORING_OP_READV		1
+#define IORING_OP_WRITEV	2
+
+/*
+ * IO completion data structure (Completion Queue Entry)
+ */
+struct io_uring_cqe {
+	__u64	user_data;	/* sqe->data submission passed back */
+	__s32	res;		/* result code for this event */
+	__u32	flags;
+};
+
+/*
+ * Magic offsets for the application to mmap the data it needs
+ */
+#define IORING_OFF_SQ_RING		0ULL
+#define IORING_OFF_CQ_RING		0x8000000ULL
+#define IORING_OFF_SQES			0x10000000ULL
+
+/*
+ * Filled with the offset for mmap(2)
+ */
+struct io_sqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 flags;
+	__u32 dropped;
+	__u32 array;
+	__u32 resv[3];
+};
+
+struct io_cqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 overflow;
+	__u32 cqes;
+	__u32 resv[4];
+};
+
+/*
+ * io_uring_enter(2) flags
+ */
+#define IORING_ENTER_GETEVENTS	(1U << 0)
+
+/*
+ * Passed in for io_uring_setup(2). Copied back with updated info on success
+ */
+struct io_uring_params {
+	__u32 sq_entries;
+	__u32 cq_entries;
+	__u32 flags;
+	__u32 resv[7];
+	struct io_sqring_offsets sq_off;
+	struct io_cqring_offsets cq_off;
+};
+
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 513fa544a134..0cf723867e69 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1403,6 +1403,15 @@ config AIO
 	  by some high performance threaded applications. Disabling
 	  this option saves about 7k.
 
+config IO_URING
+	bool "Enable IO uring support" if EXPERT
+	select ANON_INODES
+	default y
+	help
+	  This option enables support for the io_uring interface, enabling
+	  applications to submit and completion IO through submission and
+	  completion rings that are shared between the kernel and application.
+
 config ADVISE_SYSCALLS
 	bool "Enable madvise/fadvise syscalls" if EXPERT
 	default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ab9d0e3c6d50..ee5e523564bb 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents);
 COND_SYSCALL(io_pgetevents);
 COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
+COND_SYSCALL(io_uring_setup);
+COND_SYSCALL(io_uring_enter);
 
 /* fs/xattr.c */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 06/18] io_uring: add fsync support
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (4 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 05/18] Add io_uring IO interface Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 07/18] io_uring: support for IO polling Jens Axboe
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Add a new fsync opcode, which either syncs a range if one is passed,
or the whole file if the offset and length fields are both cleared
to zero.  A flag is provided to use fdatasync semantics, that is only
force out metadata which is required to retrieve the file data, but
not others like metadata.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 40 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/io_uring.h |  8 ++++++-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b41fed1c056b..c75a3e197ed5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4,6 +4,7 @@
  * supporting fast/efficient IO.
  *
  * Copyright (C) 2018-2019 Jens Axboe
+ * Copyright (c) 2018-2019 Christoph Hellwig
  */
 #include <linux/kernel.h>
 #include <linux/init.h>
@@ -473,6 +474,42 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 	return 0;
 }
 
+static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		    bool force_nonblock)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	loff_t sqe_off = READ_ONCE(sqe->off);
+	loff_t sqe_len = READ_ONCE(sqe->len);
+	loff_t end = sqe_off + sqe_len;
+	unsigned fsync_flags;
+	struct file *file;
+	int ret, fd;
+
+	/* fsync always requires a blocking context */
+	if (force_nonblock)
+		return -EAGAIN;
+
+	if (unlikely(sqe->addr || sqe->ioprio))
+		return -EINVAL;
+
+	fsync_flags = READ_ONCE(sqe->fsync_flags);
+	if (unlikely(fsync_flags & ~IORING_FSYNC_DATASYNC))
+		return -EINVAL;
+
+	fd = READ_ONCE(sqe->fd);
+	file = fget(fd);
+	if (unlikely(!file))
+		return -EBADF;
+
+	ret = vfs_fsync_range(file, sqe_off, end > 0 ? end : LLONG_MAX,
+				fsync_flags & IORING_FSYNC_DATASYNC);
+
+	fput(file);
+	io_cqring_add_event(ctx, sqe->user_data, ret, 0);
+	io_free_req(req);
+	return 0;
+}
+
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			   const struct sqe_submit *s, bool force_nonblock)
 {
@@ -495,6 +532,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	case IORING_OP_WRITEV:
 		ret = io_write(req, s->sqe, force_nonblock);
 		break;
+	case IORING_OP_FSYNC:
+		ret = io_fsync(req, s->sqe, force_nonblock);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index bd47e86701ea..0fca46f8fc37 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -24,7 +24,7 @@ struct io_uring_sqe {
 	__u32	len;		/* buffer size or number of iovecs */
 	union {
 		__kernel_rwf_t	rw_flags;
-		__u32		__resv;
+		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	__u64	__pad2[3];
@@ -33,6 +33,12 @@ struct io_uring_sqe {
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
+#define IORING_OP_FSYNC		3
+
+/*
+ * sqe->fsync_flags
+ */
+#define IORING_FSYNC_DATASYNC	(1U << 0)
 
 /*
  * IO completion data structure (Completion Queue Entry)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 07/18] io_uring: support for IO polling
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (5 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 06/18] io_uring: add fsync support Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 20:47   ` Jann Horn
  2019-01-29 19:26 ` [PATCH 08/18] fs: add fget_many() and fput_many() Jens Axboe
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

Add support for a polled io_uring context. When a read or write is
submitted to a polled context, the application must poll for completions
on the CQ ring through io_uring_enter(2). Polled IO may not generate
IRQ completions, hence they need to be actively found by the application
itself.

To use polling, io_uring_setup() must be used with the
IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
polled and non-polled IO on an io_uring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 243 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/io_uring.h |   5 +
 2 files changed, 240 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c75a3e197ed5..a4f4d75609d5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -102,6 +102,8 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
+		bool			poll_multi_file;
+		struct list_head	poll_list;
 	} ____cacheline_aligned_in_smp;
 };
 
@@ -120,12 +122,15 @@ struct io_kiocb {
 	struct list_head	list;
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
+#define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 	u64			user_data;
+	u64			error;
 
 	struct work_struct	work;
 };
 
 #define IO_PLUG_THRESHOLD		2
+#define IO_IOPOLL_BATCH			8
 
 static struct kmem_cache *req_cachep;
 
@@ -157,6 +162,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
+	INIT_LIST_HEAD(&ctx->poll_list);
 	return ctx;
 }
 
@@ -251,12 +257,154 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
 	return NULL;
 }
 
+static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
+{
+	if (*nr) {
+		kmem_cache_free_bulk(req_cachep, *nr, reqs);
+		io_ring_drop_ctx_refs(ctx, *nr);
+		*nr = 0;
+	}
+}
+
 static void io_free_req(struct io_kiocb *req)
 {
 	io_ring_drop_ctx_refs(req->ctx, 1);
 	kmem_cache_free(req_cachep, req);
 }
 
+/*
+ * Find and free completed poll iocbs
+ */
+static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
+			       struct list_head *done)
+{
+	void *reqs[IO_IOPOLL_BATCH];
+	struct io_kiocb *req;
+	int to_free = 0;
+
+	while (!list_empty(done)) {
+		req = list_first_entry(done, struct io_kiocb, list);
+		list_del(&req->list);
+
+		io_cqring_fill_event(ctx, req->user_data, req->error, 0);
+
+		reqs[to_free++] = req;
+		(*nr_events)++;
+
+		fput(req->rw.ki_filp);
+		if (to_free == ARRAY_SIZE(reqs))
+			io_free_req_many(ctx, reqs, &to_free);
+	}
+	io_commit_cqring(ctx);
+
+	if (to_free)
+		io_free_req_many(ctx, reqs, &to_free);
+}
+
+static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
+			long min)
+{
+	struct io_kiocb *req, *tmp;
+	LIST_HEAD(done);
+	bool spin;
+	int ret;
+
+	/*
+	 * Only spin for completions if we don't have multiple devices hanging
+	 * off our complete list, and we're under the requested amount.
+	 */
+	spin = !ctx->poll_multi_file && *nr_events < min;
+
+	ret = 0;
+	list_for_each_entry_safe(req, tmp, &ctx->poll_list, list) {
+		struct kiocb *kiocb = &req->rw;
+
+		/*
+		 * Move completed entries to our local list. If we find a
+		 * request that requires polling, break out and complete
+		 * the done list first, if we have entries there.
+		 */
+		if (req->flags & REQ_F_IOPOLL_COMPLETED) {
+			list_move_tail(&req->list, &done);
+			continue;
+		}
+		if (!list_empty(&done))
+			break;
+
+		ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin);
+		if (ret < 0)
+			break;
+
+		if (ret && spin)
+			spin = false;
+		ret = 0;
+	}
+
+	if (!list_empty(&done))
+		io_iopoll_complete(ctx, nr_events, &done);
+
+	return ret;
+}
+
+/*
+ * Poll for a mininum of 'min' events. Note that if min == 0 we consider that a
+ * non-spinning poll check - we'll still enter the driver poll loop, but only
+ * as a non-spinning completion check.
+ */
+static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events,
+				long min)
+{
+	while (!list_empty(&ctx->poll_list)) {
+		int ret;
+
+		ret = io_do_iopoll(ctx, nr_events, min);
+		if (ret < 0)
+			return ret;
+		if (!min || *nr_events >= min)
+			return 0;
+	}
+
+	return 1;
+}
+
+/*
+ * We can't just wait for polled events to come to us, we have to actively
+ * find and complete them.
+ */
+static void io_iopoll_reap_events(struct io_ring_ctx *ctx)
+{
+	if (!(ctx->flags & IORING_SETUP_IOPOLL))
+		return;
+
+	mutex_lock(&ctx->uring_lock);
+	while (!list_empty(&ctx->poll_list)) {
+		unsigned int nr_events = 0;
+
+		io_iopoll_getevents(ctx, &nr_events, 1);
+	}
+	mutex_unlock(&ctx->uring_lock);
+}
+
+static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
+			   long min)
+{
+	int ret = 0;
+
+	do {
+		int tmin = 0;
+
+		if (*nr_events < min)
+			tmin = min - *nr_events;
+
+		ret = io_iopoll_getevents(ctx, nr_events, tmin);
+		if (ret <= 0)
+			break;
+		ret = 0;
+	} while (!*nr_events || !need_resched());
+
+	return ret;
+}
+
 static void kiocb_end_write(struct kiocb *kiocb)
 {
 	if (kiocb->ki_flags & IOCB_WRITE) {
@@ -283,9 +431,57 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 	io_free_req(req);
 }
 
+static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
+{
+	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	req->error = res;
+	if (res != -EAGAIN)
+		req->flags |= REQ_F_IOPOLL_COMPLETED;
+}
+
+/*
+ * After the iocb has been issued, it's safe to be found on the poll list.
+ * Adding the kiocb to the list AFTER submission ensures that we don't
+ * find it from a io_iopoll_getevents() thread before the issuer is done
+ * accessing the kiocb cookie.
+ */
+static void io_iopoll_req_issued(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	/*
+	 * Track whether we have multiple files in our lists. This will impact
+	 * how we do polling eventually, not spinning if we're on potentially
+	 * different devices.
+	 */
+	if (list_empty(&ctx->poll_list)) {
+		ctx->poll_multi_file = false;
+	} else if (!ctx->poll_multi_file) {
+		struct io_kiocb *list_req;
+
+		list_req = list_first_entry(&ctx->poll_list, struct io_kiocb,
+						list);
+		if (list_req->rw.ki_filp != req->rw.ki_filp)
+			ctx->poll_multi_file = true;
+	}
+
+	/*
+	 * For fast devices, IO may have already completed. If it has, add
+	 * it to the front so we find it first.
+	 */
+	if (req->flags & REQ_F_IOPOLL_COMPLETED)
+		list_add(&req->list, &ctx->poll_list);
+	else
+		list_add_tail(&req->list, &ctx->poll_list);
+}
+
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		      bool force_nonblock)
 {
+	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
 	unsigned ioprio;
 	int fd, ret;
@@ -315,12 +511,21 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		kiocb->ki_flags |= IOCB_NOWAIT;
 		req->flags |= REQ_F_FORCE_NONBLOCK;
 	}
-	if (kiocb->ki_flags & IOCB_HIPRI) {
-		ret = -EINVAL;
-		goto out_fput;
-	}
+	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		ret = -EOPNOTSUPP;
+		if (!(kiocb->ki_flags & IOCB_DIRECT) ||
+		    !kiocb->ki_filp->f_op->iopoll)
+			goto out_fput;
 
-	kiocb->ki_complete = io_complete_rw;
+		kiocb->ki_flags |= IOCB_HIPRI;
+		kiocb->ki_complete = io_complete_rw_iopoll;
+	} else {
+		if (kiocb->ki_flags & IOCB_HIPRI) {
+			ret = -EINVAL;
+			goto out_fput;
+		}
+		kiocb->ki_complete = io_complete_rw;
+	}
 	return 0;
 out_fput:
 	fput(kiocb->ki_filp);
@@ -469,6 +674,9 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 
+	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+
 	io_cqring_add_event(ctx, user_data, 0, 0);
 	io_free_req(req);
 	return 0;
@@ -489,6 +697,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (force_nonblock)
 		return -EAGAIN;
 
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
 	if (unlikely(sqe->addr || sqe->ioprio))
 		return -EINVAL;
 
@@ -540,7 +750,16 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		break;
 	}
 
-	return ret;
+	if (ret)
+		return ret;
+
+	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		if (req->error == -EAGAIN)
+			return -EAGAIN;
+		io_iopoll_req_issued(req);
+	}
+
+	return 0;
 }
 
 static void io_sq_wq_submit_work(struct work_struct *work)
@@ -763,6 +982,8 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
 			return submitted;
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
+		unsigned nr_events = 0;
+
 		/*
 		 * The application could have included the 'to_submit' count
 		 * in how many events it wanted to wait for. If we failed to
@@ -772,7 +993,10 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
 		if (submitted < to_submit)
 			min_complete = min_t(unsigned, submitted, min_complete);
 
-		ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
+		if (ctx->flags & IORING_SETUP_IOPOLL)
+			ret = io_iopoll_check(ctx, &nr_events, min_complete);
+		else
+			ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
 	}
 
 	return submitted ? submitted : ret;
@@ -873,6 +1097,8 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	mmdrop(ctx->sqo_mm);
 	put_files_struct(ctx->sqo_files);
 
+	io_iopoll_reap_events(ctx);
+
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
 	io_mem_free(ctx->cq_ring);
@@ -910,6 +1136,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	percpu_ref_kill(&ctx->refs);
 	mutex_unlock(&ctx->uring_lock);
 
+	io_iopoll_reap_events(ctx);
 	wait_for_completion(&ctx->ctx_done);
 	io_ring_ctx_free(ctx);
 }
@@ -1131,7 +1358,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			return -EINVAL;
 	}
 
-	if (p.flags)
+	if (p.flags & ~IORING_SETUP_IOPOLL)
 		return -EINVAL;
 
 	ret = io_uring_create(entries, &p);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0fca46f8fc37..4952fc921866 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -30,6 +30,11 @@ struct io_uring_sqe {
 	__u64	__pad2[3];
 };
 
+/*
+ * io_uring_setup() flags
+ */
+#define IORING_SETUP_IOPOLL	(1U << 0)	/* io_context is polled */
+
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 08/18] fs: add fget_many() and fput_many()
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (6 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 07/18] io_uring: support for IO polling Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 09/18] io_uring: use fget/fput_many() for file references Jens Axboe
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/file.c            | 15 ++++++++++-----
 fs/file_table.c      |  9 +++++++--
 include/linux/file.h |  2 ++
 include/linux/fs.h   |  4 +++-
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 3209ee271c41..97df385d6ab0 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__fget(unsigned int fd, fmode_t mask)
+static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
 {
 	struct files_struct *files = current->files;
 	struct file *file;
@@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 		 */
 		if (file->f_mode & mask)
 			file = NULL;
-		else if (!get_file_rcu(file))
+		else if (!get_file_rcu_many(file, refs))
 			goto loop;
 	}
 	rcu_read_unlock();
@@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 	return file;
 }
 
+struct file *fget_many(unsigned int fd, unsigned int refs)
+{
+	return __fget(fd, FMODE_PATH, refs);
+}
+
 struct file *fget(unsigned int fd)
 {
-	return __fget(fd, FMODE_PATH);
+	return __fget(fd, FMODE_PATH, 1);
 }
 EXPORT_SYMBOL(fget);
 
 struct file *fget_raw(unsigned int fd)
 {
-	return __fget(fd, 0);
+	return __fget(fd, 0, 1);
 }
 EXPORT_SYMBOL(fget_raw);
 
@@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget(fd, mask, 1);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/fs/file_table.c b/fs/file_table.c
index 5679e7fcb6b0..155d7514a094 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -326,9 +326,9 @@ void flush_delayed_fput(void)
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
-void fput(struct file *file)
+void fput_many(struct file *file, unsigned int refs)
 {
-	if (atomic_long_dec_and_test(&file->f_count)) {
+	if (atomic_long_sub_and_test(refs, &file->f_count)) {
 		struct task_struct *task = current;
 
 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
@@ -347,6 +347,11 @@ void fput(struct file *file)
 	}
 }
 
+void fput(struct file *file)
+{
+	fput_many(file, 1);
+}
+
 /*
  * synchronous analog of fput(); for kernel threads that might be needed
  * in some umount() (and thus can't use flush_delayed_fput() without
diff --git a/include/linux/file.h b/include/linux/file.h
index 6b2fb032416c..3fcddff56bc4 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -13,6 +13,7 @@
 struct file;
 
 extern void fput(struct file *);
+extern void fput_many(struct file *, unsigned int);
 
 struct file_operations;
 struct vfsmount;
@@ -44,6 +45,7 @@ static inline void fdput(struct fd fd)
 }
 
 extern struct file *fget(unsigned int fd);
+extern struct file *fget_many(unsigned int fd, unsigned int refs);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
 extern unsigned long __fdget_raw(unsigned int fd);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ccb0b7a63aa5..acaad78b6781 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f)
 	atomic_long_inc(&f->f_count);
 	return f;
 }
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+#define get_file_rcu_many(x, cnt)	\
+	atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+#define get_file_rcu(x) get_file_rcu_many((x), 1)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
 #define file_count(x)	atomic_long_read(&(x)->f_count)
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 09/18] io_uring: use fget/fput_many() for file references
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (7 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 08/18] fs: add fget_many() and fput_many() Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 23:31   ` Jann Horn
  2019-01-29 19:26 ` [PATCH 10/18] io_uring: batch io_kiocb allocation Jens Axboe
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

Add a separate io_submit_state structure, to cache some of the things
we need for IO submission.

One such example is file reference batching. io_submit_state. We get as
many references as the number of sqes we are submitting, and drop
unused ones if we end up switching files. The assumption here is that
we're usually only dealing with one fd, and if there are multiple,
hopefuly they are at least somewhat ordered. Could trivially be extended
to cover multiple fds, if needed.

On the completion side we do the same thing, except this is trivially
done just locally in io_iopoll_reap().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 139 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 118 insertions(+), 21 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a4f4d75609d5..a4c3c91abc76 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -132,6 +132,19 @@ struct io_kiocb {
 #define IO_PLUG_THRESHOLD		2
 #define IO_IOPOLL_BATCH			8
 
+struct io_submit_state {
+	struct blk_plug plug;
+
+	/*
+	 * File reference cache
+	 */
+	struct file *file;
+	unsigned int fd;
+	unsigned int has_refs;
+	unsigned int used_refs;
+	unsigned int ios_left;
+};
+
 static struct kmem_cache *req_cachep;
 
 static const struct file_operations io_uring_fops;
@@ -279,9 +292,11 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 			       struct list_head *done)
 {
 	void *reqs[IO_IOPOLL_BATCH];
+	int file_count, to_free;
+	struct file *file = NULL;
 	struct io_kiocb *req;
-	int to_free = 0;
 
+	file_count = to_free = 0;
 	while (!list_empty(done)) {
 		req = list_first_entry(done, struct io_kiocb, list);
 		list_del(&req->list);
@@ -291,12 +306,28 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		reqs[to_free++] = req;
 		(*nr_events)++;
 
-		fput(req->rw.ki_filp);
+		/*
+		 * Batched puts of the same file, to avoid dirtying the
+		 * file usage count multiple times, if avoidable.
+		 */
+		if (!file) {
+			file = req->rw.ki_filp;
+			file_count = 1;
+		} else if (file == req->rw.ki_filp) {
+			file_count++;
+		} else {
+			fput_many(file, file_count);
+			file = req->rw.ki_filp;
+			file_count = 1;
+		}
+
 		if (to_free == ARRAY_SIZE(reqs))
 			io_free_req_many(ctx, reqs, &to_free);
 	}
 	io_commit_cqring(ctx);
 
+	if (file)
+		fput_many(file, file_count);
 	if (to_free)
 		io_free_req_many(ctx, reqs, &to_free);
 }
@@ -478,8 +509,50 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
 		list_add_tail(&req->list, &ctx->poll_list);
 }
 
+static void io_file_put(struct io_submit_state *state, struct file *file)
+{
+	if (!state) {
+		fput(file);
+	} else if (state->file) {
+		int diff = state->has_refs - state->used_refs;
+
+		if (diff)
+			fput_many(state->file, diff);
+		state->file = NULL;
+	}
+}
+
+/*
+ * Get as many references to a file as we have IOs left in this submission,
+ * assuming most submissions are for one file, or at least that each file
+ * has more than one submission.
+ */
+static struct file *io_file_get(struct io_submit_state *state, int fd)
+{
+	if (!state)
+		return fget(fd);
+
+	if (state->file) {
+		if (state->fd == fd) {
+			state->used_refs++;
+			state->ios_left--;
+			return state->file;
+		}
+		io_file_put(state, NULL);
+	}
+	state->file = fget_many(fd, state->ios_left);
+	if (!state->file)
+		return NULL;
+
+	state->fd = fd;
+	state->has_refs = state->ios_left;
+	state->used_refs = 1;
+	state->ios_left--;
+	return state->file;
+}
+
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		      bool force_nonblock)
+		      bool force_nonblock, struct io_submit_state *state)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
@@ -487,7 +560,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	int fd, ret;
 
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = fget(fd);
+	kiocb->ki_filp = io_file_get(state, fd);
 	if (unlikely(!kiocb->ki_filp))
 		return -EBADF;
 	kiocb->ki_pos = READ_ONCE(sqe->off);
@@ -528,7 +601,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	fput(kiocb->ki_filp);
+	io_file_put(state, kiocb->ki_filp);
 	return ret;
 }
 
@@ -570,7 +643,7 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 }
 
 static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		       bool force_nonblock)
+		       bool force_nonblock, struct io_submit_state *state)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw;
@@ -578,7 +651,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, sqe, force_nonblock);
+	ret = io_prep_rw(req, sqe, force_nonblock, state);
 	if (ret)
 		return ret;
 	file = kiocb->ki_filp;
@@ -613,7 +686,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 }
 
 static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-			bool force_nonblock)
+			bool force_nonblock, struct io_submit_state *state)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw;
@@ -621,7 +694,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, sqe, force_nonblock);
+	ret = io_prep_rw(req, sqe, force_nonblock, state);
 	if (ret)
 		return ret;
 	file = kiocb->ki_filp;
@@ -721,7 +794,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 }
 
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			   const struct sqe_submit *s, bool force_nonblock)
+			   const struct sqe_submit *s, bool force_nonblock,
+			   struct io_submit_state *state)
 {
 	ssize_t ret;
 	int opcode;
@@ -737,10 +811,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
-		ret = io_read(req, s->sqe, force_nonblock);
+		ret = io_read(req, s->sqe, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
-		ret = io_write(req, s->sqe, force_nonblock);
+		ret = io_write(req, s->sqe, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
 		ret = io_fsync(req, s->sqe, force_nonblock);
@@ -788,7 +862,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	use_mm(ctx->sqo_mm);
 	set_fs(USER_DS);
 
-	ret = __io_submit_sqe(ctx, req, s, false);
+	ret = __io_submit_sqe(ctx, req, s, false, NULL);
 
 	set_fs(old_fs);
 	unuse_mm(ctx->sqo_mm);
@@ -804,7 +878,8 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	task_unlock(current);
 }
 
-static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s)
+static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
+			 struct io_submit_state *state)
 {
 	struct io_kiocb *req;
 	ssize_t ret;
@@ -817,7 +892,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s)
 	if (unlikely(!req))
 		return -EAGAIN;
 
-	ret = __io_submit_sqe(ctx, req, s, true);
+	ret = __io_submit_sqe(ctx, req, s, true, state);
 	if (ret == -EAGAIN) {
 		memcpy(&req->submit, s, sizeof(*s));
 		INIT_WORK(&req->work, io_sq_wq_submit_work);
@@ -830,6 +905,26 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s)
 	return ret;
 }
 
+/*
+ * Batched submission is done, ensure local IO is flushed out.
+ */
+static void io_submit_state_end(struct io_submit_state *state)
+{
+	blk_finish_plug(&state->plug);
+	io_file_put(state, NULL);
+}
+
+/*
+ * Start submission side cache.
+ */
+static void io_submit_state_start(struct io_submit_state *state,
+				  struct io_ring_ctx *ctx, unsigned max_ios)
+{
+	blk_start_plug(&state->plug);
+	state->file = NULL;
+	state->ios_left = max_ios;
+}
+
 static void io_commit_sqring(struct io_ring_ctx *ctx)
 {
 	struct io_sq_ring *ring = ctx->sq_ring;
@@ -892,11 +987,13 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
+	struct io_submit_state state, *statep = NULL;
 	int i, ret = 0, submit = 0;
-	struct blk_plug plug;
 
-	if (to_submit > IO_PLUG_THRESHOLD)
-		blk_start_plug(&plug);
+	if (to_submit > IO_PLUG_THRESHOLD) {
+		io_submit_state_start(&state, ctx, to_submit);
+		statep = &state;
+	}
 
 	for (i = 0; i < to_submit; i++) {
 		struct sqe_submit s;
@@ -904,7 +1001,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		if (!io_get_sqring(ctx, &s))
 			break;
 
-		ret = io_submit_sqe(ctx, &s);
+		ret = io_submit_sqe(ctx, &s, statep);
 		if (ret) {
 			io_drop_sqring(ctx);
 			break;
@@ -914,8 +1011,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 	}
 	io_commit_sqring(ctx);
 
-	if (to_submit > IO_PLUG_THRESHOLD)
-		blk_finish_plug(&plug);
+	if (statep)
+		io_submit_state_end(statep);
 
 	return submit ? submit : ret;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 10/18] io_uring: batch io_kiocb allocation
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (8 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 09/18] io_uring: use fget/fput_many() for file references Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 11/18] block: implement bio helper to add iter bvec pages to bio Jens Axboe
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

Similarly to how we use the state->ios_left to know how many references
to get to a file, we can use it to allocate the io_kiocb's we need in
bulk.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 45 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a4c3c91abc76..eb7deca41cf7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -135,6 +135,13 @@ struct io_kiocb {
 struct io_submit_state {
 	struct blk_plug plug;
 
+	/*
+	 * io_kiocb alloc cache
+	 */
+	void *reqs[IO_IOPOLL_BATCH];
+	unsigned int free_reqs;
+	unsigned int cur_req;
+
 	/*
 	 * File reference cache
 	 */
@@ -252,20 +259,40 @@ static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
 		wake_up(&ctx->wait);
 }
 
-static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
+static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
+				   struct io_submit_state *state)
 {
 	struct io_kiocb *req;
 
 	if (!percpu_ref_tryget(&ctx->refs))
 		return NULL;
 
-	req = kmem_cache_alloc(req_cachep, __GFP_NOWARN);
-	if (req) {
-		req->ctx = ctx;
-		req->flags = 0;
-		return req;
+	if (!state) {
+		req = kmem_cache_alloc(req_cachep, __GFP_NOWARN);
+		if (unlikely(!req))
+			goto out;
+	} else if (!state->free_reqs) {
+		size_t sz;
+		int ret;
+
+		sz = min_t(size_t, state->ios_left, ARRAY_SIZE(state->reqs));
+		ret = kmem_cache_alloc_bulk(req_cachep, __GFP_NOWARN, sz,
+						state->reqs);
+		if (unlikely(ret <= 0))
+			goto out;
+		state->free_reqs = ret - 1;
+		state->cur_req = 1;
+		req = state->reqs[0];
+	} else {
+		req = state->reqs[state->cur_req];
+		state->free_reqs--;
+		state->cur_req++;
 	}
 
+	req->ctx = ctx;
+	req->flags = 0;
+	return req;
+out:
 	io_ring_drop_ctx_refs(ctx, 1);
 	return NULL;
 }
@@ -888,7 +915,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	if (unlikely(s->sqe->flags))
 		return -EINVAL;
 
-	req = io_get_req(ctx);
+	req = io_get_req(ctx, state);
 	if (unlikely(!req))
 		return -EAGAIN;
 
@@ -912,6 +939,9 @@ static void io_submit_state_end(struct io_submit_state *state)
 {
 	blk_finish_plug(&state->plug);
 	io_file_put(state, NULL);
+	if (state->free_reqs)
+		kmem_cache_free_bulk(req_cachep, state->free_reqs,
+					&state->reqs[state->cur_req]);
 }
 
 /*
@@ -921,6 +951,7 @@ static void io_submit_state_start(struct io_submit_state *state,
 				  struct io_ring_ctx *ctx, unsigned max_ios)
 {
 	blk_start_plug(&state->plug);
+	state->free_reqs = 0;
 	state->file = NULL;
 	state->ios_left = max_ios;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 11/18] block: implement bio helper to add iter bvec pages to bio
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (9 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 10/18] io_uring: batch io_kiocb allocation Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. This requires that the caller doesn't releases
the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.

The current two callers of bio_iov_iter_get_pages() are updated to
check if they need to release pages on completion. This makes them
work with bvecs that contain kernel mapped pages already.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
 fs/block_dev.c            |  5 ++--
 fs/iomap.c                |  5 ++--
 include/linux/blk_types.h |  1 +
 4 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..330df572cfb8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
+static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned int len;
+	size_t size;
+
+	len = min_t(size_t, bv->bv_len, iter->count);
+	size = bio_add_page(bio, bv->bv_page, len,
+				bv->bv_offset + iter->iov_offset);
+	if (size == len) {
+		iov_iter_advance(iter, size);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
 
 /**
@@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 }
 
 /**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
- * @iter: iov iterator describing the region to be mapped
+ * @iter: iov iterator describing the region to be added
+ *
+ * This takes either an iterator pointing to user memory, or one pointing to
+ * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
+ * map them into the kernel. On IO completion, the caller should put those
+ * pages. If we're adding kernel pages, we just have to add the pages to the
+ * bio directly. We don't grab an extra reference to those pages (the user
+ * should already have that), and we don't put the page on IO completion.
+ * The caller needs to check if the bio is flagged BIO_NO_PAGE_REF on IO
+ * completion. If it isn't, then pages should be released.
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
  * The function tries, but does not guarantee, to pin as many pages as
- * fit into the bio, or are requested in *iter, whatever is smaller.
- * If MM encounters an error pinning the requested pages, it stops.
- * Error is returned only if 0 pages could be pinned.
+ * fit into the bio, or are requested in *iter, whatever is smaller. If
+ * MM encounters an error pinning the requested pages, it stops. Error
+ * is returned only if 0 pages could be pinned.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
+	const bool is_bvec = iov_iter_is_bvec(iter);
 	unsigned short orig_vcnt = bio->bi_vcnt;
 
+	/*
+	 * If this is a BVEC iter, then the pages are kernel pages. Don't
+	 * release them on IO completion.
+	 */
+	if (is_bvec)
+		bio_set_flag(bio, BIO_NO_PAGE_REF);
+
 	do {
-		int ret = __bio_iov_iter_get_pages(bio, iter);
+		int ret;
+
+		if (is_bvec)
+			ret = __bio_iov_bvec_add_pages(bio, iter);
+		else
+			ret = __bio_iov_iter_get_pages(bio, iter);
 
 		if (unlikely(ret))
 			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
@@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work)
 		next = bio->bi_private;
 
 		bio_set_pages_dirty(bio);
-		bio_release_pages(bio);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_release_pages(bio);
 		bio_put(bio);
 	}
 }
@@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio)
 			goto defer;
 	}
 
-	bio_release_pages(bio);
+	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+		bio_release_pages(bio);
 	bio_put(bio);
 	return;
 defer:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 392e2bfb636f..051ab41d1c61 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -338,8 +338,9 @@ static void blkdev_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 4ee50b76b4a1..e5c48a0b20e0 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1582,8 +1582,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d66bf5f32610..791fee35df88 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -215,6 +215,7 @@ struct bio {
 /*
  * bio flags
  */
+#define BIO_NO_PAGE_REF	0	/* don't put release vec pages */
 #define BIO_SEG_VALID	1	/* bi_phys_segments valid */
 #define BIO_CLONED	2	/* doesn't own data */
 #define BIO_BOUNCED	3	/* bio is a bounce bio */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (10 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 11/18] block: implement bio helper to add iter bvec pages to bio Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 22:44   ` Jann Horn
  2019-01-29 19:26 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must call io_uring_register()
after having setup an io_uring context, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
to an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.

If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.

The application may register buffers throughout the lifetime of the
io_uring context. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring context.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.

RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/io_uring.c                          | 365 ++++++++++++++++++++++++-
 include/linux/sched/user.h             |   2 +-
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/io_uring.h          |  13 +-
 kernel/sys_ni.c                        |   1 +
 8 files changed, 373 insertions(+), 16 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 481c126259e9..2eefd2a7c1ce 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
+427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 6a32a430c8e0..65c026185e61 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
+427	common	io_uring_register	__x64_sys_io_uring_register
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/io_uring.c b/fs/io_uring.c
index eb7deca41cf7..17c869f3ea2f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -25,10 +25,12 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/blkdev.h>
+#include <linux/bvec.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
+#include <linux/sizes.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -58,6 +60,13 @@ struct io_cq_ring {
 	struct io_uring_cqe	cqes[];
 };
 
+struct io_mapped_ubuf {
+	u64		ubuf;
+	size_t		len;
+	struct		bio_vec *bvec;
+	unsigned int	nr_bvecs;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -91,6 +100,10 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/* if used, fixed mapped user buffers */
+	unsigned		nr_user_bufs;
+	struct io_mapped_ubuf	*user_bufs;
+
 	struct user_struct	*user;
 
 	struct completion	ctx_done;
@@ -653,12 +666,56 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
+static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe,
+			   struct iov_iter *iter)
+{
+	size_t len = READ_ONCE(sqe->len);
+	struct io_mapped_ubuf *imu;
+	int buf_index, index;
+	size_t offset;
+	u64 buf_addr;
+
+	/* attempt to use fixed buffers without having provided iovecs */
+	if (unlikely(!ctx->user_bufs))
+		return -EFAULT;
+
+	buf_index = READ_ONCE(sqe->buf_index);
+	if (unlikely(buf_index >= ctx->nr_user_bufs))
+		return -EFAULT;
+
+	index = array_index_nospec(buf_index, ctx->sq_entries);
+	imu = &ctx->user_bufs[index];
+	buf_addr = READ_ONCE(sqe->addr);
+	if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
+		return -EFAULT;
+
+	/*
+	 * May not be a start of buffer, set size appropriately
+	 * and advance us to the beginning.
+	 */
+	offset = buf_addr - imu->ubuf;
+	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
+	if (offset)
+		iov_iter_advance(iter, offset);
+	return 0;
+}
+
 static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 			   const struct io_uring_sqe *sqe, struct iovec **iovec,
 			   struct iov_iter *iter)
 {
 	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	size_t sqe_len = READ_ONCE(sqe->len);
+	int opcode;
+
+	opcode = READ_ONCE(sqe->opcode);
+	if (opcode == IORING_OP_READ_FIXED ||
+	    opcode == IORING_OP_WRITE_FIXED) {
+		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
+		*iovec = NULL;
+		return ret;
+	}
 
 #ifdef CONFIG_COMPAT
 	if (ctx->compat)
@@ -799,7 +856,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio))
+	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fsync_flags = READ_ONCE(sqe->fsync_flags);
@@ -838,9 +895,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
 		ret = io_read(req, s->sqe, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
+		ret = io_write(req, s->sqe, force_nonblock, state);
+		break;
+	case IORING_OP_READ_FIXED:
+		ret = io_read(req, s->sqe, force_nonblock, state);
+		break;
+	case IORING_OP_WRITE_FIXED:
 		ret = io_write(req, s->sqe, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
@@ -863,16 +930,30 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
+{
+	return !(sqe->opcode == IORING_OP_READ_FIXED ||
+		 sqe->opcode == IORING_OP_WRITE_FIXED);
+}
+
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	struct sqe_submit *s = &req->submit;
-	u64 user_data = READ_ONCE(s->sqe->user_data);
 	struct io_ring_ctx *ctx = req->ctx;
-	mm_segment_t old_fs = get_fs();
 	struct files_struct *old_files;
+	struct io_uring_sqe sqe;
+	mm_segment_t old_fs;
+	bool needs_user;
 	int ret;
 
+	/*
+	 * Ensure sqe is stable between checking if we need user access,
+	 * and actually importing the iovec further down the stack.
+	 */
+	memcpy(&sqe, s->sqe, sizeof(sqe));
+	s->sqe = &sqe;
+
 	 /* Ensure we clear previously set forced non-block flag */
 	req->flags &= ~REQ_F_FORCE_NONBLOCK;
 
@@ -881,22 +962,31 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	current->files = ctx->sqo_files;
 	task_unlock(current);
 
-	if (!mmget_not_zero(ctx->sqo_mm)) {
-		ret = -EFAULT;
-		goto err;
+	/*
+	 * If we're doing IO to fixed buffers, we don't need to get/set
+	 * user context
+	 */
+	needs_user = io_sqe_needs_user(&sqe);
+	if (needs_user) {
+		if (!mmget_not_zero(ctx->sqo_mm)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		use_mm(ctx->sqo_mm);
+		old_fs = get_fs();
+		set_fs(USER_DS);
 	}
 
-	use_mm(ctx->sqo_mm);
-	set_fs(USER_DS);
-
 	ret = __io_submit_sqe(ctx, req, s, false, NULL);
 
-	set_fs(old_fs);
-	unuse_mm(ctx->sqo_mm);
-	mmput(ctx->sqo_mm);
+	if (needs_user) {
+		set_fs(old_fs);
+		unuse_mm(ctx->sqo_mm);
+		mmput(ctx->sqo_mm);
+	}
 err:
 	if (ret) {
-		io_cqring_add_event(ctx, user_data, ret, 0);
+		io_cqring_add_event(ctx, sqe.user_data, ret, 0);
 		io_free_req(req);
 	}
 
@@ -1206,6 +1296,14 @@ static void *io_mem_alloc(size_t size)
 	return (void *) __get_free_pages(gfp_flags, get_order(size));
 }
 
+static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
+{
+	if (ctx->user)
+		return __io_account_mem(ctx->user, nr_pages);
+
+	return 0;
+}
+
 static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 {
 	struct io_sq_ring *sq_ring;
@@ -1219,6 +1317,190 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
 }
 
+static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->sq_entries; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+
+		for (j = 0; j < imu->nr_bvecs; j++)
+			put_page(imu->bvec[j].bv_page);
+
+		io_unaccount_mem(ctx, imu->nr_bvecs);
+		kfree(imu->bvec);
+		imu->nr_bvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+	free_uid(ctx->user);
+	ctx->user = NULL;
+	return 0;
+}
+
+static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
+		       void __user *arg, unsigned index)
+{
+	struct iovec __user *src;
+
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall()) {
+		struct compat_iovec __user *ciovs;
+		struct compat_iovec ciov;
+
+		ciovs = (struct compat_iovec __user *) arg;
+		if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
+			return -EFAULT;
+
+		dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
+		dst->iov_len = ciov.iov_len;
+		return 0;
+	}
+#endif
+	src = (struct iovec __user *) arg;
+	if (copy_from_user(dst, &src[index], sizeof(*dst)))
+		return -EFAULT;
+	return 0;
+}
+
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
+				  unsigned nr_args)
+{
+	struct vm_area_struct **vmas = NULL;
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	int ret = -EINVAL;
+
+	if (ctx->user_bufs)
+		return -EBUSY;
+	if (!nr_args || nr_args > UIO_MAXIOV)
+		return -EINVAL;
+
+	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	if (!capable(CAP_IPC_LOCK))
+		ctx->user = get_uid(current_user());
+
+	for (i = 0; i < nr_args; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		struct iovec iov;
+		size_t size;
+
+		ret = io_copy_iov(ctx, &iov, arg, i);
+		if (ret)
+			break;
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iov.iov_base)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iov.iov_len > SZ_1G)
+			goto err;
+
+		ubuf = (unsigned long) iov.iov_base;
+		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		ret = io_account_mem(ctx, nr_pages);
+		if (ret)
+			goto err;
+
+		if (!pages || nr_pages > got_pages) {
+			kfree(vmas);
+			kfree(pages);
+			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+						GFP_KERNEL);
+			vmas = kmalloc_array(nr_pages,
+					sizeof(struct vma_area_struct *),
+					GFP_KERNEL);
+			if (!pages || !vmas) {
+				io_unaccount_mem(ctx, nr_pages);
+				goto err;
+			}
+			got_pages = nr_pages;
+		}
+
+		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+						GFP_KERNEL);
+		if (!imu->bvec) {
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		down_write(&current->mm->mmap_sem);
+		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
+						pages, vmas);
+		if (pret == nr_pages) {
+			/* don't support file backed memory */
+			for (j = 0; j < nr_pages; j++) {
+				struct vm_area_struct *vma = vmas[j];
+
+				if (vma->vm_file) {
+					ret = -EOPNOTSUPP;
+					break;
+				}
+			}
+		} else {
+			ret = pret < 0 ? pret : -EFAULT;
+		}
+		up_write(&current->mm->mmap_sem);
+		if (ret) {
+			/*
+			 * if we did partial map, or found file backed vmas,
+			 * release any pages we did get
+			 */
+			if (pret > 0) {
+				for (j = 0; j < pret; j++)
+					put_page(pages[j]);
+			}
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iov.iov_len;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			imu->bvec[j].bv_page = pages[j];
+			imu->bvec[j].bv_len = vec_len;
+			imu->bvec[j].bv_offset = off;
+			off = 0;
+			size -= vec_len;
+		}
+		/* store original address for later verification */
+		imu->ubuf = ubuf;
+		imu->len = iov.iov_len;
+		imu->nr_bvecs = nr_pages;
+	}
+	kfree(pages);
+	kfree(vmas);
+	ctx->nr_user_bufs = nr_args;
+	return 0;
+err:
+	kfree(pages);
+	kfree(vmas);
+	io_sqe_buffer_unregister(ctx);
+	return ret;
+}
+
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	destroy_workqueue(ctx->sqo_wq);
@@ -1226,6 +1508,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	put_files_struct(ctx->sqo_files);
 
 	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
 
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
@@ -1505,6 +1788,62 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 	return io_uring_setup(entries, params);
 }
 
+static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
+			       void __user *arg, unsigned nr_args)
+{
+	int ret;
+
+	percpu_ref_kill(&ctx->refs);
+	wait_for_completion(&ctx->ctx_done);
+
+	switch (opcode) {
+	case IORING_REGISTER_BUFFERS:
+		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_BUFFERS:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_buffer_unregister(ctx);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	/* bring the ctx back to life */
+	reinit_completion(&ctx->ctx_done);
+	percpu_ref_resurrect(&ctx->refs);
+	return ret;
+}
+
+SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
+		void __user *, arg, unsigned int, nr_args)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ctx = f.file->private_data;
+
+	ret = -EBUSY;
+	if (mutex_trylock(&ctx->uring_lock)) {
+		ret = __io_uring_register(ctx, opcode, arg, nr_args);
+		mutex_unlock(&ctx->uring_lock);
+	}
+out_fput:
+	fdput(f);
+	return ret;
+}
+
 static int __init io_uring_init(void)
 {
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..c7b5f86b91a1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -40,7 +40,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
 	atomic_long_t locked_vm;
 #endif
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3072dbaa7869..3681c05ac538 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries,
 asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
 				u32 min_complete, u32 flags,
 				const sigset_t __user *sig, size_t sigsz);
+asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
+				void __user *arg, unsigned int nr_args);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 87871e7b7ea7..d346229a1eb0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
 __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 #define __NR_io_uring_enter 426
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+#define __NR_io_uring_register 427
+__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
 
 #undef __NR_syscalls
-#define __NR_syscalls 427
+#define __NR_syscalls 428
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 4952fc921866..16c423d74f2e 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -27,7 +27,10 @@ struct io_uring_sqe {
 		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
-	__u64	__pad2[3];
+	union {
+		__u16	buf_index;	/* index into fixed buffers, if used */
+		__u64	__pad2[3];
+	};
 };
 
 /*
@@ -39,6 +42,8 @@ struct io_uring_sqe {
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
 #define IORING_OP_FSYNC		3
+#define IORING_OP_READ_FIXED	4
+#define IORING_OP_WRITE_FIXED	5
 
 /*
  * sqe->fsync_flags
@@ -102,4 +107,10 @@ struct io_uring_params {
 	struct io_cqring_offsets cq_off;
 };
 
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+#define IORING_REGISTER_BUFFERS		0
+#define IORING_UNREGISTER_BUFFERS	1
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ee5e523564bb..1bb6604dc19f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
+COND_SYSCALL(io_uring_register);
 
 /* fs/xattr.c */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 13/18] io_uring: add file set registration
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (11 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-30  1:29   ` Jann Horn
  2019-01-29 19:26 ` [PATCH 14/18] io_uring: add submission polling Jens Axboe
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

We normally have to fget/fput for each IO we do on a file. Even with
the batching we do, the cost of the atomic inc/dec of the file usage
count adds up.

This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
for the io_uring_register(2) system call. The arguments passed in must
be an array of __s32 holding file descriptors, and nr_args should hold
the number of file descriptors the application wishes to pin for the
duration of the io_uring context (or until IORING_UNREGISTER_FILES is
called).

When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
to the index in the array passed in to IORING_REGISTER_FILES.

Files are automatically unregistered when the io_uring context is
torn down. An application need only unregister if it wishes to
register a new set of fds.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 138 +++++++++++++++++++++++++++++-----
 include/uapi/linux/io_uring.h |   9 ++-
 2 files changed, 127 insertions(+), 20 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 17c869f3ea2f..13c3f8212815 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -100,6 +100,14 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/*
+	 * If used, fixed file set. Writers must ensure that ->refs is dead,
+	 * readers must ensure that ->refs is alive as long as the file* is
+	 * used. Only updated through io_uring_register(2).
+	 */
+	struct file		**user_files;
+	unsigned		nr_user_files;
+
 	/* if used, fixed mapped user buffers */
 	unsigned		nr_user_bufs;
 	struct io_mapped_ubuf	*user_bufs;
@@ -136,6 +144,7 @@ struct io_kiocb {
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
+#define REQ_F_FIXED_FILE	4	/* ctx owns file */
 	u64			user_data;
 	u64			error;
 
@@ -350,15 +359,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		 * Batched puts of the same file, to avoid dirtying the
 		 * file usage count multiple times, if avoidable.
 		 */
-		if (!file) {
-			file = req->rw.ki_filp;
-			file_count = 1;
-		} else if (file == req->rw.ki_filp) {
-			file_count++;
-		} else {
-			fput_many(file, file_count);
-			file = req->rw.ki_filp;
-			file_count = 1;
+		if (!(req->flags & REQ_F_FIXED_FILE)) {
+			if (!file) {
+				file = req->rw.ki_filp;
+				file_count = 1;
+			} else if (file == req->rw.ki_filp) {
+				file_count++;
+			} else {
+				fput_many(file, file_count);
+				file = req->rw.ki_filp;
+				file_count = 1;
+			}
 		}
 
 		if (to_free == ARRAY_SIZE(reqs))
@@ -491,13 +502,19 @@ static void kiocb_end_write(struct kiocb *kiocb)
 	}
 }
 
+static void io_fput(struct io_kiocb *req)
+{
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		fput(req->rw.ki_filp);
+}
+
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
 
 	kiocb_end_write(kiocb);
 
-	fput(kiocb->ki_filp);
+	io_fput(req);
 	io_cqring_add_event(req->ctx, req->user_data, res, 0);
 	io_free_req(req);
 }
@@ -596,11 +613,22 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
-	unsigned ioprio;
+	unsigned ioprio, flags;
 	int fd, ret;
 
+	flags = READ_ONCE(sqe->flags);
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = io_file_get(state, fd);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files ||
+		    (unsigned) fd >= ctx->nr_user_files))
+			return -EBADF;
+		kiocb->ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		kiocb->ki_filp = io_file_get(state, fd);
+	}
+
 	if (unlikely(!kiocb->ki_filp))
 		return -EBADF;
 	kiocb->ki_pos = READ_ONCE(sqe->off);
@@ -641,7 +669,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	io_file_put(state, kiocb->ki_filp);
+	if (!(flags & IOSQE_FIXED_FILE))
+		io_file_put(state, kiocb->ki_filp);
 	return ret;
 }
 
@@ -765,7 +794,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -820,7 +849,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -846,7 +875,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	loff_t sqe_off = READ_ONCE(sqe->off);
 	loff_t sqe_len = READ_ONCE(sqe->len);
 	loff_t end = sqe_off + sqe_len;
-	unsigned fsync_flags;
+	unsigned fsync_flags, flags;
 	struct file *file;
 	int ret, fd;
 
@@ -864,14 +893,23 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return -EINVAL;
 
 	fd = READ_ONCE(sqe->fd);
-	file = fget(fd);
+	flags = READ_ONCE(sqe->flags);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		file = ctx->user_files[fd];
+	} else {
+		file = fget(fd);
+	}
 	if (unlikely(!file))
 		return -EBADF;
 
 	ret = vfs_fsync_range(file, sqe_off, end > 0 ? end : LLONG_MAX,
 				fsync_flags & IORING_FSYNC_DATASYNC);
 
-	fput(file);
+	if (!(flags & IOSQE_FIXED_FILE))
+		fput(file);
 	io_cqring_add_event(ctx, sqe->user_data, ret, 0);
 	io_free_req(req);
 	return 0;
@@ -1002,7 +1040,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
-	if (unlikely(s->sqe->flags))
+	if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
 		return -EINVAL;
 
 	req = io_get_req(ctx, state);
@@ -1220,6 +1258,58 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
 	return submitted ? submitted : ret;
 }
 
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+	int i;
+
+	if (!ctx->user_files)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->nr_user_files; i++)
+		fput(ctx->user_files[i]);
+
+	kfree(ctx->user_files);
+	ctx->user_files = NULL;
+	ctx->nr_user_files = 0;
+	return 0;
+}
+
+static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
+				 unsigned nr_args)
+{
+	__s32 __user *fds = (__s32 __user *) arg;
+	int fd, ret = 0;
+	unsigned i;
+
+	if (ctx->user_files)
+		return -EBUSY;
+	if (!nr_args)
+		return -EINVAL;
+
+	ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
+	if (!ctx->user_files)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		ret = -EFAULT;
+		if (copy_from_user(&fd, &fds[i], sizeof(fd)))
+			break;
+
+		ctx->user_files[i] = fget(fd);
+
+		ret = -EBADF;
+		if (!ctx->user_files[i])
+			break;
+		ctx->nr_user_files++;
+		ret = 0;
+	}
+
+	if (ret)
+		io_sqe_files_unregister(ctx);
+
+	return ret;
+}
+
 static int io_sq_offload_start(struct io_ring_ctx *ctx)
 {
 	int ret;
@@ -1509,6 +1599,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 
 	io_iopoll_reap_events(ctx);
 	io_sqe_buffer_unregister(ctx);
+	io_sqe_files_unregister(ctx);
 
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
@@ -1806,6 +1897,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_sqe_buffer_unregister(ctx);
 		break;
+	case IORING_REGISTER_FILES:
+		ret = io_sqe_files_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_FILES:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_files_unregister(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 16c423d74f2e..3e79feb34a9c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -16,7 +16,7 @@
  */
 struct io_uring_sqe {
 	__u8	opcode;		/* type of operation for this sqe */
-	__u8	flags;		/* as of now unused */
+	__u8	flags;		/* IOSQE_ flags */
 	__u16	ioprio;		/* ioprio for the request */
 	__s32	fd;		/* file descriptor to do IO on */
 	__u64	off;		/* offset into file */
@@ -33,6 +33,11 @@ struct io_uring_sqe {
 	};
 };
 
+/*
+ * sqe->flags
+ */
+#define IOSQE_FIXED_FILE	(1U << 0)	/* use fixed fileset */
+
 /*
  * io_uring_setup() flags
  */
@@ -112,5 +117,7 @@ struct io_uring_params {
  */
 #define IORING_REGISTER_BUFFERS		0
 #define IORING_UNREGISTER_BUFFERS	1
+#define IORING_REGISTER_FILES		2
+#define IORING_UNREGISTER_FILES		3
 
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 14/18] io_uring: add submission polling
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (12 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:26 ` [PATCH 15/18] io_uring: add io_kiocb ref count Jens Axboe
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.

By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:

sq_ring->flags |= IORING_SQ_NEED_WAKEUP.

The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:

read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

instead of calling it unconditionally.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 243 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |  12 +-
 2 files changed, 247 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 13c3f8212815..328fb35b4df7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -24,6 +24,7 @@
 #include <linux/percpu.h>
 #include <linux/slab.h>
 #include <linux/workqueue.h>
+#include <linux/kthread.h>
 #include <linux/blkdev.h>
 #include <linux/bvec.h>
 #include <linux/anon_inodes.h>
@@ -81,14 +82,16 @@ struct io_ring_ctx {
 		unsigned		cached_sq_head;
 		unsigned		sq_entries;
 		unsigned		sq_mask;
-		unsigned		sq_thread_cpu;
+		unsigned		sq_thread_idle;
 		struct io_uring_sqe	*sq_sqes;
 	} ____cacheline_aligned_in_smp;
 
 	/* IO offload */
 	struct workqueue_struct	*sqo_wq;
+	struct task_struct	*sqo_thread;	/* if using sq thread polling */
 	struct mm_struct	*sqo_mm;
 	struct files_struct	*sqo_files;
+	wait_queue_head_t	sqo_wait;
 
 	struct {
 		/* CQ ring */
@@ -271,6 +274,8 @@ static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
 
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
+	if (waitqueue_active(&ctx->sqo_wait))
+		wake_up(&ctx->sqo_wait);
 }
 
 static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
@@ -1144,6 +1149,173 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	return false;
 }
 
+static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
+			  unsigned int nr, bool mm_fault)
+{
+	struct io_submit_state state, *statep = NULL;
+	int ret, i, submitted = 0;
+
+	if (nr > IO_PLUG_THRESHOLD) {
+		io_submit_state_start(&state, ctx, nr);
+		statep = &state;
+	}
+
+	for (i = 0; i < nr; i++) {
+		if (unlikely(mm_fault))
+			ret = -EFAULT;
+		else
+			ret = io_submit_sqe(ctx, &sqes[i], statep);
+		if (!ret) {
+			submitted++;
+			continue;
+		}
+
+		io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret, 0);
+	}
+
+	if (statep)
+		io_submit_state_end(&state);
+
+	return submitted;
+}
+
+static int io_sq_thread(void *data)
+{
+	struct sqe_submit sqes[IO_IOPOLL_BATCH];
+	struct io_ring_ctx *ctx = data;
+	struct mm_struct *cur_mm = NULL;
+	struct files_struct *old_files;
+	mm_segment_t old_fs;
+	DEFINE_WAIT(wait);
+	unsigned inflight;
+	unsigned long timeout;
+
+	task_lock(current);
+	old_files = current->files;
+	current->files = ctx->sqo_files;
+	task_unlock(current);
+
+	old_fs = get_fs();
+	set_fs(USER_DS);
+
+	timeout = inflight = 0;
+	while (!kthread_should_stop()) {
+		bool all_fixed, mm_fault = false;
+		int i;
+
+		if (inflight) {
+			unsigned nr_events = 0;
+
+			if (ctx->flags & IORING_SETUP_IOPOLL) {
+				/*
+				 * We disallow the app entering submit/complete
+				 * with polling, so no need to lock the ring.
+				 */
+				io_iopoll_check(ctx, &nr_events, 0);
+			} else {
+				/*
+				 * Normal IO, just pretend everything completed.
+				 * We don't have to poll completions for that.
+				 */
+				nr_events = inflight;
+			}
+
+			inflight -= nr_events;
+			if (!inflight)
+				timeout = jiffies + ctx->sq_thread_idle;
+		}
+
+		if (!io_get_sqring(ctx, &sqes[0])) {
+			/*
+			 * We're polling. If we're within the defined idle
+			 * period, then let us spin without work before going
+			 * to sleep.
+			 */
+			if (inflight || !time_after(jiffies, timeout)) {
+				cpu_relax();
+				continue;
+			}
+
+			/*
+			 * Drop cur_mm before scheduling, we can't hold it for
+			 * long periods (or over schedule()). Do this before
+			 * adding ourselves to the waitqueue, as the unuse/drop
+			 * may sleep.
+			 */
+			if (cur_mm) {
+				unuse_mm(cur_mm);
+				mmput(cur_mm);
+				cur_mm = NULL;
+			}
+
+			prepare_to_wait(&ctx->sqo_wait, &wait,
+						TASK_INTERRUPTIBLE);
+
+			/* Tell userspace we may need a wakeup call */
+			ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP;
+			smp_wmb();
+
+			if (!io_get_sqring(ctx, &sqes[0])) {
+				if (kthread_should_park())
+					kthread_parkme();
+				if (kthread_should_stop()) {
+					finish_wait(&ctx->sqo_wait, &wait);
+					break;
+				}
+				if (signal_pending(current))
+					flush_signals(current);
+				schedule();
+				finish_wait(&ctx->sqo_wait, &wait);
+
+				ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
+				smp_wmb();
+				continue;
+			}
+			finish_wait(&ctx->sqo_wait, &wait);
+
+			ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
+			smp_wmb();
+		}
+
+		i = 0;
+		all_fixed = true;
+		do {
+			if (all_fixed && io_sqe_needs_user(sqes[i].sqe))
+				all_fixed = false;
+
+			i++;
+			if (i == ARRAY_SIZE(sqes))
+				break;
+		} while (io_get_sqring(ctx, &sqes[i]));
+
+		io_commit_sqring(ctx);
+
+		/* Unless all new commands are FIXED regions, grab mm */
+		if (!all_fixed && !cur_mm) {
+			mm_fault = !mmget_not_zero(ctx->sqo_mm);
+			if (!mm_fault) {
+				use_mm(ctx->sqo_mm);
+				cur_mm = ctx->sqo_mm;
+			}
+		}
+
+		inflight += io_submit_sqes(ctx, sqes, i, mm_fault);
+	}
+
+	io_iopoll_reap_events(ctx);
+
+	task_lock(current);
+	current->files = old_files;
+	task_unlock(current);
+
+	set_fs(old_fs);
+	if (cur_mm) {
+		unuse_mm(cur_mm);
+		mmput(cur_mm);
+	}
+	return 0;
+}
+
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
 	struct io_submit_state state, *statep = NULL;
@@ -1229,6 +1401,17 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
 {
 	int submitted, ret;
 
+	/*
+	 * For SQ polling, the thread will do all submissions and completions.
+	 * Just return the requested submit count, and wake the thread if
+	 * we were asked to.
+	 */
+	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		if (flags & IORING_ENTER_SQ_WAKEUP)
+			wake_up(&ctx->sqo_wait);
+		return to_submit;
+	}
+
 	submitted = ret = 0;
 	if (to_submit) {
 		to_submit = min(to_submit, ctx->sq_entries);
@@ -1310,10 +1493,12 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
-static int io_sq_offload_start(struct io_ring_ctx *ctx)
+static int io_sq_offload_start(struct io_ring_ctx *ctx,
+			       struct io_uring_params *p)
 {
 	int ret;
 
+	init_waitqueue_head(&ctx->sqo_wait);
 	mmgrab(current->mm);
 	ctx->sqo_mm = current->mm;
 
@@ -1322,6 +1507,38 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx)
 	if (!ctx->sqo_files)
 		goto err;
 
+	ctx->sq_thread_idle = msecs_to_jiffies(p->sq_thread_idle);
+	if (!ctx->sq_thread_idle)
+		ctx->sq_thread_idle = HZ;
+
+	ret = -EINVAL;
+	if (!cpu_possible(p->sq_thread_cpu))
+		goto err;
+
+	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		if (p->flags & IORING_SETUP_SQ_AFF) {
+			int cpu;
+
+			cpu = array_index_nospec(p->sq_thread_cpu, NR_CPUS);
+			ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
+							ctx, cpu,
+							"io_uring-sq");
+		} else {
+			ctx->sqo_thread = kthread_create(io_sq_thread, ctx,
+							"io_uring-sq");
+		}
+		if (IS_ERR(ctx->sqo_thread)) {
+			ret = PTR_ERR(ctx->sqo_thread);
+			ctx->sqo_thread = NULL;
+			goto err;
+		}
+		wake_up_process(ctx->sqo_thread);
+	} else if (p->flags & IORING_SETUP_SQ_AFF) {
+		/* Can't have SQ_AFF without SQPOLL */
+		ret = -EINVAL;
+		goto err;
+	}
+
 	/* Do QD, or 2 * CPUS, whatever is smallest */
 	ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
 			min(ctx->sq_entries - 1, 2 * num_online_cpus()));
@@ -1332,6 +1549,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx)
 
 	return 0;
 err:
+	if (ctx->sqo_thread) {
+		kthread_park(ctx->sqo_thread);
+		kthread_stop(ctx->sqo_thread);
+		ctx->sqo_thread = NULL;
+	}
 	if (ctx->sqo_files) {
 		put_files_struct(ctx->sqo_files);
 		ctx->sqo_files = NULL;
@@ -1593,11 +1815,16 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
+	if (ctx->sqo_thread) {
+		kthread_park(ctx->sqo_thread);
+		kthread_stop(ctx->sqo_thread);
+	}
 	destroy_workqueue(ctx->sqo_wq);
 	mmdrop(ctx->sqo_mm);
 	put_files_struct(ctx->sqo_files);
 
-	io_iopoll_reap_events(ctx);
+	if (!(ctx->flags & IORING_SETUP_SQPOLL))
+		io_iopoll_reap_events(ctx);
 	io_sqe_buffer_unregister(ctx);
 	io_sqe_files_unregister(ctx);
 
@@ -1638,7 +1865,8 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	percpu_ref_kill(&ctx->refs);
 	mutex_unlock(&ctx->uring_lock);
 
-	io_iopoll_reap_events(ctx);
+	if (!(ctx->flags & IORING_SETUP_SQPOLL))
+		io_iopoll_reap_events(ctx);
 	wait_for_completion(&ctx->ctx_done);
 	io_ring_ctx_free(ctx);
 }
@@ -1691,7 +1919,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	long ret = -EBADF;
 	struct fd f;
 
-	if (flags & ~IORING_ENTER_GETEVENTS)
+	if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP))
 		return -EINVAL;
 
 	f = fdget(fd);
@@ -1811,7 +2039,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p)
 	if (ret)
 		goto err;
 
-	ret = io_sq_offload_start(ctx);
+	ret = io_sq_offload_start(ctx, p);
 	if (ret)
 		goto err;
 
@@ -1860,7 +2088,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			return -EINVAL;
 	}
 
-	if (p.flags & ~IORING_SETUP_IOPOLL)
+	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
+			IORING_SETUP_SQ_AFF))
 		return -EINVAL;
 
 	ret = io_uring_create(entries, &p);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 3e79feb34a9c..0d85da31e260 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -42,6 +42,8 @@ struct io_uring_sqe {
  * io_uring_setup() flags
  */
 #define IORING_SETUP_IOPOLL	(1U << 0)	/* io_context is polled */
+#define IORING_SETUP_SQPOLL	(1U << 1)	/* SQ poll thread */
+#define IORING_SETUP_SQ_AFF	(1U << 2)	/* sq_thread_cpu is valid */
 
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
@@ -85,6 +87,11 @@ struct io_sqring_offsets {
 	__u32 resv[3];
 };
 
+/*
+ * sq_ring->flags
+ */
+#define IORING_SQ_NEED_WAKEUP	(1U << 0) /* needs io_uring_enter wakeup */
+
 struct io_cqring_offsets {
 	__u32 head;
 	__u32 tail;
@@ -99,6 +106,7 @@ struct io_cqring_offsets {
  * io_uring_enter(2) flags
  */
 #define IORING_ENTER_GETEVENTS	(1U << 0)
+#define IORING_ENTER_SQ_WAKEUP	(1U << 1)
 
 /*
  * Passed in for io_uring_setup(2). Copied back with updated info on success
@@ -107,7 +115,9 @@ struct io_uring_params {
 	__u32 sq_entries;
 	__u32 cq_entries;
 	__u32 flags;
-	__u32 resv[7];
+	__u32 sq_thread_cpu;
+	__u32 sq_thread_idle;
+	__u32 resv[5];
 	struct io_sqring_offsets sq_off;
 	struct io_cqring_offsets cq_off;
 };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 15/18] io_uring: add io_kiocb ref count
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (13 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 14/18] io_uring: add submission polling Jens Axboe
@ 2019-01-29 19:26 ` Jens Axboe
  2019-01-29 19:27 ` [PATCH 16/18] io_uring: add support for IORING_OP_POLL Jens Axboe
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:26 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

We'll use this for the POLL implementation. Regular requests will
NOT be using references, so initialize it to 0. Any real use of
the io_kiocb ref will initialize it to at least 2.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 328fb35b4df7..bb215fe494f7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -145,6 +145,7 @@ struct io_kiocb {
 	struct io_ring_ctx	*ctx;
 	struct list_head	list;
 	unsigned int		flags;
+	refcount_t		refs;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 #define REQ_F_FIXED_FILE	4	/* ctx owns file */
@@ -318,6 +319,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 
 	req->ctx = ctx;
 	req->flags = 0;
+	refcount_set(&req->refs, 0);
 	return req;
 out:
 	io_ring_drop_ctx_refs(ctx, 1);
@@ -335,8 +337,10 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
 
 static void io_free_req(struct io_kiocb *req)
 {
-	io_ring_drop_ctx_refs(req->ctx, 1);
-	kmem_cache_free(req_cachep, req);
+	if (!refcount_read(&req->refs) || refcount_dec_and_test(&req->refs)) {
+		io_ring_drop_ctx_refs(req->ctx, 1);
+		kmem_cache_free(req_cachep, req);
+	}
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 16/18] io_uring: add support for IORING_OP_POLL
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (14 preceding siblings ...)
  2019-01-29 19:26 ` [PATCH 15/18] io_uring: add io_kiocb ref count Jens Axboe
@ 2019-01-29 19:27 ` Jens Axboe
  2019-01-29 19:27 ` [PATCH 17/18] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
  2019-01-29 19:27 ` [PATCH 18/18] io_uring: add io_uring_event cache hit information Jens Axboe
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:27 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

This is basically a direct port of bfe4037e722e, which implements a
one-shot poll command through aio. Description below is based on that
commit as well. However, instead of adding a POLL command and relying
on io_cancel(2) to remove it, we mimic the epoll(2) interface of
having a command to add a poll notification, IORING_OP_POLL_ADD,
and one to remove it again, IORING_OP_POLL_REMOVE.

To poll for a file descriptor the application should submit an sqe of
type IORING_OP_POLL. It will poll the fd for the events specified in the
poll_events field.

Unlike poll or epoll without EPOLLONESHOT this interface always works in
one shot mode, that is once the sqe is completed, it will have to be
resubmitted.

Based-on-code-from: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 252 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/io_uring.h |   3 +
 2 files changed, 255 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index bb215fe494f7..f8c0bcc6d299 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -128,6 +128,7 @@ struct io_ring_ctx {
 		spinlock_t		completion_lock;
 		bool			poll_multi_file;
 		struct list_head	poll_list;
+		struct list_head	cancel_list;
 	} ____cacheline_aligned_in_smp;
 };
 
@@ -136,9 +137,19 @@ struct sqe_submit {
 	unsigned			index;
 };
 
+struct io_poll_iocb {
+	struct file *file;
+	struct wait_queue_head *head;
+	__poll_t events;
+	bool woken;
+	bool canceled;
+	struct wait_queue_entry wait;
+};
+
 struct io_kiocb {
 	union {
 		struct kiocb		rw;
+		struct io_poll_iocb	poll;
 		struct sqe_submit	submit;
 	};
 
@@ -209,6 +220,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
 	INIT_LIST_HEAD(&ctx->poll_list);
+	INIT_LIST_HEAD(&ctx->cancel_list);
 	return ctx;
 }
 
@@ -924,6 +936,239 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	return 0;
 }
 
+static void io_poll_remove_one(struct io_kiocb *req)
+{
+	struct io_poll_iocb *poll = &req->poll;
+
+	spin_lock(&poll->head->lock);
+	WRITE_ONCE(poll->canceled, true);
+	if (!list_empty(&poll->wait.entry)) {
+		list_del_init(&poll->wait.entry);
+		queue_work(req->ctx->sqo_wq, &req->work);
+	}
+	spin_unlock(&poll->head->lock);
+
+	list_del_init(&req->list);
+}
+
+static void io_poll_remove_all(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	spin_lock_irq(&ctx->completion_lock);
+	while (!list_empty(&ctx->cancel_list)) {
+		req = list_first_entry(&ctx->cancel_list, struct io_kiocb,list);
+		io_poll_remove_one(req);
+	}
+	spin_unlock_irq(&ctx->completion_lock);
+}
+
+/*
+ * Find a running poll command that matches one specified in sqe->addr,
+ * and remove it if found.
+ */
+static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_kiocb *poll_req, *next;
+	int ret = -ENOENT;
+
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+	if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index ||
+	    sqe->poll_events)
+		return -EINVAL;
+
+	spin_lock_irq(&ctx->completion_lock);
+	list_for_each_entry_safe(poll_req, next, &ctx->cancel_list, list) {
+		if (READ_ONCE(sqe->addr) == poll_req->user_data) {
+			io_poll_remove_one(poll_req);
+			ret = 0;
+			break;
+		}
+	}
+	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
+	io_free_req(req);
+	return 0;
+}
+
+static void io_poll_complete(struct io_kiocb *req, __poll_t mask)
+{
+	io_cqring_add_event(req->ctx, req->user_data, mangle_poll(mask), 0);
+	io_fput(req);
+	io_free_req(req);
+}
+
+static void io_poll_complete_work(struct work_struct *work)
+{
+	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+	struct io_poll_iocb *poll = &req->poll;
+	struct poll_table_struct pt = { ._key = poll->events };
+	struct io_ring_ctx *ctx = req->ctx;
+	__poll_t mask = 0;
+
+	if (!READ_ONCE(poll->canceled))
+		mask = vfs_poll(poll->file, &pt) & poll->events;
+
+	/*
+	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
+	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
+	 * synchronize with them.  In the cancellation case the list_del_init
+	 * itself is not actually needed, but harmless so we keep it in to
+	 * avoid further branches in the fast path.
+	 */
+	spin_lock_irq(&ctx->completion_lock);
+	if (!mask && !READ_ONCE(poll->canceled)) {
+		add_wait_queue(poll->head, &poll->wait);
+		spin_unlock_irq(&ctx->completion_lock);
+		return;
+	}
+	list_del_init(&req->list);
+	spin_unlock_irq(&ctx->completion_lock);
+
+	io_poll_complete(req, mask);
+}
+
+static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
+			void *key)
+{
+	struct io_poll_iocb *poll = container_of(wait, struct io_poll_iocb,
+							wait);
+	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
+	struct io_ring_ctx *ctx = req->ctx;
+	__poll_t mask = key_to_poll(key);
+
+	poll->woken = true;
+
+	/* for instances that support it check for an event match first: */
+	if (mask) {
+		if (!(mask & poll->events))
+			return 0;
+
+		/* try to complete the iocb inline if we can: */
+		if (spin_trylock(&ctx->completion_lock)) {
+			list_del(&req->list);
+			spin_unlock(&ctx->completion_lock);
+
+			list_del_init(&poll->wait.entry);
+			io_poll_complete(req, mask);
+			return 1;
+		}
+	}
+
+	list_del_init(&poll->wait.entry);
+	queue_work(ctx->sqo_wq, &req->work);
+	return 1;
+}
+
+struct io_poll_table {
+	struct poll_table_struct pt;
+	struct io_kiocb *req;
+	int error;
+};
+
+static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
+			       struct poll_table_struct *p)
+{
+	struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
+
+	if (unlikely(pt->req->poll.head)) {
+		pt->error = -EINVAL;
+		return;
+	}
+
+	pt->error = 0;
+	pt->req->poll.head = head;
+	add_wait_queue(head, &pt->req->poll.wait);
+}
+
+static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_poll_iocb *poll = &req->poll;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_poll_table ipt;
+	unsigned flags;
+	__poll_t mask;
+	u16 events;
+	int fd;
+
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+	if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index)
+		return -EINVAL;
+
+	INIT_WORK(&req->work, io_poll_complete_work);
+	events = READ_ONCE(sqe->poll_events);
+	poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
+
+	flags = READ_ONCE(sqe->flags);
+	fd = READ_ONCE(sqe->fd);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		poll->file = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		poll->file = fget(fd);
+	}
+	if (unlikely(!poll->file))
+		return -EBADF;
+
+	poll->head = NULL;
+	poll->woken = false;
+	poll->canceled = false;
+
+	ipt.pt._qproc = io_poll_queue_proc;
+	ipt.pt._key = poll->events;
+	ipt.req = req;
+	ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */
+
+	/* initialized the list so that we can do list_empty checks */
+	INIT_LIST_HEAD(&poll->wait.entry);
+	init_waitqueue_func_entry(&poll->wait, io_poll_wake);
+
+	/* one for removal from waitqueue, one for this function */
+	refcount_set(&req->refs, 2);
+
+	mask = vfs_poll(poll->file, &ipt.pt) & poll->events;
+	if (unlikely(!poll->head)) {
+		/* we did not manage to set up a waitqueue, done */
+		goto out;
+	}
+
+	spin_lock_irq(&ctx->completion_lock);
+	spin_lock(&poll->head->lock);
+	if (poll->woken) {
+		/* wake_up context handles the rest */
+		mask = 0;
+		ipt.error = 0;
+	} else if (mask || ipt.error) {
+		/* if we get an error or a mask we are done */
+		WARN_ON_ONCE(list_empty(&poll->wait.entry));
+		list_del_init(&poll->wait.entry);
+	} else {
+		/* actually waiting for an event */
+		list_add_tail(&req->list, &ctx->cancel_list);
+	}
+	spin_unlock(&poll->head->lock);
+	spin_unlock_irq(&ctx->completion_lock);
+
+out:
+	if (unlikely(ipt.error)) {
+		if (!(flags & IOSQE_FIXED_FILE))
+			fput(poll->file);
+		return ipt.error;
+	}
+
+	if (mask)
+		io_poll_complete(req, mask);
+	io_free_req(req);
+	return 0;
+}
+
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			   const struct sqe_submit *s, bool force_nonblock,
 			   struct io_submit_state *state)
@@ -960,6 +1205,12 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	case IORING_OP_FSYNC:
 		ret = io_fsync(req, s->sqe, force_nonblock);
 		break;
+	case IORING_OP_POLL_ADD:
+		ret = io_poll_add(req, s->sqe);
+		break;
+	case IORING_OP_POLL_REMOVE:
+		ret = io_poll_remove(req, s->sqe);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -1869,6 +2120,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	percpu_ref_kill(&ctx->refs);
 	mutex_unlock(&ctx->uring_lock);
 
+	io_poll_remove_all(ctx);
 	if (!(ctx->flags & IORING_SETUP_SQPOLL))
 		io_iopoll_reap_events(ctx);
 	wait_for_completion(&ctx->ctx_done);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0d85da31e260..d319b2cd6319 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -25,6 +25,7 @@ struct io_uring_sqe {
 	union {
 		__kernel_rwf_t	rw_flags;
 		__u32		fsync_flags;
+		__u16		poll_events;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	union {
@@ -51,6 +52,8 @@ struct io_uring_sqe {
 #define IORING_OP_FSYNC		3
 #define IORING_OP_READ_FIXED	4
 #define IORING_OP_WRITE_FIXED	5
+#define IORING_OP_POLL_ADD	6
+#define IORING_OP_POLL_REMOVE	7
 
 /*
  * sqe->fsync_flags
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 17/18] io_uring: allow workqueue item to handle multiple buffered requests
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (15 preceding siblings ...)
  2019-01-29 19:27 ` [PATCH 16/18] io_uring: add support for IORING_OP_POLL Jens Axboe
@ 2019-01-29 19:27 ` Jens Axboe
  2019-01-29 19:27 ` [PATCH 18/18] io_uring: add io_uring_event cache hit information Jens Axboe
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:27 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

Right now we punt any buffered request that ends up triggering an
-EAGAIN to an async workqueue. This works fine in terms of providing
async execution of them, but it also can create quite a lot of work
queue items. For sequentially buffered IO, it's advantageous to
serialize the issue of them. For reads, the first one will trigger a
read-ahead, and subsequent request merely end up waiting on later pages
to complete. For writes, devices usually respond better to streamed
sequential writes.

Add state to track the last buffered request we punted to a work queue,
and if the next one is sequential to the previous, attempt to get the
previous work item to handle it. We limit the number of sequential
add-ons to the a multiple (8) of the max read-ahead size of the file.
This should be a good number for both reads and wries, as it defines the
max IO size the device can do directly.

This drastically cuts down on the number of context switches we need to
handle buffered sequential IO, and a basic test case of copying a big
file with io_uring sees a 5x speedup.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 245 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 201 insertions(+), 44 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f8c0bcc6d299..af2519d3e434 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -68,6 +68,16 @@ struct io_mapped_ubuf {
 	unsigned int	nr_bvecs;
 };
 
+struct async_list {
+	spinlock_t		lock;
+	atomic_t		cnt;
+	struct list_head	list;
+
+	struct file		*file;
+	off_t			io_end;
+	size_t			io_pages;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -130,6 +140,8 @@ struct io_ring_ctx {
 		struct list_head	poll_list;
 		struct list_head	cancel_list;
 	} ____cacheline_aligned_in_smp;
+
+	struct async_list	pending_async[2];
 };
 
 struct sqe_submit {
@@ -160,6 +172,7 @@ struct io_kiocb {
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 #define REQ_F_FIXED_FILE	4	/* ctx owns file */
+#define REQ_F_SEQ_PREV		8	/* sequential with previous */
 	u64			user_data;
 	u64			error;
 
@@ -203,6 +216,7 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref)
 static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 {
 	struct io_ring_ctx *ctx;
+	int i;
 
 	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
 	if (!ctx)
@@ -218,6 +232,11 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	init_completion(&ctx->ctx_done);
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
+	for (i = 0; i < ARRAY_SIZE(ctx->pending_async); i++) {
+		spin_lock_init(&ctx->pending_async[i].lock);
+		INIT_LIST_HEAD(&ctx->pending_async[i].list);
+		atomic_set(&ctx->pending_async[i].cnt, 0);
+	}
 	spin_lock_init(&ctx->completion_lock);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->cancel_list);
@@ -776,6 +795,39 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 	return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter);
 }
 
+static void io_async_list_note(int rw, struct io_kiocb *req, size_t len)
+{
+	struct async_list *async_list = &req->ctx->pending_async[rw];
+	struct kiocb *kiocb = &req->rw;
+	struct file *filp = kiocb->ki_filp;
+	off_t io_end = kiocb->ki_pos + len;
+
+	if (filp == async_list->file && kiocb->ki_pos == async_list->io_end) {
+		unsigned long max_pages;
+
+		/* Use 8x RA size as a decent limiter for both reads/writes */
+		max_pages = filp->f_ra.ra_pages;
+		if (!max_pages)
+			max_pages = VM_MAX_READAHEAD >> (PAGE_SHIFT - 10);
+		max_pages *= 8;
+
+		len >>= PAGE_SHIFT;
+		if (async_list->io_pages + len <= max_pages) {
+			req->flags |= REQ_F_SEQ_PREV;
+			async_list->io_pages += len;
+		} else {
+			io_end = 0;
+			async_list->io_pages = 0;
+		}
+	}
+
+	if (async_list->file != filp) {
+		async_list->io_pages = 0;
+		async_list->file = filp;
+	}
+	async_list->io_end = io_end;
+}
+
 static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		       bool force_nonblock, struct io_submit_state *state)
 {
@@ -783,6 +835,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct kiocb *kiocb = &req->rw;
 	struct iov_iter iter;
 	struct file *file;
+	size_t iov_count;
 	ssize_t ret;
 
 	ret = io_prep_rw(req, sqe, force_nonblock, state);
@@ -801,16 +854,19 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (ret)
 		goto out_fput;
 
-	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
+	iov_count = iov_iter_count(&iter);
+	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count);
 	if (!ret) {
 		ssize_t ret2;
 
 		/* Catch -EAGAIN return for forced non-blocking submission */
 		ret2 = call_read_iter(file, kiocb, &iter);
-		if (!force_nonblock || ret2 != -EAGAIN)
+		if (!force_nonblock || ret2 != -EAGAIN) {
 			io_rw_done(kiocb, ret2);
-		else
+		} else {
+			io_async_list_note(READ, req, iov_count);
 			ret = -EAGAIN;
+		}
 	}
 	kfree(iovec);
 out_fput:
@@ -826,6 +882,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct kiocb *kiocb = &req->rw;
 	struct iov_iter iter;
 	struct file *file;
+	size_t iov_count;
 	ssize_t ret;
 
 	ret = io_prep_rw(req, sqe, force_nonblock, state);
@@ -833,10 +890,6 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return ret;
 	file = kiocb->ki_filp;
 
-	ret = -EAGAIN;
-	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT))
-		goto out_fput;
-
 	ret = -EBADF;
 	if (unlikely(!(file->f_mode & FMODE_WRITE)))
 		goto out_fput;
@@ -848,8 +901,15 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (ret)
 		goto out_fput;
 
-	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos,
-				iov_iter_count(&iter));
+	iov_count = iov_iter_count(&iter);
+
+	ret = -EAGAIN;
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) {
+		io_async_list_note(WRITE, req, iov_count);
+		goto out_free;
+	}
+
+	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count);
 	if (!ret) {
 		/*
 		 * Open-code file_start_write here to grab freeze protection,
@@ -867,6 +927,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		kiocb->ki_flags |= IOCB_WRITE;
 		io_rw_done(kiocb, call_write_iter(file, kiocb, &iter));
 	}
+out_free:
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
@@ -1228,6 +1289,21 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx,
+						 const struct io_uring_sqe *sqe)
+{
+	switch (sqe->opcode) {
+	case IORING_OP_READV:
+	case IORING_OP_READ_FIXED:
+		return &ctx->pending_async[READ];
+	case IORING_OP_WRITEV:
+	case IORING_OP_WRITE_FIXED:
+		return &ctx->pending_async[WRITE];
+	default:
+		return NULL;
+	}
+}
+
 static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
 {
 	return !(sqe->opcode == IORING_OP_READ_FIXED ||
@@ -1237,55 +1313,102 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct sqe_submit *s = &req->submit;
 	struct io_ring_ctx *ctx = req->ctx;
+	struct mm_struct *cur_mm = NULL;
 	struct files_struct *old_files;
-	struct io_uring_sqe sqe;
+	struct async_list *async_list;
+	LIST_HEAD(req_list);
 	mm_segment_t old_fs;
-	bool needs_user;
 	int ret;
 
-	/*
-	 * Ensure sqe is stable between checking if we need user access,
-	 * and actually importing the iovec further down the stack.
-	 */
-	memcpy(&sqe, s->sqe, sizeof(sqe));
-	s->sqe = &sqe;
-
-	 /* Ensure we clear previously set forced non-block flag */
-	req->flags &= ~REQ_F_FORCE_NONBLOCK;
-
 	task_lock(current);
 	old_files = current->files;
 	current->files = ctx->sqo_files;
 	task_unlock(current);
 
+	async_list = io_async_list_from_sqe(ctx, req->submit.sqe);
+restart:
+	do {
+		struct sqe_submit *s = &req->submit;
+		struct io_uring_sqe sqe;
+
+		/*
+		 * Ensure sqe is stable between checking if we need user access,
+		 * and actually importing the iovec further down the stack.
+		 */
+		memcpy(&sqe, s->sqe, sizeof(sqe));
+		s->sqe = &sqe;
+
+		/* Ensure we clear previously set forced non-block flag */
+		req->flags &= ~REQ_F_FORCE_NONBLOCK;
+
+		ret = 0;
+		if (io_sqe_needs_user(&sqe) && !cur_mm) {
+			if (!mmget_not_zero(ctx->sqo_mm)) {
+				ret = -EFAULT;
+			} else {
+				cur_mm = ctx->sqo_mm;;
+				use_mm(ctx->sqo_mm);
+				old_fs = get_fs();
+				set_fs(USER_DS);
+			}
+		}
+
+		if (!ret)
+			ret = __io_submit_sqe(ctx, req, s, false, NULL);
+		if (ret) {
+			io_cqring_add_event(ctx, sqe.user_data, ret, 0);
+			io_free_req(req);
+		}
+		if (!async_list)
+			break;
+		if (!list_empty(&req_list)) {
+			req = list_first_entry(&req_list, struct io_kiocb,
+						list);
+			list_del(&req->list);
+			continue;
+		}
+		if (list_empty(&async_list->list))
+			break;
+
+		req = NULL;
+		spin_lock(&async_list->lock);
+		if (list_empty(&async_list->list)) {
+			spin_unlock(&async_list->lock);
+			break;
+		}
+		list_splice_init(&async_list->list, &req_list);
+		spin_unlock(&async_list->lock);
+
+		req = list_first_entry(&req_list, struct io_kiocb, list);
+		list_del(&req->list);
+	} while (req);
+
 	/*
-	 * If we're doing IO to fixed buffers, we don't need to get/set
-	 * user context
+	 * Rare case of racing with a submitter. If we find the count has
+	 * dropped to zero AND we have pending work items, then restart
+	 * the processing. This is a tiny race window.
 	 */
-	needs_user = io_sqe_needs_user(&sqe);
-	if (needs_user) {
-		if (!mmget_not_zero(ctx->sqo_mm)) {
-			ret = -EFAULT;
-			goto err;
+	ret = atomic_dec_return(&async_list->cnt);
+	while (!ret && !list_empty(&async_list->list)) {
+		spin_lock(&async_list->lock);
+		atomic_inc(&async_list->cnt);
+		list_splice_init(&async_list->list, &req_list);
+		spin_unlock(&async_list->lock);
+
+		if (!list_empty(&req_list)) {
+			req = list_first_entry(&req_list, struct io_kiocb,
+						list);
+			list_del(&req->list);
+			goto restart;
 		}
-		use_mm(ctx->sqo_mm);
-		old_fs = get_fs();
-		set_fs(USER_DS);
+		ret = atomic_dec_return(&async_list->cnt);
 	}
 
-	ret = __io_submit_sqe(ctx, req, s, false, NULL);
-
-	if (needs_user) {
+	if (cur_mm) {
 		set_fs(old_fs);
-		unuse_mm(ctx->sqo_mm);
-		mmput(ctx->sqo_mm);
-	}
-err:
-	if (ret) {
-		io_cqring_add_event(ctx, sqe.user_data, ret, 0);
-		io_free_req(req);
+		unuse_mm(cur_mm);
+		mmput(cur_mm);
 	}
 
 	task_lock(current);
@@ -1293,6 +1416,33 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	task_unlock(current);
 }
 
+/*
+ * See if we can piggy back onto previously submitted work, that is still
+ * running. We currently only allow this if the new request is sequential
+ * to the previous one we punted.
+ */
+static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req)
+{
+	bool ret = false;
+
+	if (!list)
+		return false;
+	if (!(req->flags & REQ_F_SEQ_PREV))
+		return false;
+	if (!atomic_read(&list->cnt))
+		return false;
+
+	ret = true;
+	spin_lock(&list->lock);
+	list_add_tail(&req->list, &list->list);
+	if (!atomic_read(&list->cnt)) {
+		list_del_init(&req->list);
+		ret = false;
+	}
+	spin_unlock(&list->lock);
+	return ret;
+}
+
 static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 			 struct io_submit_state *state)
 {
@@ -1309,9 +1459,16 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 
 	ret = __io_submit_sqe(ctx, req, s, true, state);
 	if (ret == -EAGAIN) {
+		struct async_list *list;
+
+		list = io_async_list_from_sqe(ctx, s->sqe);
 		memcpy(&req->submit, s, sizeof(*s));
-		INIT_WORK(&req->work, io_sq_wq_submit_work);
-		queue_work(ctx->sqo_wq, &req->work);
+		if (!io_add_to_prev_work(list, req)) {
+			if (list)
+				atomic_inc(&list->cnt);
+			INIT_WORK(&req->work, io_sq_wq_submit_work);
+			queue_work(ctx->sqo_wq, &req->work);
+		}
 		ret = 0;
 	}
 	if (ret)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 18/18] io_uring: add io_uring_event cache hit information
  2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
                   ` (16 preceding siblings ...)
  2019-01-29 19:27 ` [PATCH 17/18] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
@ 2019-01-29 19:27 ` Jens Axboe
  17 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 19:27 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

Add hint on whether a read was served out of the page cache, or if it
hit media. This is useful for buffered async IO, O_DIRECT reads would
never have this set (for obvious reasons).

If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT
set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 7 ++++++-
 include/uapi/linux/io_uring.h | 5 +++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index af2519d3e434..c6187a8f62a9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -551,11 +551,16 @@ static void io_fput(struct io_kiocb *req)
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+	unsigned ev_flags = 0;
 
 	kiocb_end_write(kiocb);
 
 	io_fput(req);
-	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+
+	if (res > 0 && (req->flags & REQ_F_FORCE_NONBLOCK))
+		ev_flags = IOCQE_FLAG_CACHEHIT;
+
+	io_cqring_add_event(req->ctx, req->user_data, res, ev_flags);
 	io_free_req(req);
 }
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d319b2cd6319..6782e4e0464b 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -69,6 +69,11 @@ struct io_uring_cqe {
 	__u32	flags;
 };
 
+/*
+ * io_uring_event->flags
+ */
+#define IOCQE_FLAG_CACHEHIT	(1U << 0)	/* IO did not hit media */
+
 /*
  * Magic offsets for the application to mmap the data it needs
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/18] io_uring: support for IO polling
  2019-01-29 19:26 ` [PATCH 07/18] io_uring: support for IO polling Jens Axboe
@ 2019-01-29 20:47   ` Jann Horn
  2019-01-29 20:56     ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jann Horn @ 2019-01-29 20:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
> Add support for a polled io_uring context. When a read or write is
> submitted to a polled context, the application must poll for completions
> on the CQ ring through io_uring_enter(2). Polled IO may not generate
> IRQ completions, hence they need to be actively found by the application
> itself.
>
> To use polling, io_uring_setup() must be used with the
> IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
> polled and non-polled IO on an io_uring.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
[...]
> @@ -102,6 +102,8 @@ struct io_ring_ctx {
>
>         struct {
>                 spinlock_t              completion_lock;
> +               bool                    poll_multi_file;
> +               struct list_head        poll_list;

Please add a comment explaining what protects poll_list against
concurrent modification, and ideally also put lockdep asserts in the
functions that access the list to allow the kernel to sanity-check the
locking at runtime.

As far as I understand:
Elements are added by io_iopoll_req_issued(). io_iopoll_req_issued()
can't race with itself because, depending on IORING_SETUP_SQPOLL,
either you have to come through sys_io_uring_enter() (which takes the
uring_lock), or you have to come from the single-threaded
io_sq_thread().
io_do_iopoll() iterates over the list and removes completed items.
io_do_iopoll() is called through io_iopoll_getevents(), which can be
invoked in two ways during normal operation:
 - sys_io_uring_enter -> __io_uring_enter -> io_iopoll_check
->io_iopoll_getevents; this is only protected by the uring_lock
 - io_sq_thread -> io_iopoll_check ->io_iopoll_getevents; this doesn't
hold any locks
Additionally, the following exit paths:
 - io_sq_thread -> io_iopoll_reap_events -> io_iopoll_getevents
 - io_uring_release -> io_ring_ctx_wait_and_kill ->
io_iopoll_reap_events -> io_iopoll_getevents
 - io_uring_release -> io_ring_ctx_wait_and_kill -> io_ring_ctx_free
-> io_iopoll_reap_events -> io_iopoll_getevents

So as far as I can tell, you can have various races around access to
the poll_list.

[...]
> +static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
> +{
> +       if (*nr) {
> +               kmem_cache_free_bulk(req_cachep, *nr, reqs);
> +               io_ring_drop_ctx_refs(ctx, *nr);
> +               *nr = 0;
> +       }
> +}
[...]
> +static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
> +                              struct list_head *done)
> +{
> +       void *reqs[IO_IOPOLL_BATCH];
> +       struct io_kiocb *req;
> +       int to_free = 0;
> +
> +       while (!list_empty(done)) {
> +               req = list_first_entry(done, struct io_kiocb, list);
> +               list_del(&req->list);
> +
> +               io_cqring_fill_event(ctx, req->user_data, req->error, 0);
> +
> +               reqs[to_free++] = req;
> +               (*nr_events)++;
> +
> +               fput(req->rw.ki_filp);
> +               if (to_free == ARRAY_SIZE(reqs))
> +                       io_free_req_many(ctx, reqs, &to_free);
> +       }
> +       io_commit_cqring(ctx);
> +
> +       if (to_free)
> +               io_free_req_many(ctx, reqs, &to_free);

Nit: You check here whether to_free==0, and then io_free_req_many()
does that again. You can delete one of those checks; I'd probably
delete this one.

> +}
[...]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/18] io_uring: support for IO polling
  2019-01-29 20:47   ` Jann Horn
@ 2019-01-29 20:56     ` Jens Axboe
  2019-01-29 21:10       ` Jann Horn
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 20:56 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On 1/29/19 1:47 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>> Add support for a polled io_uring context. When a read or write is
>> submitted to a polled context, the application must poll for completions
>> on the CQ ring through io_uring_enter(2). Polled IO may not generate
>> IRQ completions, hence they need to be actively found by the application
>> itself.
>>
>> To use polling, io_uring_setup() must be used with the
>> IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
>> polled and non-polled IO on an io_uring.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> [...]
>> @@ -102,6 +102,8 @@ struct io_ring_ctx {
>>
>>         struct {
>>                 spinlock_t              completion_lock;
>> +               bool                    poll_multi_file;
>> +               struct list_head        poll_list;
> 
> Please add a comment explaining what protects poll_list against
> concurrent modification, and ideally also put lockdep asserts in the
> functions that access the list to allow the kernel to sanity-check the
> locking at runtime.

Not sure that's needed, and it would be a bit difficult with the SQPOLL
thread and non-thread being different cases.

But comments I can definitely add.

> As far as I understand:
> Elements are added by io_iopoll_req_issued(). io_iopoll_req_issued()
> can't race with itself because, depending on IORING_SETUP_SQPOLL,
> either you have to come through sys_io_uring_enter() (which takes the
> uring_lock), or you have to come from the single-threaded
> io_sq_thread().
> io_do_iopoll() iterates over the list and removes completed items.
> io_do_iopoll() is called through io_iopoll_getevents(), which can be
> invoked in two ways during normal operation:
>  - sys_io_uring_enter -> __io_uring_enter -> io_iopoll_check
> ->io_iopoll_getevents; this is only protected by the uring_lock
>  - io_sq_thread -> io_iopoll_check ->io_iopoll_getevents; this doesn't
> hold any locks
> Additionally, the following exit paths:
>  - io_sq_thread -> io_iopoll_reap_events -> io_iopoll_getevents
>  - io_uring_release -> io_ring_ctx_wait_and_kill ->
> io_iopoll_reap_events -> io_iopoll_getevents
>  - io_uring_release -> io_ring_ctx_wait_and_kill -> io_ring_ctx_free
> -> io_iopoll_reap_events -> io_iopoll_getevents

Yes, your understanding is correct. But of important note, those two
cases don't co-exist. If you are using SQPOLL, then only the thread
itself is the one that modifies the list. The only valid call of
io_uring_enter(2) is to wakeup the thread, the task itself will NOT be
doing any issues. If you are NOT using SQPOLL, then any access is inside
the ->uring_lock.

For the reap cases, we don't enter those at shutdown for SQPOLL, we
expect the thread to do it. Hence we wait for the thread to exit before
we do our final release.

> So as far as I can tell, you can have various races around access to
> the poll_list.

How did you make that leap?

>> +static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
>> +{
>> +       if (*nr) {
>> +               kmem_cache_free_bulk(req_cachep, *nr, reqs);
>> +               io_ring_drop_ctx_refs(ctx, *nr);
>> +               *nr = 0;
>> +       }
>> +}
> [...]
>> +static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
>> +                              struct list_head *done)
>> +{
>> +       void *reqs[IO_IOPOLL_BATCH];
>> +       struct io_kiocb *req;
>> +       int to_free = 0;
>> +
>> +       while (!list_empty(done)) {
>> +               req = list_first_entry(done, struct io_kiocb, list);
>> +               list_del(&req->list);
>> +
>> +               io_cqring_fill_event(ctx, req->user_data, req->error, 0);
>> +
>> +               reqs[to_free++] = req;
>> +               (*nr_events)++;
>> +
>> +               fput(req->rw.ki_filp);
>> +               if (to_free == ARRAY_SIZE(reqs))
>> +                       io_free_req_many(ctx, reqs, &to_free);
>> +       }
>> +       io_commit_cqring(ctx);
>> +
>> +       if (to_free)
>> +               io_free_req_many(ctx, reqs, &to_free);
> 
> Nit: You check here whether to_free==0, and then io_free_req_many()
> does that again. You can delete one of those checks; I'd probably
> delete this one.

Agree, I'll kill it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/18] io_uring: support for IO polling
  2019-01-29 20:56     ` Jens Axboe
@ 2019-01-29 21:10       ` Jann Horn
  2019-01-29 21:33         ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jann Horn @ 2019-01-29 21:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 9:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/29/19 1:47 PM, Jann Horn wrote:
> > On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> Add support for a polled io_uring context. When a read or write is
> >> submitted to a polled context, the application must poll for completions
> >> on the CQ ring through io_uring_enter(2). Polled IO may not generate
> >> IRQ completions, hence they need to be actively found by the application
> >> itself.
> >>
> >> To use polling, io_uring_setup() must be used with the
> >> IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
> >> polled and non-polled IO on an io_uring.
> >>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > [...]
> >> @@ -102,6 +102,8 @@ struct io_ring_ctx {
> >>
> >>         struct {
> >>                 spinlock_t              completion_lock;
> >> +               bool                    poll_multi_file;
> >> +               struct list_head        poll_list;
> >
> > Please add a comment explaining what protects poll_list against
> > concurrent modification, and ideally also put lockdep asserts in the
> > functions that access the list to allow the kernel to sanity-check the
> > locking at runtime.
>
> Not sure that's needed, and it would be a bit difficult with the SQPOLL
> thread and non-thread being different cases.
>
> But comments I can definitely add.
>
> > As far as I understand:
> > Elements are added by io_iopoll_req_issued(). io_iopoll_req_issued()
> > can't race with itself because, depending on IORING_SETUP_SQPOLL,
> > either you have to come through sys_io_uring_enter() (which takes the
> > uring_lock), or you have to come from the single-threaded
> > io_sq_thread().
> > io_do_iopoll() iterates over the list and removes completed items.
> > io_do_iopoll() is called through io_iopoll_getevents(), which can be
> > invoked in two ways during normal operation:
> >  - sys_io_uring_enter -> __io_uring_enter -> io_iopoll_check
> > ->io_iopoll_getevents; this is only protected by the uring_lock
> >  - io_sq_thread -> io_iopoll_check ->io_iopoll_getevents; this doesn't
> > hold any locks
> > Additionally, the following exit paths:
> >  - io_sq_thread -> io_iopoll_reap_events -> io_iopoll_getevents
> >  - io_uring_release -> io_ring_ctx_wait_and_kill ->
> > io_iopoll_reap_events -> io_iopoll_getevents
> >  - io_uring_release -> io_ring_ctx_wait_and_kill -> io_ring_ctx_free
> > -> io_iopoll_reap_events -> io_iopoll_getevents
>
> Yes, your understanding is correct. But of important note, those two
> cases don't co-exist. If you are using SQPOLL, then only the thread
> itself is the one that modifies the list. The only valid call of
> io_uring_enter(2) is to wakeup the thread, the task itself will NOT be
> doing any issues. If you are NOT using SQPOLL, then any access is inside
> the ->uring_lock.
>
> For the reap cases, we don't enter those at shutdown for SQPOLL, we
> expect the thread to do it. Hence we wait for the thread to exit before
> we do our final release.
>
> > So as far as I can tell, you can have various races around access to
> > the poll_list.
>
> How did you make that leap?

Ah, you're right, I missed a check when going through
__io_uring_enter(), never mind.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 07/18] io_uring: support for IO polling
  2019-01-29 21:10       ` Jann Horn
@ 2019-01-29 21:33         ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 21:33 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On 1/29/19 2:10 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 9:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/29/19 1:47 PM, Jann Horn wrote:
>>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> Add support for a polled io_uring context. When a read or write is
>>>> submitted to a polled context, the application must poll for completions
>>>> on the CQ ring through io_uring_enter(2). Polled IO may not generate
>>>> IRQ completions, hence they need to be actively found by the application
>>>> itself.
>>>>
>>>> To use polling, io_uring_setup() must be used with the
>>>> IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
>>>> polled and non-polled IO on an io_uring.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> [...]
>>>> @@ -102,6 +102,8 @@ struct io_ring_ctx {
>>>>
>>>>         struct {
>>>>                 spinlock_t              completion_lock;
>>>> +               bool                    poll_multi_file;
>>>> +               struct list_head        poll_list;
>>>
>>> Please add a comment explaining what protects poll_list against
>>> concurrent modification, and ideally also put lockdep asserts in the
>>> functions that access the list to allow the kernel to sanity-check the
>>> locking at runtime.
>>
>> Not sure that's needed, and it would be a bit difficult with the SQPOLL
>> thread and non-thread being different cases.
>>
>> But comments I can definitely add.
>>
>>> As far as I understand:
>>> Elements are added by io_iopoll_req_issued(). io_iopoll_req_issued()
>>> can't race with itself because, depending on IORING_SETUP_SQPOLL,
>>> either you have to come through sys_io_uring_enter() (which takes the
>>> uring_lock), or you have to come from the single-threaded
>>> io_sq_thread().
>>> io_do_iopoll() iterates over the list and removes completed items.
>>> io_do_iopoll() is called through io_iopoll_getevents(), which can be
>>> invoked in two ways during normal operation:
>>>  - sys_io_uring_enter -> __io_uring_enter -> io_iopoll_check
>>> ->io_iopoll_getevents; this is only protected by the uring_lock
>>>  - io_sq_thread -> io_iopoll_check ->io_iopoll_getevents; this doesn't
>>> hold any locks
>>> Additionally, the following exit paths:
>>>  - io_sq_thread -> io_iopoll_reap_events -> io_iopoll_getevents
>>>  - io_uring_release -> io_ring_ctx_wait_and_kill ->
>>> io_iopoll_reap_events -> io_iopoll_getevents
>>>  - io_uring_release -> io_ring_ctx_wait_and_kill -> io_ring_ctx_free
>>> -> io_iopoll_reap_events -> io_iopoll_getevents
>>
>> Yes, your understanding is correct. But of important note, those two
>> cases don't co-exist. If you are using SQPOLL, then only the thread
>> itself is the one that modifies the list. The only valid call of
>> io_uring_enter(2) is to wakeup the thread, the task itself will NOT be
>> doing any issues. If you are NOT using SQPOLL, then any access is inside
>> the ->uring_lock.
>>
>> For the reap cases, we don't enter those at shutdown for SQPOLL, we
>> expect the thread to do it. Hence we wait for the thread to exit before
>> we do our final release.
>>
>>> So as far as I can tell, you can have various races around access to
>>> the poll_list.
>>
>> How did you make that leap?
> 
> Ah, you're right, I missed a check when going through
> __io_uring_enter(), never mind.

OK good, thanks for confirming, was afraid I was starting to lose my
mind.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29 19:26 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
@ 2019-01-29 22:44   ` Jann Horn
  2019-01-29 22:56     ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jann Horn @ 2019-01-29 22:44 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
> If we have fixed user buffers, we can map them into the kernel when we
> setup the io_context. That avoids the need to do get_user_pages() for
> each and every IO.
>
> To utilize this feature, the application must call io_uring_register()
> after having setup an io_uring context, passing in
> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
> to an iovec array, and the nr_args should contain how many iovecs the
> application wishes to map.
>
> If successful, these buffers are now mapped into the kernel, eligible
> for IO. To use these fixed buffers, the application must use the
> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
> must point to somewhere inside the indexed buffer.
>
> The application may register buffers throughout the lifetime of the
> io_uring context. It can call io_uring_register() with
> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
> buffers, and then register a new set. The application need not
> unregister buffers explicitly before shutting down the io_uring context.
[...]
> +static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
> +                          const struct io_uring_sqe *sqe,
> +                          struct iov_iter *iter)
> +{
> +       size_t len = READ_ONCE(sqe->len);
> +       struct io_mapped_ubuf *imu;
> +       int buf_index, index;
> +       size_t offset;
> +       u64 buf_addr;
> +
> +       /* attempt to use fixed buffers without having provided iovecs */
> +       if (unlikely(!ctx->user_bufs))
> +               return -EFAULT;
> +
> +       buf_index = READ_ONCE(sqe->buf_index);
> +       if (unlikely(buf_index >= ctx->nr_user_bufs))
> +               return -EFAULT;

Nit: If you make the local copy of buf_index unsigned, it is slightly
easier to see that this code is correct. (I know, it has to be
positive anyway because the value in shared memory is a u16.)

> +       index = array_index_nospec(buf_index, ctx->sq_entries);

This looks weird. Did you mean s/ctx->sq_entries/ctx->nr_user_bufs/?

> +       imu = &ctx->user_bufs[index];
> +       buf_addr = READ_ONCE(sqe->addr);
> +       if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)

This can wrap around if `buf_addr` or `len` is very big, right? Then
you e.g. get past the first check because `buf_addr` is sufficiently
big, and get past the second check because `buf_addr + len` wraps
around and becomes small.

> +               return -EFAULT;
> +
> +       /*
> +        * May not be a start of buffer, set size appropriately
> +        * and advance us to the beginning.
> +        */
> +       offset = buf_addr - imu->ubuf;
> +       iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
> +       if (offset)
> +               iov_iter_advance(iter, offset);
> +       return 0;
> +}
> +
>  static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
>                            const struct io_uring_sqe *sqe, struct iovec **iovec,
>                            struct iov_iter *iter)
>  {
>         void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
>         size_t sqe_len = READ_ONCE(sqe->len);
> +       int opcode;
> +
> +       opcode = READ_ONCE(sqe->opcode);
> +       if (opcode == IORING_OP_READ_FIXED ||
> +           opcode == IORING_OP_WRITE_FIXED) {
> +               ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
> +               *iovec = NULL;
> +               return ret;
> +       }
[...]
>
> +static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
> +{
> +       return !(sqe->opcode == IORING_OP_READ_FIXED ||
> +                sqe->opcode == IORING_OP_WRITE_FIXED);
> +}

This still looks racy to me?

>  static void io_sq_wq_submit_work(struct work_struct *work)
>  {
[...]
> -       if (!mmget_not_zero(ctx->sqo_mm)) {
> -               ret = -EFAULT;
> -               goto err;
> +       /*
> +        * If we're doing IO to fixed buffers, we don't need to get/set
> +        * user context
> +        */
> +       needs_user = io_sqe_needs_user(&sqe);
> +       if (needs_user) {
> +               if (!mmget_not_zero(ctx->sqo_mm)) {
> +                       ret = -EFAULT;
> +                       goto err;
> +               }
> +               use_mm(ctx->sqo_mm);
> +               old_fs = get_fs();
> +               set_fs(USER_DS);
>         }
>
> -       use_mm(ctx->sqo_mm);
> -       set_fs(USER_DS);
> -
>         ret = __io_submit_sqe(ctx, req, s, false, NULL);
>
> -       set_fs(old_fs);
> -       unuse_mm(ctx->sqo_mm);
> -       mmput(ctx->sqo_mm);
> +       if (needs_user) {
> +               set_fs(old_fs);
> +               unuse_mm(ctx->sqo_mm);
> +               mmput(ctx->sqo_mm);
> +       }
>  err:
>         if (ret) {
> -               io_cqring_add_event(ctx, user_data, ret, 0);
> +               io_cqring_add_event(ctx, sqe.user_data, ret, 0);
>                 io_free_req(req);
>         }
[...]
> +static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
> +                      void __user *arg, unsigned index)
> +{

This function doesn't actually use the "ctx" parameter, right? You
might want to remove it.

> +       struct iovec __user *src;
> +
> +#ifdef CONFIG_COMPAT
> +       if (in_compat_syscall()) {
> +               struct compat_iovec __user *ciovs;
> +               struct compat_iovec ciov;
> +
> +               ciovs = (struct compat_iovec __user *) arg;
> +               if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
> +                       return -EFAULT;
> +
> +               dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
> +               dst->iov_len = ciov.iov_len;
> +               return 0;
> +       }
> +#endif
> +       src = (struct iovec __user *) arg;
> +       if (copy_from_user(dst, &src[index], sizeof(*dst)))
> +               return -EFAULT;
> +       return 0;
> +}
> +
> +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
> +                                 unsigned nr_args)
> +{
> +       struct vm_area_struct **vmas = NULL;
> +       struct page **pages = NULL;
> +       int i, j, got_pages = 0;
> +       int ret = -EINVAL;
> +
> +       if (ctx->user_bufs)
> +               return -EBUSY;
> +       if (!nr_args || nr_args > UIO_MAXIOV)
> +               return -EINVAL;
> +
> +       ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
> +                                       GFP_KERNEL);
> +       if (!ctx->user_bufs)
> +               return -ENOMEM;
> +
> +       if (!capable(CAP_IPC_LOCK))
> +               ctx->user = get_uid(current_user());
> +
> +       for (i = 0; i < nr_args; i++) {
> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
> +               unsigned long off, start, end, ubuf;
> +               int pret, nr_pages;
> +               struct iovec iov;
> +               size_t size;
> +
> +               ret = io_copy_iov(ctx, &iov, arg, i);
> +               if (ret)
> +                       break;
> +
> +               /*
> +                * Don't impose further limits on the size and buffer
> +                * constraints here, we'll -EINVAL later when IO is
> +                * submitted if they are wrong.
> +                */
> +               ret = -EFAULT;
> +               if (!iov.iov_base)
> +                       goto err;
> +
> +               /* arbitrary limit, but we need something */
> +               if (iov.iov_len > SZ_1G)
> +                       goto err;

You might also want to check for iov_len==0? Otherwise, if iov_base
isn't page-aligned, the following code might grab a reference to one
page even though the iov covers zero pages, that'd be kinda weird.

> +               ubuf = (unsigned long) iov.iov_base;
> +               end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +               start = ubuf >> PAGE_SHIFT;
> +               nr_pages = end - start;
> +
> +               ret = io_account_mem(ctx, nr_pages);
> +               if (ret)
> +                       goto err;
> +
> +               if (!pages || nr_pages > got_pages) {
> +                       kfree(vmas);
> +                       kfree(pages);
> +                       pages = kmalloc_array(nr_pages, sizeof(struct page *),
> +                                               GFP_KERNEL);
> +                       vmas = kmalloc_array(nr_pages,
> +                                       sizeof(struct vma_area_struct *),
> +                                       GFP_KERNEL);
> +                       if (!pages || !vmas) {
> +                               io_unaccount_mem(ctx, nr_pages);
> +                               goto err;
> +                       }
> +                       got_pages = nr_pages;
> +               }
> +
> +               imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
> +                                               GFP_KERNEL);
> +               if (!imu->bvec) {
> +                       io_unaccount_mem(ctx, nr_pages);
> +                       goto err;
> +               }
> +
> +               down_write(&current->mm->mmap_sem);

Is there a reason why you're using down_write() and not down_read()?
As far as I can tell, down_read() is all you need...

> +               pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
> +                                               pages, vmas);
> +               if (pret == nr_pages) {
> +                       /* don't support file backed memory */
> +                       for (j = 0; j < nr_pages; j++) {
> +                               struct vm_area_struct *vma = vmas[j];
> +
> +                               if (vma->vm_file) {
> +                                       ret = -EOPNOTSUPP;
> +                                       break;
> +                               }
> +                       }

Are you intentionally doing the check for vma->vm_file instead of
calling GUP with FOLL_ANON, which would automatically verify
vma->vm_ops==NULL for you using vma_is_anonymous()? FOLL_ANON is what
procfs uses to avoid blocking on page faults when reading remote
process memory via /proc/*/{cmdline,environ}. I don't entirely
understand the motivation for this check, so I can't really tell
whether FOLL_ANON would do the job.

> +               } else {
> +                       ret = pret < 0 ? pret : -EFAULT;
> +               }
> +               up_write(&current->mm->mmap_sem);
> +               if (ret) {
> +                       /*
> +                        * if we did partial map, or found file backed vmas,
> +                        * release any pages we did get
> +                        */
> +                       if (pret > 0) {
> +                               for (j = 0; j < pret; j++)
> +                                       put_page(pages[j]);
> +                       }
> +                       io_unaccount_mem(ctx, nr_pages);
> +                       goto err;
> +               }
> +
> +               off = ubuf & ~PAGE_MASK;
> +               size = iov.iov_len;
> +               for (j = 0; j < nr_pages; j++) {
> +                       size_t vec_len;
> +
> +                       vec_len = min_t(size_t, size, PAGE_SIZE - off);
> +                       imu->bvec[j].bv_page = pages[j];
> +                       imu->bvec[j].bv_len = vec_len;
> +                       imu->bvec[j].bv_offset = off;
> +                       off = 0;
> +                       size -= vec_len;
> +               }
> +               /* store original address for later verification */
> +               imu->ubuf = ubuf;
> +               imu->len = iov.iov_len;
> +               imu->nr_bvecs = nr_pages;
> +       }
> +       kfree(pages);
> +       kfree(vmas);
> +       ctx->nr_user_bufs = nr_args;
> +       return 0;
> +err:
> +       kfree(pages);
> +       kfree(vmas);
> +       io_sqe_buffer_unregister(ctx);
> +       return ret;
> +}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29 22:44   ` Jann Horn
@ 2019-01-29 22:56     ` Jens Axboe
  2019-01-29 23:03       ` Jann Horn
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 22:56 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On 1/29/19 3:44 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>> If we have fixed user buffers, we can map them into the kernel when we
>> setup the io_context. That avoids the need to do get_user_pages() for
>> each and every IO.
>>
>> To utilize this feature, the application must call io_uring_register()
>> after having setup an io_uring context, passing in
>> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
>> to an iovec array, and the nr_args should contain how many iovecs the
>> application wishes to map.
>>
>> If successful, these buffers are now mapped into the kernel, eligible
>> for IO. To use these fixed buffers, the application must use the
>> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
>> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
>> must point to somewhere inside the indexed buffer.
>>
>> The application may register buffers throughout the lifetime of the
>> io_uring context. It can call io_uring_register() with
>> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
>> buffers, and then register a new set. The application need not
>> unregister buffers explicitly before shutting down the io_uring context.
> [...]
>> +static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
>> +                          const struct io_uring_sqe *sqe,
>> +                          struct iov_iter *iter)
>> +{
>> +       size_t len = READ_ONCE(sqe->len);
>> +       struct io_mapped_ubuf *imu;
>> +       int buf_index, index;
>> +       size_t offset;
>> +       u64 buf_addr;
>> +
>> +       /* attempt to use fixed buffers without having provided iovecs */
>> +       if (unlikely(!ctx->user_bufs))
>> +               return -EFAULT;
>> +
>> +       buf_index = READ_ONCE(sqe->buf_index);
>> +       if (unlikely(buf_index >= ctx->nr_user_bufs))
>> +               return -EFAULT;
> 
> Nit: If you make the local copy of buf_index unsigned, it is slightly
> easier to see that this code is correct. (I know, it has to be
> positive anyway because the value in shared memory is a u16.)

I'll definitely fit, but I can make it unsigned.

>> +       index = array_index_nospec(buf_index, ctx->sq_entries);
> 
> This looks weird. Did you mean s/ctx->sq_entries/ctx->nr_user_bufs/?

I did, too much copy/paste for that one. Fixed.

>> +       imu = &ctx->user_bufs[index];
>> +       buf_addr = READ_ONCE(sqe->addr);
>> +       if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
> 
> This can wrap around if `buf_addr` or `len` is very big, right? Then
> you e.g. get past the first check because `buf_addr` is sufficiently
> big, and get past the second check because `buf_addr + len` wraps
> around and becomes small.

Good point. I wonder if we have a verification helper for something like
this?

>> +               return -EFAULT;
>> +
>> +       /*
>> +        * May not be a start of buffer, set size appropriately
>> +        * and advance us to the beginning.
>> +        */
>> +       offset = buf_addr - imu->ubuf;
>> +       iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
>> +       if (offset)
>> +               iov_iter_advance(iter, offset);
>> +       return 0;
>> +}
>> +
>>  static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
>>                            const struct io_uring_sqe *sqe, struct iovec **iovec,
>>                            struct iov_iter *iter)
>>  {
>>         void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
>>         size_t sqe_len = READ_ONCE(sqe->len);
>> +       int opcode;
>> +
>> +       opcode = READ_ONCE(sqe->opcode);
>> +       if (opcode == IORING_OP_READ_FIXED ||
>> +           opcode == IORING_OP_WRITE_FIXED) {
>> +               ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
>> +               *iovec = NULL;
>> +               return ret;
>> +       }
> [...]
>>
>> +static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
>> +{
>> +       return !(sqe->opcode == IORING_OP_READ_FIXED ||
>> +                sqe->opcode == IORING_OP_WRITE_FIXED);
>> +}
> 
> This still looks racy to me?

I didn't change it because the below one you quoted below
(io_sq_wq_submit_work()) is using a local copy, but we do need it for
for the SQPOLL io_sq_thread() case. I'll get that one fixed up.

I suspect the easiest fix is to ensure that io_sq_thread() copies the
sqe.

>> +static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
>> +                      void __user *arg, unsigned index)
>> +{
> 
> This function doesn't actually use the "ctx" parameter, right? You
> might want to remove it.

It should just use ctx->compat for this check, we're now only calling
in_compat_syscall() when we setup the ctx. This keeps it all in one
spot.

>> +       struct iovec __user *src;
>> +
>> +#ifdef CONFIG_COMPAT
>> +       if (in_compat_syscall()) {
>> +               struct compat_iovec __user *ciovs;
>> +               struct compat_iovec ciov;
>> +
>> +               ciovs = (struct compat_iovec __user *) arg;
>> +               if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
>> +                       return -EFAULT;
>> +
>> +               dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
>> +               dst->iov_len = ciov.iov_len;
>> +               return 0;
>> +       }
>> +#endif
>> +       src = (struct iovec __user *) arg;
>> +       if (copy_from_user(dst, &src[index], sizeof(*dst)))
>> +               return -EFAULT;
>> +       return 0;
>> +}
>> +
>> +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
>> +                                 unsigned nr_args)
>> +{
>> +       struct vm_area_struct **vmas = NULL;
>> +       struct page **pages = NULL;
>> +       int i, j, got_pages = 0;
>> +       int ret = -EINVAL;
>> +
>> +       if (ctx->user_bufs)
>> +               return -EBUSY;
>> +       if (!nr_args || nr_args > UIO_MAXIOV)
>> +               return -EINVAL;
>> +
>> +       ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
>> +                                       GFP_KERNEL);
>> +       if (!ctx->user_bufs)
>> +               return -ENOMEM;
>> +
>> +       if (!capable(CAP_IPC_LOCK))
>> +               ctx->user = get_uid(current_user());
>> +
>> +       for (i = 0; i < nr_args; i++) {
>> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
>> +               unsigned long off, start, end, ubuf;
>> +               int pret, nr_pages;
>> +               struct iovec iov;
>> +               size_t size;
>> +
>> +               ret = io_copy_iov(ctx, &iov, arg, i);
>> +               if (ret)
>> +                       break;
>> +
>> +               /*
>> +                * Don't impose further limits on the size and buffer
>> +                * constraints here, we'll -EINVAL later when IO is
>> +                * submitted if they are wrong.
>> +                */
>> +               ret = -EFAULT;
>> +               if (!iov.iov_base)
>> +                       goto err;
>> +
>> +               /* arbitrary limit, but we need something */
>> +               if (iov.iov_len > SZ_1G)
>> +                       goto err;
> 
> You might also want to check for iov_len==0? Otherwise, if iov_base
> isn't page-aligned, the following code might grab a reference to one
> page even though the iov covers zero pages, that'd be kinda weird.

Good catch, will do.

> 
>> +               ubuf = (unsigned long) iov.iov_base;
>> +               end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> +               start = ubuf >> PAGE_SHIFT;
>> +               nr_pages = end - start;
>> +
>> +               ret = io_account_mem(ctx, nr_pages);
>> +               if (ret)
>> +                       goto err;
>> +
>> +               if (!pages || nr_pages > got_pages) {
>> +                       kfree(vmas);
>> +                       kfree(pages);
>> +                       pages = kmalloc_array(nr_pages, sizeof(struct page *),
>> +                                               GFP_KERNEL);
>> +                       vmas = kmalloc_array(nr_pages,
>> +                                       sizeof(struct vma_area_struct *),
>> +                                       GFP_KERNEL);
>> +                       if (!pages || !vmas) {
>> +                               io_unaccount_mem(ctx, nr_pages);
>> +                               goto err;
>> +                       }
>> +                       got_pages = nr_pages;
>> +               }
>> +
>> +               imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
>> +                                               GFP_KERNEL);
>> +               if (!imu->bvec) {
>> +                       io_unaccount_mem(ctx, nr_pages);
>> +                       goto err;
>> +               }
>> +
>> +               down_write(&current->mm->mmap_sem);
> 
> Is there a reason why you're using down_write() and not down_read()?
> As far as I can tell, down_read() is all you need...

Looks like you are right, I'll change that.

>> +               pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
>> +                                               pages, vmas);
>> +               if (pret == nr_pages) {
>> +                       /* don't support file backed memory */
>> +                       for (j = 0; j < nr_pages; j++) {
>> +                               struct vm_area_struct *vma = vmas[j];
>> +
>> +                               if (vma->vm_file) {
>> +                                       ret = -EOPNOTSUPP;
>> +                                       break;
>> +                               }
>> +                       }
> 
> Are you intentionally doing the check for vma->vm_file instead of
> calling GUP with FOLL_ANON, which would automatically verify
> vma->vm_ops==NULL for you using vma_is_anonymous()? FOLL_ANON is what
> procfs uses to avoid blocking on page faults when reading remote
> process memory via /proc/*/{cmdline,environ}. I don't entirely
> understand the motivation for this check, so I can't really tell
> whether FOLL_ANON would do the job.

I wasn't aware of FOLL_ANON, it looks exactly like what I need. All I
care about is the mapping being anon, and not file backed. If FOLL_ANON
ensures me that (or fails), then that'll kill this vma checking code.
Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29 22:56     ` Jens Axboe
@ 2019-01-29 23:03       ` Jann Horn
  2019-01-29 23:06         ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jann Horn @ 2019-01-29 23:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/29/19 3:44 PM, Jann Horn wrote:
> > On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> If we have fixed user buffers, we can map them into the kernel when we
> >> setup the io_context. That avoids the need to do get_user_pages() for
> >> each and every IO.
> >>
> >> To utilize this feature, the application must call io_uring_register()
> >> after having setup an io_uring context, passing in
> >> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
> >> to an iovec array, and the nr_args should contain how many iovecs the
> >> application wishes to map.
> >>
> >> If successful, these buffers are now mapped into the kernel, eligible
> >> for IO. To use these fixed buffers, the application must use the
> >> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
> >> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
> >> must point to somewhere inside the indexed buffer.
> >>
> >> The application may register buffers throughout the lifetime of the
> >> io_uring context. It can call io_uring_register() with
> >> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
> >> buffers, and then register a new set. The application need not
> >> unregister buffers explicitly before shutting down the io_uring context.
[...]
> >> +       imu = &ctx->user_bufs[index];
> >> +       buf_addr = READ_ONCE(sqe->addr);
> >> +       if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
> >
> > This can wrap around if `buf_addr` or `len` is very big, right? Then
> > you e.g. get past the first check because `buf_addr` is sufficiently
> > big, and get past the second check because `buf_addr + len` wraps
> > around and becomes small.
>
> Good point. I wonder if we have a verification helper for something like
> this?

check_add_overflow() exists, I guess that might help a bit. I don't
think I've seen a more specific helper for this situation.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29 23:03       ` Jann Horn
@ 2019-01-29 23:06         ` Jens Axboe
  2019-01-29 23:08           ` Jann Horn
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 23:06 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On 1/29/19 4:03 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/29/19 3:44 PM, Jann Horn wrote:
>>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> If we have fixed user buffers, we can map them into the kernel when we
>>>> setup the io_context. That avoids the need to do get_user_pages() for
>>>> each and every IO.
>>>>
>>>> To utilize this feature, the application must call io_uring_register()
>>>> after having setup an io_uring context, passing in
>>>> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
>>>> to an iovec array, and the nr_args should contain how many iovecs the
>>>> application wishes to map.
>>>>
>>>> If successful, these buffers are now mapped into the kernel, eligible
>>>> for IO. To use these fixed buffers, the application must use the
>>>> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
>>>> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
>>>> must point to somewhere inside the indexed buffer.
>>>>
>>>> The application may register buffers throughout the lifetime of the
>>>> io_uring context. It can call io_uring_register() with
>>>> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
>>>> buffers, and then register a new set. The application need not
>>>> unregister buffers explicitly before shutting down the io_uring context.
> [...]
>>>> +       imu = &ctx->user_bufs[index];
>>>> +       buf_addr = READ_ONCE(sqe->addr);
>>>> +       if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
>>>
>>> This can wrap around if `buf_addr` or `len` is very big, right? Then
>>> you e.g. get past the first check because `buf_addr` is sufficiently
>>> big, and get past the second check because `buf_addr + len` wraps
>>> around and becomes small.
>>
>> Good point. I wonder if we have a verification helper for something like
>> this?
> 
> check_add_overflow() exists, I guess that might help a bit. I don't
> think I've seen a more specific helper for this situation.

Hmm, not super appropriate. How about something ala:

if (buf_addr + len < buf_addr)
    ... overflow ...

?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29 23:06         ` Jens Axboe
@ 2019-01-29 23:08           ` Jann Horn
  2019-01-29 23:14             ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jann Horn @ 2019-01-29 23:08 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On Wed, Jan 30, 2019 at 12:06 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/29/19 4:03 PM, Jann Horn wrote:
> > On Tue, Jan 29, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 1/29/19 3:44 PM, Jann Horn wrote:
> >>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> If we have fixed user buffers, we can map them into the kernel when we
> >>>> setup the io_context. That avoids the need to do get_user_pages() for
> >>>> each and every IO.
> >>>>
> >>>> To utilize this feature, the application must call io_uring_register()
> >>>> after having setup an io_uring context, passing in
> >>>> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
> >>>> to an iovec array, and the nr_args should contain how many iovecs the
> >>>> application wishes to map.
> >>>>
> >>>> If successful, these buffers are now mapped into the kernel, eligible
> >>>> for IO. To use these fixed buffers, the application must use the
> >>>> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
> >>>> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
> >>>> must point to somewhere inside the indexed buffer.
> >>>>
> >>>> The application may register buffers throughout the lifetime of the
> >>>> io_uring context. It can call io_uring_register() with
> >>>> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
> >>>> buffers, and then register a new set. The application need not
> >>>> unregister buffers explicitly before shutting down the io_uring context.
> > [...]
> >>>> +       imu = &ctx->user_bufs[index];
> >>>> +       buf_addr = READ_ONCE(sqe->addr);
> >>>> +       if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
> >>>
> >>> This can wrap around if `buf_addr` or `len` is very big, right? Then
> >>> you e.g. get past the first check because `buf_addr` is sufficiently
> >>> big, and get past the second check because `buf_addr + len` wraps
> >>> around and becomes small.
> >>
> >> Good point. I wonder if we have a verification helper for something like
> >> this?
> >
> > check_add_overflow() exists, I guess that might help a bit. I don't
> > think I've seen a more specific helper for this situation.
>
> Hmm, not super appropriate. How about something ala:
>
> if (buf_addr + len < buf_addr)
>     ... overflow ...
>
> ?

Sure, sounds good.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29 23:08           ` Jann Horn
@ 2019-01-29 23:14             ` Jens Axboe
  2019-01-29 23:42               ` Jann Horn
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 23:14 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On 1/29/19 4:08 PM, Jann Horn wrote:
> On Wed, Jan 30, 2019 at 12:06 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/29/19 4:03 PM, Jann Horn wrote:
>>> On Tue, Jan 29, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 1/29/19 3:44 PM, Jann Horn wrote:
>>>>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> If we have fixed user buffers, we can map them into the kernel when we
>>>>>> setup the io_context. That avoids the need to do get_user_pages() for
>>>>>> each and every IO.
>>>>>>
>>>>>> To utilize this feature, the application must call io_uring_register()
>>>>>> after having setup an io_uring context, passing in
>>>>>> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
>>>>>> to an iovec array, and the nr_args should contain how many iovecs the
>>>>>> application wishes to map.
>>>>>>
>>>>>> If successful, these buffers are now mapped into the kernel, eligible
>>>>>> for IO. To use these fixed buffers, the application must use the
>>>>>> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
>>>>>> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
>>>>>> must point to somewhere inside the indexed buffer.
>>>>>>
>>>>>> The application may register buffers throughout the lifetime of the
>>>>>> io_uring context. It can call io_uring_register() with
>>>>>> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
>>>>>> buffers, and then register a new set. The application need not
>>>>>> unregister buffers explicitly before shutting down the io_uring context.
>>> [...]
>>>>>> +       imu = &ctx->user_bufs[index];
>>>>>> +       buf_addr = READ_ONCE(sqe->addr);
>>>>>> +       if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
>>>>>
>>>>> This can wrap around if `buf_addr` or `len` is very big, right? Then
>>>>> you e.g. get past the first check because `buf_addr` is sufficiently
>>>>> big, and get past the second check because `buf_addr + len` wraps
>>>>> around and becomes small.
>>>>
>>>> Good point. I wonder if we have a verification helper for something like
>>>> this?
>>>
>>> check_add_overflow() exists, I guess that might help a bit. I don't
>>> think I've seen a more specific helper for this situation.
>>
>> Hmm, not super appropriate. How about something ala:
>>
>> if (buf_addr + len < buf_addr)
>>     ... overflow ...
>>
>> ?
> 
> Sure, sounds good.

Just folded in this incremental, which should fix all the issues outlined
in your email.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7364feebafed..d42541357969 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -751,7 +751,7 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
 {
 	size_t len = READ_ONCE(sqe->len);
 	struct io_mapped_ubuf *imu;
-	int buf_index, index;
+	unsigned index, buf_index;
 	size_t offset;
 	u64 buf_addr;
 
@@ -763,9 +763,12 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
 	if (unlikely(buf_index >= ctx->nr_user_bufs))
 		return -EFAULT;
 
-	index = array_index_nospec(buf_index, ctx->sq_entries);
+	index = array_index_nospec(buf_index, ctx->nr_user_bufs);
 	imu = &ctx->user_bufs[index];
 	buf_addr = READ_ONCE(sqe->addr);
+
+	if (buf_addr + len < buf_addr)
+		return -EFAULT;
 	if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
 		return -EFAULT;
 
@@ -1602,6 +1605,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
 
 static int io_sq_thread(void *data)
 {
+	struct io_uring_sqe lsqe[IO_IOPOLL_BATCH];
 	struct sqe_submit sqes[IO_IOPOLL_BATCH];
 	struct io_ring_ctx *ctx = data;
 	struct mm_struct *cur_mm = NULL;
@@ -1701,6 +1705,14 @@ static int io_sq_thread(void *data)
 		i = 0;
 		all_fixed = true;
 		do {
+			/*
+			 * Ensure sqe is stable between checking if we need
+			 * user access, and actually importing the iovec
+			 * further down the stack.
+			 */
+			memcpy(&lsqe[i], sqes[i].sqe, sizeof(lsqe[i]));
+			sqes[i].sqe = &lsqe[i];
+
 			if (all_fixed && io_sqe_needs_user(sqes[i].sqe))
 				all_fixed = false;
 
@@ -2081,7 +2093,7 @@ static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
 	struct iovec __user *src;
 
 #ifdef CONFIG_COMPAT
-	if (in_compat_syscall()) {
+	if (ctx->compat) {
 		struct compat_iovec __user *ciovs;
 		struct compat_iovec ciov;
 
@@ -2103,7 +2115,6 @@ static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
 static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 				  unsigned nr_args)
 {
-	struct vm_area_struct **vmas = NULL;
 	struct page **pages = NULL;
 	int i, j, got_pages = 0;
 	int ret = -EINVAL;
@@ -2138,7 +2149,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 		 * submitted if they are wrong.
 		 */
 		ret = -EFAULT;
-		if (!iov.iov_base)
+		if (!iov.iov_base || !iov.iov_len)
 			goto err;
 
 		/* arbitrary limit, but we need something */
@@ -2155,14 +2166,10 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			goto err;
 
 		if (!pages || nr_pages > got_pages) {
-			kfree(vmas);
 			kfree(pages);
 			pages = kmalloc_array(nr_pages, sizeof(struct page *),
 						GFP_KERNEL);
-			vmas = kmalloc_array(nr_pages,
-					sizeof(struct vma_area_struct *),
-					GFP_KERNEL);
-			if (!pages || !vmas) {
+			if (!pages) {
 				io_unaccount_mem(ctx, nr_pages);
 				goto err;
 			}
@@ -2176,32 +2183,18 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			goto err;
 		}
 
-		down_write(&current->mm->mmap_sem);
-		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
-						pages, vmas);
-		if (pret == nr_pages) {
-			/* don't support file backed memory */
-			for (j = 0; j < nr_pages; j++) {
-				struct vm_area_struct *vma = vmas[j];
+		down_read(&current->mm->mmap_sem);
+		pret = get_user_pages_longterm(ubuf, nr_pages,
+						FOLL_WRITE | FOLL_ANON, pages,
+						NULL);
+		up_read(&current->mm->mmap_sem);
 
-				if (vma->vm_file) {
-					ret = -EOPNOTSUPP;
-					break;
-				}
-			}
-		} else {
-			ret = pret < 0 ? pret : -EFAULT;
-		}
-		up_write(&current->mm->mmap_sem);
-		if (ret) {
-			/*
-			 * if we did partial map, or found file backed vmas,
-			 * release any pages we did get
-			 */
+		if (pret != nr_pages) {
 			if (pret > 0) {
 				for (j = 0; j < pret; j++)
 					put_page(pages[j]);
 			}
+			ret = pret < 0 ? pret : -EFAULT;
 			io_unaccount_mem(ctx, nr_pages);
 			goto err;
 		}
@@ -2224,12 +2217,10 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 		imu->nr_bvecs = nr_pages;
 	}
 	kfree(pages);
-	kfree(vmas);
 	ctx->nr_user_bufs = nr_args;
 	return 0;
 err:
 	kfree(pages);
-	kfree(vmas);
 	io_sqe_buffer_unregister(ctx);
 	return ret;
 }

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 09/18] io_uring: use fget/fput_many() for file references
  2019-01-29 19:26 ` [PATCH 09/18] io_uring: use fget/fput_many() for file references Jens Axboe
@ 2019-01-29 23:31   ` Jann Horn
  2019-01-29 23:44     ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jann Horn @ 2019-01-29 23:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
> Add a separate io_submit_state structure, to cache some of the things
> we need for IO submission.
>
> One such example is file reference batching. io_submit_state. We get as
> many references as the number of sqes we are submitting, and drop
> unused ones if we end up switching files. The assumption here is that
> we're usually only dealing with one fd, and if there are multiple,
> hopefuly they are at least somewhat ordered. Could trivially be extended
> to cover multiple fds, if needed.
>
> On the completion side we do the same thing, except this is trivially
> done just locally in io_iopoll_reap().
[...]
> +static void io_file_put(struct io_submit_state *state, struct file *file)
> +{
> +       if (!state) {
> +               fput(file);
> +       } else if (state->file) {
> +               int diff = state->has_refs - state->used_refs;
> +
> +               if (diff)
> +                       fput_many(state->file, diff);
> +               state->file = NULL;
> +       }
> +}

Hmm, this function confuses me.
The state==NULL path works as I'd expect, it calls fput() on the file.
But if `state!=NULL && state->file==NULL`, it does nothing, it never
uses `file`.
And if `state->file!=NULL`, it drops the excess bias on the file's
refcount, but it doesn't drop the current reference - and again
without even looking at `file`.

So when io_prep_rw() uses io_file_get() to grab a reference on a file
it hasn't seen before, it will acquire `ios_left` references and
actually use one of them; then if it goes through the out_fput error
path, it goes through the path for `state->file!=NULL`, drops
`ios_left-1` references (leaving the refcount elevated by 1), and
forgets about the file?

> +/*
> + * Get as many references to a file as we have IOs left in this submission,
> + * assuming most submissions are for one file, or at least that each file
> + * has more than one submission.
> + */
> +static struct file *io_file_get(struct io_submit_state *state, int fd)
> +{
> +       if (!state)
> +               return fget(fd);
> +
> +       if (state->file) {
> +               if (state->fd == fd) {
> +                       state->used_refs++;
> +                       state->ios_left--;
> +                       return state->file;
> +               }
> +               io_file_put(state, NULL);
> +       }
> +       state->file = fget_many(fd, state->ios_left);
> +       if (!state->file)
> +               return NULL;
> +
> +       state->fd = fd;
> +       state->has_refs = state->ios_left;
> +       state->used_refs = 1;
> +       state->ios_left--;
> +       return state->file;
> +}
> +
>  static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> -                     bool force_nonblock)
> +                     bool force_nonblock, struct io_submit_state *state)
>  {
>         struct io_ring_ctx *ctx = req->ctx;
>         struct kiocb *kiocb = &req->rw;
> @@ -487,7 +560,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>         int fd, ret;
>
>         fd = READ_ONCE(sqe->fd);
> -       kiocb->ki_filp = fget(fd);
> +       kiocb->ki_filp = io_file_get(state, fd);
>         if (unlikely(!kiocb->ki_filp))
>                 return -EBADF;
>         kiocb->ki_pos = READ_ONCE(sqe->off);
> @@ -528,7 +601,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>         }
>         return 0;
>  out_fput:
> -       fput(kiocb->ki_filp);
> +       io_file_put(state, kiocb->ki_filp);
>         return ret;
>  }
[...]
> +static void io_submit_state_start(struct io_submit_state *state,
> +                                 struct io_ring_ctx *ctx, unsigned max_ios)

There are various places in your series where you use raw "unsigned"
instead of "unsigned int"; when I run your tree through checkpatch.pl,
it complains about that and a few other things. Please fix the
checkpatch warnings (except for warnings where you know that they
shouldn't apply here for some reason).

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29 23:14             ` Jens Axboe
@ 2019-01-29 23:42               ` Jann Horn
  2019-01-29 23:51                 ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jann Horn @ 2019-01-29 23:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On Wed, Jan 30, 2019 at 12:14 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/29/19 4:08 PM, Jann Horn wrote:
> > On Wed, Jan 30, 2019 at 12:06 AM Jens Axboe <axboe@kernel.dk> wrote:
> >> On 1/29/19 4:03 PM, Jann Horn wrote:
> >>> On Tue, Jan 29, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 1/29/19 3:44 PM, Jann Horn wrote:
> >>>>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>> If we have fixed user buffers, we can map them into the kernel when we
> >>>>>> setup the io_context. That avoids the need to do get_user_pages() for
> >>>>>> each and every IO.
> >>>>>>
> >>>>>> To utilize this feature, the application must call io_uring_register()
> >>>>>> after having setup an io_uring context, passing in
> >>>>>> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
> >>>>>> to an iovec array, and the nr_args should contain how many iovecs the
> >>>>>> application wishes to map.
> >>>>>>
> >>>>>> If successful, these buffers are now mapped into the kernel, eligible
> >>>>>> for IO. To use these fixed buffers, the application must use the
> >>>>>> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
> >>>>>> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
> >>>>>> must point to somewhere inside the indexed buffer.
> >>>>>>
> >>>>>> The application may register buffers throughout the lifetime of the
> >>>>>> io_uring context. It can call io_uring_register() with
> >>>>>> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
> >>>>>> buffers, and then register a new set. The application need not
> >>>>>> unregister buffers explicitly before shutting down the io_uring context.
[...]
> Just folded in this incremental, which should fix all the issues outlined
> in your email.
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 7364feebafed..d42541357969 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
[...]
> @@ -1602,6 +1605,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
>
>  static int io_sq_thread(void *data)
>  {
> +       struct io_uring_sqe lsqe[IO_IOPOLL_BATCH];
>         struct sqe_submit sqes[IO_IOPOLL_BATCH];
>         struct io_ring_ctx *ctx = data;
>         struct mm_struct *cur_mm = NULL;
> @@ -1701,6 +1705,14 @@ static int io_sq_thread(void *data)
>                 i = 0;
>                 all_fixed = true;
>                 do {
> +                       /*
> +                        * Ensure sqe is stable between checking if we need
> +                        * user access, and actually importing the iovec
> +                        * further down the stack.
> +                        */
> +                       memcpy(&lsqe[i], sqes[i].sqe, sizeof(lsqe[i]));
> +                       sqes[i].sqe = &lsqe[i];
> +

What if io_submit_sqe() gets an -EAGAIN from __io_submit_sqe() and
queues io_sq_wq_submit_work()? Could that cause io_sq_wq_submit_work()
to read from io_sq_thread()'s lsge[i] while io_sq_thread() is already
in the next loop iteration of the outer loop and has copied new data
into lsge[i]?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 09/18] io_uring: use fget/fput_many() for file references
  2019-01-29 23:31   ` Jann Horn
@ 2019-01-29 23:44     ` Jens Axboe
  2019-01-30 15:33       ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 23:44 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On 1/29/19 4:31 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>> Add a separate io_submit_state structure, to cache some of the things
>> we need for IO submission.
>>
>> One such example is file reference batching. io_submit_state. We get as
>> many references as the number of sqes we are submitting, and drop
>> unused ones if we end up switching files. The assumption here is that
>> we're usually only dealing with one fd, and if there are multiple,
>> hopefuly they are at least somewhat ordered. Could trivially be extended
>> to cover multiple fds, if needed.
>>
>> On the completion side we do the same thing, except this is trivially
>> done just locally in io_iopoll_reap().
> [...]
>> +static void io_file_put(struct io_submit_state *state, struct file *file)
>> +{
>> +       if (!state) {
>> +               fput(file);
>> +       } else if (state->file) {
>> +               int diff = state->has_refs - state->used_refs;
>> +
>> +               if (diff)
>> +                       fput_many(state->file, diff);
>> +               state->file = NULL;
>> +       }
>> +}
> 
> Hmm, this function confuses me.
> The state==NULL path works as I'd expect, it calls fput() on the file.
> But if `state!=NULL && state->file==NULL`, it does nothing, it never
> uses `file`.
> And if `state->file!=NULL`, it drops the excess bias on the file's
> refcount, but it doesn't drop the current reference - and again
> without even looking at `file`.
> 
> So when io_prep_rw() uses io_file_get() to grab a reference on a file
> it hasn't seen before, it will acquire `ios_left` references and
> actually use one of them; then if it goes through the out_fput error
> path, it goes through the path for `state->file!=NULL`, drops
> `ios_left-1` references (leaving the refcount elevated by 1), and
> forgets about the file?

I'll take a look, it's not impossible there's an off-by-one there.

>> +/*
>> + * Get as many references to a file as we have IOs left in this submission,
>> + * assuming most submissions are for one file, or at least that each file
>> + * has more than one submission.
>> + */
>> +static struct file *io_file_get(struct io_submit_state *state, int fd)
>> +{
>> +       if (!state)
>> +               return fget(fd);
>> +
>> +       if (state->file) {
>> +               if (state->fd == fd) {
>> +                       state->used_refs++;
>> +                       state->ios_left--;
>> +                       return state->file;
>> +               }
>> +               io_file_put(state, NULL);
>> +       }
>> +       state->file = fget_many(fd, state->ios_left);
>> +       if (!state->file)
>> +               return NULL;
>> +
>> +       state->fd = fd;
>> +       state->has_refs = state->ios_left;
>> +       state->used_refs = 1;
>> +       state->ios_left--;
>> +       return state->file;
>> +}
>> +
>>  static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>> -                     bool force_nonblock)
>> +                     bool force_nonblock, struct io_submit_state *state)
>>  {
>>         struct io_ring_ctx *ctx = req->ctx;
>>         struct kiocb *kiocb = &req->rw;
>> @@ -487,7 +560,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>>         int fd, ret;
>>
>>         fd = READ_ONCE(sqe->fd);
>> -       kiocb->ki_filp = fget(fd);
>> +       kiocb->ki_filp = io_file_get(state, fd);
>>         if (unlikely(!kiocb->ki_filp))
>>                 return -EBADF;
>>         kiocb->ki_pos = READ_ONCE(sqe->off);
>> @@ -528,7 +601,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>>         }
>>         return 0;
>>  out_fput:
>> -       fput(kiocb->ki_filp);
>> +       io_file_put(state, kiocb->ki_filp);
>>         return ret;
>>  }
> [...]
>> +static void io_submit_state_start(struct io_submit_state *state,
>> +                                 struct io_ring_ctx *ctx, unsigned max_ios)
> 
> There are various places in your series where you use raw "unsigned"
> instead of "unsigned int"; when I run your tree through checkpatch.pl,
> it complains about that and a few other things. Please fix the
> checkpatch warnings (except for warnings where you know that they
> shouldn't apply here for some reason).

Using unsigned is just fine, it's the same thing. checkpatch.pl
complains about a lot of stuff that doesn't matter, that's one of them.
I don't mind fixing valid warnings, but this particular one is just
noise.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29 23:42               ` Jann Horn
@ 2019-01-29 23:51                 ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29 23:51 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On 1/29/19 4:42 PM, Jann Horn wrote:
> On Wed, Jan 30, 2019 at 12:14 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/29/19 4:08 PM, Jann Horn wrote:
>>> On Wed, Jan 30, 2019 at 12:06 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 1/29/19 4:03 PM, Jann Horn wrote:
>>>>> On Tue, Jan 29, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> On 1/29/19 3:44 PM, Jann Horn wrote:
>>>>>>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>> If we have fixed user buffers, we can map them into the kernel when we
>>>>>>>> setup the io_context. That avoids the need to do get_user_pages() for
>>>>>>>> each and every IO.
>>>>>>>>
>>>>>>>> To utilize this feature, the application must call io_uring_register()
>>>>>>>> after having setup an io_uring context, passing in
>>>>>>>> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
>>>>>>>> to an iovec array, and the nr_args should contain how many iovecs the
>>>>>>>> application wishes to map.
>>>>>>>>
>>>>>>>> If successful, these buffers are now mapped into the kernel, eligible
>>>>>>>> for IO. To use these fixed buffers, the application must use the
>>>>>>>> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
>>>>>>>> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
>>>>>>>> must point to somewhere inside the indexed buffer.
>>>>>>>>
>>>>>>>> The application may register buffers throughout the lifetime of the
>>>>>>>> io_uring context. It can call io_uring_register() with
>>>>>>>> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
>>>>>>>> buffers, and then register a new set. The application need not
>>>>>>>> unregister buffers explicitly before shutting down the io_uring context.
> [...]
>> Just folded in this incremental, which should fix all the issues outlined
>> in your email.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 7364feebafed..d42541357969 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
> [...]
>> @@ -1602,6 +1605,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
>>
>>  static int io_sq_thread(void *data)
>>  {
>> +       struct io_uring_sqe lsqe[IO_IOPOLL_BATCH];
>>         struct sqe_submit sqes[IO_IOPOLL_BATCH];
>>         struct io_ring_ctx *ctx = data;
>>         struct mm_struct *cur_mm = NULL;
>> @@ -1701,6 +1705,14 @@ static int io_sq_thread(void *data)
>>                 i = 0;
>>                 all_fixed = true;
>>                 do {
>> +                       /*
>> +                        * Ensure sqe is stable between checking if we need
>> +                        * user access, and actually importing the iovec
>> +                        * further down the stack.
>> +                        */
>> +                       memcpy(&lsqe[i], sqes[i].sqe, sizeof(lsqe[i]));
>> +                       sqes[i].sqe = &lsqe[i];
>> +
> 
> What if io_submit_sqe() gets an -EAGAIN from __io_submit_sqe() and
> queues io_sq_wq_submit_work()? Could that cause io_sq_wq_submit_work()
> to read from io_sq_thread()'s lsge[i] while io_sq_thread() is already
> in the next loop iteration of the outer loop and has copied new data
> into lsge[i]?

Hmm yes. I think we'll need to both embed a pointer to an sqe into
sqe_submit, and an actual sqe. The former can be used in the fast
path, while the latter will be our copy for the offload path for
io_sq_thread() and io_sq_wq_submit_work().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-01-29 19:26 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
@ 2019-01-30  1:29   ` Jann Horn
  2019-01-30 15:35     ` Jens Axboe
  2019-02-04  2:56     ` Al Viro
  0 siblings, 2 replies; 76+ messages in thread
From: Jann Horn @ 2019-01-30  1:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, avi, Al Viro,
	linux-fsdevel

On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
> We normally have to fget/fput for each IO we do on a file. Even with
> the batching we do, the cost of the atomic inc/dec of the file usage
> count adds up.
>
> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> for the io_uring_register(2) system call. The arguments passed in must
> be an array of __s32 holding file descriptors, and nr_args should hold
> the number of file descriptors the application wishes to pin for the
> duration of the io_uring context (or until IORING_UNREGISTER_FILES is
> called).
>
> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> to the index in the array passed in to IORING_REGISTER_FILES.
>
> Files are automatically unregistered when the io_uring context is
> torn down. An application need only unregister if it wishes to
> register a new set of fds.

Crazy idea:

Taking a step back, at a high level, basically this patch creates sort
of the same difference that you get when you compare the following
scenarios for normal multithreaded I/O in userspace:

===========================================================
~/tests/fdget_perf$ cat fdget_perf.c
#define _GNU_SOURCE
#include <sys/wait.h>
#include <sched.h>
#include <unistd.h>
#include <stdbool.h>
#include <string.h>
#include <err.h>
#include <signal.h>
#include <sys/eventfd.h>
#include <stdio.h>

// two different physical processors on my machine
#define CORE_A 0
#define CORE_B 14

static void pin_to_core(int coreid) {
  cpu_set_t set;
  CPU_ZERO(&set);
  CPU_SET(coreid, &set);
  if (sched_setaffinity(0, sizeof(cpu_set_t), &set))
    err(1, "sched_setaffinity");
}

static int fd = -1;

static volatile int time_over = 0;
static void alarm_handler(int sig) { time_over = 1; }
static void run_stuff(void) {
  unsigned long long iterations = 0;
  if (signal(SIGALRM, alarm_handler) == SIG_ERR) err(1, "signal");
  alarm(10);
  while (1) {
    uint64_t val;
    read(fd, &val, sizeof(val));
    if (time_over) {
      printf("iterations = 0x%llx\n", iterations);
      return;
    }
    iterations++;
  }
}

static int child_fn(void *dummy) {
  pin_to_core(CORE_B);
  run_stuff();
  return 0;
}

static char child_stack[1024*1024];

int main(int argc, char **argv) {
  fd = eventfd(0, EFD_NONBLOCK);
  if (fd == -1) err(1, "eventfd");

  if (argc != 2) errx(1, "bad usage");
  int flags = SIGCHLD;
  if (strcmp(argv[1], "shared") == 0) {
    flags |= CLONE_FILES;
  } else if (strcmp(argv[1], "cloned") == 0) {
    /* nothing */
  } else {
    errx(1, "bad usage");
  }
  pid_t child = clone(child_fn, child_stack+sizeof(child_stack), flags, NULL);
  if (child == -1) err(1, "clone");

  pin_to_core(CORE_A);
  run_stuff();
  int status;
  if (wait(&status) != child) err(1, "wait");
  return 0;
}
~/tests/fdget_perf$ gcc -Wall -o fdget_perf fdget_perf.c
~/tests/fdget_perf$ ./fdget_perf shared
iterations = 0x8d3010
iterations = 0x92d894
~/tests/fdget_perf$ ./fdget_perf cloned
iterations = 0xad3bbd
iterations = 0xb08838
~/tests/fdget_perf$ ./fdget_perf shared
iterations = 0x8cc340
iterations = 0x8e4e64
~/tests/fdget_perf$ ./fdget_perf cloned
iterations = 0xada5f3
iterations = 0xb04b6f
===========================================================

This kinda makes me wonder whether this is really something that
should be implemented specifically for the io_uring API, or whether it
would make sense to somehow handle part of this in the generic VFS
code and give the user the ability to prepare a new files_struct that
can then be transferred to the worker thread, or something like
that... I'm not sure whether there's a particularly clean way to do
that though.

Or perhaps you could add a userspace API for marking file descriptor
table entries as "has percpu refcounting" somehow, with one percpu
refcount per files_struct and one bit per fd, allocated when percpu
refcounting is activated for the files_struct the first time, or
something like that...

> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io_uring.c                 | 138 +++++++++++++++++++++++++++++-----
>  include/uapi/linux/io_uring.h |   9 ++-
>  2 files changed, 127 insertions(+), 20 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 17c869f3ea2f..13c3f8212815 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -100,6 +100,14 @@ struct io_ring_ctx {
>                 struct fasync_struct    *cq_fasync;
>         } ____cacheline_aligned_in_smp;
>
> +       /*
> +        * If used, fixed file set. Writers must ensure that ->refs is dead,
> +        * readers must ensure that ->refs is alive as long as the file* is
> +        * used. Only updated through io_uring_register(2).
> +        */
> +       struct file             **user_files;
> +       unsigned                nr_user_files;
> +
>         /* if used, fixed mapped user buffers */
>         unsigned                nr_user_bufs;
>         struct io_mapped_ubuf   *user_bufs;
> @@ -136,6 +144,7 @@ struct io_kiocb {
>         unsigned int            flags;
>  #define REQ_F_FORCE_NONBLOCK   1       /* inline submission attempt */
>  #define REQ_F_IOPOLL_COMPLETED 2       /* polled IO has completed */
> +#define REQ_F_FIXED_FILE       4       /* ctx owns file */
>         u64                     user_data;
>         u64                     error;
>
> @@ -350,15 +359,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
>                  * Batched puts of the same file, to avoid dirtying the
>                  * file usage count multiple times, if avoidable.
>                  */
> -               if (!file) {
> -                       file = req->rw.ki_filp;
> -                       file_count = 1;
> -               } else if (file == req->rw.ki_filp) {
> -                       file_count++;
> -               } else {
> -                       fput_many(file, file_count);
> -                       file = req->rw.ki_filp;
> -                       file_count = 1;
> +               if (!(req->flags & REQ_F_FIXED_FILE)) {
> +                       if (!file) {
> +                               file = req->rw.ki_filp;
> +                               file_count = 1;
> +                       } else if (file == req->rw.ki_filp) {
> +                               file_count++;
> +                       } else {
> +                               fput_many(file, file_count);
> +                               file = req->rw.ki_filp;
> +                               file_count = 1;
> +                       }
>                 }
>
>                 if (to_free == ARRAY_SIZE(reqs))
> @@ -491,13 +502,19 @@ static void kiocb_end_write(struct kiocb *kiocb)
>         }
>  }
>
> +static void io_fput(struct io_kiocb *req)
> +{
> +       if (!(req->flags & REQ_F_FIXED_FILE))
> +               fput(req->rw.ki_filp);
> +}
> +
>  static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
>  {
>         struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
>
>         kiocb_end_write(kiocb);
>
> -       fput(kiocb->ki_filp);
> +       io_fput(req);
>         io_cqring_add_event(req->ctx, req->user_data, res, 0);
>         io_free_req(req);
>  }
> @@ -596,11 +613,22 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  {
>         struct io_ring_ctx *ctx = req->ctx;
>         struct kiocb *kiocb = &req->rw;
> -       unsigned ioprio;
> +       unsigned ioprio, flags;
>         int fd, ret;
>
> +       flags = READ_ONCE(sqe->flags);
>         fd = READ_ONCE(sqe->fd);
> -       kiocb->ki_filp = io_file_get(state, fd);
> +
> +       if (flags & IOSQE_FIXED_FILE) {
> +               if (unlikely(!ctx->user_files ||
> +                   (unsigned) fd >= ctx->nr_user_files))
> +                       return -EBADF;
> +               kiocb->ki_filp = ctx->user_files[fd];
> +               req->flags |= REQ_F_FIXED_FILE;
> +       } else {
> +               kiocb->ki_filp = io_file_get(state, fd);
> +       }
> +
>         if (unlikely(!kiocb->ki_filp))
>                 return -EBADF;
>         kiocb->ki_pos = READ_ONCE(sqe->off);
> @@ -641,7 +669,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>         }
>         return 0;
>  out_fput:
> -       io_file_put(state, kiocb->ki_filp);
> +       if (!(flags & IOSQE_FIXED_FILE))
> +               io_file_put(state, kiocb->ki_filp);
>         return ret;
>  }
>
> @@ -765,7 +794,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>         kfree(iovec);
>  out_fput:
>         if (unlikely(ret))
> -               fput(file);
> +               io_fput(req);
>         return ret;
>  }
>
> @@ -820,7 +849,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>         kfree(iovec);
>  out_fput:
>         if (unlikely(ret))
> -               fput(file);
> +               io_fput(req);
>         return ret;
>  }
>
> @@ -846,7 +875,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>         loff_t sqe_off = READ_ONCE(sqe->off);
>         loff_t sqe_len = READ_ONCE(sqe->len);
>         loff_t end = sqe_off + sqe_len;
> -       unsigned fsync_flags;
> +       unsigned fsync_flags, flags;
>         struct file *file;
>         int ret, fd;
>
> @@ -864,14 +893,23 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>                 return -EINVAL;
>
>         fd = READ_ONCE(sqe->fd);
> -       file = fget(fd);
> +       flags = READ_ONCE(sqe->flags);
> +
> +       if (flags & IOSQE_FIXED_FILE) {
> +               if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
> +                       return -EBADF;
> +               file = ctx->user_files[fd];
> +       } else {
> +               file = fget(fd);
> +       }
>         if (unlikely(!file))
>                 return -EBADF;
>
>         ret = vfs_fsync_range(file, sqe_off, end > 0 ? end : LLONG_MAX,
>                                 fsync_flags & IORING_FSYNC_DATASYNC);
>
> -       fput(file);
> +       if (!(flags & IOSQE_FIXED_FILE))
> +               fput(file);
>         io_cqring_add_event(ctx, sqe->user_data, ret, 0);
>         io_free_req(req);
>         return 0;
> @@ -1002,7 +1040,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
>         ssize_t ret;
>
>         /* enforce forwards compatibility on users */
> -       if (unlikely(s->sqe->flags))
> +       if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
>                 return -EINVAL;
>
>         req = io_get_req(ctx, state);
> @@ -1220,6 +1258,58 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
>         return submitted ? submitted : ret;
>  }
>
> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +       int i;
> +
> +       if (!ctx->user_files)
> +               return -ENXIO;
> +
> +       for (i = 0; i < ctx->nr_user_files; i++)
> +               fput(ctx->user_files[i]);
> +
> +       kfree(ctx->user_files);
> +       ctx->user_files = NULL;
> +       ctx->nr_user_files = 0;
> +       return 0;
> +}
> +
> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
> +                                unsigned nr_args)
> +{
> +       __s32 __user *fds = (__s32 __user *) arg;
> +       int fd, ret = 0;
> +       unsigned i;
> +
> +       if (ctx->user_files)
> +               return -EBUSY;
> +       if (!nr_args)
> +               return -EINVAL;
> +
> +       ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
> +       if (!ctx->user_files)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nr_args; i++) {
> +               ret = -EFAULT;
> +               if (copy_from_user(&fd, &fds[i], sizeof(fd)))
> +                       break;
> +
> +               ctx->user_files[i] = fget(fd);
> +
> +               ret = -EBADF;
> +               if (!ctx->user_files[i])
> +                       break;
> +               ctx->nr_user_files++;
> +               ret = 0;
> +       }
> +
> +       if (ret)
> +               io_sqe_files_unregister(ctx);
> +
> +       return ret;
> +}
> +
>  static int io_sq_offload_start(struct io_ring_ctx *ctx)
>  {
>         int ret;
> @@ -1509,6 +1599,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
>
>         io_iopoll_reap_events(ctx);
>         io_sqe_buffer_unregister(ctx);
> +       io_sqe_files_unregister(ctx);
>
>         io_mem_free(ctx->sq_ring);
>         io_mem_free(ctx->sq_sqes);
> @@ -1806,6 +1897,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>                         break;
>                 ret = io_sqe_buffer_unregister(ctx);
>                 break;
> +       case IORING_REGISTER_FILES:
> +               ret = io_sqe_files_register(ctx, arg, nr_args);
> +               break;
> +       case IORING_UNREGISTER_FILES:
> +               ret = -EINVAL;
> +               if (arg || nr_args)
> +                       break;
> +               ret = io_sqe_files_unregister(ctx);
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 16c423d74f2e..3e79feb34a9c 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -16,7 +16,7 @@
>   */
>  struct io_uring_sqe {
>         __u8    opcode;         /* type of operation for this sqe */
> -       __u8    flags;          /* as of now unused */
> +       __u8    flags;          /* IOSQE_ flags */
>         __u16   ioprio;         /* ioprio for the request */
>         __s32   fd;             /* file descriptor to do IO on */
>         __u64   off;            /* offset into file */
> @@ -33,6 +33,11 @@ struct io_uring_sqe {
>         };
>  };
>
> +/*
> + * sqe->flags
> + */
> +#define IOSQE_FIXED_FILE       (1U << 0)       /* use fixed fileset */
> +
>  /*
>   * io_uring_setup() flags
>   */
> @@ -112,5 +117,7 @@ struct io_uring_params {
>   */
>  #define IORING_REGISTER_BUFFERS                0
>  #define IORING_UNREGISTER_BUFFERS      1
> +#define IORING_REGISTER_FILES          2
> +#define IORING_UNREGISTER_FILES                3
>
>  #endif
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 09/18] io_uring: use fget/fput_many() for file references
  2019-01-29 23:44     ` Jens Axboe
@ 2019-01-30 15:33       ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-30 15:33 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity

On 1/29/19 4:44 PM, Jens Axboe wrote:
> On 1/29/19 4:31 PM, Jann Horn wrote:
>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> Add a separate io_submit_state structure, to cache some of the things
>>> we need for IO submission.
>>>
>>> One such example is file reference batching. io_submit_state. We get as
>>> many references as the number of sqes we are submitting, and drop
>>> unused ones if we end up switching files. The assumption here is that
>>> we're usually only dealing with one fd, and if there are multiple,
>>> hopefuly they are at least somewhat ordered. Could trivially be extended
>>> to cover multiple fds, if needed.
>>>
>>> On the completion side we do the same thing, except this is trivially
>>> done just locally in io_iopoll_reap().
>> [...]
>>> +static void io_file_put(struct io_submit_state *state, struct file *file)
>>> +{
>>> +       if (!state) {
>>> +               fput(file);
>>> +       } else if (state->file) {
>>> +               int diff = state->has_refs - state->used_refs;
>>> +
>>> +               if (diff)
>>> +                       fput_many(state->file, diff);
>>> +               state->file = NULL;
>>> +       }
>>> +}
>>
>> Hmm, this function confuses me.
>> The state==NULL path works as I'd expect, it calls fput() on the file.
>> But if `state!=NULL && state->file==NULL`, it does nothing, it never
>> uses `file`.
>> And if `state->file!=NULL`, it drops the excess bias on the file's
>> refcount, but it doesn't drop the current reference - and again
>> without even looking at `file`.
>>
>> So when io_prep_rw() uses io_file_get() to grab a reference on a file
>> it hasn't seen before, it will acquire `ios_left` references and
>> actually use one of them; then if it goes through the out_fput error
>> path, it goes through the path for `state->file!=NULL`, drops
>> `ios_left-1` references (leaving the refcount elevated by 1), and
>> forgets about the file?
> 
> I'll take a look, it's not impossible there's an off-by-one there.

Fixed it!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-01-30  1:29   ` Jann Horn
@ 2019-01-30 15:35     ` Jens Axboe
  2019-02-04  2:56     ` Al Viro
  1 sibling, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-30 15:35 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, avi, Al Viro,
	linux-fsdevel

On 1/29/19 6:29 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>> We normally have to fget/fput for each IO we do on a file. Even with
>> the batching we do, the cost of the atomic inc/dec of the file usage
>> count adds up.
>>
>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>> for the io_uring_register(2) system call. The arguments passed in must
>> be an array of __s32 holding file descriptors, and nr_args should hold
>> the number of file descriptors the application wishes to pin for the
>> duration of the io_uring context (or until IORING_UNREGISTER_FILES is
>> called).
>>
>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>> to the index in the array passed in to IORING_REGISTER_FILES.
>>
>> Files are automatically unregistered when the io_uring context is
>> torn down. An application need only unregister if it wishes to
>> register a new set of fds.
> 
> Crazy idea:
> 
> Taking a step back, at a high level, basically this patch creates sort
> of the same difference that you get when you compare the following
> scenarios for normal multithreaded I/O in userspace:
> 
> ===========================================================
> ~/tests/fdget_perf$ cat fdget_perf.c
> #define _GNU_SOURCE
> #include <sys/wait.h>
> #include <sched.h>
> #include <unistd.h>
> #include <stdbool.h>
> #include <string.h>
> #include <err.h>
> #include <signal.h>
> #include <sys/eventfd.h>
> #include <stdio.h>
> 
> // two different physical processors on my machine
> #define CORE_A 0
> #define CORE_B 14
> 
> static void pin_to_core(int coreid) {
>   cpu_set_t set;
>   CPU_ZERO(&set);
>   CPU_SET(coreid, &set);
>   if (sched_setaffinity(0, sizeof(cpu_set_t), &set))
>     err(1, "sched_setaffinity");
> }
> 
> static int fd = -1;
> 
> static volatile int time_over = 0;
> static void alarm_handler(int sig) { time_over = 1; }
> static void run_stuff(void) {
>   unsigned long long iterations = 0;
>   if (signal(SIGALRM, alarm_handler) == SIG_ERR) err(1, "signal");
>   alarm(10);
>   while (1) {
>     uint64_t val;
>     read(fd, &val, sizeof(val));
>     if (time_over) {
>       printf("iterations = 0x%llx\n", iterations);
>       return;
>     }
>     iterations++;
>   }
> }
> 
> static int child_fn(void *dummy) {
>   pin_to_core(CORE_B);
>   run_stuff();
>   return 0;
> }
> 
> static char child_stack[1024*1024];
> 
> int main(int argc, char **argv) {
>   fd = eventfd(0, EFD_NONBLOCK);
>   if (fd == -1) err(1, "eventfd");
> 
>   if (argc != 2) errx(1, "bad usage");
>   int flags = SIGCHLD;
>   if (strcmp(argv[1], "shared") == 0) {
>     flags |= CLONE_FILES;
>   } else if (strcmp(argv[1], "cloned") == 0) {
>     /* nothing */
>   } else {
>     errx(1, "bad usage");
>   }
>   pid_t child = clone(child_fn, child_stack+sizeof(child_stack), flags, NULL);
>   if (child == -1) err(1, "clone");
> 
>   pin_to_core(CORE_A);
>   run_stuff();
>   int status;
>   if (wait(&status) != child) err(1, "wait");
>   return 0;
> }
> ~/tests/fdget_perf$ gcc -Wall -o fdget_perf fdget_perf.c
> ~/tests/fdget_perf$ ./fdget_perf shared
> iterations = 0x8d3010
> iterations = 0x92d894
> ~/tests/fdget_perf$ ./fdget_perf cloned
> iterations = 0xad3bbd
> iterations = 0xb08838
> ~/tests/fdget_perf$ ./fdget_perf shared
> iterations = 0x8cc340
> iterations = 0x8e4e64
> ~/tests/fdget_perf$ ./fdget_perf cloned
> iterations = 0xada5f3
> iterations = 0xb04b6f
> ===========================================================
> 
> This kinda makes me wonder whether this is really something that
> should be implemented specifically for the io_uring API, or whether it
> would make sense to somehow handle part of this in the generic VFS
> code and give the user the ability to prepare a new files_struct that
> can then be transferred to the worker thread, or something like
> that... I'm not sure whether there's a particularly clean way to do
> that though.
> 
> Or perhaps you could add a userspace API for marking file descriptor
> table entries as "has percpu refcounting" somehow, with one percpu
> refcount per files_struct and one bit per fd, allocated when percpu
> refcounting is activated for the files_struct the first time, or
> something like that...

There's undoubtedly a win by NOT sharing, obviously. Not sure how to do
this in a generalized fashion, cleanly, it's easier (and a better fit)
to do it for specific cases, like io_uring here. If others want to go
down that path, io_uring could always be adapted to use that
infrastructure.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-01-30  1:29   ` Jann Horn
  2019-01-30 15:35     ` Jens Axboe
@ 2019-02-04  2:56     ` Al Viro
  2019-02-05  2:19       ` Jens Axboe
  1 sibling, 1 reply; 76+ messages in thread
From: Al Viro @ 2019-02-04  2:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Jens Axboe, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On Wed, Jan 30, 2019 at 02:29:05AM +0100, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
> > We normally have to fget/fput for each IO we do on a file. Even with
> > the batching we do, the cost of the atomic inc/dec of the file usage
> > count adds up.
> >
> > This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> > for the io_uring_register(2) system call. The arguments passed in must
> > be an array of __s32 holding file descriptors, and nr_args should hold
> > the number of file descriptors the application wishes to pin for the
> > duration of the io_uring context (or until IORING_UNREGISTER_FILES is
> > called).
> >
> > When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> > member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> > to the index in the array passed in to IORING_REGISTER_FILES.
> >
> > Files are automatically unregistered when the io_uring context is
> > torn down. An application need only unregister if it wishes to
> > register a new set of fds.
> 
> Crazy idea:
> 
> Taking a step back, at a high level, basically this patch creates sort
> of the same difference that you get when you compare the following
> scenarios for normal multithreaded I/O in userspace:

> This kinda makes me wonder whether this is really something that
> should be implemented specifically for the io_uring API, or whether it
> would make sense to somehow handle part of this in the generic VFS
> code and give the user the ability to prepare a new files_struct that
> can then be transferred to the worker thread, or something like
> that... I'm not sure whether there's a particularly clean way to do
> that though.

Using files_struct for that opens a can of worms you really don't
want to touch.

Consider the following scenario with any variant of this interface:
	* create io_uring fd.
	* send an SCM_RIGHTS with that fd to AF_UNIX socket.
	* add the descriptor of that AF_UNIX socket to your fd
	* close AF_UNIX fd, close io_uring fd.
Voila - you've got a shiny leak.  No ->release() is called for
anyone (and you really don't want to do that on ->flush(), because
otherwise a library helper doing e.g. system("/bin/date") will tear
down all the io_uring in your process).  The socket is held by
the reference you've stashed into io_uring (whichever way you do
that).  io_uring is held by the reference you've stashed into
SCM_RIGHTS datagram in queue of the socket.

No matter what, you need net/unix/garbage.c to be aware of that stuff.
And getting files_struct lifetime mixed into that would be beyond
any reason.

The only reason for doing that as a descriptor table would be
avoiding the cost of fget() in whatever uses it, right?  Since
those are *not* the normal syscalls (and fdget() really should not
be used anywhere other than the very top of syscall's call chain -
that's another reason why tossing file_struct around like that
is insane) and since the benefit is all due to the fact that it's
*NOT* shared, *NOT* modified in parallel, etc., allowing us to
treat file references as stable... why the hell use the descriptor
tables at all?

All you need is an array of struct file *, explicitly populated.
With net/unix/garbage.c aware of such beasts.  Guess what?  We
do have such an object already.  The one net/unix/garbage.c is
working with.  SCM_RIGHTS datagrams, that is.

IOW, can't we give those io_uring descriptors associated struct
unix_sock?  No socket descriptors, no struct socket (probably),
just the AF_UNIX-specific part thereof.  Then teach
unix_inflight()/unix_notinflight() about getting unix_sock out
of these guys (incidentally, both would seem to benefit from
_not_ touching unix_gc_lock in case when there's no unix_sock
attached to file we are dealing with - I might be missing
something very subtle about barriers there, but it doesn't
look likely).

And make that (i.e. registering the descriptors) mandatory.
Hell, combine that with creating io_uring fd, if we really
care about the syscall count.  Benefits:
	* no file_struct refcount wanking
	* no fget()/fput() (conditional, at that) from kernel
threads
	* no CLOEXEC-dependent anything; just the teardown
on the final fput(), whichever way it comes.
	* no fun with duelling garbage collectors.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-04  2:56     ` Al Viro
@ 2019-02-05  2:19       ` Jens Axboe
  2019-02-05 17:57         ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-02-05  2:19 UTC (permalink / raw)
  To: Al Viro, Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, avi, linux-fsdevel

On 2/3/19 7:56 PM, Al Viro wrote:
> On Wed, Jan 30, 2019 at 02:29:05AM +0100, Jann Horn wrote:
>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>> We normally have to fget/fput for each IO we do on a file. Even with
>>> the batching we do, the cost of the atomic inc/dec of the file usage
>>> count adds up.
>>>
>>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>>> for the io_uring_register(2) system call. The arguments passed in must
>>> be an array of __s32 holding file descriptors, and nr_args should hold
>>> the number of file descriptors the application wishes to pin for the
>>> duration of the io_uring context (or until IORING_UNREGISTER_FILES is
>>> called).
>>>
>>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>>> to the index in the array passed in to IORING_REGISTER_FILES.
>>>
>>> Files are automatically unregistered when the io_uring context is
>>> torn down. An application need only unregister if it wishes to
>>> register a new set of fds.
>>
>> Crazy idea:
>>
>> Taking a step back, at a high level, basically this patch creates sort
>> of the same difference that you get when you compare the following
>> scenarios for normal multithreaded I/O in userspace:
> 
>> This kinda makes me wonder whether this is really something that
>> should be implemented specifically for the io_uring API, or whether it
>> would make sense to somehow handle part of this in the generic VFS
>> code and give the user the ability to prepare a new files_struct that
>> can then be transferred to the worker thread, or something like
>> that... I'm not sure whether there's a particularly clean way to do
>> that though.
> 
> Using files_struct for that opens a can of worms you really don't
> want to touch.
> 
> Consider the following scenario with any variant of this interface:
> 	* create io_uring fd.
> 	* send an SCM_RIGHTS with that fd to AF_UNIX socket.
> 	* add the descriptor of that AF_UNIX socket to your fd
> 	* close AF_UNIX fd, close io_uring fd.
> Voila - you've got a shiny leak.  No ->release() is called for
> anyone (and you really don't want to do that on ->flush(), because
> otherwise a library helper doing e.g. system("/bin/date") will tear
> down all the io_uring in your process).  The socket is held by
> the reference you've stashed into io_uring (whichever way you do
> that).  io_uring is held by the reference you've stashed into
> SCM_RIGHTS datagram in queue of the socket.
> 
> No matter what, you need net/unix/garbage.c to be aware of that stuff.
> And getting files_struct lifetime mixed into that would be beyond
> any reason.
> 
> The only reason for doing that as a descriptor table would be
> avoiding the cost of fget() in whatever uses it, right?  Since

Right, the only purpose of this patch is to avoid doing fget/fput for
each IO.

> those are *not* the normal syscalls (and fdget() really should not
> be used anywhere other than the very top of syscall's call chain -
> that's another reason why tossing file_struct around like that
> is insane) and since the benefit is all due to the fact that it's
> *NOT* shared, *NOT* modified in parallel, etc., allowing us to
> treat file references as stable... why the hell use the descriptor
> tables at all?

This one is not a regular system call, since we don't do fget, then IO,
then fput. We hang on to it. But for the non-registered case, it's very
much just like a regular read/write system call, where we fget to do IO
on it, then fput when we are done.

> All you need is an array of struct file *, explicitly populated.
> With net/unix/garbage.c aware of such beasts.  Guess what?  We
> do have such an object already.  The one net/unix/garbage.c is
> working with.  SCM_RIGHTS datagrams, that is.
> 
> IOW, can't we give those io_uring descriptors associated struct
> unix_sock?  No socket descriptors, no struct socket (probably),
> just the AF_UNIX-specific part thereof.  Then teach
> unix_inflight()/unix_notinflight() about getting unix_sock out
> of these guys (incidentally, both would seem to benefit from
> _not_ touching unix_gc_lock in case when there's no unix_sock
> attached to file we are dealing with - I might be missing
> something very subtle about barriers there, but it doesn't
> look likely).

That might be workable, though I'm not sure we currently have helpers to
just explicitly create a unix_sock by itself. Not familiar with the
networking bits at all, I'll take a look.

> And make that (i.e. registering the descriptors) mandatory.

I don't want to make it mandatory, that's very inflexible for managing
tons of files. The registration is useful for specific cases where we
have high frequency of operations on a set of files. Besides, it'd make
the use of the API cumbersome as well for the basic case of just wanting
to do async IO.

> Hell, combine that with creating io_uring fd, if we really
> care about the syscall count.  Benefits:

We don't care about syscall count for setup as much. If you're doing
registration of a file set, you're expected to do a LOT of IO to those
files. Hence having an extra one for setup is not a concern. My concern
is just making it mandatory to do registration, I don't think that's a
workable alternative.

> 	* no file_struct refcount wanking
> 	* no fget()/fput() (conditional, at that) from kernel
> threads
> 	* no CLOEXEC-dependent anything; just the teardown
> on the final fput(), whichever way it comes.
> 	* no fun with duelling garbage collectors.

The fget/fput from a kernel thread can be solved by just hanging on to
the struct file * when we punt the IO. Right now we don't, which is a
little silly, that should be changed.

Getting rid of the files_struct{} is doable.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-05  2:19       ` Jens Axboe
@ 2019-02-05 17:57         ` Jens Axboe
  2019-02-05 19:08           ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-02-05 17:57 UTC (permalink / raw)
  To: Al Viro, Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, avi, linux-fsdevel

On 2/4/19 7:19 PM, Jens Axboe wrote:
> On 2/3/19 7:56 PM, Al Viro wrote:
>> On Wed, Jan 30, 2019 at 02:29:05AM +0100, Jann Horn wrote:
>>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> We normally have to fget/fput for each IO we do on a file. Even with
>>>> the batching we do, the cost of the atomic inc/dec of the file usage
>>>> count adds up.
>>>>
>>>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>>>> for the io_uring_register(2) system call. The arguments passed in must
>>>> be an array of __s32 holding file descriptors, and nr_args should hold
>>>> the number of file descriptors the application wishes to pin for the
>>>> duration of the io_uring context (or until IORING_UNREGISTER_FILES is
>>>> called).
>>>>
>>>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>>>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>>>> to the index in the array passed in to IORING_REGISTER_FILES.
>>>>
>>>> Files are automatically unregistered when the io_uring context is
>>>> torn down. An application need only unregister if it wishes to
>>>> register a new set of fds.
>>>
>>> Crazy idea:
>>>
>>> Taking a step back, at a high level, basically this patch creates sort
>>> of the same difference that you get when you compare the following
>>> scenarios for normal multithreaded I/O in userspace:
>>
>>> This kinda makes me wonder whether this is really something that
>>> should be implemented specifically for the io_uring API, or whether it
>>> would make sense to somehow handle part of this in the generic VFS
>>> code and give the user the ability to prepare a new files_struct that
>>> can then be transferred to the worker thread, or something like
>>> that... I'm not sure whether there's a particularly clean way to do
>>> that though.
>>
>> Using files_struct for that opens a can of worms you really don't
>> want to touch.
>>
>> Consider the following scenario with any variant of this interface:
>> 	* create io_uring fd.
>> 	* send an SCM_RIGHTS with that fd to AF_UNIX socket.
>> 	* add the descriptor of that AF_UNIX socket to your fd
>> 	* close AF_UNIX fd, close io_uring fd.
>> Voila - you've got a shiny leak.  No ->release() is called for
>> anyone (and you really don't want to do that on ->flush(), because
>> otherwise a library helper doing e.g. system("/bin/date") will tear
>> down all the io_uring in your process).  The socket is held by
>> the reference you've stashed into io_uring (whichever way you do
>> that).  io_uring is held by the reference you've stashed into
>> SCM_RIGHTS datagram in queue of the socket.
>>
>> No matter what, you need net/unix/garbage.c to be aware of that stuff.
>> And getting files_struct lifetime mixed into that would be beyond
>> any reason.
>>
>> The only reason for doing that as a descriptor table would be
>> avoiding the cost of fget() in whatever uses it, right?  Since
> 
> Right, the only purpose of this patch is to avoid doing fget/fput for
> each IO.
> 
>> those are *not* the normal syscalls (and fdget() really should not
>> be used anywhere other than the very top of syscall's call chain -
>> that's another reason why tossing file_struct around like that
>> is insane) and since the benefit is all due to the fact that it's
>> *NOT* shared, *NOT* modified in parallel, etc., allowing us to
>> treat file references as stable... why the hell use the descriptor
>> tables at all?
> 
> This one is not a regular system call, since we don't do fget, then IO,
> then fput. We hang on to it. But for the non-registered case, it's very
> much just like a regular read/write system call, where we fget to do IO
> on it, then fput when we are done.
> 
>> All you need is an array of struct file *, explicitly populated.
>> With net/unix/garbage.c aware of such beasts.  Guess what?  We
>> do have such an object already.  The one net/unix/garbage.c is
>> working with.  SCM_RIGHTS datagrams, that is.
>>
>> IOW, can't we give those io_uring descriptors associated struct
>> unix_sock?  No socket descriptors, no struct socket (probably),
>> just the AF_UNIX-specific part thereof.  Then teach
>> unix_inflight()/unix_notinflight() about getting unix_sock out
>> of these guys (incidentally, both would seem to benefit from
>> _not_ touching unix_gc_lock in case when there's no unix_sock
>> attached to file we are dealing with - I might be missing
>> something very subtle about barriers there, but it doesn't
>> look likely).
> 
> That might be workable, though I'm not sure we currently have helpers to
> just explicitly create a unix_sock by itself. Not familiar with the
> networking bits at all, I'll take a look.
> 
>> And make that (i.e. registering the descriptors) mandatory.
> 
> I don't want to make it mandatory, that's very inflexible for managing
> tons of files. The registration is useful for specific cases where we
> have high frequency of operations on a set of files. Besides, it'd make
> the use of the API cumbersome as well for the basic case of just wanting
> to do async IO.
> 
>> Hell, combine that with creating io_uring fd, if we really
>> care about the syscall count.  Benefits:
> 
> We don't care about syscall count for setup as much. If you're doing
> registration of a file set, you're expected to do a LOT of IO to those
> files. Hence having an extra one for setup is not a concern. My concern
> is just making it mandatory to do registration, I don't think that's a
> workable alternative.
> 
>> 	* no file_struct refcount wanking
>> 	* no fget()/fput() (conditional, at that) from kernel
>> threads
>> 	* no CLOEXEC-dependent anything; just the teardown
>> on the final fput(), whichever way it comes.
>> 	* no fun with duelling garbage collectors.
> 
> The fget/fput from a kernel thread can be solved by just hanging on to
> the struct file * when we punt the IO. Right now we don't, which is a
> little silly, that should be changed.
> 
> Getting rid of the files_struct{} is doable.

OK, I've reworked the initial parts to wire up the io_uring fd to the
AF_UNIX garbage collection. As I made it to the file registration part,
I wanted to wire up that too. But I don't think there's a need for that
- if we have the io_uring fd appropriately protected, we'll be dropping
our struct file ** array index when the io_uring fd is released. That
should be adequate, we don't need the garbage collection to be aware of
those individually.

The only part I had to drop for now is the sq thread polling, as that
depends on us carrying the files_struct. I'm going to fold that in
shortly, but just make it be dependent on having registered files. That
avoids needing to fget/fput for that case, and needing registered files
for the sq side submission/polling is not a usability issue like it
would be for the "normal" use cases.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-05 17:57         ` Jens Axboe
@ 2019-02-05 19:08           ` Jens Axboe
  2019-02-06  0:27             ` Jens Axboe
  2019-02-06  0:56             ` Al Viro
  0 siblings, 2 replies; 76+ messages in thread
From: Jens Axboe @ 2019-02-05 19:08 UTC (permalink / raw)
  To: Al Viro, Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, avi, linux-fsdevel

On 2/5/19 10:57 AM, Jens Axboe wrote:
> On 2/4/19 7:19 PM, Jens Axboe wrote:
>> On 2/3/19 7:56 PM, Al Viro wrote:
>>> On Wed, Jan 30, 2019 at 02:29:05AM +0100, Jann Horn wrote:
>>>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>> We normally have to fget/fput for each IO we do on a file. Even with
>>>>> the batching we do, the cost of the atomic inc/dec of the file usage
>>>>> count adds up.
>>>>>
>>>>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>>>>> for the io_uring_register(2) system call. The arguments passed in must
>>>>> be an array of __s32 holding file descriptors, and nr_args should hold
>>>>> the number of file descriptors the application wishes to pin for the
>>>>> duration of the io_uring context (or until IORING_UNREGISTER_FILES is
>>>>> called).
>>>>>
>>>>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>>>>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>>>>> to the index in the array passed in to IORING_REGISTER_FILES.
>>>>>
>>>>> Files are automatically unregistered when the io_uring context is
>>>>> torn down. An application need only unregister if it wishes to
>>>>> register a new set of fds.
>>>>
>>>> Crazy idea:
>>>>
>>>> Taking a step back, at a high level, basically this patch creates sort
>>>> of the same difference that you get when you compare the following
>>>> scenarios for normal multithreaded I/O in userspace:
>>>
>>>> This kinda makes me wonder whether this is really something that
>>>> should be implemented specifically for the io_uring API, or whether it
>>>> would make sense to somehow handle part of this in the generic VFS
>>>> code and give the user the ability to prepare a new files_struct that
>>>> can then be transferred to the worker thread, or something like
>>>> that... I'm not sure whether there's a particularly clean way to do
>>>> that though.
>>>
>>> Using files_struct for that opens a can of worms you really don't
>>> want to touch.
>>>
>>> Consider the following scenario with any variant of this interface:
>>> 	* create io_uring fd.
>>> 	* send an SCM_RIGHTS with that fd to AF_UNIX socket.
>>> 	* add the descriptor of that AF_UNIX socket to your fd
>>> 	* close AF_UNIX fd, close io_uring fd.
>>> Voila - you've got a shiny leak.  No ->release() is called for
>>> anyone (and you really don't want to do that on ->flush(), because
>>> otherwise a library helper doing e.g. system("/bin/date") will tear
>>> down all the io_uring in your process).  The socket is held by
>>> the reference you've stashed into io_uring (whichever way you do
>>> that).  io_uring is held by the reference you've stashed into
>>> SCM_RIGHTS datagram in queue of the socket.
>>>
>>> No matter what, you need net/unix/garbage.c to be aware of that stuff.
>>> And getting files_struct lifetime mixed into that would be beyond
>>> any reason.
>>>
>>> The only reason for doing that as a descriptor table would be
>>> avoiding the cost of fget() in whatever uses it, right?  Since
>>
>> Right, the only purpose of this patch is to avoid doing fget/fput for
>> each IO.
>>
>>> those are *not* the normal syscalls (and fdget() really should not
>>> be used anywhere other than the very top of syscall's call chain -
>>> that's another reason why tossing file_struct around like that
>>> is insane) and since the benefit is all due to the fact that it's
>>> *NOT* shared, *NOT* modified in parallel, etc., allowing us to
>>> treat file references as stable... why the hell use the descriptor
>>> tables at all?
>>
>> This one is not a regular system call, since we don't do fget, then IO,
>> then fput. We hang on to it. But for the non-registered case, it's very
>> much just like a regular read/write system call, where we fget to do IO
>> on it, then fput when we are done.
>>
>>> All you need is an array of struct file *, explicitly populated.
>>> With net/unix/garbage.c aware of such beasts.  Guess what?  We
>>> do have such an object already.  The one net/unix/garbage.c is
>>> working with.  SCM_RIGHTS datagrams, that is.
>>>
>>> IOW, can't we give those io_uring descriptors associated struct
>>> unix_sock?  No socket descriptors, no struct socket (probably),
>>> just the AF_UNIX-specific part thereof.  Then teach
>>> unix_inflight()/unix_notinflight() about getting unix_sock out
>>> of these guys (incidentally, both would seem to benefit from
>>> _not_ touching unix_gc_lock in case when there's no unix_sock
>>> attached to file we are dealing with - I might be missing
>>> something very subtle about barriers there, but it doesn't
>>> look likely).
>>
>> That might be workable, though I'm not sure we currently have helpers to
>> just explicitly create a unix_sock by itself. Not familiar with the
>> networking bits at all, I'll take a look.
>>
>>> And make that (i.e. registering the descriptors) mandatory.
>>
>> I don't want to make it mandatory, that's very inflexible for managing
>> tons of files. The registration is useful for specific cases where we
>> have high frequency of operations on a set of files. Besides, it'd make
>> the use of the API cumbersome as well for the basic case of just wanting
>> to do async IO.
>>
>>> Hell, combine that with creating io_uring fd, if we really
>>> care about the syscall count.  Benefits:
>>
>> We don't care about syscall count for setup as much. If you're doing
>> registration of a file set, you're expected to do a LOT of IO to those
>> files. Hence having an extra one for setup is not a concern. My concern
>> is just making it mandatory to do registration, I don't think that's a
>> workable alternative.
>>
>>> 	* no file_struct refcount wanking
>>> 	* no fget()/fput() (conditional, at that) from kernel
>>> threads
>>> 	* no CLOEXEC-dependent anything; just the teardown
>>> on the final fput(), whichever way it comes.
>>> 	* no fun with duelling garbage collectors.
>>
>> The fget/fput from a kernel thread can be solved by just hanging on to
>> the struct file * when we punt the IO. Right now we don't, which is a
>> little silly, that should be changed.
>>
>> Getting rid of the files_struct{} is doable.
> 
> OK, I've reworked the initial parts to wire up the io_uring fd to the
> AF_UNIX garbage collection. As I made it to the file registration part,
> I wanted to wire up that too. But I don't think there's a need for that
> - if we have the io_uring fd appropriately protected, we'll be dropping
> our struct file ** array index when the io_uring fd is released. That
> should be adequate, we don't need the garbage collection to be aware of
> those individually.
> 
> The only part I had to drop for now is the sq thread polling, as that
> depends on us carrying the files_struct. I'm going to fold that in
> shortly, but just make it be dependent on having registered files. That
> avoids needing to fget/fput for that case, and needing registered files
> for the sq side submission/polling is not a usability issue like it
> would be for the "normal" use cases.

Proof is in the pudding, here's the main commit introducing io_uring
and now wiring it up to the AF_UNIX garbage collection:

http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=158e6f42b67d0abe9ee84886b96ca8c4b3d3dfd5

How does that look? Outside of the inflight hookup, we simply retain
the file * for punting to the workqueue. This means that buffered
retry does NOT need to do fget/fput, so we don't need a files_struct
for that anymore.

In terms of the SQPOLL patch that's further down the series, it doesn't
allow that mode of operation without having fixed files enabled. That
eliminates the need for fget/fput from a kernel thread, and hence the
need to carry a files_struct around for that as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-05 19:08           ` Jens Axboe
@ 2019-02-06  0:27             ` Jens Axboe
  2019-02-06  1:01               ` Al Viro
  2019-02-06  0:56             ` Al Viro
  1 sibling, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-02-06  0:27 UTC (permalink / raw)
  To: Al Viro, Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, avi, linux-fsdevel

On 2/5/19 12:08 PM, Jens Axboe wrote:
> On 2/5/19 10:57 AM, Jens Axboe wrote:
>> On 2/4/19 7:19 PM, Jens Axboe wrote:
>>> On 2/3/19 7:56 PM, Al Viro wrote:
>>>> On Wed, Jan 30, 2019 at 02:29:05AM +0100, Jann Horn wrote:
>>>>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> We normally have to fget/fput for each IO we do on a file. Even with
>>>>>> the batching we do, the cost of the atomic inc/dec of the file usage
>>>>>> count adds up.
>>>>>>
>>>>>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>>>>>> for the io_uring_register(2) system call. The arguments passed in must
>>>>>> be an array of __s32 holding file descriptors, and nr_args should hold
>>>>>> the number of file descriptors the application wishes to pin for the
>>>>>> duration of the io_uring context (or until IORING_UNREGISTER_FILES is
>>>>>> called).
>>>>>>
>>>>>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>>>>>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>>>>>> to the index in the array passed in to IORING_REGISTER_FILES.
>>>>>>
>>>>>> Files are automatically unregistered when the io_uring context is
>>>>>> torn down. An application need only unregister if it wishes to
>>>>>> register a new set of fds.
>>>>>
>>>>> Crazy idea:
>>>>>
>>>>> Taking a step back, at a high level, basically this patch creates sort
>>>>> of the same difference that you get when you compare the following
>>>>> scenarios for normal multithreaded I/O in userspace:
>>>>
>>>>> This kinda makes me wonder whether this is really something that
>>>>> should be implemented specifically for the io_uring API, or whether it
>>>>> would make sense to somehow handle part of this in the generic VFS
>>>>> code and give the user the ability to prepare a new files_struct that
>>>>> can then be transferred to the worker thread, or something like
>>>>> that... I'm not sure whether there's a particularly clean way to do
>>>>> that though.
>>>>
>>>> Using files_struct for that opens a can of worms you really don't
>>>> want to touch.
>>>>
>>>> Consider the following scenario with any variant of this interface:
>>>> 	* create io_uring fd.
>>>> 	* send an SCM_RIGHTS with that fd to AF_UNIX socket.
>>>> 	* add the descriptor of that AF_UNIX socket to your fd
>>>> 	* close AF_UNIX fd, close io_uring fd.
>>>> Voila - you've got a shiny leak.  No ->release() is called for
>>>> anyone (and you really don't want to do that on ->flush(), because
>>>> otherwise a library helper doing e.g. system("/bin/date") will tear
>>>> down all the io_uring in your process).  The socket is held by
>>>> the reference you've stashed into io_uring (whichever way you do
>>>> that).  io_uring is held by the reference you've stashed into
>>>> SCM_RIGHTS datagram in queue of the socket.
>>>>
>>>> No matter what, you need net/unix/garbage.c to be aware of that stuff.
>>>> And getting files_struct lifetime mixed into that would be beyond
>>>> any reason.
>>>>
>>>> The only reason for doing that as a descriptor table would be
>>>> avoiding the cost of fget() in whatever uses it, right?  Since
>>>
>>> Right, the only purpose of this patch is to avoid doing fget/fput for
>>> each IO.
>>>
>>>> those are *not* the normal syscalls (and fdget() really should not
>>>> be used anywhere other than the very top of syscall's call chain -
>>>> that's another reason why tossing file_struct around like that
>>>> is insane) and since the benefit is all due to the fact that it's
>>>> *NOT* shared, *NOT* modified in parallel, etc., allowing us to
>>>> treat file references as stable... why the hell use the descriptor
>>>> tables at all?
>>>
>>> This one is not a regular system call, since we don't do fget, then IO,
>>> then fput. We hang on to it. But for the non-registered case, it's very
>>> much just like a regular read/write system call, where we fget to do IO
>>> on it, then fput when we are done.
>>>
>>>> All you need is an array of struct file *, explicitly populated.
>>>> With net/unix/garbage.c aware of such beasts.  Guess what?  We
>>>> do have such an object already.  The one net/unix/garbage.c is
>>>> working with.  SCM_RIGHTS datagrams, that is.
>>>>
>>>> IOW, can't we give those io_uring descriptors associated struct
>>>> unix_sock?  No socket descriptors, no struct socket (probably),
>>>> just the AF_UNIX-specific part thereof.  Then teach
>>>> unix_inflight()/unix_notinflight() about getting unix_sock out
>>>> of these guys (incidentally, both would seem to benefit from
>>>> _not_ touching unix_gc_lock in case when there's no unix_sock
>>>> attached to file we are dealing with - I might be missing
>>>> something very subtle about barriers there, but it doesn't
>>>> look likely).
>>>
>>> That might be workable, though I'm not sure we currently have helpers to
>>> just explicitly create a unix_sock by itself. Not familiar with the
>>> networking bits at all, I'll take a look.
>>>
>>>> And make that (i.e. registering the descriptors) mandatory.
>>>
>>> I don't want to make it mandatory, that's very inflexible for managing
>>> tons of files. The registration is useful for specific cases where we
>>> have high frequency of operations on a set of files. Besides, it'd make
>>> the use of the API cumbersome as well for the basic case of just wanting
>>> to do async IO.
>>>
>>>> Hell, combine that with creating io_uring fd, if we really
>>>> care about the syscall count.  Benefits:
>>>
>>> We don't care about syscall count for setup as much. If you're doing
>>> registration of a file set, you're expected to do a LOT of IO to those
>>> files. Hence having an extra one for setup is not a concern. My concern
>>> is just making it mandatory to do registration, I don't think that's a
>>> workable alternative.
>>>
>>>> 	* no file_struct refcount wanking
>>>> 	* no fget()/fput() (conditional, at that) from kernel
>>>> threads
>>>> 	* no CLOEXEC-dependent anything; just the teardown
>>>> on the final fput(), whichever way it comes.
>>>> 	* no fun with duelling garbage collectors.
>>>
>>> The fget/fput from a kernel thread can be solved by just hanging on to
>>> the struct file * when we punt the IO. Right now we don't, which is a
>>> little silly, that should be changed.
>>>
>>> Getting rid of the files_struct{} is doable.
>>
>> OK, I've reworked the initial parts to wire up the io_uring fd to the
>> AF_UNIX garbage collection. As I made it to the file registration part,
>> I wanted to wire up that too. But I don't think there's a need for that
>> - if we have the io_uring fd appropriately protected, we'll be dropping
>> our struct file ** array index when the io_uring fd is released. That
>> should be adequate, we don't need the garbage collection to be aware of
>> those individually.
>>
>> The only part I had to drop for now is the sq thread polling, as that
>> depends on us carrying the files_struct. I'm going to fold that in
>> shortly, but just make it be dependent on having registered files. That
>> avoids needing to fget/fput for that case, and needing registered files
>> for the sq side submission/polling is not a usability issue like it
>> would be for the "normal" use cases.
> 
> Proof is in the pudding, here's the main commit introducing io_uring
> and now wiring it up to the AF_UNIX garbage collection:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=158e6f42b67d0abe9ee84886b96ca8c4b3d3dfd5
> 
> How does that look? Outside of the inflight hookup, we simply retain
> the file * for punting to the workqueue. This means that buffered
> retry does NOT need to do fget/fput, so we don't need a files_struct
> for that anymore.
> 
> In terms of the SQPOLL patch that's further down the series, it doesn't
> allow that mode of operation without having fixed files enabled. That
> eliminates the need for fget/fput from a kernel thread, and hence the
> need to carry a files_struct around for that as well.

This should be better, passes some basic testing, too:

http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=01a93aa784319a02ccfa6523371b93401c9e0073

Verified that we're grabbing the right refs, and don't hold any
ourselves. For the file registration, forbid registration of the
io_uring fd, as that is pointless and will introduce a loop regardless
of fd passing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-05 19:08           ` Jens Axboe
  2019-02-06  0:27             ` Jens Axboe
@ 2019-02-06  0:56             ` Al Viro
  2019-02-06 13:41               ` Jens Axboe
  1 sibling, 1 reply; 76+ messages in thread
From: Al Viro @ 2019-02-06  0:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On Tue, Feb 05, 2019 at 12:08:25PM -0700, Jens Axboe wrote:
> Proof is in the pudding, here's the main commit introducing io_uring
> and now wiring it up to the AF_UNIX garbage collection:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=158e6f42b67d0abe9ee84886b96ca8c4b3d3dfd5
> 
> How does that look?

In a word - wrong.  Some theory: garbage collector assumes that there is
a subset of file references such that
	* for all files with such references there's an associated unix_sock.
	* all such references are stored in SCM_RIGHTS datagrams that can be
found by the garbage collector (currently: for data-bearing AF_UNIX sockets -
queued SCM_RIGHTS datagrams, for listeners - SCM_RIGHTS datagrams sent via
yet-to-be-accepted connections).
	* there is an efficient way to count those references for given file
(->inflight of the corresponding unix_sock).
	* removal of those references would render the graph acyclic.
	* file can _NOT_ be subject to syscalls unless there are references
to it outside of that subset.

unix_inflight() moves a reference into the subset
unix_notinflight() moves a reference out of the subset
activity that might add such references ought to call wait_for_unix_gc() first
(basically, to stall the massive insertions when gc is running).

Note that unix_gc() does *NOT* work in terms of dropping file references -
the primary effect is locating the SCM_RIGHTS datagrams that can be disposed
of and taking them out.  It simply won't do anything to your file references,
no matter what.  Add a printk into your ->release() and try to register io_uring
descriptor into itself, then close it.  And observe ->release() not being
called for that object.  Ever.

PS: The algorithm used by unix_gc() is basically this -

	grab unix_gc_lock (giving exclusion with unix_inflight/unix_notinflight
			   and stabilizing ->inflight counters)

	Candidates = {}
	for all unix_sock u such that u->inflight > 0
		if file corresponding to u has no other references
			Candidates += u

	/* everything else already is reachable; due to unix_gc_lock these
	   can't die or get syscall-visible references under us */
	Might_Die = Candidates

	/* invariant to maintain: for u in Candidates u->inflight will be equal
	   to the number of references from SCM_RIGHTS datagrams *except*
	   those immediately reachable from elements of Might_Die */

	for all u in Candidates
		for each file reference v in SCM_RIGHTS datagrams
					immediately reachable from u
			if v in Candidates
				v->inflight--

	To_Scan = ()	// stuff reachable from those must live
	for all u in Might_Die
		if u->inflight > 0
			queue u into To_Scan

	while To_Scan is non-empty
		u = dequeue(To_Scan)
		Might_Die -= u
		for each file reference v in SCM_RIGHTS datagrams
					immediately reachable from u
			if v in Candidates
				v->inflight++	// maintain the invariant
				if v in Might_Die
					queue v into To_Scan

	/* at that point nothing in Might_Die is reachable from the outside */

	/* restore the original values of ->inflight */
	for all u in Might_Die
		for each file reference v in SCM_RIGHTS datagrams
					immediately reachable from u
			if v in Candidates
				v->inflight++

	hitlist = ()
	for all u in Might_Die
		for each SCM_RIGHTS datagram D immediately reachable from u
			if D contains references to something in Candidates
				move D to hitlist
	/* all those datagrams would've never become reachable */

	drop unix_gc_lock

	discard all datagrams in hitlist.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-06  0:27             ` Jens Axboe
@ 2019-02-06  1:01               ` Al Viro
  2019-02-06 17:56                 ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Al Viro @ 2019-02-06  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On Tue, Feb 05, 2019 at 05:27:29PM -0700, Jens Axboe wrote:

> This should be better, passes some basic testing, too:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=01a93aa784319a02ccfa6523371b93401c9e0073
> 
> Verified that we're grabbing the right refs, and don't hold any
> ourselves. For the file registration, forbid registration of the
> io_uring fd, as that is pointless and will introduce a loop regardless
> of fd passing.

*shrug*

So pass it to AF_UNIX socket and register _that_ - does't change the
underlying problem.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-06  0:56             ` Al Viro
@ 2019-02-06 13:41               ` Jens Axboe
  2019-02-07  4:00                 ` Al Viro
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-02-06 13:41 UTC (permalink / raw)
  To: Al Viro
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On 2/5/19 5:56 PM, Al Viro wrote:
> On Tue, Feb 05, 2019 at 12:08:25PM -0700, Jens Axboe wrote:
>> Proof is in the pudding, here's the main commit introducing io_uring
>> and now wiring it up to the AF_UNIX garbage collection:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=158e6f42b67d0abe9ee84886b96ca8c4b3d3dfd5
>>
>> How does that look?
> 
> In a word - wrong.  Some theory: garbage collector assumes that there is
> a subset of file references such that
> 	* for all files with such references there's an associated unix_sock.
> 	* all such references are stored in SCM_RIGHTS datagrams that can be
> found by the garbage collector (currently: for data-bearing AF_UNIX sockets -
> queued SCM_RIGHTS datagrams, for listeners - SCM_RIGHTS datagrams sent via
> yet-to-be-accepted connections).
> 	* there is an efficient way to count those references for given file
> (->inflight of the corresponding unix_sock).
> 	* removal of those references would render the graph acyclic.
> 	* file can _NOT_ be subject to syscalls unless there are references
> to it outside of that subset.

IOW, we cannot use fget() for registering files, and we still need fget/fput
in the fast path to retain safe use of the file. If I'm understanding you
correctly?

Just trying to ensure that I understand what you're saying here, as it seems
to refer to the file registration part, not the main patch (which did get
reworked, though).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-06  1:01               ` Al Viro
@ 2019-02-06 17:56                 ` Jens Axboe
  2019-02-07  4:05                   ` Al Viro
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-02-06 17:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On 2/5/19 6:01 PM, Al Viro wrote:
> On Tue, Feb 05, 2019 at 05:27:29PM -0700, Jens Axboe wrote:
> 
>> This should be better, passes some basic testing, too:
>>
>> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=01a93aa784319a02ccfa6523371b93401c9e0073
>>
>> Verified that we're grabbing the right refs, and don't hold any
>> ourselves. For the file registration, forbid registration of the
>> io_uring fd, as that is pointless and will introduce a loop regardless
>> of fd passing.
> 
> *shrug*
> 
> So pass it to AF_UNIX socket and register _that_ - does't change the
> underlying problem.

Maybe I'm being dense here, but it's an f_op match. Should catch a
passed fd as well, correct?

With that, how can there be a loop?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-06 13:41               ` Jens Axboe
@ 2019-02-07  4:00                 ` Al Viro
  2019-02-07  9:22                   ` Miklos Szeredi
  2019-02-07 18:45                   ` Jens Axboe
  0 siblings, 2 replies; 76+ messages in thread
From: Al Viro @ 2019-02-07  4:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On Wed, Feb 06, 2019 at 06:41:00AM -0700, Jens Axboe wrote:
> On 2/5/19 5:56 PM, Al Viro wrote:
> > On Tue, Feb 05, 2019 at 12:08:25PM -0700, Jens Axboe wrote:
> >> Proof is in the pudding, here's the main commit introducing io_uring
> >> and now wiring it up to the AF_UNIX garbage collection:
> >>
> >> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=158e6f42b67d0abe9ee84886b96ca8c4b3d3dfd5
> >>
> >> How does that look?
> > 
> > In a word - wrong.  Some theory: garbage collector assumes that there is
> > a subset of file references such that
> > 	* for all files with such references there's an associated unix_sock.
> > 	* all such references are stored in SCM_RIGHTS datagrams that can be
> > found by the garbage collector (currently: for data-bearing AF_UNIX sockets -
> > queued SCM_RIGHTS datagrams, for listeners - SCM_RIGHTS datagrams sent via
> > yet-to-be-accepted connections).
> > 	* there is an efficient way to count those references for given file
> > (->inflight of the corresponding unix_sock).
> > 	* removal of those references would render the graph acyclic.
> > 	* file can _NOT_ be subject to syscalls unless there are references
> > to it outside of that subset.
> 
> IOW, we cannot use fget() for registering files, and we still need fget/fput
> in the fast path to retain safe use of the file. If I'm understanding you
> correctly?

No.  *ALL* references (inflight and not) are the same for file->f_count.
unix_inflight() does not grab a new reference to file; it only says that
reference passed to it by the caller is now an in-flight one.

OK, braindump time:

Lifetime for struct file is controlled by a simple refcount.  Destructor
(__fput() and ->release() called from it) is called once the counter hits
zero.  Which is fine, except for the situations when some struct file
references are pinned down in structures reachable only via our struct file
instance.

Each file descriptor counts as a reference.  IOW, dup() will increment
the refcount by 1, close() will decrement it, fork() will increment it
by the number of descriptors in your descriptor table refering to this
struct file, desctruction of descriptor table on exit() will decrement
by the same amount, etc.

Syscalls like read() and friends turn descriptor(s) into struct file
references.  If descriptor table is shared, that counts as a new reference
that must be dropped in the end of syscall.  If it's not shared, we are
guaranteed that the reference in descriptor table will stay around until
the end of syscall, so we may use it without bumping the file refcount.
That's the difference between fget() and fdget() - the former will
bump the refcount, the latter will try to avoid that.  Of course, if
we do not intend to drop the reference we'd acquired by the end of
syscall, we want fget() - fdget() is for transient references only.

Descriptor tables (struct files) *can* be shared; several processes
(usually - threads that share VM as well, but that's not necessary)
may be working with the same instance of struct files, so e.g. open()
in one of them is seen by the others.  The same goes for close(),
dup(), dup2(), etc.

That makes for an interesting corner case - what if two threads happen
to share a descriptor table and we have close(fd) in one of them
in the middle of read(fd, ..., ...) in another?  That's one aread where
Unices differ - one variant is to abort read(), another - to have close()
wait for read() to finish, etc.  What we do is
	* close() succeeds immediately; the reference is removed from
descriptor table and dropped.
	* if close(fd) has happened before read(fd, ...) has converted
fd to struce file reference, read() will get -EBADF.
	* otherwise, read() proceeds unmolested; the reference it has
acquired is dropped in the end of syscall.  If that's the last reference
to struct file, struct file will get shut down at that point.

clone(2) will have the child sharing descriptor table of parent if
CLONE_FILES is in the flags.  Note that in this case struct file
refcounts are not modified at all - no new references to files are
created.  Without CLONE_FILES it's the same as fork() - an independent
copy of descriptor table is created and populated by copies of references
to files, each bumping file's refcount.

unshare(2) with CLONE_FILES in flags will get a copy of descriptor table
(same as done on fork(), etc.) and switch to using it; the old reference
is dropped (note: it'll only bother with that if descriptor table used
to be shared in the first place - if we hold the only reference to
descriptor table, we'll just keep using it).

execve(2) does almost the same - if descriptor table used to be shared,
it will switch to a new copy first; in case of success the reference
to original is dropped, in case of failure we revert to original and
drop the copy.  Note that handling of close-on-exec is done in the _copy_
- the original is unaffected, so failing execve() does not disrupt the
descriptor table.

exit(2) will drop the reference to descriptor table.  When the last
reference is dropped, all file references are removed from it (and dropped).

The thread's pointer to descriptor table (current->files) is never
modified by other thread; something like ls /proc/<pid>/fd will fetch
it, so stores need to be protected (by task_lock(current)), but
the only the thread itself can do them.  

Note that while extra references to descriptor table can appear at any
time (/proc/<pid>/fd accesses, for example), such references may not
be used for modifications.  In particular, you can't switch to another
thread's descriptor table, unless it had been yours at some earlier
point _and_ you've kept a reference to it.

That's about it for descriptor tables; that, by far, is the main case
of persistently held struct file references.  Transient references
are grabbed by syscalls when they resolve descriptor to struct file *,
which ought to be done once per syscall _and_ reasonably early in
it.  Unfortunately, that's not all - there are other persistent struct
file references.

The things like "LOOP_SET_FD grabs a reference to struct file and
stashes it in ->lo_backing_file" are reasonably simple - the reference
will be dropped later, either directly by LOOP_CLR_FD (if nothing else
held the damn thing open at the time) or later in lo_release().
Note that in the latter case it's possible to get "close() of
/dev/loop descriptor drops the last reference to it, triggering
bdput(), which happens to by the last thing that held block device
opened, which triggers lo_release(), which drops the reference to
underlying struct file (almost certainly the last one by that point)".
It's still not a problem - while we have the underlying struct file
pinned by something held by another struct file, the dependencies'
graph is acyclic, so plain refcounts we are using work fine.

The same goes for the things like e.g. ecryptfs opening an underlying
(encrypted) file on open() and dropping it when the last reference
to ecryptfs file is dropped - the only difference here is that the
underlying struct file is never appearing in _anyone's_ descriptor
tables.

However, in a couple of cases we do have something trickier.

Case 1: SCM_RIGHTS datagram can be sent to an AF_UNIX socket.  That
converts the caller-supplied array of descriptors into an array of
struct file references, which gets attached to the packet we queue.
When the datagram is received, the struct file references are
moved into the descriptor table of recepient or, in case of error,
dropped.  Note that sending some descriptors in an SCM_RIGHTS datagram
and closing them is perfectly legitimate - as soon as sendmsg(2)
returns you can go ahead and close the descriptors you've sent;
the references are already acquired and you don't need to wait for
the packet to be received.

That would still be simple, if not for the fact that there's nothing
to stop you from passing AF_UNIX sockets around the same way.  In fact,
that has legitimate uses and, most of the time, doesn't cause any
complications at all.  However, it is possible to get the situation
when
	* struct file instances A and B are both AF_UNIX sockets.
	* the only reference to A is in the SCM_RIGHTS packet that
sits in the receiving queue of B.
	* the only reference to B is in the SCM_RIGHTS packet that
sits in the receiving queue of A.
That, of course, is where the pure refcounting of any kind will break.

SCM_RIGHTS datagram that contains the sole reference to A can't be
received without the recepient getting hold of a reference to B.
Which cannot happen until somebody manages to receive the SCM_RIGHTS
datagram containing the sole reference to B.  Which cannot happen
until that somebody manages to get hold of a reference to A,
which cannot happen until the first SCM_RIGHTS datagram is
received.

Dropping the last reference to A would've discarded everything in
its receiving queue, including the SCM_RIGHTS that contains the
reference to B; however, that can't happen either - the other
SCM_RIGHTS datagram would have to be either received or discarded
first, etc.

Case 2: similar, with a bit of a twist.  AF_UNIX socket used for
descriptor passing is normally set up by socket(), followed by
connect().  As soon as connect() returns, one can start sending.
Note that connect() does *NOT* wait for the recepient to call
accept() - it creates the object that will serve as low-level
part of the other end of connection (complete with received
packet queue) and stashes that object into the queue of *listener*
socket.  Subsequent accept() fetches it from there and attaches
it to a new socket, completing the setup; in the meanwhile,
sending packets works fine.  Once accept() is done, it'll see
the stuff you'd sent already in the queue of the new socket and
everything works fine.

If the listening socket gets closed without accept() having
been called, its queue is flushed, discarding all pending
connection attempts, complete with _their_ queues.  Which is
the same effect as accept() + close(), so again, normally
everything just works.

However, consider the case when we have
	* struct file instances A and B being AF_UNIX sockets.
	* A is a listener
	* B is an established connection, with the other end
yet to be accepted on A
	* the only references to A and B are in an SCM_RIGHTS
datagram sent over by A.

That SCM_RIGHTS datagram could've been received, if somebody
had managed to call accept(2) on A and recvmsg(2) on the
socket created by that accept(2).  But that can't happen
without that somebody getting hold of a reference to A in
the first place, which can't happen without having received
that SCM_RIGHTS datagram.  It can't be discarded either,
since that can't happen without dropping the last reference
to A, which sits right in it.

The difference from the previous case is that there we had
	A holds unix_sock of A
	unix_sock of A holds SCM_RIGHTS with reference to B
	B holds unix_sock of B
	unix_sock of B holds SCM_RIGHTS with reference to A
and here we have
	A holds unix_sock of A
	unix_sock of A holds the packet with reference to embryonic
unix_sock created by connect()
	that embryionic unix_sock holds SCM_RIGHTS with references
to A and B.

Dependency graph is different, but the problem is the same -
unreachable loops in it.  Note that neither class of situations
would occur normally - in the best case it's "somebody had been
doing rather convoluted descriptor passing, but everyone involved
got hit with kill -9 at the wrong time; please, make sure nothing
leaks".  That can happen, but a userland race (e.g. botched protocol
handling of some sort) or a deliberate abuse are much more likely.

Catching the loop creation is hard and paying for that every time
we do descriptor-passing would be a bad idea.  Besides, the loop
per se is not fatal - e.g if in the second case the descriptor for
A had been kept around, close(accept()) would've cleaned everything
up.  Which means that we need a garbage collector to deal with
the (rare) leaks.

Note that in both cases the leaks are caused by loops passing through
some SCM_RIGHTS datagrams that could never be received.  So locating
those, removing them from the queues they sit in and then discarding
the suckers is enough to resolve the situation.

Furthermore, in both cases the loop passes through unix_sock of
something that got sent over in an SCM_RIGHTS datagram.  So we
can do the following:
	1) keep the count of references to struct file of AF_UNIX
socket held by SCM_RIGHTS (kept in unix_sock->inflight).  Any
struct unix_sock instance without such references is not a part of
unreachable loop.  Maintain the set of unix_sock that are not excluded
by that (i.e. the ones that have some of references from SCM_RIGHTS
instances).  Note that we don't need to maintain those counts in
struct file - we care only about unix_sock here.
	2) struct file of AF_UNIX socket with some references
*NOT* from SCM_RIGHTS is also not a part of unreachable loop.
	3) for each unix_sock consider the following set of
SCM_RIGHTS: everything in queue of that unix_sock if it's
a non-listener and everything in queues of all embryonic unix_sock
in queue of a listener.  Let's call those SCM_RIGHTS associated
with our unix_sock.
	4) all SCM_RIGHTS associated with a reachable unix_sock
are reachable.
	5) if some references to struct file of a unix_sock
are in reachable SCM_RIGHTS, it is reachable.

Garbage collector starts with calculating the set of potentially
unreachable unix_sock - the ones not excluded by (1,2).
No unix_sock instances outside of that set need to be considered.

If some unix_sock in that set has counter _not_ entirely covered
by SCM_RIGHTS associated with the elements of the set, we can
conclude that there are references to it in SCM_RIGHTS associated
with something outside of our set and therefore it is reachable
and can be removed from the set.

If that process converges to a non-empty set, we know that
everything left in that set is unreachable - all references
to their struct file come from _some_ SCM_RIGHTS datagrams
and all those SCM_RIGHTS datagrams are among those that can't
be received or discarded without getting hold of a reference
to struct file of something in our set.

Everything outside of that set is reachable, so taking the
SCM_RIGHTS with references to stuff in our set (all of
them to be found among those associated with elements of
our set) out of the queues they are in will break all
unreachable loops.  Discarding the collected datagrams
will do the rest - file references in those will be
dropped, etc.

One thing to keep in mind here is the locking.  What the garbage
collector relies upon is
	* changes of ->inflight are serialized with respect to
it (on unix_gc_lock; increment done by unix_inflight(),
decrement - by unix_notinflight()).
	* references cannot be extracted from SCM_RIGHTS datagrams
while the garbage collector is running (achieved by having
unix_notinflight() done before references out of SCM_RIGHTS)
	* removal of SCM_RIGHTS associated with a socket can't
be done without a reference to that socket _outside_ of any
SCM_RIGHTS (automatically true).
	* adding SCM_RIGHTS in the middle of garbage collection
is possible, but in that case it will contain no references to
anything in the initial candidate set.

The last one is delicate.  SCM_RIGHTS creation has unix_inflight()
called for each reference we put there, so it's serialized wrt
unix_gc(); however, insertion into queue is *NOT* covered by that -
queue rescans are, but each queue has a lock of its own and they
are definitely not going to be held throughout the whole thing.

So in theory it would be possible to have
	* thread A: sendmsg() has SCM_RIGHTS created and populated,
complete with file refcount and ->inflight increments implied,
at which point it gets preempted and loses the timeslice.
	* thread B: gets to run and removes all references
from descriptor table it shares with thread A.
	* on another CPU we have garbage collector triggered;
it determines the set of potentially unreachable unix_sock and
everything in our SCM_RIGHTS _is_ in that set, now that no
other references remain.
	* on the first CPU, thread A regains the timeslice
and inserts its SCM_RIGHTS into queue.  And it does contain
references to sockets from the candidate set of running
garbage collector, confusing the hell out of it.


That is avoided by a convoluted dance around the SCM_RIGHTS creation
and insertion - we use fget() to obtain struct file references,
then _duplicate_ them in SCM_RIGHTS (bumping a refcount for each, so
we are holding *two* references), do unix_inflight() on them, then
queue the damn thing, then drop each reference we got from fget().

That way everything refered to in that SCM_RIGHTS is going to have
extra struct file references (and thus be excluded from the initial
candidate set) until after it gets inserted into queue.  In other
words, if it does appear in a queue between two passes, it's
guaranteed to contain no references to anything in the initial
canidate set.

End of braindump.
===================================================================

What you are about to add is *ANOTHER* kind of loops - references
to files in the "registered" set are pinned down by owning io_uring.

That would invalidate just about every assumption made the garbage
collector - even if you forbid to register io_uring itself, you
still can register both ends of AF_UNIX socket pair, then pass
io_uring in SCM_RIGHTS over that, then close all descriptors involved.
From the garbage collector point of view all sockets have external
references, so there's nothing to collect.  In fact those external
references are only reachable if you have a reachable reference
to io_uring, so we get a leak.

To make it work:
	* have unix_sock created for each io_uring (as your code does)
	* do *NOT* have unix_inflight() done at that point - it's
completely wrong there.
	* file set registration becomes
		* create and populate SCM_RIGHTS, with the same
fget()+grab an extra reference + unix_inflight() sequence.
Don't forget to have skb->destructor set to unix_destruct_scm
or equivalent thereof.
		* remember UNIXCB(skb).fp - that'll give you your
array of struct file *, to use in lookups.
		* queue it into your unix_sock
		* do _one_ fput() for everything you've grabbed,
dropping one of two references you've taken.
	* unregistering is simply skb_dequeue() + kfree_skb().
	* in ->release() you do sock_release(); it'll do
everything you need (including unregistering the set, etc.)

The hairiest part is the file set registration, of course -
there's almost certainly a helper or two buried in that thing;
simply exposing all the required net/unix/af_unix.c bits is
ucking fugly.

I'm not sure what you propose for non-registered descriptors -
_any_ struct file reference that outlives the return from syscall
stored in some io_uring-attached data structure is has exact same
loop (and leak) problem.  And if you mean to have it dropped before
return from syscall, I'm afraid I don't follow you.  How would
that be done?

Again, "io_uring descriptor can't be used in those requests" does
not help at all - use a socket instead, pass the io_uring fd over
it in SCM_RIGHTS and you are back to square 1.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-06 17:56                 ` Jens Axboe
@ 2019-02-07  4:05                   ` Al Viro
  2019-02-07 16:14                     ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Al Viro @ 2019-02-07  4:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On Wed, Feb 06, 2019 at 10:56:41AM -0700, Jens Axboe wrote:
> On 2/5/19 6:01 PM, Al Viro wrote:
> > On Tue, Feb 05, 2019 at 05:27:29PM -0700, Jens Axboe wrote:
> > 
> >> This should be better, passes some basic testing, too:
> >>
> >> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=01a93aa784319a02ccfa6523371b93401c9e0073
> >>
> >> Verified that we're grabbing the right refs, and don't hold any
> >> ourselves. For the file registration, forbid registration of the
> >> io_uring fd, as that is pointless and will introduce a loop regardless
> >> of fd passing.
> > 
> > *shrug*
> > 
> > So pass it to AF_UNIX socket and register _that_ - does't change the
> > underlying problem.
> 
> Maybe I'm being dense here, but it's an f_op match. Should catch a
> passed fd as well, correct?

f_op match on _what_?

> With that, how can there be a loop?

	io_uring_fd = ....
	socketpair(PF_UNIX, SOCK_STREAM, 0, sock_fds);
	register sock_fds[0] and sock_fds[1] to io_uring_fd
	send SCM_RIGHTS datagram with io_uring_fd to sock_fds[0]
	close sock_fds[0], sock_fds[1] and io_uring_fd

And there's your unreachable loop.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07  4:00                 ` Al Viro
@ 2019-02-07  9:22                   ` Miklos Szeredi
  2019-02-07 13:31                     ` Al Viro
  2019-02-07 18:45                   ` Jens Axboe
  1 sibling, 1 reply; 76+ messages in thread
From: Miklos Szeredi @ 2019-02-07  9:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jann Horn, linux-aio, linux-block, Linux API, hch,
	jmoyer, avi, linux-fsdevel

On Thu, Feb 07, 2019 at 04:00:59AM +0000, Al Viro wrote:

> So in theory it would be possible to have
> 	* thread A: sendmsg() has SCM_RIGHTS created and populated,
> complete with file refcount and ->inflight increments implied,
> at which point it gets preempted and loses the timeslice.
> 	* thread B: gets to run and removes all references
> from descriptor table it shares with thread A.
> 	* on another CPU we have garbage collector triggered;
> it determines the set of potentially unreachable unix_sock and
> everything in our SCM_RIGHTS _is_ in that set, now that no
> other references remain.
> 	* on the first CPU, thread A regains the timeslice
> and inserts its SCM_RIGHTS into queue.  And it does contain
> references to sockets from the candidate set of running
> garbage collector, confusing the hell out of it.

Reminds me: long time ago there was a bug report, and based on that I found a
bug in MSG_PEEK handling (not confirmed to have fixed the reported bug).  This
fix, although pretty simple, got lost somehow.  While unix gc code is in your
head, can you please review and I'll resend through davem?

Thanks,
Miklos
---

From: Miklos Szeredi <mszeredi@redhat.com>
Subject: af_unix: fix garbage collect vs. MSG_PEEK

Gc assumes that in-flight sockets that don't have an external ref can't
gain one while unix_gc_lock is held.  That is true because
unix_notinflight() will be called before detaching fds, which takes
unix_gc_lock.

Only MSG_PEEK was somehow overlooked.  That one also clones the fds, also
keeping them in the skb.  But through MSG_PEEK an external reference can
definitely be gained without ever touching unix_gc_lock.

This patch adds unix_gc_barrier() that waits for a garbage collect run to
finish (if there is one), before actually installing the peeked in-flight
files to file descriptors.  This prevents problems from a pure in-flight
socket having its buffers modified while the garbage collect is taking
place.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
---
 include/net/af_unix.h |    1 +
 net/unix/af_unix.c    |   15 +++++++++++++--
 net/unix/garbage.c    |    6 ++++++
 3 files changed, 20 insertions(+), 2 deletions(-)

--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -12,6 +12,7 @@ void unix_inflight(struct user_struct *u
 void unix_notinflight(struct user_struct *user, struct file *fp);
 void unix_gc(void);
 void wait_for_unix_gc(void);
+void unix_gc_barrier(void);
 struct sock *unix_get_socket(struct file *filp);
 struct sock *unix_peer_get(struct sock *sk);
 
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1547,6 +1547,17 @@ static int unix_attach_fds(struct scm_co
 	return 0;
 }
 
+static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
+{
+	scm->fp = scm_fp_dup(UNIXCB(skb).fp);
+	/*
+	 * During garbage collection it is assumed that in-flight sockets don't
+	 * get a new external reference.  So we need to wait until current run
+	 * finishes.
+	 */
+	unix_gc_barrier();
+}
+
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds)
 {
 	int err = 0;
@@ -2171,7 +2182,7 @@ static int unix_dgram_recvmsg(struct soc
 		sk_peek_offset_fwd(sk, size);
 
 		if (UNIXCB(skb).fp)
-			scm.fp = scm_fp_dup(UNIXCB(skb).fp);
+			unix_peek_fds(&scm, skb);
 	}
 	err = (flags & MSG_TRUNC) ? skb->len - skip : size;
 
@@ -2412,7 +2423,7 @@ static int unix_stream_read_generic(stru
 			/* It is questionable, see note in unix_dgram_recvmsg.
 			 */
 			if (UNIXCB(skb).fp)
-				scm.fp = scm_fp_dup(UNIXCB(skb).fp);
+				unix_peek_fds(&scm, skb);
 
 			sk_peek_offset_fwd(sk, chunk);
 
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -267,6 +267,12 @@ void wait_for_unix_gc(void)
 	wait_event(unix_gc_wait, gc_in_progress == false);
 }
 
+void unix_gc_barrier(void)
+{
+	spin_lock(&unix_gc_lock);
+	spin_unlock(&unix_gc_lock);
+}
+
 /* The external entry point: unix_gc() */
 void unix_gc(void)
 {

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07  9:22                   ` Miklos Szeredi
@ 2019-02-07 13:31                     ` Al Viro
  2019-02-07 14:20                       ` Miklos Szeredi
  0 siblings, 1 reply; 76+ messages in thread
From: Al Viro @ 2019-02-07 13:31 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jens Axboe, Jann Horn, linux-aio, linux-block, Linux API, hch,
	jmoyer, avi, linux-fsdevel

On Thu, Feb 07, 2019 at 10:22:53AM +0100, Miklos Szeredi wrote:
> On Thu, Feb 07, 2019 at 04:00:59AM +0000, Al Viro wrote:
> 
> > So in theory it would be possible to have
> > 	* thread A: sendmsg() has SCM_RIGHTS created and populated,
> > complete with file refcount and ->inflight increments implied,
> > at which point it gets preempted and loses the timeslice.
> > 	* thread B: gets to run and removes all references
> > from descriptor table it shares with thread A.
> > 	* on another CPU we have garbage collector triggered;
> > it determines the set of potentially unreachable unix_sock and
> > everything in our SCM_RIGHTS _is_ in that set, now that no
> > other references remain.
> > 	* on the first CPU, thread A regains the timeslice
> > and inserts its SCM_RIGHTS into queue.  And it does contain
> > references to sockets from the candidate set of running
> > garbage collector, confusing the hell out of it.
> 
> Reminds me: long time ago there was a bug report, and based on that I found a
> bug in MSG_PEEK handling (not confirmed to have fixed the reported bug).  This
> fix, although pretty simple, got lost somehow.  While unix gc code is in your
> head, can you please review and I'll resend through davem?

Umm...  I think the bug is real (something that looks like an eviction
candidate, but actually is referenced from the reachable queue might
get peeked via that queue, then have _its_ queue modified via new
external reference, all between two passes over that queue, confusing the
fuck out of unix_gc()), but I think the fix is an overkill...

Am I right assuming that this queue-modifying operation is accept(), removing
an embryo unix_sock from the queue of listener and thus hiding SCM_RIGHTS in
_its_ queue from scan_children()?

Let me think of it a bit, OK?  While we are at it, some questions from digging
through the current net/unix/garbage.c:
	1) is there any need for ->inflight to be atomic?  All accesses are under
unix_gc_lock, after all...
	2) pumping unix_gc on each sodding reference in SCM_RIGHTS (within
unix_notinflight()/unix_inflight()) looks atrocious... wouldn't it be better to
hold it over that loop?
	3) unix_get_socket() probably ought to be static nowadays...
	4) I wonder if in scan_inflight()/scan_children() we would be better
off with explicit switch (by enum argument) instead of an indirect call.
	5) do we really need UNIX_GC_MAYBE_CYCLE?  Looks like the only
real use is in inc_inflight_move_tail(), and AFAICS it could bloody well
have been
	u->inflight++;
	if (u->inflight == 1)	// just got from zero to non-zero
		list_move_tail(&u->link, &gc_candidates);
The logics there is "we'd found a reference to something that still was
a candidate for eviction in a reachable SCM_RIGHTS, so it's actually
reachable and needs to be scanned (and removed from the set of candidates);
move to the end of list, so that the main loop gets around to it".
If it *was* past the cursor in the list, there's no need to move it; if
we got past it, it must've had zero ->inflight (or we would've removed
it from the set back when we got past it).  Note that it's only called if
UNIX_GC_CANDIDATE is set (i.e. if it's in the initial candidate set),
so for this one ->inflight is guaranteed to mean the number of SCM_RIGHTS
refs from outside of the current candidate set...
	6) unix_get_socket() looks like it might benefit from another
FMODE bit; not sure where it's best set, though - the obvious way would
be SOCK_GC_CARES in sock->flags, set by e.g. unix_create1(), with
sock_alloc_file() propagating it into file->f_mode.  Then unix_get_socket()
would be able to bugger off with NULL for most of the references in
SCM_RIGHTS, without looking into inode...

Comments?  IIRC, you'd done the last serious round of rewriting unix_gc()
and friends...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 13:31                     ` Al Viro
@ 2019-02-07 14:20                       ` Miklos Szeredi
  2019-02-07 15:20                         ` Al Viro
  0 siblings, 1 reply; 76+ messages in thread
From: Miklos Szeredi @ 2019-02-07 14:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jann Horn, linux-aio, linux-block, Linux API,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, linux-fsdevel

On Thu, Feb 7, 2019 at 2:31 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Feb 07, 2019 at 10:22:53AM +0100, Miklos Szeredi wrote:
> > On Thu, Feb 07, 2019 at 04:00:59AM +0000, Al Viro wrote:
> >
> > > So in theory it would be possible to have
> > >     * thread A: sendmsg() has SCM_RIGHTS created and populated,
> > > complete with file refcount and ->inflight increments implied,
> > > at which point it gets preempted and loses the timeslice.
> > >     * thread B: gets to run and removes all references
> > > from descriptor table it shares with thread A.
> > >     * on another CPU we have garbage collector triggered;
> > > it determines the set of potentially unreachable unix_sock and
> > > everything in our SCM_RIGHTS _is_ in that set, now that no
> > > other references remain.
> > >     * on the first CPU, thread A regains the timeslice
> > > and inserts its SCM_RIGHTS into queue.  And it does contain
> > > references to sockets from the candidate set of running
> > > garbage collector, confusing the hell out of it.
> >
> > Reminds me: long time ago there was a bug report, and based on that I found a
> > bug in MSG_PEEK handling (not confirmed to have fixed the reported bug).  This
> > fix, although pretty simple, got lost somehow.  While unix gc code is in your
> > head, can you please review and I'll resend through davem?
>
> Umm...  I think the bug is real (something that looks like an eviction
> candidate, but actually is referenced from the reachable queue might
> get peeked via that queue, then have _its_ queue modified via new
> external reference, all between two passes over that queue, confusing the
> fuck out of unix_gc()), but I think the fix is an overkill...
>
> Am I right assuming that this queue-modifying operation is accept(), removing
> an embryo unix_sock from the queue of listener and thus hiding SCM_RIGHTS in
> _its_ queue from scan_children()?

Hmm... How about just receiving an SCM_RIGHTS socket (which was a
candidate) from the queue of the peeked socket?

> Let me think of it a bit, OK?  While we are at it, some questions from digging
> through the current net/unix/garbage.c:
>         1) is there any need for ->inflight to be atomic?  All accesses are under
> unix_gc_lock, after all...

Seems so.  Probably historic.

>         2) pumping unix_gc on each sodding reference in SCM_RIGHTS (within
> unix_notinflight()/unix_inflight()) looks atrocious... wouldn't it be better to
> hold it over that loop?

Sure.  I guess SCM_RIGHTS are not too performance sensitive, but
that's a trivial cleanup...


>         3) unix_get_socket() probably ought to be static nowadays...
>         4) I wonder if in scan_inflight()/scan_children() we would be better
> off with explicit switch (by enum argument) instead of an indirect call.

Right.

>         5) do we really need UNIX_GC_MAYBE_CYCLE?  Looks like the only
> real use is in inc_inflight_move_tail(), and AFAICS it could bloody well
> have been
>         u->inflight++;
>         if (u->inflight == 1)   // just got from zero to non-zero
>                 list_move_tail(&u->link, &gc_candidates);
> The logics there is "we'd found a reference to something that still was
> a candidate for eviction in a reachable SCM_RIGHTS, so it's actually
> reachable and needs to be scanned (and removed from the set of candidates);
> move to the end of list, so that the main loop gets around to it".
> If it *was* past the cursor in the list, there's no need to move it; if
> we got past it, it must've had zero ->inflight (or we would've removed
> it from the set back when we got past it).  Note that it's only called if
> UNIX_GC_CANDIDATE is set (i.e. if it's in the initial candidate set),
> so for this one ->inflight is guaranteed to mean the number of SCM_RIGHTS
> refs from outside of the current candidate set...

Right, makes sense.

>         6) unix_get_socket() looks like it might benefit from another
> FMODE bit; not sure where it's best set, though - the obvious way would
> be SOCK_GC_CARES in sock->flags, set by e.g. unix_create1(), with
> sock_alloc_file() propagating it into file->f_mode.  Then unix_get_socket()
> would be able to bugger off with NULL for most of the references in
> SCM_RIGHTS, without looking into inode...

Yep, sounds good.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 14:20                       ` Miklos Szeredi
@ 2019-02-07 15:20                         ` Al Viro
  2019-02-07 15:27                           ` Miklos Szeredi
  0 siblings, 1 reply; 76+ messages in thread
From: Al Viro @ 2019-02-07 15:20 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jens Axboe, Jann Horn, linux-aio, linux-block, Linux API,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, linux-fsdevel

On Thu, Feb 07, 2019 at 03:20:06PM +0100, Miklos Szeredi wrote:

> > Am I right assuming that this queue-modifying operation is accept(), removing
> > an embryo unix_sock from the queue of listener and thus hiding SCM_RIGHTS in
> > _its_ queue from scan_children()?
> 
> Hmm... How about just receiving an SCM_RIGHTS socket (which was a
> candidate) from the queue of the peeked socket?

Right, skb unlinked before unix_detach_fds().  I was actually thinking of a stream
case, where unlink is done after that...

*grumble*

The entire thing is far too brittle for my taste ;-/

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 15:20                         ` Al Viro
@ 2019-02-07 15:27                           ` Miklos Szeredi
  2019-02-07 16:26                             ` Al Viro
  0 siblings, 1 reply; 76+ messages in thread
From: Miklos Szeredi @ 2019-02-07 15:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jann Horn, linux-aio, linux-block, Linux API,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, linux-fsdevel

On Thu, Feb 7, 2019 at 4:20 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Feb 07, 2019 at 03:20:06PM +0100, Miklos Szeredi wrote:
>
> > > Am I right assuming that this queue-modifying operation is accept(), removing
> > > an embryo unix_sock from the queue of listener and thus hiding SCM_RIGHTS in
> > > _its_ queue from scan_children()?
> >
> > Hmm... How about just receiving an SCM_RIGHTS socket (which was a
> > candidate) from the queue of the peeked socket?
>
> Right, skb unlinked before unix_detach_fds().  I was actually thinking of a stream
> case, where unlink is done after that...
>
> *grumble*
>
> The entire thing is far too brittle for my taste ;-/

If it gets used as part of io_uring, I guess it's worth a fresh look.
I wrote it without basically any experience with either networking or
garbage collecting, so no wonder it has rough edges.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07  4:05                   ` Al Viro
@ 2019-02-07 16:14                     ` Jens Axboe
  2019-02-07 16:30                       ` Al Viro
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-02-07 16:14 UTC (permalink / raw)
  To: Al Viro
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1524 bytes --]

On 2/6/19 9:05 PM, Al Viro wrote:
> On Wed, Feb 06, 2019 at 10:56:41AM -0700, Jens Axboe wrote:
>> On 2/5/19 6:01 PM, Al Viro wrote:
>>> On Tue, Feb 05, 2019 at 05:27:29PM -0700, Jens Axboe wrote:
>>>
>>>> This should be better, passes some basic testing, too:
>>>>
>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=01a93aa784319a02ccfa6523371b93401c9e0073
>>>>
>>>> Verified that we're grabbing the right refs, and don't hold any
>>>> ourselves. For the file registration, forbid registration of the
>>>> io_uring fd, as that is pointless and will introduce a loop regardless
>>>> of fd passing.
>>>
>>> *shrug*
>>>
>>> So pass it to AF_UNIX socket and register _that_ - does't change the
>>> underlying problem.
>>
>> Maybe I'm being dense here, but it's an f_op match. Should catch a
>> passed fd as well, correct?
> 
> f_op match on _what_?
> 
>> With that, how can there be a loop?
> 
> 	io_uring_fd = ....
> 	socketpair(PF_UNIX, SOCK_STREAM, 0, sock_fds);
> 	register sock_fds[0] and sock_fds[1] to io_uring_fd
> 	send SCM_RIGHTS datagram with io_uring_fd to sock_fds[0]
> 	close sock_fds[0], sock_fds[1] and io_uring_fd
> 
> And there's your unreachable loop.

I created a small app to do just that, and ran it and verified that
->release() is called and the io_uring is released as expected. This
is run on the current -git branch, which has a socket backing for
the io_uring fd itself, but not for the registered files.

What am I missing here? Attaching the program as a reference.

-- 
Jens Axboe


[-- Attachment #2: viro.c --]
[-- Type: text/x-csrc, Size: 2969 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <signal.h>
#include <inttypes.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <linux/fs.h>

struct io_sqring_offsets {
	__u32 head;
	__u32 tail;
	__u32 ring_mask;
	__u32 ring_entries;
	__u32 flags;
	__u32 dropped;
	__u32 array;
	__u32 resv[3];
};

struct io_cqring_offsets {
	__u32 head;
	__u32 tail;
	__u32 ring_mask;
	__u32 ring_entries;
	__u32 overflow;
	__u32 cqes;
	__u32 resv[4];
};

struct io_uring_params {
	__u32 sq_entries;
	__u32 cq_entries;
	__u32 flags;
	__u32 sq_thread_cpu;
	__u32 sq_thread_idle;
	__u32 resv[5];
	struct io_sqring_offsets sq_off;
	struct io_cqring_offsets cq_off;
};

#define IORING_REGISTER_FILES		2

#define __NR_sys_io_uring_setup		425
#define __NR_sys_io_uring_register	427

static int io_uring_register_files(int ring_fd, int fd1, int fd2)
{
	__s32 *fds;

	fds = calloc(2, sizeof(__s32));
	fds[0] = fd1;
	fds[1] = fd2;

	return syscall(__NR_sys_io_uring_register, ring_fd,
			IORING_REGISTER_FILES, fds, 2);
}

static int io_uring_setup(unsigned entries, struct io_uring_params *p)
{
	return syscall(__NR_sys_io_uring_setup, entries, p);
}

static int get_ring_fd(void)
{
	struct io_uring_params p;
	int fd;

	memset(&p, 0, sizeof(p));

	fd = io_uring_setup(2, &p);
	if (fd < 0) {
		perror("io_uring_setup");
		return -1;
	}

	return fd;
}

static void send_fd(int socket, int fd)
{
	char buf[CMSG_SPACE(sizeof(fd))];
	struct cmsghdr *cmsg;
	struct msghdr msg;

	memset(buf, 0, sizeof(buf));
	memset(&msg, 0, sizeof(msg));

	msg.msg_control = buf;
	msg.msg_controllen = sizeof(buf);

	cmsg = CMSG_FIRSTHDR(&msg);
	cmsg->cmsg_level = SOL_SOCKET;
	cmsg->cmsg_type = SCM_RIGHTS;
	cmsg->cmsg_len = CMSG_LEN(sizeof(fd));

	memmove(CMSG_DATA(cmsg), &fd, sizeof(fd));

	msg.msg_controllen = CMSG_SPACE(sizeof(fd));

	if (sendmsg(socket, &msg, 0) < 0)
		perror("sendmsg");
}

static int recv_fd(int socket)
{
	struct msghdr msg;
	char c_buffer[256];
	struct cmsghdr *cmsg;
	int fd;

	memset(&msg, 0, sizeof(msg));

	msg.msg_control = c_buffer;
	msg.msg_controllen = sizeof(c_buffer);

	if (recvmsg(socket, &msg, 0) < 0)
		perror("recvmsg\n");

	cmsg = CMSG_FIRSTHDR(&msg);
	memmove(&fd, CMSG_DATA(cmsg), sizeof(fd));
	return fd;
}

int main(int argc, char *argv[])
{
	int sp[2], pid, ring_fd, ret;

	if (socketpair(AF_UNIX, SOCK_DGRAM, 0, sp) != 0) {
		perror("Failed to create Unix-domain socket pair\n");
		return 1;
	}

	ring_fd = get_ring_fd();
	if (ring_fd < 0)
		return 1;

	ret = io_uring_register_files(ring_fd, sp[0], sp[1]);
	if (ret < 0) {
		perror("register files");
		return 1;
	}

	pid = fork();
	if (pid) {
		printf("Sending fd %d\n", ring_fd);

		send_fd(sp[0], ring_fd);
	} else {
		int fd;

		fd = recv_fd(sp[1]);
		printf("Got fd %d\n", fd);
		close(fd);
	}

	usleep(500000);
	close(ring_fd);
	close(sp[0]);
	close(sp[1]);
	return 0;
}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 15:27                           ` Miklos Szeredi
@ 2019-02-07 16:26                             ` Al Viro
  2019-02-07 19:08                               ` Miklos Szeredi
  0 siblings, 1 reply; 76+ messages in thread
From: Al Viro @ 2019-02-07 16:26 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Jens Axboe, Jann Horn, linux-aio, linux-block, Linux API,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, linux-fsdevel

On Thu, Feb 07, 2019 at 04:27:13PM +0100, Miklos Szeredi wrote:
> On Thu, Feb 7, 2019 at 4:20 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Thu, Feb 07, 2019 at 03:20:06PM +0100, Miklos Szeredi wrote:
> >
> > > > Am I right assuming that this queue-modifying operation is accept(), removing
> > > > an embryo unix_sock from the queue of listener and thus hiding SCM_RIGHTS in
> > > > _its_ queue from scan_children()?
> > >
> > > Hmm... How about just receiving an SCM_RIGHTS socket (which was a
> > > candidate) from the queue of the peeked socket?
> >
> > Right, skb unlinked before unix_detach_fds().  I was actually thinking of a stream
> > case, where unlink is done after that...
> >
> > *grumble*
> >
> > The entire thing is far too brittle for my taste ;-/
> 
> If it gets used as part of io_uring, I guess it's worth a fresh look.
> I wrote it without basically any experience with either networking or
> garbage collecting, so no wonder it has rough edges.

It had a plenty of those edges before your changes as well - I'm not blaming you
for that mess, in case that's not obvious from what I'd written.

I'm trying to put together some formal description of what's going on in there.
Another question, BTW: updates of user->unix_inflight would seem to be movable
into the callers of unix_{not,}inflight().  Any objections against lifting
it into unix_{attach,detach}_fds()?  We do, after all, have fp->count right
there, so what's the point incrementing/decrementing the sucker one-by-one?
_And_ we are checking it right there (in too_many_unix_fds() called from
unix_attach_fds())...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 16:14                     ` Jens Axboe
@ 2019-02-07 16:30                       ` Al Viro
  2019-02-07 16:35                         ` Jens Axboe
  2019-02-07 16:51                         ` Al Viro
  0 siblings, 2 replies; 76+ messages in thread
From: Al Viro @ 2019-02-07 16:30 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On Thu, Feb 07, 2019 at 09:14:41AM -0700, Jens Axboe wrote:

> I created a small app to do just that, and ran it and verified that
> ->release() is called and the io_uring is released as expected. This
> is run on the current -git branch, which has a socket backing for
> the io_uring fd itself, but not for the registered files.
> 
> What am I missing here? Attaching the program as a reference.

> int main(int argc, char *argv[])
> {
> 	int sp[2], pid, ring_fd, ret;
> 
> 	if (socketpair(AF_UNIX, SOCK_DGRAM, 0, sp) != 0) {
> 		perror("Failed to create Unix-domain socket pair\n");
> 		return 1;
> 	}
> 
> 	ring_fd = get_ring_fd();
> 	if (ring_fd < 0)
> 		return 1;
> 
> 	ret = io_uring_register_files(ring_fd, sp[0], sp[1]);
> 	if (ret < 0) {
> 		perror("register files");
> 		return 1;
> 	}
> 
> 	pid = fork();
> 	if (pid) {
> 		printf("Sending fd %d\n", ring_fd);
> 
> 		send_fd(sp[0], ring_fd);
> 	} else {
> 		int fd;
> 
> 		fd = recv_fd(sp[1]);

Well, yes - once you receive it, you obviously have no references
sitting in SCM_RIGHTS anymore.

Get rid of recv_fd() there (along with fork(), while we are at it - what's
it for?) and just do send_fd + these 3 close (or just exit, for that matter).

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 16:30                       ` Al Viro
@ 2019-02-07 16:35                         ` Jens Axboe
  2019-02-07 16:51                         ` Al Viro
  1 sibling, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-02-07 16:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On 2/7/19 9:30 AM, Al Viro wrote:
> On Thu, Feb 07, 2019 at 09:14:41AM -0700, Jens Axboe wrote:
> 
>> I created a small app to do just that, and ran it and verified that
>> ->release() is called and the io_uring is released as expected. This
>> is run on the current -git branch, which has a socket backing for
>> the io_uring fd itself, but not for the registered files.
>>
>> What am I missing here? Attaching the program as a reference.
> 
>> int main(int argc, char *argv[])
>> {
>> 	int sp[2], pid, ring_fd, ret;
>>
>> 	if (socketpair(AF_UNIX, SOCK_DGRAM, 0, sp) != 0) {
>> 		perror("Failed to create Unix-domain socket pair\n");
>> 		return 1;
>> 	}
>>
>> 	ring_fd = get_ring_fd();
>> 	if (ring_fd < 0)
>> 		return 1;
>>
>> 	ret = io_uring_register_files(ring_fd, sp[0], sp[1]);
>> 	if (ret < 0) {
>> 		perror("register files");
>> 		return 1;
>> 	}
>>
>> 	pid = fork();
>> 	if (pid) {
>> 		printf("Sending fd %d\n", ring_fd);
>>
>> 		send_fd(sp[0], ring_fd);
>> 	} else {
>> 		int fd;
>>
>> 		fd = recv_fd(sp[1]);
> 
> Well, yes - once you receive it, you obviously have no references
> sitting in SCM_RIGHTS anymore.
> 
> Get rid of recv_fd() there (along with fork(), while we are at it - what's
> it for?) and just do send_fd + these 3 close (or just exit, for that matter).

Ah got it, yes you are right, that does leak.

Thanks for the other (very) detailed note, I'll add a socket backing for the
registered files. I'll respond to the other details in there a bit later.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 16:30                       ` Al Viro
  2019-02-07 16:35                         ` Jens Axboe
@ 2019-02-07 16:51                         ` Al Viro
  1 sibling, 0 replies; 76+ messages in thread
From: Al Viro @ 2019-02-07 16:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On Thu, Feb 07, 2019 at 04:30:47PM +0000, Al Viro wrote:
> Well, yes - once you receive it, you obviously have no references
> sitting in SCM_RIGHTS anymore.
> 
> Get rid of recv_fd() there (along with fork(), while we are at it - what's
> it for?) and just do send_fd + these 3 close (or just exit, for that matter).

If you pardon a bad ASCII graphics,

ring_fd            sv[0]             sv[1]                 (descriptors)
   |                |                  |
   V                V                  V
[io_uring]        [socket1]          [socket2]             (struct file)
 ^ |               ^  |                ^  |
 | |               |  |                |  |
 | \_______________/__|________________/  |                (after registering)
 |                    |                   |
 |                    V                   V
 |                 [unix_sock1]<---->[unix_sock2]          (struct unix_sock)
 |                                        |
 |                                        V
 \----------------------------------[SCM_RIGHTS]           (queue contents)

References from io_uring to other two struct file are added when you
register these suckers.  Reference from SCM_RIGHTS appears when you
do send_fd().  Now, each file has two references to it.  And
if you close all 3 descriptors (either explicitly, or by exiting)
you will be left with this graph:

[io_uring]------------\-------------------\
 ^                    |                   |
 |                    V                   V
 |                [socket1]          [socket2]
 |                    |                   |
 |                    V                   V
 |                 [unix_sock1]<---->[unix_sock2]
 |                                        |
 |                                        V
 \----------------------------------[SCM_RIGHTS]

All struct file still have references, so they are all still alive,
->release() isn't called on any of them.  And the entire thing
is obviously unreachable from the rest of data structures.

Of course recvmsg() would've removed the loop.  The point is, with
that situation you *can't* get it called - you'd need to reach
socket2 to do that and you can't do that anymore.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07  4:00                 ` Al Viro
  2019-02-07  9:22                   ` Miklos Szeredi
@ 2019-02-07 18:45                   ` Jens Axboe
  2019-02-07 18:58                     ` Jens Axboe
  2019-02-11 15:55                     ` Jonathan Corbet
  1 sibling, 2 replies; 76+ messages in thread
From: Jens Axboe @ 2019-02-07 18:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On 2/6/19 9:00 PM, Al Viro wrote:
> On Wed, Feb 06, 2019 at 06:41:00AM -0700, Jens Axboe wrote:
>> On 2/5/19 5:56 PM, Al Viro wrote:
>>> On Tue, Feb 05, 2019 at 12:08:25PM -0700, Jens Axboe wrote:
>>>> Proof is in the pudding, here's the main commit introducing io_uring
>>>> and now wiring it up to the AF_UNIX garbage collection:
>>>>
>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=158e6f42b67d0abe9ee84886b96ca8c4b3d3dfd5
>>>>
>>>> How does that look?
>>>
>>> In a word - wrong.  Some theory: garbage collector assumes that there is
>>> a subset of file references such that
>>> 	* for all files with such references there's an associated unix_sock.
>>> 	* all such references are stored in SCM_RIGHTS datagrams that can be
>>> found by the garbage collector (currently: for data-bearing AF_UNIX sockets -
>>> queued SCM_RIGHTS datagrams, for listeners - SCM_RIGHTS datagrams sent via
>>> yet-to-be-accepted connections).
>>> 	* there is an efficient way to count those references for given file
>>> (->inflight of the corresponding unix_sock).
>>> 	* removal of those references would render the graph acyclic.
>>> 	* file can _NOT_ be subject to syscalls unless there are references
>>> to it outside of that subset.
>>
>> IOW, we cannot use fget() for registering files, and we still need fget/fput
>> in the fast path to retain safe use of the file. If I'm understanding you
>> correctly?
> 
> No.  *ALL* references (inflight and not) are the same for file->f_count.
> unix_inflight() does not grab a new reference to file; it only says that
> reference passed to it by the caller is now an in-flight one.
> 
> OK, braindump time:

[snip]

This is great info, and I think it belongs in Documentation/ somewhere.
Not sure I've ever seen such a good and detailed dump of this before.

> What you are about to add is *ANOTHER* kind of loops - references
> to files in the "registered" set are pinned down by owning io_uring.
> 
> That would invalidate just about every assumption made the garbage
> collector - even if you forbid to register io_uring itself, you
> still can register both ends of AF_UNIX socket pair, then pass
> io_uring in SCM_RIGHTS over that, then close all descriptors involved.
> From the garbage collector point of view all sockets have external
> references, so there's nothing to collect.  In fact those external
> references are only reachable if you have a reachable reference
> to io_uring, so we get a leak.
> 
> To make it work:
> 	* have unix_sock created for each io_uring (as your code does)
> 	* do *NOT* have unix_inflight() done at that point - it's
> completely wrong there.
> 	* file set registration becomes
> 		* create and populate SCM_RIGHTS, with the same
> fget()+grab an extra reference + unix_inflight() sequence.
> Don't forget to have skb->destructor set to unix_destruct_scm
> or equivalent thereof.
> 		* remember UNIXCB(skb).fp - that'll give you your
> array of struct file *, to use in lookups.
> 		* queue it into your unix_sock
> 		* do _one_ fput() for everything you've grabbed,
> dropping one of two references you've taken.
> 	* unregistering is simply skb_dequeue() + kfree_skb().
> 	* in ->release() you do sock_release(); it'll do
> everything you need (including unregistering the set, etc.)

This is genius! I implemented this and it works. I've verified that the
previous test app failed to release due to the loop, and with this in
place, once the GC kicks in, the io_uring is released appropriately.

> The hairiest part is the file set registration, of course -
> there's almost certainly a helper or two buried in that thing;
> simply exposing all the required net/unix/af_unix.c bits is
> ucking fugly.

Outside of the modification to unix_get_socket(), the only change I had
to make was to ensure that unix_destruct_scm() is available to io_uring.
No other changes needed.

> I'm not sure what you propose for non-registered descriptors -
> _any_ struct file reference that outlives the return from syscall
> stored in some io_uring-attached data structure is has exact same
> loop (and leak) problem.  And if you mean to have it dropped before
> return from syscall, I'm afraid I don't follow you.  How would
> that be done?
> 
> Again, "io_uring descriptor can't be used in those requests" does
> not help at all - use a socket instead, pass the io_uring fd over
> it in SCM_RIGHTS and you are back to square 1.

I wasn't proposing to fput() before return, otherwise I can't hang on to
that file *.

Right now for async punt, we don't release the reference, and then we
fput() when IO completes. According to what you're saying here, that's
not good enough. Correct me if I'm wrong, but what if we:

1) For non-sock/io_uring fds, the current approach is sufficient
2) Disallow io_uring fd, doesn't make sense anyway

That leaves the socket fd, which is problematic. Should be solvable by
allocating an skb and marking that file inflight?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 18:45                   ` Jens Axboe
@ 2019-02-07 18:58                     ` Jens Axboe
  2019-02-11 15:55                     ` Jonathan Corbet
  1 sibling, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-02-07 18:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Jann Horn, linux-aio, linux-block, Linux API, hch, jmoyer, avi,
	linux-fsdevel

On 2/7/19 11:45 AM, Jens Axboe wrote:
> On 2/6/19 9:00 PM, Al Viro wrote:
>> On Wed, Feb 06, 2019 at 06:41:00AM -0700, Jens Axboe wrote:
>>> On 2/5/19 5:56 PM, Al Viro wrote:
>>>> On Tue, Feb 05, 2019 at 12:08:25PM -0700, Jens Axboe wrote:
>>>>> Proof is in the pudding, here's the main commit introducing io_uring
>>>>> and now wiring it up to the AF_UNIX garbage collection:
>>>>>
>>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=158e6f42b67d0abe9ee84886b96ca8c4b3d3dfd5
>>>>>
>>>>> How does that look?
>>>>
>>>> In a word - wrong.  Some theory: garbage collector assumes that there is
>>>> a subset of file references such that
>>>> 	* for all files with such references there's an associated unix_sock.
>>>> 	* all such references are stored in SCM_RIGHTS datagrams that can be
>>>> found by the garbage collector (currently: for data-bearing AF_UNIX sockets -
>>>> queued SCM_RIGHTS datagrams, for listeners - SCM_RIGHTS datagrams sent via
>>>> yet-to-be-accepted connections).
>>>> 	* there is an efficient way to count those references for given file
>>>> (->inflight of the corresponding unix_sock).
>>>> 	* removal of those references would render the graph acyclic.
>>>> 	* file can _NOT_ be subject to syscalls unless there are references
>>>> to it outside of that subset.
>>>
>>> IOW, we cannot use fget() for registering files, and we still need fget/fput
>>> in the fast path to retain safe use of the file. If I'm understanding you
>>> correctly?
>>
>> No.  *ALL* references (inflight and not) are the same for file->f_count.
>> unix_inflight() does not grab a new reference to file; it only says that
>> reference passed to it by the caller is now an in-flight one.
>>
>> OK, braindump time:
> 
> [snip]
> 
> This is great info, and I think it belongs in Documentation/ somewhere.
> Not sure I've ever seen such a good and detailed dump of this before.
> 
>> What you are about to add is *ANOTHER* kind of loops - references
>> to files in the "registered" set are pinned down by owning io_uring.
>>
>> That would invalidate just about every assumption made the garbage
>> collector - even if you forbid to register io_uring itself, you
>> still can register both ends of AF_UNIX socket pair, then pass
>> io_uring in SCM_RIGHTS over that, then close all descriptors involved.
>> From the garbage collector point of view all sockets have external
>> references, so there's nothing to collect.  In fact those external
>> references are only reachable if you have a reachable reference
>> to io_uring, so we get a leak.
>>
>> To make it work:
>> 	* have unix_sock created for each io_uring (as your code does)
>> 	* do *NOT* have unix_inflight() done at that point - it's
>> completely wrong there.
>> 	* file set registration becomes
>> 		* create and populate SCM_RIGHTS, with the same
>> fget()+grab an extra reference + unix_inflight() sequence.
>> Don't forget to have skb->destructor set to unix_destruct_scm
>> or equivalent thereof.
>> 		* remember UNIXCB(skb).fp - that'll give you your
>> array of struct file *, to use in lookups.
>> 		* queue it into your unix_sock
>> 		* do _one_ fput() for everything you've grabbed,
>> dropping one of two references you've taken.
>> 	* unregistering is simply skb_dequeue() + kfree_skb().
>> 	* in ->release() you do sock_release(); it'll do
>> everything you need (including unregistering the set, etc.)
> 
> This is genius! I implemented this and it works. I've verified that the
> previous test app failed to release due to the loop, and with this in
> place, once the GC kicks in, the io_uring is released appropriately.
> 
>> The hairiest part is the file set registration, of course -
>> there's almost certainly a helper or two buried in that thing;
>> simply exposing all the required net/unix/af_unix.c bits is
>> ucking fugly.
> 
> Outside of the modification to unix_get_socket(), the only change I had
> to make was to ensure that unix_destruct_scm() is available to io_uring.
> No other changes needed.
> 
>> I'm not sure what you propose for non-registered descriptors -
>> _any_ struct file reference that outlives the return from syscall
>> stored in some io_uring-attached data structure is has exact same
>> loop (and leak) problem.  And if you mean to have it dropped before
>> return from syscall, I'm afraid I don't follow you.  How would
>> that be done?
>>
>> Again, "io_uring descriptor can't be used in those requests" does
>> not help at all - use a socket instead, pass the io_uring fd over
>> it in SCM_RIGHTS and you are back to square 1.
> 
> I wasn't proposing to fput() before return, otherwise I can't hang on to
> that file *.
> 
> Right now for async punt, we don't release the reference, and then we
> fput() when IO completes. According to what you're saying here, that's
> not good enough. Correct me if I'm wrong, but what if we:
> 
> 1) For non-sock/io_uring fds, the current approach is sufficient
> 2) Disallow io_uring fd, doesn't make sense anyway
> 
> That leaves the socket fd, which is problematic. Should be solvable by
> allocating an skb and marking that file inflight?

Actually, we can just NOT set NOWAIT for types we don't support. That
means we'll never punt to async context for those.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 16:26                             ` Al Viro
@ 2019-02-07 19:08                               ` Miklos Szeredi
  0 siblings, 0 replies; 76+ messages in thread
From: Miklos Szeredi @ 2019-02-07 19:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jann Horn, linux-aio, linux-block, Linux API,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, linux-fsdevel

On Thu, Feb 7, 2019 at 5:26 PM Al Viro <viro@zeniv.linux.org.uk> wrote:

> I'm trying to put together some formal description of what's going on in there.
> Another question, BTW: updates of user->unix_inflight would seem to be movable
> into the callers of unix_{not,}inflight().  Any objections against lifting
> it into unix_{attach,detach}_fds()?  We do, after all, have fp->count right
> there, so what's the point incrementing/decrementing the sucker one-by-one?
> _And_ we are checking it right there (in too_many_unix_fds() called from
> unix_attach_fds())...

I see no issues with that.

Also shouldn't the rlimit check be made against user->unix_inflight +
fp->count?  Althought I'm not quite following if fp->user can end up
different from current_user() and what should happen in that case...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-07 18:45                   ` Jens Axboe
  2019-02-07 18:58                     ` Jens Axboe
@ 2019-02-11 15:55                     ` Jonathan Corbet
  2019-02-11 17:35                       ` Al Viro
  1 sibling, 1 reply; 76+ messages in thread
From: Jonathan Corbet @ 2019-02-11 15:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Al Viro, Jann Horn, linux-aio, linux-block, Linux API, hch,
	jmoyer, avi, linux-fsdevel

On Thu, 7 Feb 2019 11:45:40 -0700
Jens Axboe <axboe@kernel.dk> wrote:

> > OK, braindump time:  
> 
> [snip]
> 
> This is great info, and I think it belongs in Documentation/ somewhere.
> Not sure I've ever seen such a good and detailed dump of this before.

I suspect I might be able to make something like that happen :)  Stay
tuned.

Thanks,

jon

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-11 15:55                     ` Jonathan Corbet
@ 2019-02-11 17:35                       ` Al Viro
  2019-02-11 20:33                         ` Jonathan Corbet
  0 siblings, 1 reply; 76+ messages in thread
From: Al Viro @ 2019-02-11 17:35 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Jens Axboe, Jann Horn, linux-aio, linux-block, Linux API, hch,
	jmoyer, avi, linux-fsdevel

On Mon, Feb 11, 2019 at 08:55:33AM -0700, Jonathan Corbet wrote:
> On Thu, 7 Feb 2019 11:45:40 -0700
> Jens Axboe <axboe@kernel.dk> wrote:
> 
> > > OK, braindump time:  
> > 
> > [snip]
> > 
> > This is great info, and I think it belongs in Documentation/ somewhere.
> > Not sure I've ever seen such a good and detailed dump of this before.
> 
> I suspect I might be able to make something like that happen :)  Stay
> tuned.

There are several typos I see in there (e.g. in

        * struct file instances A and B being AF_UNIX sockets.
        * A is a listener
        * B is an established connection, with the other end
yet to be accepted on A
        * the only references to A and B are in an SCM_RIGHTS
datagram sent over by A.

the last line should be "datagram sent over by B", of course - you can't
send anything over a listener socket, to start with).

Another thing is this:

        * references cannot be extracted from SCM_RIGHTS datagrams
while the garbage collector is running (achieved by having
unix_notinflight() done before references out of SCM_RIGHTS)
        * removal of SCM_RIGHTS associated with a socket can't
be done without a reference to that socket _outside_ of any
SCM_RIGHTS (automatically true).

That's worse than a typo - that's an actual bug (see the subthread
with Miklos).  Correct version would be
	* any references extracted from SCM_RIGHTS during the
garbage collector run will not be actually used until the end
of garbage collection.  For normal recvmsg() it is guaranteed
by having unix_notinflight() called between the extraction of
scm_fp_list from the packet and doing anything else with the
references extracted.  For MSG_PEEK recvmsg() it's actually
broken and lacks synchronization; Miklos has proposed to grab
and release unix_gc_lock in those, between scm_fp_dup() and
doing anything else with the references copied.

If you turn that thing into a coherent text, I'd appreciate a chance
to take a look at the result and see if anything else needs to be
corrected...

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 13/18] io_uring: add file set registration
  2019-02-11 17:35                       ` Al Viro
@ 2019-02-11 20:33                         ` Jonathan Corbet
  0 siblings, 0 replies; 76+ messages in thread
From: Jonathan Corbet @ 2019-02-11 20:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jann Horn, linux-aio, linux-block, Linux API, hch,
	jmoyer, avi, linux-fsdevel

On Mon, 11 Feb 2019 17:35:21 +0000
Al Viro <viro@zeniv.linux.org.uk> wrote:

> If you turn that thing into a coherent text, I'd appreciate a chance
> to take a look at the result and see if anything else needs to be
> corrected...

Thanks for the updated info!

I'm likely to do a watered-down version for some disreputable web site
first; after that, I'll try to do it properly for Documentation/.  You'll
certainly see the result in your inbox once it's ready to be looked at.

Thanks,

jon

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-02-07 22:38   ` Jeff Moyer
@ 2019-02-07 22:47     ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-02-07 22:47 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-aio, linux-block, linux-api, hch, avi, jannh, viro

On 2/7/19 3:38 PM, Jeff Moyer wrote:
> Hi, Jens,
> 
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> For now, buffers must not be file backed. If file backed buffers are
>> passed in, the registration will fail with -1/EOPNOTSUPP. This
>> restriction may be relaxed in the future.
> 
> [...]
> 
>> +		down_write(&current->mm->mmap_sem);
>> +		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
>> +						pages, vmas);
>> +		if (pret == nr_pages) {
>> +			/* don't support file backed memory */
>> +			for (j = 0; j < nr_pages; j++) {
>> +				struct vm_area_struct *vma = vmas[j];
>> +
>> +				if (vma->vm_file) {
>> +					ret = -EOPNOTSUPP;
>> +					break;
>> +				}
>> +			}
> 
> Unfortunately, this suffers the same problem as FOLL_ANON.  Huge pages
> are backed by hugetlbfs, and vma->vm_file will be filled in.
> 
> I guess you could check is_file_hugepages(vma->vm_file):
> 
>         if (vma->vm_file &&
>             !is_file_hugepages(vma->vm_file)) {
>                 ret = -EOPNOTSUPP;
>                 break;
>        }
> 
> That works for me.

Thanks, that looks better. Fixed!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-02-07 19:55 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
  2019-02-07 20:57   ` Jeff Moyer
@ 2019-02-07 22:38   ` Jeff Moyer
  2019-02-07 22:47     ` Jens Axboe
  1 sibling, 1 reply; 76+ messages in thread
From: Jeff Moyer @ 2019-02-07 22:38 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, linux-api, hch, avi, jannh, viro

Hi, Jens,

Jens Axboe <axboe@kernel.dk> writes:

> For now, buffers must not be file backed. If file backed buffers are
> passed in, the registration will fail with -1/EOPNOTSUPP. This
> restriction may be relaxed in the future.

[...]

> +		down_write(&current->mm->mmap_sem);
> +		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
> +						pages, vmas);
> +		if (pret == nr_pages) {
> +			/* don't support file backed memory */
> +			for (j = 0; j < nr_pages; j++) {
> +				struct vm_area_struct *vma = vmas[j];
> +
> +				if (vma->vm_file) {
> +					ret = -EOPNOTSUPP;
> +					break;
> +				}
> +			}

Unfortunately, this suffers the same problem as FOLL_ANON.  Huge pages
are backed by hugetlbfs, and vma->vm_file will be filled in.

I guess you could check is_file_hugepages(vma->vm_file):

        if (vma->vm_file &&
            !is_file_hugepages(vma->vm_file)) {
                ret = -EOPNOTSUPP;
                break;
       }

That works for me.

-Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-02-07 20:57   ` Jeff Moyer
@ 2019-02-07 21:02     ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-02-07 21:02 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-aio, linux-block, linux-api, hch, avi, jannh, viro

On 2/7/19 1:57 PM, Jeff Moyer wrote:
> Hi, Jens,
> 
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
>> +{
>> +	int i, j;
>> +
>> +	if (!ctx->user_bufs)
>> +		return -ENXIO;
>> +
>> +	for (i = 0; i < ctx->sq_entries; i++) {
>> +		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
>> +
>> +		for (j = 0; j < imu->nr_bvecs; j++)
>> +			put_page(imu->bvec[j].bv_page);
>> +
>> +		io_unaccount_mem(ctx->user, imu->nr_bvecs);
>> +		kfree(imu->bvec);
>> +		imu->nr_bvecs = 0;
>> +	}
>> +
>> +	kfree(ctx->user_bufs);
>> +	ctx->user_bufs = NULL;
>> +	free_uid(ctx->user);
>         ^^^^^^^^^^^^^^^^^^^
>> +	ctx->user = NULL;
>         ^^^^^^^^^^^^^^^^^
> 
> I don't think you want to do that here.  If you do an
> IORING_REGISTER_BUFFERS, followed by IORING_UNREGISTER_BUFFERS, and then
> follow that up with IORING_REGISTER_FILES, you'll get a null pointer
> dereference trying to bump the reference count of the (now NULL)
> ctx->user (io_uring.c:1944):
> 
> [  216.927990] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> [  216.935825] #PF error: [WRITE]
> [  216.938883] PGD 5f39244067 P4D 5f39244067 PUD 5f043ca067 PMD 0 
> [  216.944803] Oops: 0002 [#1] SMP
> [  216.947949] CPU: 79 PID: 3371 Comm: io_uring_regist Not tainted 5.0.0-rc5.io_uring.4+ #26
> [  216.956119] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0108.091420182119 09/14/2018
> [  216.966553] RIP: 0010:__io_uring_register+0x1c2/0x7c0
> [  216.971606] Code: 49 89 c6 48 85 c0 0f 84 9b 05 00 00 48 8b 83 20 02 00 00 48 8b 40 20 49 c7 46 60 60 89 1d 96 49 89 46 18 48 8b 83 18 01 00 00 <f0> ff 00 0f 88 1a a0 52 00 45 31 e4 66 83 7d 00 00 48 89 45 08 7e
> [  216.990355] RSP: 0018:ffffb296087e3e70 EFLAGS: 00010286
> [  216.995578] RAX: 0000000000000000 RBX: ffff9aacbbff3800 RCX: 0000000000000000
> [  217.002711] RDX: ffff9aacbbaf1ac0 RSI: 00000000ffffffff RDI: ffff9aacb9a8f6b0
> [  217.009842] RBP: ffff9aacbb45e800 R08: 00000000000000c0 R09: ffff9a4e87c07000
> [  217.016977] R10: 0000000000000006 R11: ffff9aac97da9b00 R12: 00007efdc3dbd1fc
> [  217.024107] R13: ffff9aacbb45ec08 R14: ffff9aacb9a8f600 R15: ffff9aac97da9a00
> [  217.031241] FS:  00007f01c439e500(0000) GS:ffff9aacbf7c0000(0000) knlGS:0000000000000000
> [  217.039326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  217.045075] CR2: 0000000000000000 CR3: 0000005f08d85002 CR4: 00000000007606e0
> [  217.052207] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  217.059340] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  217.066472] PKRU: 55555554
> [  217.069183] Call Trace:
> [  217.071638]  __x64_sys_io_uring_register+0x91/0xb0
> [  217.076433]  do_syscall_64+0x4f/0x190
> [  217.080110]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  217.085167] RIP: 0033:0x7f01c3eb42bd
> [  217.088743] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9b 6b 2c 00 f7 d8 64 89 01 48
> 
> I'd expect ctx->user to live as long as the io_uring context itself,
> right?

Yes, it used to just be used for the buffers, now we use it generally. I've
fixed that up, thanks Jeff!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-02-07 19:55 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
@ 2019-02-07 20:57   ` Jeff Moyer
  2019-02-07 21:02     ` Jens Axboe
  2019-02-07 22:38   ` Jeff Moyer
  1 sibling, 1 reply; 76+ messages in thread
From: Jeff Moyer @ 2019-02-07 20:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-aio, linux-block, linux-api, hch, avi, jannh, viro

Hi, Jens,

Jens Axboe <axboe@kernel.dk> writes:

> +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
> +{
> +	int i, j;
> +
> +	if (!ctx->user_bufs)
> +		return -ENXIO;
> +
> +	for (i = 0; i < ctx->sq_entries; i++) {
> +		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
> +
> +		for (j = 0; j < imu->nr_bvecs; j++)
> +			put_page(imu->bvec[j].bv_page);
> +
> +		io_unaccount_mem(ctx->user, imu->nr_bvecs);
> +		kfree(imu->bvec);
> +		imu->nr_bvecs = 0;
> +	}
> +
> +	kfree(ctx->user_bufs);
> +	ctx->user_bufs = NULL;
> +	free_uid(ctx->user);
        ^^^^^^^^^^^^^^^^^^^
> +	ctx->user = NULL;
        ^^^^^^^^^^^^^^^^^

I don't think you want to do that here.  If you do an
IORING_REGISTER_BUFFERS, followed by IORING_UNREGISTER_BUFFERS, and then
follow that up with IORING_REGISTER_FILES, you'll get a null pointer
dereference trying to bump the reference count of the (now NULL)
ctx->user (io_uring.c:1944):

[  216.927990] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  216.935825] #PF error: [WRITE]
[  216.938883] PGD 5f39244067 P4D 5f39244067 PUD 5f043ca067 PMD 0 
[  216.944803] Oops: 0002 [#1] SMP
[  216.947949] CPU: 79 PID: 3371 Comm: io_uring_regist Not tainted 5.0.0-rc5.io_uring.4+ #26
[  216.956119] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0108.091420182119 09/14/2018
[  216.966553] RIP: 0010:__io_uring_register+0x1c2/0x7c0
[  216.971606] Code: 49 89 c6 48 85 c0 0f 84 9b 05 00 00 48 8b 83 20 02 00 00 48 8b 40 20 49 c7 46 60 60 89 1d 96 49 89 46 18 48 8b 83 18 01 00 00 <f0> ff 00 0f 88 1a a0 52 00 45 31 e4 66 83 7d 00 00 48 89 45 08 7e
[  216.990355] RSP: 0018:ffffb296087e3e70 EFLAGS: 00010286
[  216.995578] RAX: 0000000000000000 RBX: ffff9aacbbff3800 RCX: 0000000000000000
[  217.002711] RDX: ffff9aacbbaf1ac0 RSI: 00000000ffffffff RDI: ffff9aacb9a8f6b0
[  217.009842] RBP: ffff9aacbb45e800 R08: 00000000000000c0 R09: ffff9a4e87c07000
[  217.016977] R10: 0000000000000006 R11: ffff9aac97da9b00 R12: 00007efdc3dbd1fc
[  217.024107] R13: ffff9aacbb45ec08 R14: ffff9aacb9a8f600 R15: ffff9aac97da9a00
[  217.031241] FS:  00007f01c439e500(0000) GS:ffff9aacbf7c0000(0000) knlGS:0000000000000000
[  217.039326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  217.045075] CR2: 0000000000000000 CR3: 0000005f08d85002 CR4: 00000000007606e0
[  217.052207] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  217.059340] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  217.066472] PKRU: 55555554
[  217.069183] Call Trace:
[  217.071638]  __x64_sys_io_uring_register+0x91/0xb0
[  217.076433]  do_syscall_64+0x4f/0x190
[  217.080110]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  217.085167] RIP: 0033:0x7f01c3eb42bd
[  217.088743] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9b 6b 2c 00 f7 d8 64 89 01 48

I'd expect ctx->user to live as long as the io_uring context itself,
right?

-Jeff

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-02-07 19:55 [PATCHSET v12] io_uring IO interface Jens Axboe
@ 2019-02-07 19:55 ` Jens Axboe
  2019-02-07 20:57   ` Jeff Moyer
  2019-02-07 22:38   ` Jeff Moyer
  0 siblings, 2 replies; 76+ messages in thread
From: Jens Axboe @ 2019-02-07 19:55 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must call io_uring_register()
after having setup an io_uring context, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
to an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.

If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.

The application may register buffers throughout the lifetime of the
io_uring context. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring context.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.

RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/io_uring.c                          | 356 ++++++++++++++++++++++++-
 include/linux/sched/user.h             |   2 +-
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/io_uring.h          |  13 +-
 kernel/sys_ni.c                        |   1 +
 8 files changed, 363 insertions(+), 17 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 481c126259e9..2eefd2a7c1ce 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
+427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 6a32a430c8e0..65c026185e61 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
+427	common	io_uring_register	__x64_sys_io_uring_register
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1369cb95e1b5..9d6233dc35ca 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -25,6 +25,7 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/blkdev.h>
+#include <linux/bvec.h>
 #include <linux/net.h>
 #include <net/sock.h>
 #include <net/af_unix.h>
@@ -32,6 +33,7 @@
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
+#include <linux/sizes.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -61,6 +63,13 @@ struct io_cq_ring {
 	struct io_uring_cqe	cqes[];
 };
 
+struct io_mapped_ubuf {
+	u64		ubuf;
+	size_t		len;
+	struct		bio_vec *bvec;
+	unsigned int	nr_bvecs;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -92,6 +101,10 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/* if used, fixed mapped user buffers */
+	unsigned		nr_user_bufs;
+	struct io_mapped_ubuf	*user_bufs;
+
 	struct user_struct	*user;
 
 	struct completion	ctx_done;
@@ -703,6 +716,44 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
+static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe,
+			   struct iov_iter *iter)
+{
+	size_t len = READ_ONCE(sqe->len);
+	struct io_mapped_ubuf *imu;
+	unsigned index, buf_index;
+	size_t offset;
+	u64 buf_addr;
+
+	/* attempt to use fixed buffers without having provided iovecs */
+	if (unlikely(!ctx->user_bufs))
+		return -EFAULT;
+
+	buf_index = READ_ONCE(sqe->buf_index);
+	if (unlikely(buf_index >= ctx->nr_user_bufs))
+		return -EFAULT;
+
+	index = array_index_nospec(buf_index, ctx->nr_user_bufs);
+	imu = &ctx->user_bufs[index];
+	buf_addr = READ_ONCE(sqe->addr);
+
+	if (buf_addr + len < buf_addr)
+		return -EFAULT;
+	if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
+		return -EFAULT;
+
+	/*
+	 * May not be a start of buffer, set size appropriately
+	 * and advance us to the beginning.
+	 */
+	offset = buf_addr - imu->ubuf;
+	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
+	if (offset)
+		iov_iter_advance(iter, offset);
+	return 0;
+}
+
 static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 			   const struct sqe_submit *s, struct iovec **iovec,
 			   struct iov_iter *iter)
@@ -710,6 +761,15 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 	const struct io_uring_sqe *sqe = s->sqe;
 	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	size_t sqe_len = READ_ONCE(sqe->len);
+	u8 opcode;
+
+	opcode = READ_ONCE(sqe->opcode);
+	if (opcode == IORING_OP_READ_FIXED ||
+	    opcode == IORING_OP_WRITE_FIXED) {
+		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
+		*iovec = NULL;
+		return ret;
+	}
 
 	if (!s->has_user)
 		return EFAULT;
@@ -853,7 +913,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio))
+	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fsync_flags = READ_ONCE(sqe->fsync_flags);
@@ -891,9 +951,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
 		ret = io_read(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
+		ret = io_write(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_READ_FIXED:
+		ret = io_read(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_WRITE_FIXED:
 		ret = io_write(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
@@ -922,28 +992,47 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
+{
+	u8 opcode = READ_ONCE(sqe->opcode);
+
+	return !(opcode == IORING_OP_READ_FIXED ||
+		 opcode == IORING_OP_WRITE_FIXED);
+}
+
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	struct sqe_submit *s = &req->submit;
-	u64 user_data = READ_ONCE(s->sqe->user_data);
 	struct io_ring_ctx *ctx = req->ctx;
-	mm_segment_t old_fs = get_fs();
+	mm_segment_t old_fs;
+	bool needs_user;
+	u64 user_data;
 	int ret;
 
 	 /* Ensure we clear previously set forced non-block flag */
 	req->flags &= ~REQ_F_FORCE_NONBLOCK;
 	req->rw.ki_flags &= ~IOCB_NOWAIT;
 
-	if (!mmget_not_zero(ctx->sqo_mm)) {
-		ret = -EFAULT;
-		goto err;
-	}
-
-	use_mm(ctx->sqo_mm);
-	set_fs(USER_DS);
-	s->has_user = true;
+	user_data = READ_ONCE(s->sqe->user_data);
 	s->needs_lock = true;
+	s->has_user = false;
+
+	/*
+	 * If we're doing IO to fixed buffers, we don't need to get/set
+	 * user context
+	 */
+	needs_user = io_sqe_needs_user(s->sqe);
+	if (needs_user) {
+		if (!mmget_not_zero(ctx->sqo_mm)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		use_mm(ctx->sqo_mm);
+		old_fs = get_fs();
+		set_fs(USER_DS);
+		s->has_user = true;
+	}
 
 	do {
 		ret = __io_submit_sqe(ctx, req, s, false, NULL);
@@ -957,9 +1046,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 		cond_resched();
 	} while (1);
 
-	set_fs(old_fs);
-	unuse_mm(ctx->sqo_mm);
-	mmput(ctx->sqo_mm);
+	if (needs_user) {
+		set_fs(old_fs);
+		unuse_mm(ctx->sqo_mm);
+		mmput(ctx->sqo_mm);
+	}
 err:
 	if (ret) {
 		io_cqring_add_event(ctx, user_data, ret, 0);
@@ -1241,6 +1332,188 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
 }
 
+static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->sq_entries; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+
+		for (j = 0; j < imu->nr_bvecs; j++)
+			put_page(imu->bvec[j].bv_page);
+
+		io_unaccount_mem(ctx->user, imu->nr_bvecs);
+		kfree(imu->bvec);
+		imu->nr_bvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+	free_uid(ctx->user);
+	ctx->user = NULL;
+	return 0;
+}
+
+static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
+		       void __user *arg, unsigned index)
+{
+	struct iovec __user *src;
+
+#ifdef CONFIG_COMPAT
+	if (ctx->compat) {
+		struct compat_iovec __user *ciovs;
+		struct compat_iovec ciov;
+
+		ciovs = (struct compat_iovec __user *) arg;
+		if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
+			return -EFAULT;
+
+		dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
+		dst->iov_len = ciov.iov_len;
+		return 0;
+	}
+#endif
+	src = (struct iovec __user *) arg;
+	if (copy_from_user(dst, &src[index], sizeof(*dst)))
+		return -EFAULT;
+	return 0;
+}
+
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
+				  unsigned nr_args)
+{
+	struct vm_area_struct **vmas = NULL;
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	int ret = -EINVAL;
+
+	if (ctx->user_bufs)
+		return -EBUSY;
+	if (!nr_args || nr_args > UIO_MAXIOV)
+		return -EINVAL;
+
+	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		struct iovec iov;
+		size_t size;
+
+		ret = io_copy_iov(ctx, &iov, arg, i);
+		if (ret)
+			break;
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iov.iov_base || !iov.iov_len)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iov.iov_len > SZ_1G)
+			goto err;
+
+		ubuf = (unsigned long) iov.iov_base;
+		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		ret = io_account_mem(ctx->user, nr_pages);
+		if (ret)
+			goto err;
+
+		if (!pages || nr_pages > got_pages) {
+			kfree(vmas);
+			kfree(pages);
+			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+						GFP_KERNEL);
+			vmas = kmalloc_array(nr_pages,
+					sizeof(struct vma_area_struct *),
+					GFP_KERNEL);
+			if (!pages || !vmas) {
+				ret = -ENOMEM;
+				io_unaccount_mem(ctx->user, nr_pages);
+				goto err;
+			}
+			got_pages = nr_pages;
+		}
+
+		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+						GFP_KERNEL);
+		if (!imu->bvec) {
+			io_unaccount_mem(ctx->user, nr_pages);
+			goto err;
+		}
+
+		down_write(&current->mm->mmap_sem);
+		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
+						pages, vmas);
+		if (pret == nr_pages) {
+			/* don't support file backed memory */
+			for (j = 0; j < nr_pages; j++) {
+				struct vm_area_struct *vma = vmas[j];
+
+				if (vma->vm_file) {
+					ret = -EOPNOTSUPP;
+					break;
+				}
+			}
+		} else {
+			ret = pret < 0 ? pret : -EFAULT;
+		}
+		up_write(&current->mm->mmap_sem);
+		if (ret) {
+			/*
+			 * if we did partial map, or found file backed vmas,
+			 * release any pages we did get
+			 */
+			if (pret > 0) {
+				for (j = 0; j < pret; j++)
+					put_page(pages[j]);
+			}
+			io_unaccount_mem(ctx->user, nr_pages);
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iov.iov_len;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			imu->bvec[j].bv_page = pages[j];
+			imu->bvec[j].bv_len = vec_len;
+			imu->bvec[j].bv_offset = off;
+			off = 0;
+			size -= vec_len;
+		}
+		/* store original address for later verification */
+		imu->ubuf = ubuf;
+		imu->len = iov.iov_len;
+		imu->nr_bvecs = nr_pages;
+	}
+	kfree(pages);
+	kfree(vmas);
+	ctx->nr_user_bufs = nr_args;
+	return 0;
+err:
+	kfree(pages);
+	kfree(vmas);
+	io_sqe_buffer_unregister(ctx);
+	return ret;
+}
+
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	if (ctx->sqo_wq)
@@ -1253,6 +1526,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 #endif
 
 	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
 
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
@@ -1593,6 +1867,60 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 	return io_uring_setup(entries, params);
 }
 
+static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
+			       void __user *arg, unsigned nr_args)
+{
+	int ret;
+
+	percpu_ref_kill(&ctx->refs);
+	wait_for_completion(&ctx->ctx_done);
+
+	switch (opcode) {
+	case IORING_REGISTER_BUFFERS:
+		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_BUFFERS:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_buffer_unregister(ctx);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	/* bring the ctx back to life */
+	reinit_completion(&ctx->ctx_done);
+	percpu_ref_reinit(&ctx->refs);
+	return ret;
+}
+
+SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
+		void __user *, arg, unsigned int, nr_args)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ctx = f.file->private_data;
+
+	mutex_lock(&ctx->uring_lock);
+	ret = __io_uring_register(ctx, opcode, arg, nr_args);
+	mutex_unlock(&ctx->uring_lock);
+out_fput:
+	fdput(f);
+	return ret;
+}
+
 static int __init io_uring_init(void)
 {
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..c7b5f86b91a1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -40,7 +40,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
 	atomic_long_t locked_vm;
 #endif
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3072dbaa7869..3681c05ac538 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries,
 asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
 				u32 min_complete, u32 flags,
 				const sigset_t __user *sig, size_t sigsz);
+asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
+				void __user *arg, unsigned int nr_args);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 87871e7b7ea7..d346229a1eb0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
 __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 #define __NR_io_uring_enter 426
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+#define __NR_io_uring_register 427
+__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
 
 #undef __NR_syscalls
-#define __NR_syscalls 427
+#define __NR_syscalls 428
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5c457ea396e6..cf28f7a11f12 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -27,7 +27,10 @@ struct io_uring_sqe {
 		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
-	__u64	__pad2[3];
+	union {
+		__u16	buf_index;	/* index into fixed buffers, if used */
+		__u64	__pad2[3];
+	};
 };
 
 /*
@@ -39,6 +42,8 @@ struct io_uring_sqe {
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
 #define IORING_OP_FSYNC		3
+#define IORING_OP_READ_FIXED	4
+#define IORING_OP_WRITE_FIXED	5
 
 /*
  * sqe->fsync_flags
@@ -103,4 +108,10 @@ struct io_uring_params {
 	struct io_cqring_offsets cq_off;
 };
 
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+#define IORING_REGISTER_BUFFERS		0
+#define IORING_UNREGISTER_BUFFERS	1
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ee5e523564bb..1bb6604dc19f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
+COND_SYSCALL(io_uring_register);
 
 /* fs/xattr.c */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-02-01 15:23 [PATCHSET v11] io_uring IO interface Jens Axboe
@ 2019-02-01 15:24 ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-02-01 15:24 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must call io_uring_register()
after having setup an io_uring context, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
to an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.

If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.

The application may register buffers throughout the lifetime of the
io_uring context. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring context.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.

RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/io_uring.c                          | 345 ++++++++++++++++++++++++-
 include/linux/sched/user.h             |   2 +-
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/io_uring.h          |  13 +-
 kernel/sys_ni.c                        |   1 +
 8 files changed, 352 insertions(+), 17 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 481c126259e9..2eefd2a7c1ce 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
+427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 6a32a430c8e0..65c026185e61 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
+427	common	io_uring_register	__x64_sys_io_uring_register
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/io_uring.c b/fs/io_uring.c
index c30b416fc5ea..80c788d8c22a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -25,10 +25,12 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/blkdev.h>
+#include <linux/bvec.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
+#include <linux/sizes.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -58,6 +60,13 @@ struct io_cq_ring {
 	struct io_uring_cqe	cqes[];
 };
 
+struct io_mapped_ubuf {
+	u64		ubuf;
+	size_t		len;
+	struct		bio_vec *bvec;
+	unsigned int	nr_bvecs;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -91,6 +100,10 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/* if used, fixed mapped user buffers */
+	unsigned		nr_user_bufs;
+	struct io_mapped_ubuf	*user_bufs;
+
 	struct user_struct	*user;
 
 	struct completion	ctx_done;
@@ -663,6 +676,44 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
+static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe,
+			   struct iov_iter *iter)
+{
+	size_t len = READ_ONCE(sqe->len);
+	struct io_mapped_ubuf *imu;
+	unsigned index, buf_index;
+	size_t offset;
+	u64 buf_addr;
+
+	/* attempt to use fixed buffers without having provided iovecs */
+	if (unlikely(!ctx->user_bufs))
+		return -EFAULT;
+
+	buf_index = READ_ONCE(sqe->buf_index);
+	if (unlikely(buf_index >= ctx->nr_user_bufs))
+		return -EFAULT;
+
+	index = array_index_nospec(buf_index, ctx->nr_user_bufs);
+	imu = &ctx->user_bufs[index];
+	buf_addr = READ_ONCE(sqe->addr);
+
+	if (buf_addr + len < buf_addr)
+		return -EFAULT;
+	if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
+		return -EFAULT;
+
+	/*
+	 * May not be a start of buffer, set size appropriately
+	 * and advance us to the beginning.
+	 */
+	offset = buf_addr - imu->ubuf;
+	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
+	if (offset)
+		iov_iter_advance(iter, offset);
+	return 0;
+}
+
 static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 			   const struct sqe_submit *s, struct iovec **iovec,
 			   struct iov_iter *iter)
@@ -670,6 +721,15 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 	const struct io_uring_sqe *sqe = s->sqe;
 	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	size_t sqe_len = READ_ONCE(sqe->len);
+	u8 opcode;
+
+	opcode = READ_ONCE(sqe->opcode);
+	if (opcode == IORING_OP_READ_FIXED ||
+	    opcode == IORING_OP_WRITE_FIXED) {
+		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
+		*iovec = NULL;
+		return ret;
+	}
 
 	if (!s->has_user)
 		return EFAULT;
@@ -813,7 +873,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio))
+	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fsync_flags = READ_ONCE(sqe->fsync_flags);
@@ -851,9 +911,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
 		ret = io_read(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
+		ret = io_write(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_READ_FIXED:
+		ret = io_read(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_WRITE_FIXED:
 		ret = io_write(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
@@ -882,14 +952,23 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
+{
+	u8 opcode = READ_ONCE(sqe->opcode);
+
+	return !(opcode == IORING_OP_READ_FIXED ||
+		 opcode == IORING_OP_WRITE_FIXED);
+}
+
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	struct sqe_submit *s = &req->submit;
-	u64 user_data = READ_ONCE(s->sqe->user_data);
 	struct io_ring_ctx *ctx = req->ctx;
-	mm_segment_t old_fs = get_fs();
 	struct files_struct *old_files;
+	mm_segment_t old_fs;
+	bool needs_user;
+	u64 user_data;
 	int ret;
 
 	 /* Ensure we clear previously set forced non-block flag */
@@ -900,15 +979,25 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	current->files = ctx->sqo_files;
 	task_unlock(current);
 
-	if (!mmget_not_zero(ctx->sqo_mm)) {
-		ret = -EFAULT;
-		goto err;
-	}
-
-	use_mm(ctx->sqo_mm);
-	set_fs(USER_DS);
-	s->has_user = true;
+	user_data = READ_ONCE(s->sqe->user_data);
 	s->needs_lock = true;
+	s->has_user = false;
+
+	/*
+	 * If we're doing IO to fixed buffers, we don't need to get/set
+	 * user context
+	 */
+	needs_user = io_sqe_needs_user(s->sqe);
+	if (needs_user) {
+		if (!mmget_not_zero(ctx->sqo_mm)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		use_mm(ctx->sqo_mm);
+		old_fs = get_fs();
+		set_fs(USER_DS);
+		s->has_user = true;
+	}
 
 	do {
 		ret = __io_submit_sqe(ctx, req, s, false, NULL);
@@ -922,9 +1011,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 		cond_resched();
 	} while (1);
 
-	set_fs(old_fs);
-	unuse_mm(ctx->sqo_mm);
-	mmput(ctx->sqo_mm);
+	if (needs_user) {
+		set_fs(old_fs);
+		unuse_mm(ctx->sqo_mm);
+		mmput(ctx->sqo_mm);
+	}
 err:
 	if (ret) {
 		io_cqring_add_event(ctx, user_data, ret, 0);
@@ -1205,6 +1296,14 @@ static void *io_mem_alloc(size_t size)
 	return (void *) __get_free_pages(gfp_flags, get_order(size));
 }
 
+static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
+{
+	if (ctx->user)
+		return __io_account_mem(ctx->user, nr_pages);
+
+	return 0;
+}
+
 static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 {
 	struct io_sq_ring *sq_ring;
@@ -1218,6 +1317,169 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
 }
 
+static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->sq_entries; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+
+		for (j = 0; j < imu->nr_bvecs; j++)
+			put_page(imu->bvec[j].bv_page);
+
+		io_unaccount_mem(ctx, imu->nr_bvecs);
+		kfree(imu->bvec);
+		imu->nr_bvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+	free_uid(ctx->user);
+	ctx->user = NULL;
+	return 0;
+}
+
+static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
+		       void __user *arg, unsigned index)
+{
+	struct iovec __user *src;
+
+#ifdef CONFIG_COMPAT
+	if (ctx->compat) {
+		struct compat_iovec __user *ciovs;
+		struct compat_iovec ciov;
+
+		ciovs = (struct compat_iovec __user *) arg;
+		if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
+			return -EFAULT;
+
+		dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
+		dst->iov_len = ciov.iov_len;
+		return 0;
+	}
+#endif
+	src = (struct iovec __user *) arg;
+	if (copy_from_user(dst, &src[index], sizeof(*dst)))
+		return -EFAULT;
+	return 0;
+}
+
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
+				  unsigned nr_args)
+{
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	int ret = -EINVAL;
+
+	if (ctx->user_bufs)
+		return -EBUSY;
+	if (!nr_args || nr_args > UIO_MAXIOV)
+		return -EINVAL;
+
+	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	if (!capable(CAP_IPC_LOCK))
+		ctx->user = get_uid(current_user());
+
+	for (i = 0; i < nr_args; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		struct iovec iov;
+		size_t size;
+
+		ret = io_copy_iov(ctx, &iov, arg, i);
+		if (ret)
+			break;
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iov.iov_base || !iov.iov_len)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iov.iov_len > SZ_1G)
+			goto err;
+
+		ubuf = (unsigned long) iov.iov_base;
+		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		ret = io_account_mem(ctx, nr_pages);
+		if (ret)
+			goto err;
+
+		if (!pages || nr_pages > got_pages) {
+			kfree(pages);
+			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+						GFP_KERNEL);
+			if (!pages) {
+				io_unaccount_mem(ctx, nr_pages);
+				goto err;
+			}
+			got_pages = nr_pages;
+		}
+
+		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+						GFP_KERNEL);
+		if (!imu->bvec) {
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		down_read(&current->mm->mmap_sem);
+		pret = get_user_pages_longterm(ubuf, nr_pages,
+						FOLL_WRITE | FOLL_ANON, pages,
+						NULL);
+		up_read(&current->mm->mmap_sem);
+
+		if (pret != nr_pages) {
+			if (pret > 0) {
+				for (j = 0; j < pret; j++)
+					put_page(pages[j]);
+			}
+			ret = pret < 0 ? pret : -EFAULT;
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iov.iov_len;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			imu->bvec[j].bv_page = pages[j];
+			imu->bvec[j].bv_len = vec_len;
+			imu->bvec[j].bv_offset = off;
+			off = 0;
+			size -= vec_len;
+		}
+		/* store original address for later verification */
+		imu->ubuf = ubuf;
+		imu->len = iov.iov_len;
+		imu->nr_bvecs = nr_pages;
+	}
+	kfree(pages);
+	ctx->nr_user_bufs = nr_args;
+	return 0;
+err:
+	kfree(pages);
+	io_sqe_buffer_unregister(ctx);
+	return ret;
+}
+
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	destroy_workqueue(ctx->sqo_wq);
@@ -1225,6 +1487,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	put_files_struct(ctx->sqo_files);
 
 	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
 
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
@@ -1530,6 +1793,60 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 	return io_uring_setup(entries, params);
 }
 
+static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
+			       void __user *arg, unsigned nr_args)
+{
+	int ret;
+
+	percpu_ref_kill(&ctx->refs);
+	wait_for_completion(&ctx->ctx_done);
+
+	switch (opcode) {
+	case IORING_REGISTER_BUFFERS:
+		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_BUFFERS:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_buffer_unregister(ctx);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	/* bring the ctx back to life */
+	reinit_completion(&ctx->ctx_done);
+	percpu_ref_reinit(&ctx->refs);
+	return ret;
+}
+
+SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
+		void __user *, arg, unsigned int, nr_args)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ctx = f.file->private_data;
+
+	mutex_lock(&ctx->uring_lock);
+	ret = __io_uring_register(ctx, opcode, arg, nr_args);
+	mutex_unlock(&ctx->uring_lock);
+out_fput:
+	fdput(f);
+	return ret;
+}
+
 static int __init io_uring_init(void)
 {
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..c7b5f86b91a1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -40,7 +40,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
 	atomic_long_t locked_vm;
 #endif
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3072dbaa7869..3681c05ac538 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries,
 asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
 				u32 min_complete, u32 flags,
 				const sigset_t __user *sig, size_t sigsz);
+asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
+				void __user *arg, unsigned int nr_args);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 87871e7b7ea7..d346229a1eb0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
 __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 #define __NR_io_uring_enter 426
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+#define __NR_io_uring_register 427
+__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
 
 #undef __NR_syscalls
-#define __NR_syscalls 427
+#define __NR_syscalls 428
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 4952fc921866..16c423d74f2e 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -27,7 +27,10 @@ struct io_uring_sqe {
 		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
-	__u64	__pad2[3];
+	union {
+		__u16	buf_index;	/* index into fixed buffers, if used */
+		__u64	__pad2[3];
+	};
 };
 
 /*
@@ -39,6 +42,8 @@ struct io_uring_sqe {
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
 #define IORING_OP_FSYNC		3
+#define IORING_OP_READ_FIXED	4
+#define IORING_OP_WRITE_FIXED	5
 
 /*
  * sqe->fsync_flags
@@ -102,4 +107,10 @@ struct io_uring_params {
 	struct io_cqring_offsets cq_off;
 };
 
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+#define IORING_REGISTER_BUFFERS		0
+#define IORING_UNREGISTER_BUFFERS	1
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ee5e523564bb..1bb6604dc19f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
+COND_SYSCALL(io_uring_register);
 
 /* fs/xattr.c */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-30 21:55 [PATCHSET v10] io_uring IO interface Jens Axboe
@ 2019-01-30 21:55 ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-30 21:55 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, Jens Axboe

If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must call io_uring_register()
after having setup an io_uring context, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
to an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.

If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.

The application may register buffers throughout the lifetime of the
io_uring context. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring context.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.

RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/io_uring.c                          | 343 ++++++++++++++++++++++++-
 include/linux/sched/user.h             |   2 +-
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/io_uring.h          |  13 +-
 kernel/sys_ni.c                        |   1 +
 8 files changed, 352 insertions(+), 15 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 481c126259e9..2eefd2a7c1ce 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
+427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 6a32a430c8e0..65c026185e61 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
+427	common	io_uring_register	__x64_sys_io_uring_register
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3cd3d0720961..30a1e0999c80 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -25,10 +25,12 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/blkdev.h>
+#include <linux/bvec.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
+#include <linux/sizes.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -58,6 +60,13 @@ struct io_cq_ring {
 	struct io_uring_cqe	cqes[];
 };
 
+struct io_mapped_ubuf {
+	u64		ubuf;
+	size_t		len;
+	struct		bio_vec *bvec;
+	unsigned int	nr_bvecs;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -91,6 +100,10 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/* if used, fixed mapped user buffers */
+	unsigned		nr_user_bufs;
+	struct io_mapped_ubuf	*user_bufs;
+
 	struct user_struct	*user;
 
 	struct completion	ctx_done;
@@ -662,6 +675,44 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
+static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe,
+			   struct iov_iter *iter)
+{
+	size_t len = READ_ONCE(sqe->len);
+	struct io_mapped_ubuf *imu;
+	unsigned index, buf_index;
+	size_t offset;
+	u64 buf_addr;
+
+	/* attempt to use fixed buffers without having provided iovecs */
+	if (unlikely(!ctx->user_bufs))
+		return -EFAULT;
+
+	buf_index = READ_ONCE(sqe->buf_index);
+	if (unlikely(buf_index >= ctx->nr_user_bufs))
+		return -EFAULT;
+
+	index = array_index_nospec(buf_index, ctx->nr_user_bufs);
+	imu = &ctx->user_bufs[index];
+	buf_addr = READ_ONCE(sqe->addr);
+
+	if (buf_addr + len < buf_addr)
+		return -EFAULT;
+	if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
+		return -EFAULT;
+
+	/*
+	 * May not be a start of buffer, set size appropriately
+	 * and advance us to the beginning.
+	 */
+	offset = buf_addr - imu->ubuf;
+	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
+	if (offset)
+		iov_iter_advance(iter, offset);
+	return 0;
+}
+
 static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 			   const struct sqe_submit *s, struct iovec **iovec,
 			   struct iov_iter *iter)
@@ -669,6 +720,15 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 	const struct io_uring_sqe *sqe = s->sqe;
 	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	size_t sqe_len = READ_ONCE(sqe->len);
+	u8 opcode;
+
+	opcode = READ_ONCE(sqe->opcode);
+	if (opcode == IORING_OP_READ_FIXED ||
+	    opcode == IORING_OP_WRITE_FIXED) {
+		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
+		*iovec = NULL;
+		return ret;
+	}
 
 	if (!s->has_user)
 		return EFAULT;
@@ -812,7 +872,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio))
+	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fsync_flags = READ_ONCE(sqe->fsync_flags);
@@ -850,9 +910,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
 		ret = io_read(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
+		ret = io_write(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_READ_FIXED:
+		ret = io_read(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_WRITE_FIXED:
 		ret = io_write(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
@@ -875,14 +945,23 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
+{
+	u8 opcode = READ_ONCE(sqe->opcode);
+
+	return !(opcode == IORING_OP_READ_FIXED ||
+		 opcode == IORING_OP_WRITE_FIXED);
+}
+
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	struct sqe_submit *s = &req->submit;
-	u64 user_data = READ_ONCE(s->sqe->user_data);
 	struct io_ring_ctx *ctx = req->ctx;
-	mm_segment_t old_fs = get_fs();
 	struct files_struct *old_files;
+	mm_segment_t old_fs;
+	bool needs_user;
+	u64 user_data;
 	int ret;
 
 	 /* Ensure we clear previously set forced non-block flag */
@@ -893,13 +972,23 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	current->files = ctx->sqo_files;
 	task_unlock(current);
 
-	if (!mmget_not_zero(ctx->sqo_mm)) {
-		ret = -EFAULT;
-		goto err;
-	}
+	user_data = READ_ONCE(s->sqe->user_data);
 
-	use_mm(ctx->sqo_mm);
-	set_fs(USER_DS);
+	/*
+	 * If we're doing IO to fixed buffers, we don't need to get/set
+	 * user context
+	 */
+	needs_user = io_sqe_needs_user(s->sqe);
+	if (needs_user) {
+		if (!mmget_not_zero(ctx->sqo_mm)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		use_mm(ctx->sqo_mm);
+		old_fs = get_fs();
+		set_fs(USER_DS);
+		s->has_user = true;
+	}
 
 	do {
 		ret = __io_submit_sqe(ctx, req, s, false, NULL);
@@ -913,9 +1002,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 		cond_resched();
 	} while (1);
 
-	set_fs(old_fs);
-	unuse_mm(ctx->sqo_mm);
-	mmput(ctx->sqo_mm);
+	if (needs_user) {
+		set_fs(old_fs);
+		unuse_mm(ctx->sqo_mm);
+		mmput(ctx->sqo_mm);
+	}
 err:
 	if (ret) {
 		io_cqring_add_event(ctx, user_data, ret, 0);
@@ -1229,6 +1320,14 @@ static void *io_mem_alloc(size_t size)
 	return (void *) __get_free_pages(gfp_flags, get_order(size));
 }
 
+static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
+{
+	if (ctx->user)
+		return __io_account_mem(ctx->user, nr_pages);
+
+	return 0;
+}
+
 static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 {
 	struct io_sq_ring *sq_ring;
@@ -1242,6 +1341,169 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
 }
 
+static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->sq_entries; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+
+		for (j = 0; j < imu->nr_bvecs; j++)
+			put_page(imu->bvec[j].bv_page);
+
+		io_unaccount_mem(ctx, imu->nr_bvecs);
+		kfree(imu->bvec);
+		imu->nr_bvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+	free_uid(ctx->user);
+	ctx->user = NULL;
+	return 0;
+}
+
+static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
+		       void __user *arg, unsigned index)
+{
+	struct iovec __user *src;
+
+#ifdef CONFIG_COMPAT
+	if (ctx->compat) {
+		struct compat_iovec __user *ciovs;
+		struct compat_iovec ciov;
+
+		ciovs = (struct compat_iovec __user *) arg;
+		if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
+			return -EFAULT;
+
+		dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
+		dst->iov_len = ciov.iov_len;
+		return 0;
+	}
+#endif
+	src = (struct iovec __user *) arg;
+	if (copy_from_user(dst, &src[index], sizeof(*dst)))
+		return -EFAULT;
+	return 0;
+}
+
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
+				  unsigned nr_args)
+{
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	int ret = -EINVAL;
+
+	if (ctx->user_bufs)
+		return -EBUSY;
+	if (!nr_args || nr_args > UIO_MAXIOV)
+		return -EINVAL;
+
+	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	if (!capable(CAP_IPC_LOCK))
+		ctx->user = get_uid(current_user());
+
+	for (i = 0; i < nr_args; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		struct iovec iov;
+		size_t size;
+
+		ret = io_copy_iov(ctx, &iov, arg, i);
+		if (ret)
+			break;
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iov.iov_base || !iov.iov_len)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iov.iov_len > SZ_1G)
+			goto err;
+
+		ubuf = (unsigned long) iov.iov_base;
+		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		ret = io_account_mem(ctx, nr_pages);
+		if (ret)
+			goto err;
+
+		if (!pages || nr_pages > got_pages) {
+			kfree(pages);
+			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+						GFP_KERNEL);
+			if (!pages) {
+				io_unaccount_mem(ctx, nr_pages);
+				goto err;
+			}
+			got_pages = nr_pages;
+		}
+
+		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+						GFP_KERNEL);
+		if (!imu->bvec) {
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		down_read(&current->mm->mmap_sem);
+		pret = get_user_pages_longterm(ubuf, nr_pages,
+						FOLL_WRITE | FOLL_ANON, pages,
+						NULL);
+		up_read(&current->mm->mmap_sem);
+
+		if (pret != nr_pages) {
+			if (pret > 0) {
+				for (j = 0; j < pret; j++)
+					put_page(pages[j]);
+			}
+			ret = pret < 0 ? pret : -EFAULT;
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iov.iov_len;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			imu->bvec[j].bv_page = pages[j];
+			imu->bvec[j].bv_len = vec_len;
+			imu->bvec[j].bv_offset = off;
+			off = 0;
+			size -= vec_len;
+		}
+		/* store original address for later verification */
+		imu->ubuf = ubuf;
+		imu->len = iov.iov_len;
+		imu->nr_bvecs = nr_pages;
+	}
+	kfree(pages);
+	ctx->nr_user_bufs = nr_args;
+	return 0;
+err:
+	kfree(pages);
+	io_sqe_buffer_unregister(ctx);
+	return ret;
+}
+
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	destroy_workqueue(ctx->sqo_wq);
@@ -1249,6 +1511,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	put_files_struct(ctx->sqo_files);
 
 	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
 
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
@@ -1528,6 +1791,62 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 	return io_uring_setup(entries, params);
 }
 
+static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
+			       void __user *arg, unsigned nr_args)
+{
+	int ret;
+
+	percpu_ref_kill(&ctx->refs);
+	wait_for_completion(&ctx->ctx_done);
+
+	switch (opcode) {
+	case IORING_REGISTER_BUFFERS:
+		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_BUFFERS:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_buffer_unregister(ctx);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	/* bring the ctx back to life */
+	reinit_completion(&ctx->ctx_done);
+	percpu_ref_reinit(&ctx->refs);
+	return ret;
+}
+
+SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
+		void __user *, arg, unsigned int, nr_args)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ctx = f.file->private_data;
+
+	ret = -EBUSY;
+	if (mutex_trylock(&ctx->uring_lock)) {
+		ret = __io_uring_register(ctx, opcode, arg, nr_args);
+		mutex_unlock(&ctx->uring_lock);
+	}
+out_fput:
+	fdput(f);
+	return ret;
+}
+
 static int __init io_uring_init(void)
 {
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..c7b5f86b91a1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -40,7 +40,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
 	atomic_long_t locked_vm;
 #endif
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3072dbaa7869..3681c05ac538 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries,
 asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
 				u32 min_complete, u32 flags,
 				const sigset_t __user *sig, size_t sigsz);
+asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
+				void __user *arg, unsigned int nr_args);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 87871e7b7ea7..d346229a1eb0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
 __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 #define __NR_io_uring_enter 426
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+#define __NR_io_uring_register 427
+__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
 
 #undef __NR_syscalls
-#define __NR_syscalls 427
+#define __NR_syscalls 428
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 4952fc921866..16c423d74f2e 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -27,7 +27,10 @@ struct io_uring_sqe {
 		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
-	__u64	__pad2[3];
+	union {
+		__u16	buf_index;	/* index into fixed buffers, if used */
+		__u64	__pad2[3];
+	};
 };
 
 /*
@@ -39,6 +42,8 @@ struct io_uring_sqe {
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
 #define IORING_OP_FSYNC		3
+#define IORING_OP_READ_FIXED	4
+#define IORING_OP_WRITE_FIXED	5
 
 /*
  * sqe->fsync_flags
@@ -102,4 +107,10 @@ struct io_uring_params {
 	struct io_cqring_offsets cq_off;
 };
 
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+#define IORING_REGISTER_BUFFERS		0
+#define IORING_UNREGISTER_BUFFERS	1
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ee5e523564bb..1bb6604dc19f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
+COND_SYSCALL(io_uring_register);
 
 /* fs/xattr.c */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-29  0:36       ` Jann Horn
@ 2019-01-29  1:25         ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-29  1:25 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On 1/28/19 5:36 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 12:50 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/28/19 4:35 PM, Jann Horn wrote:
>>> On Mon, Jan 28, 2019 at 10:36 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> If we have fixed user buffers, we can map them into the kernel when we
>>>> setup the io_context. That avoids the need to do get_user_pages() for
>>>> each and every IO.
>>> [...]
>>>> +static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>>>> +                              void __user *arg, unsigned nr_args)
>>>> +{
>>>> +       int ret;
>>>> +
>>>> +       /* Drop our initial ref and wait for the ctx to be fully idle */
>>>> +       percpu_ref_put(&ctx->refs);
>>>
>>> The line above drops a reference that you just got in the caller...
>>
>> Right
>>
>>>> +       percpu_ref_kill(&ctx->refs);
>>>> +       wait_for_completion(&ctx->ctx_done);
>>>> +
>>>> +       switch (opcode) {
>>>> +       case IORING_REGISTER_BUFFERS:
>>>> +               ret = io_sqe_buffer_register(ctx, arg, nr_args);
>>>> +               break;
>>>> +       case IORING_UNREGISTER_BUFFERS:
>>>> +               ret = -EINVAL;
>>>> +               if (arg || nr_args)
>>>> +                       break;
>>>> +               ret = io_sqe_buffer_unregister(ctx);
>>>> +               break;
>>>> +       default:
>>>> +               ret = -EINVAL;
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       /* bring the ctx back to life */
>>>> +       reinit_completion(&ctx->ctx_done);
>>>> +       percpu_ref_resurrect(&ctx->refs);
>>>> +       percpu_ref_get(&ctx->refs);
>>>
>>> And then this line takes a reference that the caller will immediately
>>> drop again? Why?
>>
>> Just want to keep it symmetric and avoid having weird "this function drops
>> a reference" use cases.
>>
>>>
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
>>>> +               void __user *, arg, unsigned int, nr_args)
>>>> +{
>>>> +       struct io_ring_ctx *ctx;
>>>> +       long ret = -EBADF;
>>>> +       struct fd f;
>>>> +
>>>> +       f = fdget(fd);
>>>> +       if (!f.file)
>>>> +               return -EBADF;
>>>> +
>>>> +       ret = -EOPNOTSUPP;
>>>> +       if (f.file->f_op != &io_uring_fops)
>>>> +               goto out_fput;
>>>> +
>>>> +       ret = -ENXIO;
>>>> +       ctx = f.file->private_data;
>>>> +       if (!percpu_ref_tryget(&ctx->refs))
>>>> +               goto out_fput;
>>>
>>> If you are holding the uring_lock of a ctx that can be accessed
>>> through a file descriptor (which you do just after this point), you
>>> know that the percpu_ref isn't zero, right? Why are you doing the
>>> tryget here?
>>
>> Not sure I follow... We don't hold the lock at this point. I guess your
>> point is that since the descriptor is open (or we'd fail the above
>> check), then there's no point doing the tryget variant here? That's
>> strictly true, that could just be a get().
> 
> As far as I can tell, you could do the following without breaking anything:
> 
> ========================
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 6916dc3222cf..c2d82765eefe 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2485,7 +2485,6 @@ static int __io_uring_register(struct
> io_ring_ctx *ctx, unsigned opcode,
>         int ret;
> 
>         /* Drop our initial ref and wait for the ctx to be fully idle */
> -       percpu_ref_put(&ctx->refs);
>         percpu_ref_kill(&ctx->refs);
>         wait_for_completion(&ctx->ctx_done);
> 
> @@ -2516,7 +2515,6 @@ static int __io_uring_register(struct
> io_ring_ctx *ctx, unsigned opcode,
>         /* bring the ctx back to life */
>         reinit_completion(&ctx->ctx_done);
>         percpu_ref_resurrect(&ctx->refs);
> -       percpu_ref_get(&ctx->refs);
>         return ret;
>  }
> 
> @@ -2535,17 +2533,13 @@ SYSCALL_DEFINE4(io_uring_register, unsigned
> int, fd, unsigned int, opcode,
>         if (f.file->f_op != &io_uring_fops)
>                 goto out_fput;
> 
> -       ret = -ENXIO;
>         ctx = f.file->private_data;
> -       if (!percpu_ref_tryget(&ctx->refs))
> -               goto out_fput;
> 
>         ret = -EBUSY;
>         if (mutex_trylock(&ctx->uring_lock)) {
>                 ret = __io_uring_register(ctx, opcode, arg, nr_args);
>                 mutex_unlock(&ctx->uring_lock);
>         }
> -       io_ring_drop_ctx_refs(ctx, 1);
>  out_fput:
>         fdput(f);
>         return ret;
> ========================
> 
> The two functions that can drop the initial ref of the percpu refcount are:
> 
> 1. io_ring_ctx_wait_and_kill(); this is only used on ->release() or on
> setup failure, meaning that as long as you have a reference to the
> file from fget()/fdget(), io_ring_ctx_wait_and_kill() can't have been
> called on your context
> 2. __io_uring_register(); this temporarily kills the percpu refcount
> and resurrects it, all under ctx->uring_lock, meaning that as long as
> you're holding ctx->uring_lock, __io_uring_register() can't have
> killed the percpu refcount
> 
> Therefore, I think that as long as you're in sys_io_uring_register and
> holding the ctx->uring_lock, you know that the percpu refcount is
> alive, and bumping and dropping non-initial references has no effect.
> 
> Perhaps this makes more sense when you view the percpu refcount as a
> read/write lock - percpu_ref_tryget() takes a read lock, the
> percpu_ref_kill() dance takes a write lock.

This looks good, I'll fold it in. Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-28 23:50     ` Jens Axboe
@ 2019-01-29  0:36       ` Jann Horn
  2019-01-29  1:25         ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jann Horn @ 2019-01-29  0:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On Tue, Jan 29, 2019 at 12:50 AM Jens Axboe <axboe@kernel.dk> wrote:
> On 1/28/19 4:35 PM, Jann Horn wrote:
> > On Mon, Jan 28, 2019 at 10:36 PM Jens Axboe <axboe@kernel.dk> wrote:
> >> If we have fixed user buffers, we can map them into the kernel when we
> >> setup the io_context. That avoids the need to do get_user_pages() for
> >> each and every IO.
> > [...]
> >> +static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
> >> +                              void __user *arg, unsigned nr_args)
> >> +{
> >> +       int ret;
> >> +
> >> +       /* Drop our initial ref and wait for the ctx to be fully idle */
> >> +       percpu_ref_put(&ctx->refs);
> >
> > The line above drops a reference that you just got in the caller...
>
> Right
>
> >> +       percpu_ref_kill(&ctx->refs);
> >> +       wait_for_completion(&ctx->ctx_done);
> >> +
> >> +       switch (opcode) {
> >> +       case IORING_REGISTER_BUFFERS:
> >> +               ret = io_sqe_buffer_register(ctx, arg, nr_args);
> >> +               break;
> >> +       case IORING_UNREGISTER_BUFFERS:
> >> +               ret = -EINVAL;
> >> +               if (arg || nr_args)
> >> +                       break;
> >> +               ret = io_sqe_buffer_unregister(ctx);
> >> +               break;
> >> +       default:
> >> +               ret = -EINVAL;
> >> +               break;
> >> +       }
> >> +
> >> +       /* bring the ctx back to life */
> >> +       reinit_completion(&ctx->ctx_done);
> >> +       percpu_ref_resurrect(&ctx->refs);
> >> +       percpu_ref_get(&ctx->refs);
> >
> > And then this line takes a reference that the caller will immediately
> > drop again? Why?
>
> Just want to keep it symmetric and avoid having weird "this function drops
> a reference" use cases.
>
> >
> >> +       return ret;
> >> +}
> >> +
> >> +SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
> >> +               void __user *, arg, unsigned int, nr_args)
> >> +{
> >> +       struct io_ring_ctx *ctx;
> >> +       long ret = -EBADF;
> >> +       struct fd f;
> >> +
> >> +       f = fdget(fd);
> >> +       if (!f.file)
> >> +               return -EBADF;
> >> +
> >> +       ret = -EOPNOTSUPP;
> >> +       if (f.file->f_op != &io_uring_fops)
> >> +               goto out_fput;
> >> +
> >> +       ret = -ENXIO;
> >> +       ctx = f.file->private_data;
> >> +       if (!percpu_ref_tryget(&ctx->refs))
> >> +               goto out_fput;
> >
> > If you are holding the uring_lock of a ctx that can be accessed
> > through a file descriptor (which you do just after this point), you
> > know that the percpu_ref isn't zero, right? Why are you doing the
> > tryget here?
>
> Not sure I follow... We don't hold the lock at this point. I guess your
> point is that since the descriptor is open (or we'd fail the above
> check), then there's no point doing the tryget variant here? That's
> strictly true, that could just be a get().

As far as I can tell, you could do the following without breaking anything:

========================
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6916dc3222cf..c2d82765eefe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2485,7 +2485,6 @@ static int __io_uring_register(struct
io_ring_ctx *ctx, unsigned opcode,
        int ret;

        /* Drop our initial ref and wait for the ctx to be fully idle */
-       percpu_ref_put(&ctx->refs);
        percpu_ref_kill(&ctx->refs);
        wait_for_completion(&ctx->ctx_done);

@@ -2516,7 +2515,6 @@ static int __io_uring_register(struct
io_ring_ctx *ctx, unsigned opcode,
        /* bring the ctx back to life */
        reinit_completion(&ctx->ctx_done);
        percpu_ref_resurrect(&ctx->refs);
-       percpu_ref_get(&ctx->refs);
        return ret;
 }

@@ -2535,17 +2533,13 @@ SYSCALL_DEFINE4(io_uring_register, unsigned
int, fd, unsigned int, opcode,
        if (f.file->f_op != &io_uring_fops)
                goto out_fput;

-       ret = -ENXIO;
        ctx = f.file->private_data;
-       if (!percpu_ref_tryget(&ctx->refs))
-               goto out_fput;

        ret = -EBUSY;
        if (mutex_trylock(&ctx->uring_lock)) {
                ret = __io_uring_register(ctx, opcode, arg, nr_args);
                mutex_unlock(&ctx->uring_lock);
        }
-       io_ring_drop_ctx_refs(ctx, 1);
 out_fput:
        fdput(f);
        return ret;
========================

The two functions that can drop the initial ref of the percpu refcount are:

1. io_ring_ctx_wait_and_kill(); this is only used on ->release() or on
setup failure, meaning that as long as you have a reference to the
file from fget()/fdget(), io_ring_ctx_wait_and_kill() can't have been
called on your context
2. __io_uring_register(); this temporarily kills the percpu refcount
and resurrects it, all under ctx->uring_lock, meaning that as long as
you're holding ctx->uring_lock, __io_uring_register() can't have
killed the percpu refcount

Therefore, I think that as long as you're in sys_io_uring_register and
holding the ctx->uring_lock, you know that the percpu refcount is
alive, and bumping and dropping non-initial references has no effect.

Perhaps this makes more sense when you view the percpu refcount as a
read/write lock - percpu_ref_tryget() takes a read lock, the
percpu_ref_kill() dance takes a write lock.

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-28 23:35   ` Jann Horn
@ 2019-01-28 23:50     ` Jens Axboe
  2019-01-29  0:36       ` Jann Horn
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-28 23:50 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On 1/28/19 4:35 PM, Jann Horn wrote:
> On Mon, Jan 28, 2019 at 10:36 PM Jens Axboe <axboe@kernel.dk> wrote:
>> If we have fixed user buffers, we can map them into the kernel when we
>> setup the io_context. That avoids the need to do get_user_pages() for
>> each and every IO.
> [...]
>> +static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>> +                              void __user *arg, unsigned nr_args)
>> +{
>> +       int ret;
>> +
>> +       /* Drop our initial ref and wait for the ctx to be fully idle */
>> +       percpu_ref_put(&ctx->refs);
> 
> The line above drops a reference that you just got in the caller...

Right

>> +       percpu_ref_kill(&ctx->refs);
>> +       wait_for_completion(&ctx->ctx_done);
>> +
>> +       switch (opcode) {
>> +       case IORING_REGISTER_BUFFERS:
>> +               ret = io_sqe_buffer_register(ctx, arg, nr_args);
>> +               break;
>> +       case IORING_UNREGISTER_BUFFERS:
>> +               ret = -EINVAL;
>> +               if (arg || nr_args)
>> +                       break;
>> +               ret = io_sqe_buffer_unregister(ctx);
>> +               break;
>> +       default:
>> +               ret = -EINVAL;
>> +               break;
>> +       }
>> +
>> +       /* bring the ctx back to life */
>> +       reinit_completion(&ctx->ctx_done);
>> +       percpu_ref_resurrect(&ctx->refs);
>> +       percpu_ref_get(&ctx->refs);
> 
> And then this line takes a reference that the caller will immediately
> drop again? Why?

Just want to keep it symmetric and avoid having weird "this function drops
a reference" use cases.

> 
>> +       return ret;
>> +}
>> +
>> +SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
>> +               void __user *, arg, unsigned int, nr_args)
>> +{
>> +       struct io_ring_ctx *ctx;
>> +       long ret = -EBADF;
>> +       struct fd f;
>> +
>> +       f = fdget(fd);
>> +       if (!f.file)
>> +               return -EBADF;
>> +
>> +       ret = -EOPNOTSUPP;
>> +       if (f.file->f_op != &io_uring_fops)
>> +               goto out_fput;
>> +
>> +       ret = -ENXIO;
>> +       ctx = f.file->private_data;
>> +       if (!percpu_ref_tryget(&ctx->refs))
>> +               goto out_fput;
> 
> If you are holding the uring_lock of a ctx that can be accessed
> through a file descriptor (which you do just after this point), you
> know that the percpu_ref isn't zero, right? Why are you doing the
> tryget here?

Not sure I follow... We don't hold the lock at this point. I guess your
point is that since the descriptor is open (or we'd fail the above
check), then there's no point doing the tryget variant here? That's
strictly true, that could just be a get().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-28 21:35 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
@ 2019-01-28 23:35   ` Jann Horn
  2019-01-28 23:50     ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jann Horn @ 2019-01-28 23:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, linux-man, Linux API, hch, jmoyer, Avi Kivity

On Mon, Jan 28, 2019 at 10:36 PM Jens Axboe <axboe@kernel.dk> wrote:
> If we have fixed user buffers, we can map them into the kernel when we
> setup the io_context. That avoids the need to do get_user_pages() for
> each and every IO.
[...]
> +static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
> +                              void __user *arg, unsigned nr_args)
> +{
> +       int ret;
> +
> +       /* Drop our initial ref and wait for the ctx to be fully idle */
> +       percpu_ref_put(&ctx->refs);

The line above drops a reference that you just got in the caller...

> +       percpu_ref_kill(&ctx->refs);
> +       wait_for_completion(&ctx->ctx_done);
> +
> +       switch (opcode) {
> +       case IORING_REGISTER_BUFFERS:
> +               ret = io_sqe_buffer_register(ctx, arg, nr_args);
> +               break;
> +       case IORING_UNREGISTER_BUFFERS:
> +               ret = -EINVAL;
> +               if (arg || nr_args)
> +                       break;
> +               ret = io_sqe_buffer_unregister(ctx);
> +               break;
> +       default:
> +               ret = -EINVAL;
> +               break;
> +       }
> +
> +       /* bring the ctx back to life */
> +       reinit_completion(&ctx->ctx_done);
> +       percpu_ref_resurrect(&ctx->refs);
> +       percpu_ref_get(&ctx->refs);

And then this line takes a reference that the caller will immediately
drop again? Why?

> +       return ret;
> +}
> +
> +SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
> +               void __user *, arg, unsigned int, nr_args)
> +{
> +       struct io_ring_ctx *ctx;
> +       long ret = -EBADF;
> +       struct fd f;
> +
> +       f = fdget(fd);
> +       if (!f.file)
> +               return -EBADF;
> +
> +       ret = -EOPNOTSUPP;
> +       if (f.file->f_op != &io_uring_fops)
> +               goto out_fput;
> +
> +       ret = -ENXIO;
> +       ctx = f.file->private_data;
> +       if (!percpu_ref_tryget(&ctx->refs))
> +               goto out_fput;

If you are holding the uring_lock of a ctx that can be accessed
through a file descriptor (which you do just after this point), you
know that the percpu_ref isn't zero, right? Why are you doing the
tryget here?

> +       ret = -EBUSY;
> +       if (mutex_trylock(&ctx->uring_lock)) {
> +               ret = __io_uring_register(ctx, opcode, arg, nr_args);
> +               mutex_unlock(&ctx->uring_lock);
> +       }
> +       io_ring_drop_ctx_refs(ctx, 1);
> +out_fput:
> +       fdput(f);
> +       return ret;
> +}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-28 21:35 [PATCHSET v8] io_uring IO interface Jens Axboe
@ 2019-01-28 21:35 ` Jens Axboe
  2019-01-28 23:35   ` Jann Horn
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2019-01-28 21:35 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-man, linux-api; +Cc: hch, jmoyer, avi, Jens Axboe

If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must call io_uring_register()
after having setup an io_uring context, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
to an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.

If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.

The application may register buffers throughout the lifetime of the
io_uring context. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring context.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.

RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/io_uring.c                          | 356 ++++++++++++++++++++++++-
 include/linux/sched/user.h             |   2 +-
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/io_uring.h          |  13 +-
 kernel/sys_ni.c                        |   1 +
 8 files changed, 366 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 481c126259e9..2eefd2a7c1ce 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
+427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 6a32a430c8e0..65c026185e61 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
+427	common	io_uring_register	__x64_sys_io_uring_register
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/io_uring.c b/fs/io_uring.c
index a33f1b1709d0..682714d6f217 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -25,8 +25,10 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/blkdev.h>
+#include <linux/bvec.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
+#include <linux/sizes.h>
 
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
@@ -57,6 +59,13 @@ struct io_cq_ring {
 	struct io_uring_cqe	cqes[];
 };
 
+struct io_mapped_ubuf {
+	u64		ubuf;
+	size_t		len;
+	struct		bio_vec *bvec;
+	unsigned int	nr_bvecs;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -89,6 +98,10 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/* if used, fixed mapped user buffers */
+	unsigned		nr_user_bufs;
+	struct io_mapped_ubuf	*user_bufs;
+
 	struct user_struct	*user;
 
 	struct completion	ctx_done;
@@ -663,12 +676,51 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
+static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe,
+			   struct iov_iter *iter)
+{
+	struct io_mapped_ubuf *imu;
+	size_t len = sqe->len;
+	size_t offset;
+	int index;
+
+	/* attempt to use fixed buffers without having provided iovecs */
+	if (unlikely(!ctx->user_bufs))
+		return -EFAULT;
+	if (unlikely(sqe->buf_index >= ctx->nr_user_bufs))
+		return -EFAULT;
+
+	index = array_index_nospec(sqe->buf_index, ctx->sq_entries);
+	imu = &ctx->user_bufs[index];
+	if ((unsigned long) sqe->addr < imu->ubuf ||
+	    (unsigned long) sqe->addr + len > imu->ubuf + imu->len)
+		return -EFAULT;
+
+	/*
+	 * May not be a start of buffer, set size appropriately
+	 * and advance us to the beginning.
+	 */
+	offset = (unsigned long) sqe->addr - imu->ubuf;
+	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
+	if (offset)
+		iov_iter_advance(iter, offset);
+	return 0;
+}
+
 static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 			   const struct io_uring_sqe *sqe,
 			   struct iovec **iovec, struct iov_iter *iter)
 {
 	void __user *buf = u64_to_user_ptr(sqe->addr);
 
+	if (sqe->opcode == IORING_OP_READ_FIXED ||
+	    sqe->opcode == IORING_OP_WRITE_FIXED) {
+		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
+		*iovec = NULL;
+		return ret;
+	}
+
 #ifdef CONFIG_COMPAT
 	if (in_compat_syscall())
 		return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
@@ -805,7 +857,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio))
+	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 	if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC))
 		return -EINVAL;
@@ -840,9 +892,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, sqe);
 		break;
 	case IORING_OP_READV:
+		if (unlikely(sqe->buf_index))
+			return -EINVAL;
 		ret = io_read(req, sqe, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
+		if (unlikely(sqe->buf_index))
+			return -EINVAL;
+		ret = io_write(req, sqe, force_nonblock, state);
+		break;
+	case IORING_OP_READ_FIXED:
+		ret = io_read(req, sqe, force_nonblock, state);
+		break;
+	case IORING_OP_WRITE_FIXED:
 		ret = io_write(req, sqe, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
@@ -865,14 +927,21 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
+{
+	return !(sqe->opcode == IORING_OP_READ_FIXED ||
+		 sqe->opcode == IORING_OP_WRITE_FIXED);
+}
+
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	struct sqe_submit *s = &req->submit;
 	u64 user_data = s->sqe->user_data;
 	struct io_ring_ctx *ctx = req->ctx;
-	mm_segment_t old_fs = get_fs();
 	struct files_struct *old_files;
+	mm_segment_t old_fs;
+	bool needs_user;
 	int ret;
 
 	 /* Ensure we clear previously set forced non-block flag */
@@ -881,19 +950,28 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	old_files = current->files;
 	current->files = ctx->sqo_files;
 
-	if (!mmget_not_zero(ctx->sqo_mm)) {
-		ret = -EFAULT;
-		goto err;
+	/*
+	 * If we're doing IO to fixed buffers, we don't need to get/set
+	 * user context
+	 */
+	needs_user = io_sqe_needs_user(s->sqe);
+	if (needs_user) {
+		if (!mmget_not_zero(ctx->sqo_mm)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		use_mm(ctx->sqo_mm);
+		old_fs = get_fs();
+		set_fs(USER_DS);
 	}
 
-	use_mm(ctx->sqo_mm);
-	set_fs(USER_DS);
-
 	ret = __io_submit_sqe(ctx, req, s, false, NULL);
 
-	set_fs(old_fs);
-	unuse_mm(ctx->sqo_mm);
-	mmput(ctx->sqo_mm);
+	if (needs_user) {
+		set_fs(old_fs);
+		unuse_mm(ctx->sqo_mm);
+		mmput(ctx->sqo_mm);
+	}
 err:
 	if (ret) {
 		io_cqring_add_event(ctx, user_data, ret, 0);
@@ -1194,6 +1272,14 @@ static void *io_mem_alloc(size_t size)
 	return (void *) __get_free_pages(gfp_flags, get_order(size));
 }
 
+static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
+{
+	if (ctx->user)
+		return __io_account_mem(ctx->user, nr_pages);
+
+	return 0;
+}
+
 static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 {
 	struct io_sq_ring *sq_ring;
@@ -1207,10 +1293,195 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
 }
 
+static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->sq_entries; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+
+		for (j = 0; j < imu->nr_bvecs; j++)
+			put_page(imu->bvec[j].bv_page);
+
+		io_unaccount_mem(ctx, imu->nr_bvecs);
+		kfree(imu->bvec);
+		imu->nr_bvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+	free_uid(ctx->user);
+	ctx->user = NULL;
+	return 0;
+}
+
+static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
+		       void __user *arg, unsigned index)
+{
+	struct iovec __user *src;
+
+#ifdef CONFIG_COMPAT
+	if (in_compat_syscall()) {
+		struct compat_iovec __user *ciovs;
+		struct compat_iovec ciov;
+
+		ciovs = (struct compat_iovec __user *) arg;
+		if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
+			return -EFAULT;
+
+		dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
+		dst->iov_len = ciov.iov_len;
+		return 0;
+	}
+#endif
+	src = (struct iovec __user *) arg;
+	if (copy_from_user(dst, &src[index], sizeof(*dst)))
+		return -EFAULT;
+	return 0;
+}
+
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
+				  unsigned nr_args)
+{
+	struct vm_area_struct **vmas = NULL;
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	int ret = -EINVAL;
+
+	if (ctx->user_bufs)
+		return -EBUSY;
+	if (!nr_args || nr_args > UIO_MAXIOV)
+		return -EINVAL;
+
+	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	if (!capable(CAP_IPC_LOCK))
+		ctx->user = get_uid(current_user());
+
+	for (i = 0; i < nr_args; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		struct iovec iov;
+		size_t size;
+
+		ret = io_copy_iov(ctx, &iov, arg, i);
+		if (ret)
+			break;
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iov.iov_base)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iov.iov_len > SZ_1G)
+			goto err;
+
+		ubuf = (unsigned long) iov.iov_base;
+		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		ret = io_account_mem(ctx, nr_pages);
+		if (ret)
+			goto err;
+
+		if (!pages || nr_pages > got_pages) {
+			kfree(vmas);
+			kfree(pages);
+			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+						GFP_KERNEL);
+			vmas = kmalloc_array(nr_pages,
+					sizeof(struct vma_area_struct *),
+					GFP_KERNEL);
+			if (!pages || !vmas) {
+				io_unaccount_mem(ctx, nr_pages);
+				goto err;
+			}
+			got_pages = nr_pages;
+		}
+
+		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+						GFP_KERNEL);
+		if (!imu->bvec) {
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		down_write(&current->mm->mmap_sem);
+		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
+						pages, vmas);
+		if (pret == nr_pages) {
+			/* don't support file backed memory */
+			for (j = 0; j < nr_pages; j++) {
+				struct vm_area_struct *vma = vmas[j];
+
+				if (vma->vm_file) {
+					ret = -EOPNOTSUPP;
+					break;
+				}
+			}
+		} else {
+			ret = pret < 0 ? pret : -EFAULT;
+		}
+		up_write(&current->mm->mmap_sem);
+		if (ret) {
+			/*
+			 * if we did partial map, or found file backed vmas,
+			 * release any pages we did get
+			 */
+			if (pret > 0) {
+				for (j = 0; j < pret; j++)
+					put_page(pages[j]);
+			}
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iov.iov_len;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			imu->bvec[j].bv_page = pages[j];
+			imu->bvec[j].bv_len = vec_len;
+			imu->bvec[j].bv_offset = off;
+			off = 0;
+			size -= vec_len;
+		}
+		/* store original address for later verification */
+		imu->ubuf = ubuf;
+		imu->len = iov.iov_len;
+		imu->nr_bvecs = nr_pages;
+	}
+	kfree(pages);
+	kfree(vmas);
+	ctx->nr_user_bufs = nr_args;
+	return 0;
+err:
+	kfree(pages);
+	kfree(vmas);
+	io_sqe_buffer_unregister(ctx);
+	return ret;
+}
+
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	destroy_workqueue(ctx->sqo_wq);
 	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
 
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
@@ -1486,6 +1757,69 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 	return io_uring_setup(entries, params);
 }
 
+static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
+			       void __user *arg, unsigned nr_args)
+{
+	int ret;
+
+	/* Drop our initial ref and wait for the ctx to be fully idle */
+	percpu_ref_put(&ctx->refs);
+	percpu_ref_kill(&ctx->refs);
+	wait_for_completion(&ctx->ctx_done);
+
+	switch (opcode) {
+	case IORING_REGISTER_BUFFERS:
+		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_BUFFERS:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_buffer_unregister(ctx);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	/* bring the ctx back to life */
+	reinit_completion(&ctx->ctx_done);
+	percpu_ref_resurrect(&ctx->refs);
+	percpu_ref_get(&ctx->refs);
+	return ret;
+}
+
+SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
+		void __user *, arg, unsigned int, nr_args)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ret = -ENXIO;
+	ctx = f.file->private_data;
+	if (!percpu_ref_tryget(&ctx->refs))
+		goto out_fput;
+
+	ret = -EBUSY;
+	if (mutex_trylock(&ctx->uring_lock)) {
+		ret = __io_uring_register(ctx, opcode, arg, nr_args);
+		mutex_unlock(&ctx->uring_lock);
+	}
+	io_ring_drop_ctx_refs(ctx, 1);
+out_fput:
+	fdput(f);
+	return ret;
+}
+
 static int __init io_uring_init(void)
 {
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..c7b5f86b91a1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -40,7 +40,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
 	atomic_long_t locked_vm;
 #endif
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3072dbaa7869..3681c05ac538 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries,
 asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
 				u32 min_complete, u32 flags,
 				const sigset_t __user *sig, size_t sigsz);
+asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
+				void __user *arg, unsigned int nr_args);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 87871e7b7ea7..d346229a1eb0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
 __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 #define __NR_io_uring_enter 426
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+#define __NR_io_uring_register 427
+__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
 
 #undef __NR_syscalls
-#define __NR_syscalls 427
+#define __NR_syscalls 428
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 4fc5fbd07688..03ce7133c3b2 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -29,7 +29,10 @@ struct io_uring_sqe {
 		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
-	__u64	__pad2[3];
+	union {
+		__u16	buf_index;	/* index into fixed buffers, if used */
+		__u64	__pad2[3];
+	};
 };
 
 /*
@@ -41,6 +44,8 @@ struct io_uring_sqe {
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
 #define IORING_OP_FSYNC		3
+#define IORING_OP_READ_FIXED	4
+#define IORING_OP_WRITE_FIXED	5
 
 /*
  * sqe->fsync_flags
@@ -104,4 +109,10 @@ struct io_uring_params {
 	struct io_cqring_offsets cq_off;
 };
 
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+#define IORING_REGISTER_BUFFERS		0
+#define IORING_UNREGISTER_BUFFERS	1
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ee5e523564bb..1bb6604dc19f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
+COND_SYSCALL(io_uring_register);
 
 /* fs/xattr.c */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers
  2019-01-23 15:35 [PATCHSET v7] io_uring IO interface Jens Axboe
@ 2019-01-23 15:35 ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2019-01-23 15:35 UTC (permalink / raw)
  To: linux-fsdevel, linux-aio, linux-block; +Cc: hch, jmoyer, avi, Jens Axboe

If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must call io_uring_register()
after having setup an io_uring context, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer
to an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.

If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.

The application may register buffers throughout the lifetime of the
io_uring context. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring context.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.

RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/io_uring.c                          | 357 ++++++++++++++++++++++++-
 include/linux/sched/user.h             |   2 +-
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/io_uring.h          |  13 +-
 kernel/sys_ni.c                        |   1 +
 8 files changed, 367 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index a6076d1e2154..7cdbd0712df5 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_compat_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
+427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 6a32a430c8e0..65c026185e61 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
+427	common	io_uring_register	__x64_sys_io_uring_register
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 497bea0f29c5..63ad09e7cdc7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -25,8 +25,11 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/blkdev.h>
+#include <linux/bvec.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
+#include <linux/sizes.h>
+#include <linux/nospec.h>
 
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
@@ -57,6 +60,13 @@ struct io_cq_ring {
 	struct io_uring_cqe	cqes[];
 };
 
+struct io_mapped_ubuf {
+	u64		ubuf;
+	size_t		len;
+	struct		bio_vec *bvec;
+	unsigned int	nr_bvecs;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -90,6 +100,10 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/* if used, fixed mapped user buffers */
+	unsigned		nr_user_bufs;
+	struct io_mapped_ubuf	*user_bufs;
+
 	struct user_struct	*user;
 
 	struct completion	ctx_done;
@@ -664,12 +678,51 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
+static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe,
+			   struct iov_iter *iter)
+{
+	struct io_mapped_ubuf *imu;
+	size_t len = sqe->len;
+	size_t offset;
+	int index;
+
+	/* attempt to use fixed buffers without having provided iovecs */
+	if (unlikely(!ctx->user_bufs))
+		return -EFAULT;
+	if (unlikely(sqe->buf_index >= ctx->nr_user_bufs))
+		return -EFAULT;
+
+	index = array_index_nospec(sqe->buf_index, ctx->sq_entries);
+	imu = &ctx->user_bufs[index];
+	if ((unsigned long) sqe->addr < imu->ubuf ||
+	    (unsigned long) sqe->addr + len > imu->ubuf + imu->len)
+		return -EFAULT;
+
+	/*
+	 * May not be a start of buffer, set size appropriately
+	 * and advance us to the beginning.
+	 */
+	offset = (unsigned long) sqe->addr - imu->ubuf;
+	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
+	if (offset)
+		iov_iter_advance(iter, offset);
+	return 0;
+}
+
 static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 			   const struct io_uring_sqe *sqe,
 			   struct iovec **iovec, struct iov_iter *iter)
 {
 	void __user *buf = u64_to_user_ptr(sqe->addr);
 
+	if (sqe->opcode == IORING_OP_READ_FIXED ||
+	    sqe->opcode == IORING_OP_WRITE_FIXED) {
+		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
+		*iovec = NULL;
+		return ret;
+	}
+
 #ifdef CONFIG_COMPAT
 	if (ctx->compat)
 		return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV,
@@ -805,7 +858,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio))
+	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 	if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC))
 		return -EINVAL;
@@ -840,9 +893,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, sqe);
 		break;
 	case IORING_OP_READV:
+		if (unlikely(sqe->buf_index))
+			return -EINVAL;
 		ret = io_read(req, sqe, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
+		if (unlikely(sqe->buf_index))
+			return -EINVAL;
+		ret = io_write(req, sqe, force_nonblock, state);
+		break;
+	case IORING_OP_READ_FIXED:
+		ret = io_read(req, sqe, force_nonblock, state);
+		break;
+	case IORING_OP_WRITE_FIXED:
 		ret = io_write(req, sqe, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
@@ -865,14 +928,21 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
+{
+	return !(sqe->opcode == IORING_OP_READ_FIXED ||
+		 sqe->opcode == IORING_OP_WRITE_FIXED);
+}
+
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	struct sqe_submit *s = &req->submit;
 	u64 user_data = s->sqe->user_data;
 	struct io_ring_ctx *ctx = req->ctx;
-	mm_segment_t old_fs = get_fs();
 	struct files_struct *old_files;
+	mm_segment_t old_fs;
+	bool needs_user;
 	int ret;
 
 	 /* Ensure we clear previously set forced non-block flag */
@@ -881,19 +951,28 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	old_files = current->files;
 	current->files = ctx->sqo_files;
 
-	if (!mmget_not_zero(ctx->sqo_mm)) {
-		ret = -EFAULT;
-		goto err;
+	/*
+	 * If we're doing IO to fixed buffers, we don't need to get/set
+	 * user context
+	 */
+	needs_user = io_sqe_needs_user(s->sqe);
+	if (needs_user) {
+		if (!mmget_not_zero(ctx->sqo_mm)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		use_mm(ctx->sqo_mm);
+		old_fs = get_fs();
+		set_fs(USER_DS);
 	}
 
-	use_mm(ctx->sqo_mm);
-	set_fs(USER_DS);
-
 	ret = __io_submit_sqe(ctx, req, s, false, NULL);
 
-	set_fs(old_fs);
-	unuse_mm(ctx->sqo_mm);
-	mmput(ctx->sqo_mm);
+	if (needs_user) {
+		set_fs(old_fs);
+		unuse_mm(ctx->sqo_mm);
+		mmput(ctx->sqo_mm);
+	}
 err:
 	if (ret) {
 		io_cqring_add_event(ctx, user_data, ret, 0);
@@ -1163,6 +1242,14 @@ static int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
 	return 0;
 }
 
+static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
+{
+	if (ctx->user)
+		return __io_account_mem(ctx->user, nr_pages);
+
+	return 0;
+}
+
 static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 {
 	struct io_sq_ring *sq_ring;
@@ -1176,6 +1263,190 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
 }
 
+static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->sq_entries; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+
+		for (j = 0; j < imu->nr_bvecs; j++)
+			put_page(imu->bvec[j].bv_page);
+
+		io_unaccount_mem(ctx, imu->nr_bvecs);
+		kfree(imu->bvec);
+		imu->nr_bvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+	free_uid(ctx->user);
+	ctx->user = NULL;
+	return 0;
+}
+
+static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
+		       void __user *arg, unsigned index)
+{
+	struct iovec __user *src;
+
+#ifdef CONFIG_COMPAT
+	if (ctx->compat) {
+		struct compat_iovec __user *ciovs;
+		struct compat_iovec ciov;
+
+		ciovs = (struct compat_iovec __user *) arg;
+		if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
+			return -EFAULT;
+
+		dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
+		dst->iov_len = ciov.iov_len;
+		return 0;
+	}
+#endif
+	src = (struct iovec __user *) arg;
+	if (copy_from_user(dst, &src[index], sizeof(*dst)))
+		return -EFAULT;
+	return 0;
+}
+
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
+				  unsigned nr_args)
+{
+	struct vm_area_struct **vmas = NULL;
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	int ret = -EINVAL;
+
+	if (ctx->user_bufs)
+		return -EBUSY;
+	if (!nr_args || nr_args > UIO_MAXIOV)
+		return -EINVAL;
+
+	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	if (!capable(CAP_IPC_LOCK))
+		ctx->user = get_uid(current_user());
+
+	for (i = 0; i < nr_args; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		struct iovec iov;
+		size_t size;
+
+		ret = io_copy_iov(ctx, &iov, arg, i);
+		if (ret)
+			break;
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iov.iov_base)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iov.iov_len > SZ_1G)
+			goto err;
+
+		ubuf = (unsigned long) iov.iov_base;
+		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		ret = io_account_mem(ctx, nr_pages);
+		if (ret)
+			goto err;
+
+		if (!pages || nr_pages > got_pages) {
+			kfree(vmas);
+			kfree(pages);
+			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+						GFP_KERNEL);
+			vmas = kmalloc_array(nr_pages,
+					sizeof(struct vma_area_struct *),
+					GFP_KERNEL);
+			if (!pages || !vmas) {
+				io_unaccount_mem(ctx, nr_pages);
+				goto err;
+			}
+			got_pages = nr_pages;
+		}
+
+		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+						GFP_KERNEL);
+		if (!imu->bvec) {
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		down_write(&current->mm->mmap_sem);
+		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
+						pages, vmas);
+		if (pret == nr_pages) {
+			/* don't support file backed memory */
+			for (j = 0; j < nr_pages; j++) {
+				struct vm_area_struct *vma = vmas[j];
+
+				if (vma->vm_file) {
+					ret = -EOPNOTSUPP;
+					break;
+				}
+			}
+		} else {
+			ret = pret < 0 ? pret : -EFAULT;
+		}
+		up_write(&current->mm->mmap_sem);
+		if (ret) {
+			/*
+			 * if we did partial map, or found file backed vmas,
+			 * release any pages we did get
+			 */
+			if (pret > 0) {
+				for (j = 0; j < pret; j++)
+					put_page(pages[j]);
+			}
+			io_unaccount_mem(ctx, nr_pages);
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iov.iov_len;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			imu->bvec[j].bv_page = pages[j];
+			imu->bvec[j].bv_len = vec_len;
+			imu->bvec[j].bv_offset = off;
+			off = 0;
+			size -= vec_len;
+		}
+		/* store original address for later verification */
+		imu->ubuf = ubuf;
+		imu->len = iov.iov_len;
+		imu->nr_bvecs = nr_pages;
+	}
+	kfree(pages);
+	kfree(vmas);
+	ctx->nr_user_bufs = nr_args;
+	return 0;
+err:
+	kfree(pages);
+	kfree(vmas);
+	io_sqe_buffer_unregister(ctx);
+	return ret;
+}
+
 static void io_free_scq_urings(struct io_ring_ctx *ctx)
 {
 	if (ctx->sq_ring) {
@@ -1197,6 +1468,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 	io_sq_offload_stop(ctx);
 	io_iopoll_reap_events(ctx);
 	io_free_scq_urings(ctx);
+	io_sqe_buffer_unregister(ctx);
 	percpu_ref_exit(&ctx->refs);
 	io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries));
 	kfree(ctx);
@@ -1488,6 +1760,69 @@ COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 }
 #endif
 
+static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
+			       void __user *arg, unsigned nr_args)
+{
+	int ret;
+
+	/* Drop our initial ref and wait for the ctx to be fully idle */
+	percpu_ref_put(&ctx->refs);
+	percpu_ref_kill(&ctx->refs);
+	wait_for_completion(&ctx->ctx_done);
+
+	switch (opcode) {
+	case IORING_REGISTER_BUFFERS:
+		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_BUFFERS:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_buffer_unregister(ctx);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	/* bring the ctx back to life */
+	reinit_completion(&ctx->ctx_done);
+	percpu_ref_resurrect(&ctx->refs);
+	percpu_ref_get(&ctx->refs);
+	return ret;
+}
+
+SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
+		void __user *, arg, unsigned int, nr_args)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ret = -ENXIO;
+	ctx = f.file->private_data;
+	if (!percpu_ref_tryget(&ctx->refs))
+		goto out_fput;
+
+	ret = -EBUSY;
+	if (mutex_trylock(&ctx->uring_lock)) {
+		ret = __io_uring_register(ctx, opcode, arg, nr_args);
+		mutex_unlock(&ctx->uring_lock);
+	}
+	io_ring_drop_ctx_refs(ctx, 1);
+out_fput:
+	fdput(f);
+	return ret;
+}
+
 static int __init io_uring_init(void)
 {
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..c7b5f86b91a1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -40,7 +40,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
 	atomic_long_t locked_vm;
 #endif
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 542757a4c898..101f7024d154 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -314,6 +314,8 @@ asmlinkage long sys_io_uring_setup(u32 entries,
 				struct io_uring_params __user *p);
 asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
 				u32 min_complete, u32 flags);
+asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
+				void __user *arg, unsigned int nr_args);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 87871e7b7ea7..d346229a1eb0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
 __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 #define __NR_io_uring_enter 426
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+#define __NR_io_uring_register 427
+__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
 
 #undef __NR_syscalls
-#define __NR_syscalls 427
+#define __NR_syscalls 428
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 4fc5fbd07688..03ce7133c3b2 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -29,7 +29,10 @@ struct io_uring_sqe {
 		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
-	__u64	__pad2[3];
+	union {
+		__u16	buf_index;	/* index into fixed buffers, if used */
+		__u64	__pad2[3];
+	};
 };
 
 /*
@@ -41,6 +44,8 @@ struct io_uring_sqe {
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
 #define IORING_OP_FSYNC		3
+#define IORING_OP_READ_FIXED	4
+#define IORING_OP_WRITE_FIXED	5
 
 /*
  * sqe->fsync_flags
@@ -104,4 +109,10 @@ struct io_uring_params {
 	struct io_cqring_offsets cq_off;
 };
 
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+#define IORING_REGISTER_BUFFERS		0
+#define IORING_UNREGISTER_BUFFERS	1
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index d754811ec780..38567718c397 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -49,6 +49,7 @@ COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL_COMPAT(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
+COND_SYSCALL(io_uring_register);
 
 /* fs/xattr.c */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2019-02-11 20:33 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-29 19:26 [PATCHSET v9] io_uring IO interface Jens Axboe
2019-01-29 19:26 ` [PATCH 01/18] fs: add an iopoll method to struct file_operations Jens Axboe
2019-01-29 19:26 ` [PATCH 02/18] block: wire up block device iopoll method Jens Axboe
2019-01-29 19:26 ` [PATCH 03/18] block: add bio_set_polled() helper Jens Axboe
2019-01-29 19:26 ` [PATCH 04/18] iomap: wire up the iopoll method Jens Axboe
2019-01-29 19:26 ` [PATCH 05/18] Add io_uring IO interface Jens Axboe
2019-01-29 19:26 ` [PATCH 06/18] io_uring: add fsync support Jens Axboe
2019-01-29 19:26 ` [PATCH 07/18] io_uring: support for IO polling Jens Axboe
2019-01-29 20:47   ` Jann Horn
2019-01-29 20:56     ` Jens Axboe
2019-01-29 21:10       ` Jann Horn
2019-01-29 21:33         ` Jens Axboe
2019-01-29 19:26 ` [PATCH 08/18] fs: add fget_many() and fput_many() Jens Axboe
2019-01-29 19:26 ` [PATCH 09/18] io_uring: use fget/fput_many() for file references Jens Axboe
2019-01-29 23:31   ` Jann Horn
2019-01-29 23:44     ` Jens Axboe
2019-01-30 15:33       ` Jens Axboe
2019-01-29 19:26 ` [PATCH 10/18] io_uring: batch io_kiocb allocation Jens Axboe
2019-01-29 19:26 ` [PATCH 11/18] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-01-29 19:26 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-29 22:44   ` Jann Horn
2019-01-29 22:56     ` Jens Axboe
2019-01-29 23:03       ` Jann Horn
2019-01-29 23:06         ` Jens Axboe
2019-01-29 23:08           ` Jann Horn
2019-01-29 23:14             ` Jens Axboe
2019-01-29 23:42               ` Jann Horn
2019-01-29 23:51                 ` Jens Axboe
2019-01-29 19:26 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-30  1:29   ` Jann Horn
2019-01-30 15:35     ` Jens Axboe
2019-02-04  2:56     ` Al Viro
2019-02-05  2:19       ` Jens Axboe
2019-02-05 17:57         ` Jens Axboe
2019-02-05 19:08           ` Jens Axboe
2019-02-06  0:27             ` Jens Axboe
2019-02-06  1:01               ` Al Viro
2019-02-06 17:56                 ` Jens Axboe
2019-02-07  4:05                   ` Al Viro
2019-02-07 16:14                     ` Jens Axboe
2019-02-07 16:30                       ` Al Viro
2019-02-07 16:35                         ` Jens Axboe
2019-02-07 16:51                         ` Al Viro
2019-02-06  0:56             ` Al Viro
2019-02-06 13:41               ` Jens Axboe
2019-02-07  4:00                 ` Al Viro
2019-02-07  9:22                   ` Miklos Szeredi
2019-02-07 13:31                     ` Al Viro
2019-02-07 14:20                       ` Miklos Szeredi
2019-02-07 15:20                         ` Al Viro
2019-02-07 15:27                           ` Miklos Szeredi
2019-02-07 16:26                             ` Al Viro
2019-02-07 19:08                               ` Miklos Szeredi
2019-02-07 18:45                   ` Jens Axboe
2019-02-07 18:58                     ` Jens Axboe
2019-02-11 15:55                     ` Jonathan Corbet
2019-02-11 17:35                       ` Al Viro
2019-02-11 20:33                         ` Jonathan Corbet
2019-01-29 19:26 ` [PATCH 14/18] io_uring: add submission polling Jens Axboe
2019-01-29 19:26 ` [PATCH 15/18] io_uring: add io_kiocb ref count Jens Axboe
2019-01-29 19:27 ` [PATCH 16/18] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-01-29 19:27 ` [PATCH 17/18] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
2019-01-29 19:27 ` [PATCH 18/18] io_uring: add io_uring_event cache hit information Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2019-02-07 19:55 [PATCHSET v12] io_uring IO interface Jens Axboe
2019-02-07 19:55 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-02-07 20:57   ` Jeff Moyer
2019-02-07 21:02     ` Jens Axboe
2019-02-07 22:38   ` Jeff Moyer
2019-02-07 22:47     ` Jens Axboe
2019-02-01 15:23 [PATCHSET v11] io_uring IO interface Jens Axboe
2019-02-01 15:24 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-30 21:55 [PATCHSET v10] io_uring IO interface Jens Axboe
2019-01-30 21:55 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-28 21:35 [PATCHSET v8] io_uring IO interface Jens Axboe
2019-01-28 21:35 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-28 23:35   ` Jann Horn
2019-01-28 23:50     ` Jens Axboe
2019-01-29  0:36       ` Jann Horn
2019-01-29  1:25         ` Jens Axboe
2019-01-23 15:35 [PATCHSET v7] io_uring IO interface Jens Axboe
2019-01-23 15:35 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).