All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v15] io_uring IO interface
@ 2019-02-11 19:00 ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, viro

Some final tweaks, mostly cosmetic, but also two important fixes:

1) Ensure that we account the skb appropriately against the socket.
   Some network config options apparently return is an skb with
   ->truesize != 0 when allocated with a size of 0, ensure we add
   those as references against sock->sk_wmem_alloc. Reported by
   Matt Mullins.

2) Ensure that ANY async punt of an sqe operates on a copy. We've
   already committed the SQ ring change at this point, we don't want
   the application inadvertently reusing the SQE causing issues for
   an SQE that got punted with -EAGAIN. This change ensures that
   whatever io_uring_enter(2) returned that got submitted, it's
   totally safe to reuse those SQE entries in the ring.

Outside of that, just cosmetic changes and additions of comments. I've
run this version in various torture overnight, using the various modes
(polled, buffered, sq thread, any combination thereof) and it's help up
perfectly. As far as I'm concerned, this is ready to get staged for 5.1.

The liburing git repo has a full set of man pages for this, though they
could probably still use a bit of polish. I'd also like to see a
io_uring(7) man page to describe the overall design of the project,
expect that in the not-so-distant future. You can clone that here:

git://git.kernel.dk/liburing

Patches are against 5.0-rc6, and can also be found in my io_uring branch
here:

git://git.kernel.dk/linux-block io_uring

Changes since v14:
- Fix skb/sock referencing if skb->truesize != 0
- Add comments on memory ordering
- Add comments on READ/WRITE_ONCE()
- Various function comments
- Align struct members of io_poll_iocb and io_submit_state
- Use io_fput() in two places where it was open-coded
- Make async context always use a copy of the sqe
- Don't reset s->needs_fixed_file for async context
- Rebase on v5.0-rc6

 Documentation/filesystems/vfs.txt      |    3 +
 arch/x86/entry/syscalls/syscall_32.tbl |    3 +
 arch/x86/entry/syscalls/syscall_64.tbl |    3 +
 block/bio.c                            |   59 +-
 fs/Makefile                            |    1 +
 fs/block_dev.c                         |   19 +-
 fs/file.c                              |   15 +-
 fs/file_table.c                        |    9 +-
 fs/gfs2/file.c                         |    2 +
 fs/io_uring.c                          | 2920 ++++++++++++++++++++++++
 fs/iomap.c                             |   48 +-
 fs/xfs/xfs_file.c                      |    1 +
 include/linux/bio.h                    |   14 +
 include/linux/blk_types.h              |    1 +
 include/linux/file.h                   |    2 +
 include/linux/fs.h                     |   15 +-
 include/linux/iomap.h                  |    1 +
 include/linux/sched/user.h             |    2 +-
 include/linux/syscalls.h               |    8 +
 include/net/af_unix.h                  |    1 +
 include/uapi/asm-generic/unistd.h      |    8 +-
 include/uapi/linux/io_uring.h          |  142 ++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    3 +
 net/Makefile                           |    2 +-
 net/unix/Kconfig                       |    5 +
 net/unix/Makefile                      |    2 +
 net/unix/af_unix.c                     |   63 +-
 net/unix/garbage.c                     |   68 +-
 net/unix/scm.c                         |  151 ++
 net/unix/scm.h                         |   10 +
 31 files changed, 3421 insertions(+), 169 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCHSET v15] io_uring IO interface
@ 2019-02-11 19:00 ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api; +Cc: hch, jmoyer, avi, jannh, viro

Some final tweaks, mostly cosmetic, but also two important fixes:

1) Ensure that we account the skb appropriately against the socket.
   Some network config options apparently return is an skb with
   ->truesize != 0 when allocated with a size of 0, ensure we add
   those as references against sock->sk_wmem_alloc. Reported by
   Matt Mullins.

2) Ensure that ANY async punt of an sqe operates on a copy. We've
   already committed the SQ ring change at this point, we don't want
   the application inadvertently reusing the SQE causing issues for
   an SQE that got punted with -EAGAIN. This change ensures that
   whatever io_uring_enter(2) returned that got submitted, it's
   totally safe to reuse those SQE entries in the ring.

Outside of that, just cosmetic changes and additions of comments. I've
run this version in various torture overnight, using the various modes
(polled, buffered, sq thread, any combination thereof) and it's help up
perfectly. As far as I'm concerned, this is ready to get staged for 5.1.

The liburing git repo has a full set of man pages for this, though they
could probably still use a bit of polish. I'd also like to see a
io_uring(7) man page to describe the overall design of the project,
expect that in the not-so-distant future. You can clone that here:

git://git.kernel.dk/liburing

Patches are against 5.0-rc6, and can also be found in my io_uring branch
here:

git://git.kernel.dk/linux-block io_uring

Changes since v14:
- Fix skb/sock referencing if skb->truesize != 0
- Add comments on memory ordering
- Add comments on READ/WRITE_ONCE()
- Various function comments
- Align struct members of io_poll_iocb and io_submit_state
- Use io_fput() in two places where it was open-coded
- Make async context always use a copy of the sqe
- Don't reset s->needs_fixed_file for async context
- Rebase on v5.0-rc6

 Documentation/filesystems/vfs.txt      |    3 +
 arch/x86/entry/syscalls/syscall_32.tbl |    3 +
 arch/x86/entry/syscalls/syscall_64.tbl |    3 +
 block/bio.c                            |   59 +-
 fs/Makefile                            |    1 +
 fs/block_dev.c                         |   19 +-
 fs/file.c                              |   15 +-
 fs/file_table.c                        |    9 +-
 fs/gfs2/file.c                         |    2 +
 fs/io_uring.c                          | 2920 ++++++++++++++++++++++++
 fs/iomap.c                             |   48 +-
 fs/xfs/xfs_file.c                      |    1 +
 include/linux/bio.h                    |   14 +
 include/linux/blk_types.h              |    1 +
 include/linux/file.h                   |    2 +
 include/linux/fs.h                     |   15 +-
 include/linux/iomap.h                  |    1 +
 include/linux/sched/user.h             |    2 +-
 include/linux/syscalls.h               |    8 +
 include/net/af_unix.h                  |    1 +
 include/uapi/asm-generic/unistd.h      |    8 +-
 include/uapi/linux/io_uring.h          |  142 ++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    3 +
 net/Makefile                           |    2 +-
 net/unix/Kconfig                       |    5 +
 net/unix/Makefile                      |    2 +
 net/unix/af_unix.c                     |   63 +-
 net/unix/garbage.c                     |   68 +-
 net/unix/scm.c                         |  151 ++
 net/unix/scm.h                         |   10 +
 31 files changed, 3421 insertions(+), 169 deletions(-)

-- 
Jens Axboe


--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCH 01/19] fs: add an iopoll method to struct file_operations
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

This new methods is used to explicitly poll for I/O completion for an
iocb.  It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/filesystems/vfs.txt | 3 +++
 include/linux/fs.h                | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 8dc8e9c2913f..761c6fd24a53 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
@@ -902,6 +903,8 @@ otherwise noted.
 
   write_iter: possibly asynchronous write with iov_iter as source
 
+  iopoll: called when aio wants to poll for completions on HIPRI iocbs
+
   iterate: called when the VFS needs to read the directory contents
 
   iterate_shared: called when the VFS needs to read the directory contents
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 29d8e2cfed0e..dedcc2e9265c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -310,6 +310,7 @@ struct kiocb {
 	int			ki_flags;
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	unsigned int		ki_cookie; /* for ->iopoll */
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1787,6 +1788,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 01/19] fs: add an iopoll method to struct file_operations
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

This new methods is used to explicitly poll for I/O completion for an
iocb.  It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/filesystems/vfs.txt | 3 +++
 include/linux/fs.h                | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 8dc8e9c2913f..761c6fd24a53 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
@@ -902,6 +903,8 @@ otherwise noted.
 
   write_iter: possibly asynchronous write with iov_iter as source
 
+  iopoll: called when aio wants to poll for completions on HIPRI iocbs
+
   iterate: called when the VFS needs to read the directory contents
 
   iterate_shared: called when the VFS needs to read the directory contents
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 29d8e2cfed0e..dedcc2e9265c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -310,6 +310,7 @@ struct kiocb {
 	int			ki_flags;
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	unsigned int		ki_cookie; /* for ->iopoll */
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1787,6 +1788,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH] io_uring: add io_uring_event cache hit information
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Add hint on whether a read was served out of the page cache, or if it
hit media. This is useful for buffered async IO, O_DIRECT reads would
never have this set (for obvious reasons).

If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT
set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 7 ++++++-
 include/uapi/linux/io_uring.h | 5 +++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 00c7c11ca699..8114723517d6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -611,11 +611,16 @@ static void io_fput(struct io_kiocb *req)
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+	unsigned ev_flags = 0;
 
 	kiocb_end_write(kiocb);
 
 	io_fput(req);
-	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+
+	if (res > 0 && (req->flags & REQ_F_FORCE_NONBLOCK))
+		ev_flags = IOCQE_FLAG_CACHEHIT;
+
+	io_cqring_add_event(req->ctx, req->user_data, res, ev_flags);
 	io_free_req(req);
 }
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e23408692118..24906e99fdc7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -69,6 +69,11 @@ struct io_uring_cqe {
 	__u32	flags;
 };
 
+/*
+ * io_uring_event->flags
+ */
+#define IOCQE_FLAG_CACHEHIT	(1U << 0)	/* IO did not hit media */
+
 /*
  * Magic offsets for the application to mmap the data it needs
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH] io_uring: add io_uring_event cache hit information
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Add hint on whether a read was served out of the page cache, or if it
hit media. This is useful for buffered async IO, O_DIRECT reads would
never have this set (for obvious reasons).

If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT
set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 7 ++++++-
 include/uapi/linux/io_uring.h | 5 +++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 00c7c11ca699..8114723517d6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -611,11 +611,16 @@ static void io_fput(struct io_kiocb *req)
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+	unsigned ev_flags = 0;
 
 	kiocb_end_write(kiocb);
 
 	io_fput(req);
-	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+
+	if (res > 0 && (req->flags & REQ_F_FORCE_NONBLOCK))
+		ev_flags = IOCQE_FLAG_CACHEHIT;
+
+	io_cqring_add_event(req->ctx, req->user_data, res, ev_flags);
 	io_free_req(req);
 }
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e23408692118..24906e99fdc7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -69,6 +69,11 @@ struct io_uring_cqe {
 	__u32	flags;
 };
 
+/*
+ * io_uring_event->flags
+ */
+#define IOCQE_FLAG_CACHEHIT	(1U << 0)	/* IO did not hit media */
+
 /*
  * Magic offsets for the application to mmap the data it needs
  */
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 02/19] block: wire up block device iopoll method
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 58a4c1217fa8..f18d076a2596 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -293,6 +293,14 @@ struct blkdev_dio {
 
 static struct bio_set blkdev_dio_pool;
 
+static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
+{
+	struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host);
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
+}
+
 static void blkdev_bio_end_io(struct bio *bio)
 {
 	struct blkdev_dio *dio = bio->bi_private;
@@ -410,6 +418,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				bio->bi_opf |= REQ_HIPRI;
 
 			qc = submit_bio(bio);
+			WRITE_ONCE(iocb->ki_cookie, qc);
 			break;
 		}
 
@@ -2076,6 +2085,7 @@ const struct file_operations def_blk_fops = {
 	.llseek		= block_llseek,
 	.read_iter	= blkdev_read_iter,
 	.write_iter	= blkdev_write_iter,
+	.iopoll		= blkdev_iopoll,
 	.mmap		= generic_file_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 02/19] block: wire up block device iopoll method
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 58a4c1217fa8..f18d076a2596 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -293,6 +293,14 @@ struct blkdev_dio {
 
 static struct bio_set blkdev_dio_pool;
 
+static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
+{
+	struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host);
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
+}
+
 static void blkdev_bio_end_io(struct bio *bio)
 {
 	struct blkdev_dio *dio = bio->bi_private;
@@ -410,6 +418,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				bio->bi_opf |= REQ_HIPRI;
 
 			qc = submit_bio(bio);
+			WRITE_ONCE(iocb->ki_cookie, qc);
 			break;
 		}
 
@@ -2076,6 +2085,7 @@ const struct file_operations def_blk_fops = {
 	.llseek		= block_llseek,
 	.read_iter	= blkdev_read_iter,
 	.write_iter	= blkdev_write_iter,
+	.iopoll		= blkdev_iopoll,
 	.mmap		= generic_file_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 03/19] block: add bio_set_polled() helper
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.

Utilize the helper in the blockdev DIRECT_IO code.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c      |  4 ++--
 include/linux/bio.h | 14 ++++++++++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f18d076a2596..392e2bfb636f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -247,7 +247,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		task_io_account_write(ret);
 	}
 	if (iocb->ki_flags & IOCB_HIPRI)
-		bio.bi_opf |= REQ_HIPRI;
+		bio_set_polled(&bio, iocb);
 
 	qc = submit_bio(&bio);
 	for (;;) {
@@ -415,7 +415,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
 		if (!nr_pages) {
 			if (iocb->ki_flags & IOCB_HIPRI)
-				bio->bi_opf |= REQ_HIPRI;
+				bio_set_polled(bio, iocb);
 
 			qc = submit_bio(bio);
 			WRITE_ONCE(iocb->ki_cookie, qc);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7380b094dcca..f6f0a2b3cbc8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+/*
+ * Mark a bio as polled. Note that for async polled IO, the caller must
+ * expect -EWOULDBLOCK if we cannot allocate a request (or other resources).
+ * We cannot block waiting for requests on polled IO, as those completions
+ * must be found by the caller. This is different than IRQ driven IO, where
+ * it's safe to wait for IO to complete.
+ */
+static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
+{
+	bio->bi_opf |= REQ_HIPRI;
+	if (!is_sync_kiocb(kiocb))
+		bio->bi_opf |= REQ_NOWAIT;
+}
+
 #endif /* CONFIG_BLOCK */
 #endif /* __LINUX_BIO_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 03/19] block: add bio_set_polled() helper
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.

Utilize the helper in the blockdev DIRECT_IO code.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c      |  4 ++--
 include/linux/bio.h | 14 ++++++++++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f18d076a2596..392e2bfb636f 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -247,7 +247,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		task_io_account_write(ret);
 	}
 	if (iocb->ki_flags & IOCB_HIPRI)
-		bio.bi_opf |= REQ_HIPRI;
+		bio_set_polled(&bio, iocb);
 
 	qc = submit_bio(&bio);
 	for (;;) {
@@ -415,7 +415,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
 		if (!nr_pages) {
 			if (iocb->ki_flags & IOCB_HIPRI)
-				bio->bi_opf |= REQ_HIPRI;
+				bio_set_polled(bio, iocb);
 
 			qc = submit_bio(bio);
 			WRITE_ONCE(iocb->ki_cookie, qc);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7380b094dcca..f6f0a2b3cbc8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+/*
+ * Mark a bio as polled. Note that for async polled IO, the caller must
+ * expect -EWOULDBLOCK if we cannot allocate a request (or other resources).
+ * We cannot block waiting for requests on polled IO, as those completions
+ * must be found by the caller. This is different than IRQ driven IO, where
+ * it's safe to wait for IO to complete.
+ */
+static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
+{
+	bio->bi_opf |= REQ_HIPRI;
+	if (!is_sync_kiocb(kiocb))
+		bio->bi_opf |= REQ_NOWAIT;
+}
+
 #endif /* CONFIG_BLOCK */
 #endif /* __LINUX_BIO_H */
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 04/19] iomap: wire up the iopoll method
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device.  Also refactor the common direct I/O bio submission code into a
nice little helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified to use bio_set_polled().

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/gfs2/file.c        |  2 ++
 fs/iomap.c            | 43 ++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_file.c     |  1 +
 include/linux/iomap.h |  1 +
 4 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index a2dea5bc0427..58a768e59712 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
@@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
diff --git a/fs/iomap.c b/fs/iomap.c
index 897c60215dd1..2ac9eb746d44 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1463,6 +1463,28 @@ struct iomap_dio {
 	};
 };
 
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
+{
+	struct request_queue *q = READ_ONCE(kiocb->private);
+
+	if (!q)
+		return 0;
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin);
+}
+EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
+
+static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
+		struct bio *bio)
+{
+	atomic_inc(&dio->ref);
+
+	if (dio->iocb->ki_flags & IOCB_HIPRI)
+		bio_set_polled(bio, dio->iocb);
+
+	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
+	dio->submit.cookie = submit_bio(bio);
+}
+
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
@@ -1575,7 +1597,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	}
 }
 
-static blk_qc_t
+static void
 iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 		unsigned len)
 {
@@ -1589,15 +1611,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	if (dio->iocb->ki_flags & IOCB_HIPRI)
-		flags |= REQ_HIPRI;
-
 	get_page(page);
 	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
-
-	atomic_inc(&dio->ref);
-	return submit_bio(bio);
+	iomap_dio_submit_bio(dio, iomap, bio);
 }
 
 static loff_t
@@ -1700,9 +1717,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 				bio_set_pages_dirty(bio);
 		}
 
-		if (dio->iocb->ki_flags & IOCB_HIPRI)
-			bio->bi_opf |= REQ_HIPRI;
-
 		iov_iter_advance(dio->submit.iter, n);
 
 		dio->size += n;
@@ -1710,11 +1724,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
-
-		atomic_inc(&dio->ref);
-
-		dio->submit.last_queue = bdev_get_queue(iomap->bdev);
-		dio->submit.cookie = submit_bio(bio);
+		iomap_dio_submit_bio(dio, iomap, bio);
 	} while (nr_pages);
 
 	/*
@@ -1925,6 +1935,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
+	WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie);
+	WRITE_ONCE(iocb->private, dio->submit.last_queue);
+
 	/*
 	 * We are about to drop our additional submission reference, which
 	 * might be the last reference to the dio.  There are three three
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e47425071e65..60c2da41f0fc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = {
 	.write_iter	= xfs_file_write_iter,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= xfs_file_compat_ioctl,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 9a4258154b25..0fefb5455bda 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
 		unsigned flags);
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
 #ifdef CONFIG_SWAP
 struct file;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 04/19] iomap: wire up the iopoll method
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device.  Also refactor the common direct I/O bio submission code into a
nice little helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified to use bio_set_polled().

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/gfs2/file.c        |  2 ++
 fs/iomap.c            | 43 ++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_file.c     |  1 +
 include/linux/iomap.h |  1 +
 4 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index a2dea5bc0427..58a768e59712 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
@@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
diff --git a/fs/iomap.c b/fs/iomap.c
index 897c60215dd1..2ac9eb746d44 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1463,6 +1463,28 @@ struct iomap_dio {
 	};
 };
 
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
+{
+	struct request_queue *q = READ_ONCE(kiocb->private);
+
+	if (!q)
+		return 0;
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin);
+}
+EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
+
+static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
+		struct bio *bio)
+{
+	atomic_inc(&dio->ref);
+
+	if (dio->iocb->ki_flags & IOCB_HIPRI)
+		bio_set_polled(bio, dio->iocb);
+
+	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
+	dio->submit.cookie = submit_bio(bio);
+}
+
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
@@ -1575,7 +1597,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	}
 }
 
-static blk_qc_t
+static void
 iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 		unsigned len)
 {
@@ -1589,15 +1611,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	if (dio->iocb->ki_flags & IOCB_HIPRI)
-		flags |= REQ_HIPRI;
-
 	get_page(page);
 	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
-
-	atomic_inc(&dio->ref);
-	return submit_bio(bio);
+	iomap_dio_submit_bio(dio, iomap, bio);
 }
 
 static loff_t
@@ -1700,9 +1717,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 				bio_set_pages_dirty(bio);
 		}
 
-		if (dio->iocb->ki_flags & IOCB_HIPRI)
-			bio->bi_opf |= REQ_HIPRI;
-
 		iov_iter_advance(dio->submit.iter, n);
 
 		dio->size += n;
@@ -1710,11 +1724,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
-
-		atomic_inc(&dio->ref);
-
-		dio->submit.last_queue = bdev_get_queue(iomap->bdev);
-		dio->submit.cookie = submit_bio(bio);
+		iomap_dio_submit_bio(dio, iomap, bio);
 	} while (nr_pages);
 
 	/*
@@ -1925,6 +1935,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
+	WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie);
+	WRITE_ONCE(iocb->private, dio->submit.last_queue);
+
 	/*
 	 * We are about to drop our additional submission reference, which
 	 * might be the last reference to the dio.  There are three three
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e47425071e65..60c2da41f0fc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = {
 	.write_iter	= xfs_file_write_iter,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= xfs_file_compat_ioctl,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 9a4258154b25..0fefb5455bda 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
 		unsigned flags);
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
 #ifdef CONFIG_SWAP
 struct file;
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 05/19] Add io_uring IO interface
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.

Two new system calls are added for this:

io_uring_setup(entries, params)
	Sets up an io_uring instance for doing async IO. On success,
	returns a file descriptor that the application can mmap to
	gain access to the SQ ring, CQ ring, and io_uring_sqes.

io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
	Initiates IO against the rings mapped to this fd, or waits for
	them to complete, or both. The behavior is controlled by the
	parameters passed in. If 'to_submit' is non-zero, then we'll
	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
	kernel will wait for 'min_complete' events, if they aren't
	already available. It's valid to set IORING_ENTER_GETEVENTS
	and 'min_complete' == 0 at the same time, this allows the
	kernel to return already completed events without waiting
	for them. This is useful only for polling, as for IRQ
	driven IO, the application can just check the CQ ring
	without entering the kernel.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 fs/Makefile                            |    1 +
 fs/io_uring.c                          | 1243 ++++++++++++++++++++++++
 include/linux/fs.h                     |    9 +
 include/linux/sched/user.h             |    2 +-
 include/linux/syscalls.h               |    6 +
 include/uapi/asm-generic/unistd.h      |    6 +-
 include/uapi/linux/io_uring.h          |   95 ++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    2 +
 net/unix/garbage.c                     |    3 +
 12 files changed, 1378 insertions(+), 2 deletions(-)
 create mode 100644 fs/io_uring.c
 create mode 100644 include/uapi/linux/io_uring.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..481c126259e9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,5 @@
 384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
+425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
+426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..6a32a430c8e0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,8 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+425	common	io_uring_setup		__x64_sys_io_uring_setup
+426	common	io_uring_enter		__x64_sys_io_uring_enter
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..8e15d6fc4340 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_IO_URING)		+= io_uring.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
diff --git a/fs/io_uring.c b/fs/io_uring.c
new file mode 100644
index 000000000000..1b28d38a9b76
--- /dev/null
+++ b/fs/io_uring.c
@@ -0,0 +1,1243 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Shared application/kernel submission and completion ring pairs, for
+ * supporting fast/efficient IO.
+ *
+ * A note on the read/write ordering memory barriers that are matched between
+ * the application and kernel side. When the application reads the CQ ring
+ * tail, it must use an appropriate smp_rmb() to order with the smp_wmb()
+ * the kernel uses after writing the tail. Failure to do so could cause a
+ * delay in when the application notices that completion events available.
+ * This isn't a fatal condition. Likewise, the application must use an
+ * appropriate smp_wmb() both before writing the SQ tail, and after writing
+ * the SQ tail. The first one orders the sqe writes with the tail write, and
+ * the latter is paired with the smp_rmb() the kernel will issue before
+ * reading the SQ tail on submission.
+ *
+ * Also see the examples in the liburing library:
+ *
+ *	git://git.kernel.dk/liburing
+ *
+ * io_uring also uses READ/WRITE_ONCE() for _any_ store or load that happens
+ * from data shared between the kernel and application. This is done both
+ * for ordering purposes, but also to ensure that once a value is loaded from
+ * data that the application could potentially modify, it remains stable.
+ *
+ * Copyright (C) 2018-2019 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/syscalls.h>
+#include <linux/compat.h>
+#include <linux/refcount.h>
+#include <linux/uio.h>
+
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mmu_context.h>
+#include <linux/percpu.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/blkdev.h>
+#include <linux/net.h>
+#include <net/sock.h>
+#include <net/af_unix.h>
+#include <linux/anon_inodes.h>
+#include <linux/sched/mm.h>
+#include <linux/uaccess.h>
+#include <linux/nospec.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "internal.h"
+
+#define IORING_MAX_ENTRIES	4096
+
+struct io_uring {
+	u32 head ____cacheline_aligned_in_smp;
+	u32 tail ____cacheline_aligned_in_smp;
+};
+
+struct io_sq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			dropped;
+	u32			flags;
+	u32			array[];
+};
+
+struct io_cq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			overflow;
+	struct io_uring_cqe	cqes[];
+};
+
+struct io_ring_ctx {
+	struct {
+		struct percpu_ref	refs;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		unsigned int		flags;
+		bool			compat;
+		bool			account_mem;
+
+		/* SQ ring */
+		struct io_sq_ring	*sq_ring;
+		unsigned		cached_sq_head;
+		unsigned		sq_entries;
+		unsigned		sq_mask;
+		struct io_uring_sqe	*sq_sqes;
+	} ____cacheline_aligned_in_smp;
+
+	/* IO offload */
+	struct workqueue_struct	*sqo_wq;
+	struct mm_struct	*sqo_mm;
+
+	struct {
+		/* CQ ring */
+		struct io_cq_ring	*cq_ring;
+		unsigned		cached_cq_tail;
+		unsigned		cq_entries;
+		unsigned		cq_mask;
+		struct wait_queue_head	cq_wait;
+		struct fasync_struct	*cq_fasync;
+	} ____cacheline_aligned_in_smp;
+
+	struct user_struct	*user;
+
+	struct completion	ctx_done;
+
+	struct {
+		struct mutex		uring_lock;
+		wait_queue_head_t	wait;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		spinlock_t		completion_lock;
+	} ____cacheline_aligned_in_smp;
+
+#if defined(CONFIG_UNIX)
+	struct socket		*ring_sock;
+#endif
+};
+
+struct sqe_submit {
+	const struct io_uring_sqe	*sqe;
+	unsigned short			index;
+	bool				has_user;
+};
+
+struct io_kiocb {
+	struct kiocb		rw;
+
+	struct sqe_submit	submit;
+
+	struct io_ring_ctx	*ctx;
+	struct list_head	list;
+	unsigned int		flags;
+#define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
+	u64			user_data;
+
+	struct work_struct	work;
+};
+
+#define IO_PLUG_THRESHOLD		2
+
+static struct kmem_cache *req_cachep;
+
+static const struct file_operations io_uring_fops;
+
+struct sock *io_uring_get_socket(struct file *file)
+{
+#if defined(CONFIG_UNIX)
+	if (file->f_op == &io_uring_fops) {
+		struct io_ring_ctx *ctx = file->private_data;
+
+		return ctx->ring_sock->sk;
+	}
+#endif
+	return NULL;
+}
+EXPORT_SYMBOL(io_uring_get_socket);
+
+static void io_ring_ctx_ref_free(struct percpu_ref *ref)
+{
+	struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
+
+	complete(&ctx->ctx_done);
+}
+
+static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
+{
+	struct io_ring_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) {
+		kfree(ctx);
+		return NULL;
+	}
+
+	ctx->flags = p->flags;
+	init_waitqueue_head(&ctx->cq_wait);
+	init_completion(&ctx->ctx_done);
+	mutex_init(&ctx->uring_lock);
+	init_waitqueue_head(&ctx->wait);
+	spin_lock_init(&ctx->completion_lock);
+	return ctx;
+}
+
+static void io_commit_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+
+	if (ctx->cached_cq_tail != READ_ONCE(ring->r.tail)) {
+		/* order cqe stores with ring update */
+		smp_wmb();
+		WRITE_ONCE(ring->r.tail, ctx->cached_cq_tail);
+		/*
+		 * Write sider barrier of tail update, app has read side. See
+		 * comment at the top of this file.
+		 */
+		smp_wmb();
+
+		if (wq_has_sleeper(&ctx->cq_wait)) {
+			wake_up_interruptible(&ctx->cq_wait);
+			kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
+		}
+	}
+}
+
+static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+	unsigned tail;
+
+	tail = ctx->cached_cq_tail;
+	/* See comment at the top of the file */
+	smp_rmb();
+	if (tail + 1 == READ_ONCE(ring->r.head))
+		return NULL;
+
+	ctx->cached_cq_tail++;
+	return &ring->cqes[tail & ctx->cq_mask];
+}
+
+static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data,
+				 long res, unsigned ev_flags)
+{
+	struct io_uring_cqe *cqe;
+
+	/*
+	 * If we can't get a cq entry, userspace overflowed the
+	 * submission (by quite a lot). Increment the overflow count in
+	 * the ring.
+	 */
+	cqe = io_get_cqring(ctx);
+	if (cqe) {
+		WRITE_ONCE(cqe->user_data, ki_user_data);
+		WRITE_ONCE(cqe->res, res);
+		WRITE_ONCE(cqe->flags, ev_flags);
+	} else {
+		unsigned overflow = READ_ONCE(ctx->cq_ring->overflow);
+
+		WRITE_ONCE(ctx->cq_ring->overflow, overflow + 1);
+	}
+}
+
+static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
+				long res, unsigned ev_flags)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctx->completion_lock, flags);
+	io_cqring_fill_event(ctx, ki_user_data, res, ev_flags);
+	io_commit_cqring(ctx);
+	spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
+	if (waitqueue_active(&ctx->wait))
+		wake_up(&ctx->wait);
+}
+
+static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
+{
+	percpu_ref_put_many(&ctx->refs, refs);
+
+	if (waitqueue_active(&ctx->wait))
+		wake_up(&ctx->wait);
+}
+
+static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	if (!percpu_ref_tryget(&ctx->refs))
+		return NULL;
+
+	req = kmem_cache_alloc(req_cachep, __GFP_NOWARN);
+	if (req) {
+		req->ctx = ctx;
+		req->flags = 0;
+		return req;
+	}
+
+	io_ring_drop_ctx_refs(ctx, 1);
+	return NULL;
+}
+
+static void io_free_req(struct io_kiocb *req)
+{
+	io_ring_drop_ctx_refs(req->ctx, 1);
+	kmem_cache_free(req_cachep, req);
+}
+
+static void kiocb_end_write(struct kiocb *kiocb)
+{
+	if (kiocb->ki_flags & IOCB_WRITE) {
+		struct inode *inode = file_inode(kiocb->ki_filp);
+
+		/*
+		 * Tell lockdep we inherited freeze protection from submission
+		 * thread.
+		 */
+		if (S_ISREG(inode->i_mode))
+			__sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE);
+		file_end_write(kiocb->ki_filp);
+	}
+}
+
+static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
+{
+	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	fput(kiocb->ki_filp);
+	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+	io_free_req(req);
+}
+
+/*
+ * If we tracked the file through the SCM inflight mechanism, we could support
+ * any file. For now, just ensure that anything potentially problematic is done
+ * inline.
+ */
+static bool io_file_supports_async(struct file *file)
+{
+	umode_t mode = file_inode(file)->i_mode;
+
+	if (S_ISBLK(mode) || S_ISCHR(mode))
+		return true;
+	if (S_ISREG(mode) && file->f_op != &io_uring_fops)
+		return true;
+
+	return false;
+}
+
+static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		      bool force_nonblock)
+{
+	struct kiocb *kiocb = &req->rw;
+	unsigned ioprio;
+	int fd, ret;
+
+	/* For -EAGAIN retry, everything is already prepped */
+	if (kiocb->ki_filp)
+		return 0;
+
+	fd = READ_ONCE(sqe->fd);
+	kiocb->ki_filp = fget(fd);
+	if (unlikely(!kiocb->ki_filp))
+		return -EBADF;
+	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
+		force_nonblock = false;
+	kiocb->ki_pos = READ_ONCE(sqe->off);
+	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
+	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
+
+	ioprio = READ_ONCE(sqe->ioprio);
+	if (ioprio) {
+		ret = ioprio_check_cap(ioprio);
+		if (ret)
+			goto out_fput;
+
+		kiocb->ki_ioprio = ioprio;
+	} else
+		kiocb->ki_ioprio = get_current_ioprio();
+
+	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
+	if (unlikely(ret))
+		goto out_fput;
+	if (force_nonblock) {
+		kiocb->ki_flags |= IOCB_NOWAIT;
+		req->flags |= REQ_F_FORCE_NONBLOCK;
+	}
+	if (kiocb->ki_flags & IOCB_HIPRI) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	kiocb->ki_complete = io_complete_rw;
+	return 0;
+out_fput:
+	fput(kiocb->ki_filp);
+	return ret;
+}
+
+static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
+{
+	switch (ret) {
+	case -EIOCBQUEUED:
+		break;
+	case -ERESTARTSYS:
+	case -ERESTARTNOINTR:
+	case -ERESTARTNOHAND:
+	case -ERESTART_RESTARTBLOCK:
+		/*
+		 * We can't just restart the syscall, since previously
+		 * submitted sqes may already be in progress. Just fail this
+		 * IO with EINTR.
+		 */
+		ret = -EINTR;
+		/* fall through */
+	default:
+		kiocb->ki_complete(kiocb, ret, 0);
+	}
+}
+
+static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
+			   const struct sqe_submit *s, struct iovec **iovec,
+			   struct iov_iter *iter)
+{
+	const struct io_uring_sqe *sqe = s->sqe;
+	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	size_t sqe_len = READ_ONCE(sqe->len);
+
+	if (!s->has_user)
+		return EFAULT;
+
+#ifdef CONFIG_COMPAT
+	if (ctx->compat)
+		return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV,
+						iovec, iter);
+#endif
+
+	return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter);
+}
+
+static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
+		       bool force_nonblock)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *kiocb = &req->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(req, s->sqe, force_nonblock);
+	if (ret)
+		return ret;
+	file = kiocb->ki_filp;
+
+	ret = -EBADF;
+	if (unlikely(!(file->f_mode & FMODE_READ)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->read_iter))
+		goto out_fput;
+
+	ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+
+	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
+	if (!ret) {
+		ssize_t ret2;
+
+		/* Catch -EAGAIN return for forced non-blocking submission */
+		ret2 = call_read_iter(file, kiocb, &iter);
+		if (!force_nonblock || ret2 != -EAGAIN)
+			io_rw_done(kiocb, ret2);
+		else
+			ret = -EAGAIN;
+	}
+	kfree(iovec);
+out_fput:
+	/* Hold on to the file for -EAGAIN */
+	if (unlikely(ret && ret != -EAGAIN))
+		fput(file);
+	return ret;
+}
+
+static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
+			bool force_nonblock)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *kiocb = &req->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(req, s->sqe, force_nonblock);
+	if (ret)
+		return ret;
+	/* Hold on to the file for -EAGAIN */
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT))
+		return -EAGAIN;
+
+	ret = -EBADF;
+	file = kiocb->ki_filp;
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->write_iter))
+		goto out_fput;
+
+	ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+
+	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos,
+				iov_iter_count(&iter));
+	if (!ret) {
+		/*
+		 * Open-code file_start_write here to grab freeze protection,
+		 * which will be released by another thread in
+		 * io_complete_rw().  Fool lockdep by telling it the lock got
+		 * released so that it doesn't complain about the held lock when
+		 * we return to userspace.
+		 */
+		if (S_ISREG(file_inode(file)->i_mode)) {
+			__sb_start_write(file_inode(file)->i_sb,
+						SB_FREEZE_WRITE, true);
+			__sb_writers_release(file_inode(file)->i_sb,
+						SB_FREEZE_WRITE);
+		}
+		kiocb->ki_flags |= IOCB_WRITE;
+		io_rw_done(kiocb, call_write_iter(file, kiocb, &iter));
+	}
+	kfree(iovec);
+out_fput:
+	if (unlikely(ret))
+		fput(file);
+	return ret;
+}
+
+/*
+ * IORING_OP_NOP just posts a completion event, nothing else.
+ */
+static int io_nop(struct io_kiocb *req, u64 user_data)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	long err = 0;
+
+	/*
+	 * Twilight zone - it's possible that someone issued an opcode that
+	 * has a file attached, then got -EAGAIN on submission, and changed
+	 * the sqe before we retried it from async context. Avoid dropping
+	 * a file reference for this malicious case, and flag the error.
+	 */
+	if (req->rw.ki_filp) {
+		err = -EBADF;
+		fput(req->rw.ki_filp);
+	}
+	io_cqring_add_event(ctx, user_data, err, 0);
+	io_free_req(req);
+	return 0;
+}
+
+static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			   const struct sqe_submit *s, bool force_nonblock)
+{
+	ssize_t ret;
+	int opcode;
+
+	if (unlikely(s->index >= ctx->sq_entries))
+		return -EINVAL;
+	req->user_data = READ_ONCE(s->sqe->user_data);
+
+	opcode = READ_ONCE(s->sqe->opcode);
+	switch (opcode) {
+	case IORING_OP_NOP:
+		ret = io_nop(req, req->user_data);
+		break;
+	case IORING_OP_READV:
+		ret = io_read(req, s, force_nonblock);
+		break;
+	case IORING_OP_WRITEV:
+		ret = io_write(req, s, force_nonblock);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static void io_sq_wq_submit_work(struct work_struct *work)
+{
+	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+	struct sqe_submit *s = &req->submit;
+	const struct io_uring_sqe *sqe = s->sqe;
+	struct io_ring_ctx *ctx = req->ctx;
+	mm_segment_t old_fs = get_fs();
+	int ret;
+
+	 /* Ensure we clear previously set forced non-block flag */
+	req->flags &= ~REQ_F_FORCE_NONBLOCK;
+	req->rw.ki_flags &= ~IOCB_NOWAIT;
+
+	if (!mmget_not_zero(ctx->sqo_mm)) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	use_mm(ctx->sqo_mm);
+	set_fs(USER_DS);
+	s->has_user = true;
+
+	ret = __io_submit_sqe(ctx, req, s, false);
+
+	set_fs(old_fs);
+	unuse_mm(ctx->sqo_mm);
+	mmput(ctx->sqo_mm);
+err:
+	if (ret) {
+		io_cqring_add_event(ctx, sqe->user_data, ret, 0);
+		io_free_req(req);
+	}
+
+	/* async context always use a copy of the sqe */
+	kfree(sqe);
+}
+
+static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s)
+{
+	struct io_kiocb *req;
+	ssize_t ret;
+
+	/* enforce forwards compatibility on users */
+	if (unlikely(s->sqe->flags))
+		return -EINVAL;
+
+	req = io_get_req(ctx);
+	if (unlikely(!req))
+		return -EAGAIN;
+
+	req->rw.ki_filp = NULL;
+
+	ret = __io_submit_sqe(ctx, req, s, true);
+	if (ret == -EAGAIN) {
+		struct io_uring_sqe *sqe_copy;
+
+		sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL);
+		if (sqe_copy) {
+			memcpy(sqe_copy, s->sqe, sizeof(*sqe_copy));
+			s->sqe = sqe_copy;
+
+			memcpy(&req->submit, s, sizeof(*s));
+			INIT_WORK(&req->work, io_sq_wq_submit_work);
+			queue_work(ctx->sqo_wq, &req->work);
+			ret = 0;
+		}
+	}
+	if (ret)
+		io_free_req(req);
+
+	return ret;
+}
+
+static void io_commit_sqring(struct io_ring_ctx *ctx)
+{
+	struct io_sq_ring *ring = ctx->sq_ring;
+
+	if (ctx->cached_sq_head != READ_ONCE(ring->r.head)) {
+		WRITE_ONCE(ring->r.head, ctx->cached_sq_head);
+		/*
+		 * write side barrier of head update, app has read side. See
+		 * comment at the top of this file
+		 */
+		smp_wmb();
+	}
+}
+
+/*
+ * Undo last io_get_sqring()
+ */
+static void io_drop_sqring(struct io_ring_ctx *ctx)
+{
+	ctx->cached_sq_head--;
+}
+
+/*
+ * Fetch an sqe, if one is available. Note that s->sqe will point to memory
+ * that is mapped by userspace. This means that care needs to be taken to
+ * ensure that reads are stable, as we cannot rely on userspace always
+ * being a good citizen. If members of the sqe are validated and then later
+ * used, it's important that those reads are done through READ_ONCE() to
+ * prevent a re-load down the line.
+ */
+static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
+{
+	struct io_sq_ring *ring = ctx->sq_ring;
+	unsigned head;
+
+	/*
+	 * The cached sq head (or cq tail) serves two purposes:
+	 *
+	 * 1) allows us to batch the cost of updating the user visible
+	 *    head updates.
+	 * 2) allows the kernel side to track the head on its own, even
+	 *    though the application is the one updating it.
+	 */
+	head = ctx->cached_sq_head;
+	/* See comment at the top of this file */
+	smp_rmb();
+	if (head == READ_ONCE(ring->r.tail))
+		return false;
+
+	head = READ_ONCE(ring->array[head & ctx->sq_mask]);
+	if (head < ctx->sq_entries) {
+		s->index = head;
+		s->sqe = &ctx->sq_sqes[head];
+		ctx->cached_sq_head++;
+		return true;
+	}
+
+	/* drop invalid entries */
+	ctx->cached_sq_head++;
+	ring->dropped++;
+	/* See comment at the top of this file */
+	smp_wmb();
+	return false;
+}
+
+static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
+{
+	int i, ret = 0, submit = 0;
+	struct blk_plug plug;
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_start_plug(&plug);
+
+	for (i = 0; i < to_submit; i++) {
+		struct sqe_submit s;
+
+		if (!io_get_sqring(ctx, &s))
+			break;
+
+		s.has_user = true;
+		ret = io_submit_sqe(ctx, &s);
+		if (ret) {
+			io_drop_sqring(ctx);
+			break;
+		}
+
+		submit++;
+	}
+	io_commit_sqring(ctx);
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_finish_plug(&plug);
+
+	return submit ? submit : ret;
+}
+
+/*
+ * Wait until events become available, if we don't already have some. The
+ * application must reap them itself, as they reside on the shared cq ring.
+ */
+static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
+			  const sigset_t __user *sig, size_t sigsz)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+	sigset_t ksigmask, sigsaved;
+	DEFINE_WAIT(wait);
+	int ret = 0;
+
+	/* See comment at the top of this file */
+	smp_rmb();
+	if (READ_ONCE(ring->r.head) != READ_ONCE(ring->r.tail))
+		return 0;
+	if (!min_events)
+		return 0;
+
+	if (sig) {
+		ret = set_user_sigmask(sig, &ksigmask, &sigsaved, sigsz);
+		if (ret)
+			return ret;
+	}
+
+	do {
+		prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE);
+
+		ret = 0;
+		/* See comment at the top of this file */
+		smp_rmb();
+		if (READ_ONCE(ring->r.head) != READ_ONCE(ring->r.tail))
+			break;
+
+		schedule();
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+	} while (1);
+
+	finish_wait(&ctx->wait, &wait);
+
+	if (sig)
+		restore_user_sigmask(sig, &sigsaved);
+
+	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
+}
+
+static int io_sq_offload_start(struct io_ring_ctx *ctx)
+{
+	int ret;
+
+	mmgrab(current->mm);
+	ctx->sqo_mm = current->mm;
+
+	/* Do QD, or 2 * CPUS, whatever is smallest */
+	ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
+			min(ctx->sq_entries - 1, 2 * num_online_cpus()));
+	if (!ctx->sqo_wq) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	return 0;
+err:
+	mmdrop(ctx->sqo_mm);
+	ctx->sqo_mm = NULL;
+	return ret;
+}
+
+static void io_unaccount_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	atomic_long_sub(nr_pages, &user->locked_vm);
+}
+
+static int io_account_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	unsigned long page_limit, cur_pages, new_pages;
+
+	/* Don't allow more pages than we can safely lock */
+	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	do {
+		cur_pages = atomic_long_read(&user->locked_vm);
+		new_pages = cur_pages + nr_pages;
+		if (new_pages > page_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&user->locked_vm, cur_pages,
+					new_pages) != cur_pages);
+
+	return 0;
+}
+
+static void io_mem_free(void *ptr)
+{
+	struct page *page = virt_to_head_page(ptr);
+
+	if (put_page_testzero(page))
+		free_compound_page(page);
+}
+
+static void *io_mem_alloc(size_t size)
+{
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
+				__GFP_NORETRY;
+
+	return (void *) __get_free_pages(gfp_flags, get_order(size));
+}
+
+static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
+{
+	struct io_sq_ring *sq_ring;
+	struct io_cq_ring *cq_ring;
+	size_t bytes;
+
+	bytes = struct_size(sq_ring, array, sq_entries);
+	bytes += array_size(sizeof(struct io_uring_sqe), sq_entries);
+	bytes += struct_size(cq_ring, cqes, cq_entries);
+
+	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
+}
+
+static void io_ring_ctx_free(struct io_ring_ctx *ctx)
+{
+	if (ctx->sqo_wq)
+		destroy_workqueue(ctx->sqo_wq);
+	if (ctx->sqo_mm)
+		mmdrop(ctx->sqo_mm);
+#if defined(CONFIG_UNIX)
+	if (ctx->ring_sock)
+		sock_release(ctx->ring_sock);
+#endif
+
+	io_mem_free(ctx->sq_ring);
+	io_mem_free(ctx->sq_sqes);
+	io_mem_free(ctx->cq_ring);
+
+	percpu_ref_exit(&ctx->refs);
+	if (ctx->account_mem)
+		io_unaccount_mem(ctx->user,
+				ring_pages(ctx->sq_entries, ctx->cq_entries));
+	free_uid(ctx->user);
+	kfree(ctx);
+}
+
+static __poll_t io_uring_poll(struct file *file, poll_table *wait)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &ctx->cq_wait, wait);
+	/* See comment at the top of this file */
+	smp_rmb();
+	if (READ_ONCE(ctx->sq_ring->r.tail) + 1 != ctx->cached_sq_head)
+		mask |= EPOLLOUT | EPOLLWRNORM;
+	if (READ_ONCE(ctx->cq_ring->r.head) != ctx->cached_cq_tail)
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static int io_uring_fasync(int fd, struct file *file, int on)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	return fasync_helper(fd, file, on, &ctx->cq_fasync);
+}
+
+static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
+{
+	mutex_lock(&ctx->uring_lock);
+	percpu_ref_kill(&ctx->refs);
+	mutex_unlock(&ctx->uring_lock);
+
+	wait_for_completion(&ctx->ctx_done);
+	io_ring_ctx_free(ctx);
+}
+
+static int io_uring_release(struct inode *inode, struct file *file)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	file->private_data = NULL;
+	io_ring_ctx_wait_and_kill(ctx);
+	return 0;
+}
+
+static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long sz = vma->vm_end - vma->vm_start;
+	struct io_ring_ctx *ctx = file->private_data;
+	unsigned long pfn;
+	struct page *page;
+	void *ptr;
+
+	switch (offset) {
+	case IORING_OFF_SQ_RING:
+		ptr = ctx->sq_ring;
+		break;
+	case IORING_OFF_SQES:
+		ptr = ctx->sq_sqes;
+		break;
+	case IORING_OFF_CQ_RING:
+		ptr = ctx->cq_ring;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	page = virt_to_head_page(ptr);
+	if (sz > (PAGE_SIZE << compound_order(page)))
+		return -EINVAL;
+
+	pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
+}
+
+SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
+		u32, min_complete, u32, flags, const sigset_t __user *, sig,
+		size_t, sigsz)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	int submitted = 0;
+	struct fd f;
+
+	if (flags & ~IORING_ENTER_GETEVENTS)
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ret = -ENXIO;
+	ctx = f.file->private_data;
+	if (!percpu_ref_tryget(&ctx->refs))
+		goto out_fput;
+
+	if (to_submit) {
+		to_submit = min(to_submit, ctx->sq_entries);
+
+		mutex_lock(&ctx->uring_lock);
+		submitted = io_ring_submit(ctx, to_submit);
+		mutex_unlock(&ctx->uring_lock);
+
+		if (submitted < 0)
+			goto out_ctx;
+	}
+	if (flags & IORING_ENTER_GETEVENTS) {
+		/*
+		 * The application could have included the 'to_submit' count
+		 * in how many events it wanted to wait for. If we failed to
+		 * submit the desired count, we may need to adjust the number
+		 * of events to poll/wait for.
+		 */
+		if (submitted < to_submit)
+			min_complete = min_t(unsigned, submitted, min_complete);
+
+		ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
+	}
+
+out_ctx:
+	io_ring_drop_ctx_refs(ctx, 1);
+out_fput:
+	fdput(f);
+	return submitted ? submitted : ret;
+}
+
+static const struct file_operations io_uring_fops = {
+	.release	= io_uring_release,
+	.mmap		= io_uring_mmap,
+	.poll		= io_uring_poll,
+	.fasync		= io_uring_fasync,
+};
+
+static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
+				  struct io_uring_params *p)
+{
+	struct io_sq_ring *sq_ring;
+	struct io_cq_ring *cq_ring;
+	size_t size;
+
+	sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries));
+	if (!sq_ring)
+		return -ENOMEM;
+
+	ctx->sq_ring = sq_ring;
+	sq_ring->ring_mask = p->sq_entries - 1;
+	sq_ring->ring_entries = p->sq_entries;
+	ctx->sq_mask = sq_ring->ring_mask;
+	ctx->sq_entries = sq_ring->ring_entries;
+
+	size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
+	if (size == SIZE_MAX)
+		return -EOVERFLOW;
+
+	ctx->sq_sqes = io_mem_alloc(size);
+	if (!ctx->sq_sqes) {
+		io_mem_free(ctx->sq_ring);
+		return -ENOMEM;
+	}
+
+	cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries));
+	if (!cq_ring) {
+		io_mem_free(ctx->sq_ring);
+		io_mem_free(ctx->sq_sqes);
+		return -ENOMEM;
+	}
+
+	ctx->cq_ring = cq_ring;
+	cq_ring->ring_mask = p->cq_entries - 1;
+	cq_ring->ring_entries = p->cq_entries;
+	ctx->cq_mask = cq_ring->ring_mask;
+	ctx->cq_entries = cq_ring->ring_entries;
+	return 0;
+}
+
+/*
+ * Allocate an anonymous fd, this is what constitutes the application
+ * visible backing of an io_uring instance. The application mmaps this
+ * fd to gain access to the SQ/CQ ring details. If UNIX sockets are enabled,
+ * we have to tie this fd to a socket for file garbage collection purposes.
+ */
+static int io_uring_get_fd(struct io_ring_ctx *ctx)
+{
+	struct file *file;
+	int ret;
+
+#if defined(CONFIG_UNIX)
+	ret = sock_create_kern(&init_net, PF_UNIX, SOCK_RAW, IPPROTO_IP,
+				&ctx->ring_sock);
+	if (ret)
+		return ret;
+#endif
+
+	ret = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
+	if (ret < 0)
+		goto err;
+
+	file = anon_inode_getfile("[io_uring]", &io_uring_fops, ctx,
+					O_RDWR | O_CLOEXEC);
+	if (IS_ERR(file)) {
+		put_unused_fd(ret);
+		ret = PTR_ERR(file);
+		goto err;
+	}
+
+#if defined(CONFIG_UNIX)
+	ctx->ring_sock->file = file;
+#endif
+	fd_install(ret, file);
+	return ret;
+err:
+#if defined(CONFIG_UNIX)
+	sock_release(ctx->ring_sock);
+	ctx->ring_sock = NULL;
+#endif
+	return ret;
+}
+
+static int io_uring_create(unsigned entries, struct io_uring_params *p)
+{
+	struct user_struct *user = NULL;
+	struct io_ring_ctx *ctx;
+	bool account_mem;
+	int ret;
+
+	if (!entries || entries > IORING_MAX_ENTRIES)
+		return -EINVAL;
+
+	/*
+	 * Use twice as many entries for the CQ ring. It's possible for the
+	 * application to drive a higher depth than the size of the SQ ring,
+	 * since the sqes are only used at submission time. This allows for
+	 * some flexibility in overcommitting a bit.
+	 */
+	p->sq_entries = roundup_pow_of_two(entries);
+	p->cq_entries = 2 * p->sq_entries;
+
+	user = get_uid(current_user());
+	account_mem = !capable(CAP_IPC_LOCK);
+
+	if (account_mem) {
+		ret = io_account_mem(user,
+				ring_pages(p->sq_entries, p->cq_entries));
+		if (ret) {
+			free_uid(user);
+			return ret;
+		}
+	}
+
+	ctx = io_ring_ctx_alloc(p);
+	if (!ctx) {
+		if (account_mem)
+			io_unaccount_mem(user, ring_pages(p->sq_entries,
+								p->cq_entries));
+		free_uid(user);
+		return -ENOMEM;
+	}
+	ctx->compat = in_compat_syscall();
+	ctx->account_mem = account_mem;
+	ctx->user = user;
+
+	ret = io_allocate_scq_urings(ctx, p);
+	if (ret)
+		goto err;
+
+	ret = io_sq_offload_start(ctx);
+	if (ret)
+		goto err;
+
+	ret = io_uring_get_fd(ctx);
+	if (ret < 0)
+		goto err;
+
+	memset(&p->sq_off, 0, sizeof(p->sq_off));
+	p->sq_off.head = offsetof(struct io_sq_ring, r.head);
+	p->sq_off.tail = offsetof(struct io_sq_ring, r.tail);
+	p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask);
+	p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries);
+	p->sq_off.flags = offsetof(struct io_sq_ring, flags);
+	p->sq_off.dropped = offsetof(struct io_sq_ring, dropped);
+	p->sq_off.array = offsetof(struct io_sq_ring, array);
+
+	memset(&p->cq_off, 0, sizeof(p->cq_off));
+	p->cq_off.head = offsetof(struct io_cq_ring, r.head);
+	p->cq_off.tail = offsetof(struct io_cq_ring, r.tail);
+	p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask);
+	p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries);
+	p->cq_off.overflow = offsetof(struct io_cq_ring, overflow);
+	p->cq_off.cqes = offsetof(struct io_cq_ring, cqes);
+	return ret;
+err:
+	io_ring_ctx_wait_and_kill(ctx);
+	return ret;
+}
+
+/*
+ * Sets up an aio uring context, and returns the fd. Applications asks for a
+ * ring size, we return the actual sq/cq ring sizes (among other things) in the
+ * params structure passed in.
+ */
+static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
+{
+	struct io_uring_params p;
+	long ret;
+	int i;
+
+	if (copy_from_user(&p, params, sizeof(p)))
+		return -EFAULT;
+	for (i = 0; i < ARRAY_SIZE(p.resv); i++) {
+		if (p.resv[i])
+			return -EINVAL;
+	}
+
+	if (p.flags)
+		return -EINVAL;
+
+	ret = io_uring_create(entries, &p);
+	if (ret < 0)
+		return ret;
+
+	if (copy_to_user(params, &p, sizeof(p)))
+		return -EFAULT;
+
+	return ret;
+}
+
+SYSCALL_DEFINE2(io_uring_setup, u32, entries,
+		struct io_uring_params __user *, params)
+{
+	return io_uring_setup(entries, params);
+}
+
+static int __init io_uring_init(void)
+{
+	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+	return 0;
+};
+__initcall(io_uring_init);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dedcc2e9265c..61aa210f0c2b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3517,4 +3517,13 @@ extern void inode_nohighmem(struct inode *inode);
 extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
 		       int advice);
 
+#if defined(CONFIG_IO_URING)
+extern struct sock *io_uring_get_socket(struct file *file);
+#else
+static inline struct sock *io_uring_get_socket(struct file *file)
+{
+	return NULL;
+}
+#endif
+
 #endif /* _LINUX_FS_H */
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..c7b5f86b91a1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -40,7 +40,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
 	atomic_long_t locked_vm;
 #endif
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 257cccba3062..3072dbaa7869 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct file_handle;
 struct sigaltstack;
 struct rseq;
 union bpf_attr;
+struct io_uring_params;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -309,6 +310,11 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id,
 				struct io_event __user *events,
 				struct old_timespec32 __user *timeout,
 				const struct __aio_sigset *sig);
+asmlinkage long sys_io_uring_setup(u32 entries,
+				struct io_uring_params __user *p);
+asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
+				u32 min_complete, u32 flags,
+				const sigset_t __user *sig, size_t sigsz);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index d90127298f12..87871e7b7ea7 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -740,9 +740,13 @@ __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents)
 __SYSCALL(__NR_rseq, sys_rseq)
 #define __NR_kexec_file_load 294
 __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
+#define __NR_io_uring_setup 425
+__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
+#define __NR_io_uring_enter 426
+__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 
 #undef __NR_syscalls
-#define __NR_syscalls 295
+#define __NR_syscalls 427
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
new file mode 100644
index 000000000000..ac692823d6f4
--- /dev/null
+++ b/include/uapi/linux/io_uring.h
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Header file for the io_uring interface.
+ *
+ * Copyright (C) 2019 Jens Axboe
+ * Copyright (C) 2019 Christoph Hellwig
+ */
+#ifndef LINUX_IO_URING_H
+#define LINUX_IO_URING_H
+
+#include <linux/fs.h>
+#include <linux/types.h>
+
+/*
+ * IO submission data structure (Submission Queue Entry)
+ */
+struct io_uring_sqe {
+	__u8	opcode;		/* type of operation for this sqe */
+	__u8	flags;		/* as of now unused */
+	__u16	ioprio;		/* ioprio for the request */
+	__s32	fd;		/* file descriptor to do IO on */
+	__u64	off;		/* offset into file */
+	__u64	addr;		/* pointer to buffer or iovecs */
+	__u32	len;		/* buffer size or number of iovecs */
+	union {
+		__kernel_rwf_t	rw_flags;
+		__u32		__resv;
+	};
+	__u64	user_data;	/* data to be passed back at completion time */
+	__u64	__pad2[3];
+};
+
+#define IORING_OP_NOP		0
+#define IORING_OP_READV		1
+#define IORING_OP_WRITEV	2
+
+/*
+ * IO completion data structure (Completion Queue Entry)
+ */
+struct io_uring_cqe {
+	__u64	user_data;	/* sqe->data submission passed back */
+	__s32	res;		/* result code for this event */
+	__u32	flags;
+};
+
+/*
+ * Magic offsets for the application to mmap the data it needs
+ */
+#define IORING_OFF_SQ_RING		0ULL
+#define IORING_OFF_CQ_RING		0x8000000ULL
+#define IORING_OFF_SQES			0x10000000ULL
+
+/*
+ * Filled with the offset for mmap(2)
+ */
+struct io_sqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 flags;
+	__u32 dropped;
+	__u32 array;
+	__u32 resv1;
+	__u64 resv2;
+};
+
+struct io_cqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 overflow;
+	__u32 cqes;
+	__u64 resv[2];
+};
+
+/*
+ * io_uring_enter(2) flags
+ */
+#define IORING_ENTER_GETEVENTS	(1U << 0)
+
+/*
+ * Passed in for io_uring_setup(2). Copied back with updated info on success
+ */
+struct io_uring_params {
+	__u32 sq_entries;
+	__u32 cq_entries;
+	__u32 flags;
+	__u32 resv[7];
+	struct io_sqring_offsets sq_off;
+	struct io_cqring_offsets cq_off;
+};
+
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index c9386a365eea..53b54214a36e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1414,6 +1414,15 @@ config AIO
 	  by some high performance threaded applications. Disabling
 	  this option saves about 7k.
 
+config IO_URING
+	bool "Enable IO uring support" if EXPERT
+	select ANON_INODES
+	default y
+	help
+	  This option enables support for the io_uring interface, enabling
+	  applications to submit and complete IO through submission and
+	  completion rings that are shared between the kernel and application.
+
 config ADVISE_SYSCALLS
 	bool "Enable madvise/fadvise syscalls" if EXPERT
 	default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ab9d0e3c6d50..ee5e523564bb 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents);
 COND_SYSCALL(io_pgetevents);
 COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
+COND_SYSCALL(io_uring_setup);
+COND_SYSCALL(io_uring_enter);
 
 /* fs/xattr.c */
 
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index c36757e72844..f81854d74c7d 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -108,6 +108,9 @@ struct sock *unix_get_socket(struct file *filp)
 		/* PF_UNIX ? */
 		if (s && sock->ops && sock->ops->family == PF_UNIX)
 			u_sock = s;
+	} else {
+		/* Could be an io_uring instance */
+		u_sock = io_uring_get_socket(filp);
 	}
 	return u_sock;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 05/19] Add io_uring IO interface
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_sqe data structure, and completions
are generated in the form of io_uring_cqe data structures. The SQ
ring is an index into the io_uring_sqe array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered, and hence any io_uring_cqe entry can point back to an
arbitrary submission.

Two new system calls are added for this:

io_uring_setup(entries, params)
	Sets up an io_uring instance for doing async IO. On success,
	returns a file descriptor that the application can mmap to
	gain access to the SQ ring, CQ ring, and io_uring_sqes.

io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
	Initiates IO against the rings mapped to this fd, or waits for
	them to complete, or both. The behavior is controlled by the
	parameters passed in. If 'to_submit' is non-zero, then we'll
	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
	kernel will wait for 'min_complete' events, if they aren't
	already available. It's valid to set IORING_ENTER_GETEVENTS
	and 'min_complete' == 0 at the same time, this allows the
	kernel to return already completed events without waiting
	for them. This is useful only for polling, as for IRQ
	driven IO, the application can just check the CQ ring
	without entering the kernel.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Each io_uring is backed by a workqueue, to support buffered async IO
as well. We will only punt to an async context if the command would
need to wait for IO on the device side. Any data that can be accessed
directly in the page cache is done inline. This avoids the slowness
issue of usual threadpools, since cached data is accessed as quickly
as a sync interface.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |    2 +
 arch/x86/entry/syscalls/syscall_64.tbl |    2 +
 fs/Makefile                            |    1 +
 fs/io_uring.c                          | 1243 ++++++++++++++++++++++++
 include/linux/fs.h                     |    9 +
 include/linux/sched/user.h             |    2 +-
 include/linux/syscalls.h               |    6 +
 include/uapi/asm-generic/unistd.h      |    6 +-
 include/uapi/linux/io_uring.h          |   95 ++
 init/Kconfig                           |    9 +
 kernel/sys_ni.c                        |    2 +
 net/unix/garbage.c                     |    3 +
 12 files changed, 1378 insertions(+), 2 deletions(-)
 create mode 100644 fs/io_uring.c
 create mode 100644 include/uapi/linux/io_uring.h

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 3cf7b533b3d1..481c126259e9 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,5 @@
 384	i386	arch_prctl		sys_arch_prctl			__ia32_compat_sys_arch_prctl
 385	i386	io_pgetevents		sys_io_pgetevents		__ia32_compat_sys_io_pgetevents
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
+425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
+426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..6a32a430c8e0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,8 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+425	common	io_uring_setup		__x64_sys_io_uring_setup
+426	common	io_uring_enter		__x64_sys_io_uring_enter
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..8e15d6fc4340 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
 obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_IO_URING)		+= io_uring.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
diff --git a/fs/io_uring.c b/fs/io_uring.c
new file mode 100644
index 000000000000..1b28d38a9b76
--- /dev/null
+++ b/fs/io_uring.c
@@ -0,0 +1,1243 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Shared application/kernel submission and completion ring pairs, for
+ * supporting fast/efficient IO.
+ *
+ * A note on the read/write ordering memory barriers that are matched between
+ * the application and kernel side. When the application reads the CQ ring
+ * tail, it must use an appropriate smp_rmb() to order with the smp_wmb()
+ * the kernel uses after writing the tail. Failure to do so could cause a
+ * delay in when the application notices that completion events available.
+ * This isn't a fatal condition. Likewise, the application must use an
+ * appropriate smp_wmb() both before writing the SQ tail, and after writing
+ * the SQ tail. The first one orders the sqe writes with the tail write, and
+ * the latter is paired with the smp_rmb() the kernel will issue before
+ * reading the SQ tail on submission.
+ *
+ * Also see the examples in the liburing library:
+ *
+ *	git://git.kernel.dk/liburing
+ *
+ * io_uring also uses READ/WRITE_ONCE() for _any_ store or load that happens
+ * from data shared between the kernel and application. This is done both
+ * for ordering purposes, but also to ensure that once a value is loaded from
+ * data that the application could potentially modify, it remains stable.
+ *
+ * Copyright (C) 2018-2019 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/syscalls.h>
+#include <linux/compat.h>
+#include <linux/refcount.h>
+#include <linux/uio.h>
+
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mmu_context.h>
+#include <linux/percpu.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/blkdev.h>
+#include <linux/net.h>
+#include <net/sock.h>
+#include <net/af_unix.h>
+#include <linux/anon_inodes.h>
+#include <linux/sched/mm.h>
+#include <linux/uaccess.h>
+#include <linux/nospec.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "internal.h"
+
+#define IORING_MAX_ENTRIES	4096
+
+struct io_uring {
+	u32 head ____cacheline_aligned_in_smp;
+	u32 tail ____cacheline_aligned_in_smp;
+};
+
+struct io_sq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			dropped;
+	u32			flags;
+	u32			array[];
+};
+
+struct io_cq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			overflow;
+	struct io_uring_cqe	cqes[];
+};
+
+struct io_ring_ctx {
+	struct {
+		struct percpu_ref	refs;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		unsigned int		flags;
+		bool			compat;
+		bool			account_mem;
+
+		/* SQ ring */
+		struct io_sq_ring	*sq_ring;
+		unsigned		cached_sq_head;
+		unsigned		sq_entries;
+		unsigned		sq_mask;
+		struct io_uring_sqe	*sq_sqes;
+	} ____cacheline_aligned_in_smp;
+
+	/* IO offload */
+	struct workqueue_struct	*sqo_wq;
+	struct mm_struct	*sqo_mm;
+
+	struct {
+		/* CQ ring */
+		struct io_cq_ring	*cq_ring;
+		unsigned		cached_cq_tail;
+		unsigned		cq_entries;
+		unsigned		cq_mask;
+		struct wait_queue_head	cq_wait;
+		struct fasync_struct	*cq_fasync;
+	} ____cacheline_aligned_in_smp;
+
+	struct user_struct	*user;
+
+	struct completion	ctx_done;
+
+	struct {
+		struct mutex		uring_lock;
+		wait_queue_head_t	wait;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		spinlock_t		completion_lock;
+	} ____cacheline_aligned_in_smp;
+
+#if defined(CONFIG_UNIX)
+	struct socket		*ring_sock;
+#endif
+};
+
+struct sqe_submit {
+	const struct io_uring_sqe	*sqe;
+	unsigned short			index;
+	bool				has_user;
+};
+
+struct io_kiocb {
+	struct kiocb		rw;
+
+	struct sqe_submit	submit;
+
+	struct io_ring_ctx	*ctx;
+	struct list_head	list;
+	unsigned int		flags;
+#define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
+	u64			user_data;
+
+	struct work_struct	work;
+};
+
+#define IO_PLUG_THRESHOLD		2
+
+static struct kmem_cache *req_cachep;
+
+static const struct file_operations io_uring_fops;
+
+struct sock *io_uring_get_socket(struct file *file)
+{
+#if defined(CONFIG_UNIX)
+	if (file->f_op == &io_uring_fops) {
+		struct io_ring_ctx *ctx = file->private_data;
+
+		return ctx->ring_sock->sk;
+	}
+#endif
+	return NULL;
+}
+EXPORT_SYMBOL(io_uring_get_socket);
+
+static void io_ring_ctx_ref_free(struct percpu_ref *ref)
+{
+	struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
+
+	complete(&ctx->ctx_done);
+}
+
+static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
+{
+	struct io_ring_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) {
+		kfree(ctx);
+		return NULL;
+	}
+
+	ctx->flags = p->flags;
+	init_waitqueue_head(&ctx->cq_wait);
+	init_completion(&ctx->ctx_done);
+	mutex_init(&ctx->uring_lock);
+	init_waitqueue_head(&ctx->wait);
+	spin_lock_init(&ctx->completion_lock);
+	return ctx;
+}
+
+static void io_commit_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+
+	if (ctx->cached_cq_tail != READ_ONCE(ring->r.tail)) {
+		/* order cqe stores with ring update */
+		smp_wmb();
+		WRITE_ONCE(ring->r.tail, ctx->cached_cq_tail);
+		/*
+		 * Write sider barrier of tail update, app has read side. See
+		 * comment at the top of this file.
+		 */
+		smp_wmb();
+
+		if (wq_has_sleeper(&ctx->cq_wait)) {
+			wake_up_interruptible(&ctx->cq_wait);
+			kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
+		}
+	}
+}
+
+static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+	unsigned tail;
+
+	tail = ctx->cached_cq_tail;
+	/* See comment at the top of the file */
+	smp_rmb();
+	if (tail + 1 == READ_ONCE(ring->r.head))
+		return NULL;
+
+	ctx->cached_cq_tail++;
+	return &ring->cqes[tail & ctx->cq_mask];
+}
+
+static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data,
+				 long res, unsigned ev_flags)
+{
+	struct io_uring_cqe *cqe;
+
+	/*
+	 * If we can't get a cq entry, userspace overflowed the
+	 * submission (by quite a lot). Increment the overflow count in
+	 * the ring.
+	 */
+	cqe = io_get_cqring(ctx);
+	if (cqe) {
+		WRITE_ONCE(cqe->user_data, ki_user_data);
+		WRITE_ONCE(cqe->res, res);
+		WRITE_ONCE(cqe->flags, ev_flags);
+	} else {
+		unsigned overflow = READ_ONCE(ctx->cq_ring->overflow);
+
+		WRITE_ONCE(ctx->cq_ring->overflow, overflow + 1);
+	}
+}
+
+static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
+				long res, unsigned ev_flags)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ctx->completion_lock, flags);
+	io_cqring_fill_event(ctx, ki_user_data, res, ev_flags);
+	io_commit_cqring(ctx);
+	spin_unlock_irqrestore(&ctx->completion_lock, flags);
+
+	if (waitqueue_active(&ctx->wait))
+		wake_up(&ctx->wait);
+}
+
+static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
+{
+	percpu_ref_put_many(&ctx->refs, refs);
+
+	if (waitqueue_active(&ctx->wait))
+		wake_up(&ctx->wait);
+}
+
+static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	if (!percpu_ref_tryget(&ctx->refs))
+		return NULL;
+
+	req = kmem_cache_alloc(req_cachep, __GFP_NOWARN);
+	if (req) {
+		req->ctx = ctx;
+		req->flags = 0;
+		return req;
+	}
+
+	io_ring_drop_ctx_refs(ctx, 1);
+	return NULL;
+}
+
+static void io_free_req(struct io_kiocb *req)
+{
+	io_ring_drop_ctx_refs(req->ctx, 1);
+	kmem_cache_free(req_cachep, req);
+}
+
+static void kiocb_end_write(struct kiocb *kiocb)
+{
+	if (kiocb->ki_flags & IOCB_WRITE) {
+		struct inode *inode = file_inode(kiocb->ki_filp);
+
+		/*
+		 * Tell lockdep we inherited freeze protection from submission
+		 * thread.
+		 */
+		if (S_ISREG(inode->i_mode))
+			__sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE);
+		file_end_write(kiocb->ki_filp);
+	}
+}
+
+static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
+{
+	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	fput(kiocb->ki_filp);
+	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+	io_free_req(req);
+}
+
+/*
+ * If we tracked the file through the SCM inflight mechanism, we could support
+ * any file. For now, just ensure that anything potentially problematic is done
+ * inline.
+ */
+static bool io_file_supports_async(struct file *file)
+{
+	umode_t mode = file_inode(file)->i_mode;
+
+	if (S_ISBLK(mode) || S_ISCHR(mode))
+		return true;
+	if (S_ISREG(mode) && file->f_op != &io_uring_fops)
+		return true;
+
+	return false;
+}
+
+static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		      bool force_nonblock)
+{
+	struct kiocb *kiocb = &req->rw;
+	unsigned ioprio;
+	int fd, ret;
+
+	/* For -EAGAIN retry, everything is already prepped */
+	if (kiocb->ki_filp)
+		return 0;
+
+	fd = READ_ONCE(sqe->fd);
+	kiocb->ki_filp = fget(fd);
+	if (unlikely(!kiocb->ki_filp))
+		return -EBADF;
+	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
+		force_nonblock = false;
+	kiocb->ki_pos = READ_ONCE(sqe->off);
+	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
+	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
+
+	ioprio = READ_ONCE(sqe->ioprio);
+	if (ioprio) {
+		ret = ioprio_check_cap(ioprio);
+		if (ret)
+			goto out_fput;
+
+		kiocb->ki_ioprio = ioprio;
+	} else
+		kiocb->ki_ioprio = get_current_ioprio();
+
+	ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags));
+	if (unlikely(ret))
+		goto out_fput;
+	if (force_nonblock) {
+		kiocb->ki_flags |= IOCB_NOWAIT;
+		req->flags |= REQ_F_FORCE_NONBLOCK;
+	}
+	if (kiocb->ki_flags & IOCB_HIPRI) {
+		ret = -EINVAL;
+		goto out_fput;
+	}
+
+	kiocb->ki_complete = io_complete_rw;
+	return 0;
+out_fput:
+	fput(kiocb->ki_filp);
+	return ret;
+}
+
+static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
+{
+	switch (ret) {
+	case -EIOCBQUEUED:
+		break;
+	case -ERESTARTSYS:
+	case -ERESTARTNOINTR:
+	case -ERESTARTNOHAND:
+	case -ERESTART_RESTARTBLOCK:
+		/*
+		 * We can't just restart the syscall, since previously
+		 * submitted sqes may already be in progress. Just fail this
+		 * IO with EINTR.
+		 */
+		ret = -EINTR;
+		/* fall through */
+	default:
+		kiocb->ki_complete(kiocb, ret, 0);
+	}
+}
+
+static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
+			   const struct sqe_submit *s, struct iovec **iovec,
+			   struct iov_iter *iter)
+{
+	const struct io_uring_sqe *sqe = s->sqe;
+	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	size_t sqe_len = READ_ONCE(sqe->len);
+
+	if (!s->has_user)
+		return EFAULT;
+
+#ifdef CONFIG_COMPAT
+	if (ctx->compat)
+		return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV,
+						iovec, iter);
+#endif
+
+	return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter);
+}
+
+static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
+		       bool force_nonblock)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *kiocb = &req->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(req, s->sqe, force_nonblock);
+	if (ret)
+		return ret;
+	file = kiocb->ki_filp;
+
+	ret = -EBADF;
+	if (unlikely(!(file->f_mode & FMODE_READ)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->read_iter))
+		goto out_fput;
+
+	ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+
+	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
+	if (!ret) {
+		ssize_t ret2;
+
+		/* Catch -EAGAIN return for forced non-blocking submission */
+		ret2 = call_read_iter(file, kiocb, &iter);
+		if (!force_nonblock || ret2 != -EAGAIN)
+			io_rw_done(kiocb, ret2);
+		else
+			ret = -EAGAIN;
+	}
+	kfree(iovec);
+out_fput:
+	/* Hold on to the file for -EAGAIN */
+	if (unlikely(ret && ret != -EAGAIN))
+		fput(file);
+	return ret;
+}
+
+static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
+			bool force_nonblock)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *kiocb = &req->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(req, s->sqe, force_nonblock);
+	if (ret)
+		return ret;
+	/* Hold on to the file for -EAGAIN */
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT))
+		return -EAGAIN;
+
+	ret = -EBADF;
+	file = kiocb->ki_filp;
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->write_iter))
+		goto out_fput;
+
+	ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+
+	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos,
+				iov_iter_count(&iter));
+	if (!ret) {
+		/*
+		 * Open-code file_start_write here to grab freeze protection,
+		 * which will be released by another thread in
+		 * io_complete_rw().  Fool lockdep by telling it the lock got
+		 * released so that it doesn't complain about the held lock when
+		 * we return to userspace.
+		 */
+		if (S_ISREG(file_inode(file)->i_mode)) {
+			__sb_start_write(file_inode(file)->i_sb,
+						SB_FREEZE_WRITE, true);
+			__sb_writers_release(file_inode(file)->i_sb,
+						SB_FREEZE_WRITE);
+		}
+		kiocb->ki_flags |= IOCB_WRITE;
+		io_rw_done(kiocb, call_write_iter(file, kiocb, &iter));
+	}
+	kfree(iovec);
+out_fput:
+	if (unlikely(ret))
+		fput(file);
+	return ret;
+}
+
+/*
+ * IORING_OP_NOP just posts a completion event, nothing else.
+ */
+static int io_nop(struct io_kiocb *req, u64 user_data)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	long err = 0;
+
+	/*
+	 * Twilight zone - it's possible that someone issued an opcode that
+	 * has a file attached, then got -EAGAIN on submission, and changed
+	 * the sqe before we retried it from async context. Avoid dropping
+	 * a file reference for this malicious case, and flag the error.
+	 */
+	if (req->rw.ki_filp) {
+		err = -EBADF;
+		fput(req->rw.ki_filp);
+	}
+	io_cqring_add_event(ctx, user_data, err, 0);
+	io_free_req(req);
+	return 0;
+}
+
+static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
+			   const struct sqe_submit *s, bool force_nonblock)
+{
+	ssize_t ret;
+	int opcode;
+
+	if (unlikely(s->index >= ctx->sq_entries))
+		return -EINVAL;
+	req->user_data = READ_ONCE(s->sqe->user_data);
+
+	opcode = READ_ONCE(s->sqe->opcode);
+	switch (opcode) {
+	case IORING_OP_NOP:
+		ret = io_nop(req, req->user_data);
+		break;
+	case IORING_OP_READV:
+		ret = io_read(req, s, force_nonblock);
+		break;
+	case IORING_OP_WRITEV:
+		ret = io_write(req, s, force_nonblock);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static void io_sq_wq_submit_work(struct work_struct *work)
+{
+	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+	struct sqe_submit *s = &req->submit;
+	const struct io_uring_sqe *sqe = s->sqe;
+	struct io_ring_ctx *ctx = req->ctx;
+	mm_segment_t old_fs = get_fs();
+	int ret;
+
+	 /* Ensure we clear previously set forced non-block flag */
+	req->flags &= ~REQ_F_FORCE_NONBLOCK;
+	req->rw.ki_flags &= ~IOCB_NOWAIT;
+
+	if (!mmget_not_zero(ctx->sqo_mm)) {
+		ret = -EFAULT;
+		goto err;
+	}
+
+	use_mm(ctx->sqo_mm);
+	set_fs(USER_DS);
+	s->has_user = true;
+
+	ret = __io_submit_sqe(ctx, req, s, false);
+
+	set_fs(old_fs);
+	unuse_mm(ctx->sqo_mm);
+	mmput(ctx->sqo_mm);
+err:
+	if (ret) {
+		io_cqring_add_event(ctx, sqe->user_data, ret, 0);
+		io_free_req(req);
+	}
+
+	/* async context always use a copy of the sqe */
+	kfree(sqe);
+}
+
+static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s)
+{
+	struct io_kiocb *req;
+	ssize_t ret;
+
+	/* enforce forwards compatibility on users */
+	if (unlikely(s->sqe->flags))
+		return -EINVAL;
+
+	req = io_get_req(ctx);
+	if (unlikely(!req))
+		return -EAGAIN;
+
+	req->rw.ki_filp = NULL;
+
+	ret = __io_submit_sqe(ctx, req, s, true);
+	if (ret == -EAGAIN) {
+		struct io_uring_sqe *sqe_copy;
+
+		sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL);
+		if (sqe_copy) {
+			memcpy(sqe_copy, s->sqe, sizeof(*sqe_copy));
+			s->sqe = sqe_copy;
+
+			memcpy(&req->submit, s, sizeof(*s));
+			INIT_WORK(&req->work, io_sq_wq_submit_work);
+			queue_work(ctx->sqo_wq, &req->work);
+			ret = 0;
+		}
+	}
+	if (ret)
+		io_free_req(req);
+
+	return ret;
+}
+
+static void io_commit_sqring(struct io_ring_ctx *ctx)
+{
+	struct io_sq_ring *ring = ctx->sq_ring;
+
+	if (ctx->cached_sq_head != READ_ONCE(ring->r.head)) {
+		WRITE_ONCE(ring->r.head, ctx->cached_sq_head);
+		/*
+		 * write side barrier of head update, app has read side. See
+		 * comment at the top of this file
+		 */
+		smp_wmb();
+	}
+}
+
+/*
+ * Undo last io_get_sqring()
+ */
+static void io_drop_sqring(struct io_ring_ctx *ctx)
+{
+	ctx->cached_sq_head--;
+}
+
+/*
+ * Fetch an sqe, if one is available. Note that s->sqe will point to memory
+ * that is mapped by userspace. This means that care needs to be taken to
+ * ensure that reads are stable, as we cannot rely on userspace always
+ * being a good citizen. If members of the sqe are validated and then later
+ * used, it's important that those reads are done through READ_ONCE() to
+ * prevent a re-load down the line.
+ */
+static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
+{
+	struct io_sq_ring *ring = ctx->sq_ring;
+	unsigned head;
+
+	/*
+	 * The cached sq head (or cq tail) serves two purposes:
+	 *
+	 * 1) allows us to batch the cost of updating the user visible
+	 *    head updates.
+	 * 2) allows the kernel side to track the head on its own, even
+	 *    though the application is the one updating it.
+	 */
+	head = ctx->cached_sq_head;
+	/* See comment at the top of this file */
+	smp_rmb();
+	if (head == READ_ONCE(ring->r.tail))
+		return false;
+
+	head = READ_ONCE(ring->array[head & ctx->sq_mask]);
+	if (head < ctx->sq_entries) {
+		s->index = head;
+		s->sqe = &ctx->sq_sqes[head];
+		ctx->cached_sq_head++;
+		return true;
+	}
+
+	/* drop invalid entries */
+	ctx->cached_sq_head++;
+	ring->dropped++;
+	/* See comment at the top of this file */
+	smp_wmb();
+	return false;
+}
+
+static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
+{
+	int i, ret = 0, submit = 0;
+	struct blk_plug plug;
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_start_plug(&plug);
+
+	for (i = 0; i < to_submit; i++) {
+		struct sqe_submit s;
+
+		if (!io_get_sqring(ctx, &s))
+			break;
+
+		s.has_user = true;
+		ret = io_submit_sqe(ctx, &s);
+		if (ret) {
+			io_drop_sqring(ctx);
+			break;
+		}
+
+		submit++;
+	}
+	io_commit_sqring(ctx);
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_finish_plug(&plug);
+
+	return submit ? submit : ret;
+}
+
+/*
+ * Wait until events become available, if we don't already have some. The
+ * application must reap them itself, as they reside on the shared cq ring.
+ */
+static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
+			  const sigset_t __user *sig, size_t sigsz)
+{
+	struct io_cq_ring *ring = ctx->cq_ring;
+	sigset_t ksigmask, sigsaved;
+	DEFINE_WAIT(wait);
+	int ret = 0;
+
+	/* See comment at the top of this file */
+	smp_rmb();
+	if (READ_ONCE(ring->r.head) != READ_ONCE(ring->r.tail))
+		return 0;
+	if (!min_events)
+		return 0;
+
+	if (sig) {
+		ret = set_user_sigmask(sig, &ksigmask, &sigsaved, sigsz);
+		if (ret)
+			return ret;
+	}
+
+	do {
+		prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE);
+
+		ret = 0;
+		/* See comment at the top of this file */
+		smp_rmb();
+		if (READ_ONCE(ring->r.head) != READ_ONCE(ring->r.tail))
+			break;
+
+		schedule();
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+	} while (1);
+
+	finish_wait(&ctx->wait, &wait);
+
+	if (sig)
+		restore_user_sigmask(sig, &sigsaved);
+
+	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
+}
+
+static int io_sq_offload_start(struct io_ring_ctx *ctx)
+{
+	int ret;
+
+	mmgrab(current->mm);
+	ctx->sqo_mm = current->mm;
+
+	/* Do QD, or 2 * CPUS, whatever is smallest */
+	ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
+			min(ctx->sq_entries - 1, 2 * num_online_cpus()));
+	if (!ctx->sqo_wq) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	return 0;
+err:
+	mmdrop(ctx->sqo_mm);
+	ctx->sqo_mm = NULL;
+	return ret;
+}
+
+static void io_unaccount_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	atomic_long_sub(nr_pages, &user->locked_vm);
+}
+
+static int io_account_mem(struct user_struct *user, unsigned long nr_pages)
+{
+	unsigned long page_limit, cur_pages, new_pages;
+
+	/* Don't allow more pages than we can safely lock */
+	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	do {
+		cur_pages = atomic_long_read(&user->locked_vm);
+		new_pages = cur_pages + nr_pages;
+		if (new_pages > page_limit)
+			return -ENOMEM;
+	} while (atomic_long_cmpxchg(&user->locked_vm, cur_pages,
+					new_pages) != cur_pages);
+
+	return 0;
+}
+
+static void io_mem_free(void *ptr)
+{
+	struct page *page = virt_to_head_page(ptr);
+
+	if (put_page_testzero(page))
+		free_compound_page(page);
+}
+
+static void *io_mem_alloc(size_t size)
+{
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
+				__GFP_NORETRY;
+
+	return (void *) __get_free_pages(gfp_flags, get_order(size));
+}
+
+static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
+{
+	struct io_sq_ring *sq_ring;
+	struct io_cq_ring *cq_ring;
+	size_t bytes;
+
+	bytes = struct_size(sq_ring, array, sq_entries);
+	bytes += array_size(sizeof(struct io_uring_sqe), sq_entries);
+	bytes += struct_size(cq_ring, cqes, cq_entries);
+
+	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
+}
+
+static void io_ring_ctx_free(struct io_ring_ctx *ctx)
+{
+	if (ctx->sqo_wq)
+		destroy_workqueue(ctx->sqo_wq);
+	if (ctx->sqo_mm)
+		mmdrop(ctx->sqo_mm);
+#if defined(CONFIG_UNIX)
+	if (ctx->ring_sock)
+		sock_release(ctx->ring_sock);
+#endif
+
+	io_mem_free(ctx->sq_ring);
+	io_mem_free(ctx->sq_sqes);
+	io_mem_free(ctx->cq_ring);
+
+	percpu_ref_exit(&ctx->refs);
+	if (ctx->account_mem)
+		io_unaccount_mem(ctx->user,
+				ring_pages(ctx->sq_entries, ctx->cq_entries));
+	free_uid(ctx->user);
+	kfree(ctx);
+}
+
+static __poll_t io_uring_poll(struct file *file, poll_table *wait)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+	__poll_t mask = 0;
+
+	poll_wait(file, &ctx->cq_wait, wait);
+	/* See comment at the top of this file */
+	smp_rmb();
+	if (READ_ONCE(ctx->sq_ring->r.tail) + 1 != ctx->cached_sq_head)
+		mask |= EPOLLOUT | EPOLLWRNORM;
+	if (READ_ONCE(ctx->cq_ring->r.head) != ctx->cached_cq_tail)
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static int io_uring_fasync(int fd, struct file *file, int on)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	return fasync_helper(fd, file, on, &ctx->cq_fasync);
+}
+
+static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
+{
+	mutex_lock(&ctx->uring_lock);
+	percpu_ref_kill(&ctx->refs);
+	mutex_unlock(&ctx->uring_lock);
+
+	wait_for_completion(&ctx->ctx_done);
+	io_ring_ctx_free(ctx);
+}
+
+static int io_uring_release(struct inode *inode, struct file *file)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	file->private_data = NULL;
+	io_ring_ctx_wait_and_kill(ctx);
+	return 0;
+}
+
+static int io_uring_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long sz = vma->vm_end - vma->vm_start;
+	struct io_ring_ctx *ctx = file->private_data;
+	unsigned long pfn;
+	struct page *page;
+	void *ptr;
+
+	switch (offset) {
+	case IORING_OFF_SQ_RING:
+		ptr = ctx->sq_ring;
+		break;
+	case IORING_OFF_SQES:
+		ptr = ctx->sq_sqes;
+		break;
+	case IORING_OFF_CQ_RING:
+		ptr = ctx->cq_ring;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	page = virt_to_head_page(ptr);
+	if (sz > (PAGE_SIZE << compound_order(page)))
+		return -EINVAL;
+
+	pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
+}
+
+SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
+		u32, min_complete, u32, flags, const sigset_t __user *, sig,
+		size_t, sigsz)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	int submitted = 0;
+	struct fd f;
+
+	if (flags & ~IORING_ENTER_GETEVENTS)
+		return -EINVAL;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ret = -ENXIO;
+	ctx = f.file->private_data;
+	if (!percpu_ref_tryget(&ctx->refs))
+		goto out_fput;
+
+	if (to_submit) {
+		to_submit = min(to_submit, ctx->sq_entries);
+
+		mutex_lock(&ctx->uring_lock);
+		submitted = io_ring_submit(ctx, to_submit);
+		mutex_unlock(&ctx->uring_lock);
+
+		if (submitted < 0)
+			goto out_ctx;
+	}
+	if (flags & IORING_ENTER_GETEVENTS) {
+		/*
+		 * The application could have included the 'to_submit' count
+		 * in how many events it wanted to wait for. If we failed to
+		 * submit the desired count, we may need to adjust the number
+		 * of events to poll/wait for.
+		 */
+		if (submitted < to_submit)
+			min_complete = min_t(unsigned, submitted, min_complete);
+
+		ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
+	}
+
+out_ctx:
+	io_ring_drop_ctx_refs(ctx, 1);
+out_fput:
+	fdput(f);
+	return submitted ? submitted : ret;
+}
+
+static const struct file_operations io_uring_fops = {
+	.release	= io_uring_release,
+	.mmap		= io_uring_mmap,
+	.poll		= io_uring_poll,
+	.fasync		= io_uring_fasync,
+};
+
+static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
+				  struct io_uring_params *p)
+{
+	struct io_sq_ring *sq_ring;
+	struct io_cq_ring *cq_ring;
+	size_t size;
+
+	sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries));
+	if (!sq_ring)
+		return -ENOMEM;
+
+	ctx->sq_ring = sq_ring;
+	sq_ring->ring_mask = p->sq_entries - 1;
+	sq_ring->ring_entries = p->sq_entries;
+	ctx->sq_mask = sq_ring->ring_mask;
+	ctx->sq_entries = sq_ring->ring_entries;
+
+	size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
+	if (size == SIZE_MAX)
+		return -EOVERFLOW;
+
+	ctx->sq_sqes = io_mem_alloc(size);
+	if (!ctx->sq_sqes) {
+		io_mem_free(ctx->sq_ring);
+		return -ENOMEM;
+	}
+
+	cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries));
+	if (!cq_ring) {
+		io_mem_free(ctx->sq_ring);
+		io_mem_free(ctx->sq_sqes);
+		return -ENOMEM;
+	}
+
+	ctx->cq_ring = cq_ring;
+	cq_ring->ring_mask = p->cq_entries - 1;
+	cq_ring->ring_entries = p->cq_entries;
+	ctx->cq_mask = cq_ring->ring_mask;
+	ctx->cq_entries = cq_ring->ring_entries;
+	return 0;
+}
+
+/*
+ * Allocate an anonymous fd, this is what constitutes the application
+ * visible backing of an io_uring instance. The application mmaps this
+ * fd to gain access to the SQ/CQ ring details. If UNIX sockets are enabled,
+ * we have to tie this fd to a socket for file garbage collection purposes.
+ */
+static int io_uring_get_fd(struct io_ring_ctx *ctx)
+{
+	struct file *file;
+	int ret;
+
+#if defined(CONFIG_UNIX)
+	ret = sock_create_kern(&init_net, PF_UNIX, SOCK_RAW, IPPROTO_IP,
+				&ctx->ring_sock);
+	if (ret)
+		return ret;
+#endif
+
+	ret = get_unused_fd_flags(O_RDWR | O_CLOEXEC);
+	if (ret < 0)
+		goto err;
+
+	file = anon_inode_getfile("[io_uring]", &io_uring_fops, ctx,
+					O_RDWR | O_CLOEXEC);
+	if (IS_ERR(file)) {
+		put_unused_fd(ret);
+		ret = PTR_ERR(file);
+		goto err;
+	}
+
+#if defined(CONFIG_UNIX)
+	ctx->ring_sock->file = file;
+#endif
+	fd_install(ret, file);
+	return ret;
+err:
+#if defined(CONFIG_UNIX)
+	sock_release(ctx->ring_sock);
+	ctx->ring_sock = NULL;
+#endif
+	return ret;
+}
+
+static int io_uring_create(unsigned entries, struct io_uring_params *p)
+{
+	struct user_struct *user = NULL;
+	struct io_ring_ctx *ctx;
+	bool account_mem;
+	int ret;
+
+	if (!entries || entries > IORING_MAX_ENTRIES)
+		return -EINVAL;
+
+	/*
+	 * Use twice as many entries for the CQ ring. It's possible for the
+	 * application to drive a higher depth than the size of the SQ ring,
+	 * since the sqes are only used at submission time. This allows for
+	 * some flexibility in overcommitting a bit.
+	 */
+	p->sq_entries = roundup_pow_of_two(entries);
+	p->cq_entries = 2 * p->sq_entries;
+
+	user = get_uid(current_user());
+	account_mem = !capable(CAP_IPC_LOCK);
+
+	if (account_mem) {
+		ret = io_account_mem(user,
+				ring_pages(p->sq_entries, p->cq_entries));
+		if (ret) {
+			free_uid(user);
+			return ret;
+		}
+	}
+
+	ctx = io_ring_ctx_alloc(p);
+	if (!ctx) {
+		if (account_mem)
+			io_unaccount_mem(user, ring_pages(p->sq_entries,
+								p->cq_entries));
+		free_uid(user);
+		return -ENOMEM;
+	}
+	ctx->compat = in_compat_syscall();
+	ctx->account_mem = account_mem;
+	ctx->user = user;
+
+	ret = io_allocate_scq_urings(ctx, p);
+	if (ret)
+		goto err;
+
+	ret = io_sq_offload_start(ctx);
+	if (ret)
+		goto err;
+
+	ret = io_uring_get_fd(ctx);
+	if (ret < 0)
+		goto err;
+
+	memset(&p->sq_off, 0, sizeof(p->sq_off));
+	p->sq_off.head = offsetof(struct io_sq_ring, r.head);
+	p->sq_off.tail = offsetof(struct io_sq_ring, r.tail);
+	p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask);
+	p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries);
+	p->sq_off.flags = offsetof(struct io_sq_ring, flags);
+	p->sq_off.dropped = offsetof(struct io_sq_ring, dropped);
+	p->sq_off.array = offsetof(struct io_sq_ring, array);
+
+	memset(&p->cq_off, 0, sizeof(p->cq_off));
+	p->cq_off.head = offsetof(struct io_cq_ring, r.head);
+	p->cq_off.tail = offsetof(struct io_cq_ring, r.tail);
+	p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask);
+	p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries);
+	p->cq_off.overflow = offsetof(struct io_cq_ring, overflow);
+	p->cq_off.cqes = offsetof(struct io_cq_ring, cqes);
+	return ret;
+err:
+	io_ring_ctx_wait_and_kill(ctx);
+	return ret;
+}
+
+/*
+ * Sets up an aio uring context, and returns the fd. Applications asks for a
+ * ring size, we return the actual sq/cq ring sizes (among other things) in the
+ * params structure passed in.
+ */
+static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
+{
+	struct io_uring_params p;
+	long ret;
+	int i;
+
+	if (copy_from_user(&p, params, sizeof(p)))
+		return -EFAULT;
+	for (i = 0; i < ARRAY_SIZE(p.resv); i++) {
+		if (p.resv[i])
+			return -EINVAL;
+	}
+
+	if (p.flags)
+		return -EINVAL;
+
+	ret = io_uring_create(entries, &p);
+	if (ret < 0)
+		return ret;
+
+	if (copy_to_user(params, &p, sizeof(p)))
+		return -EFAULT;
+
+	return ret;
+}
+
+SYSCALL_DEFINE2(io_uring_setup, u32, entries,
+		struct io_uring_params __user *, params)
+{
+	return io_uring_setup(entries, params);
+}
+
+static int __init io_uring_init(void)
+{
+	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+	return 0;
+};
+__initcall(io_uring_init);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dedcc2e9265c..61aa210f0c2b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3517,4 +3517,13 @@ extern void inode_nohighmem(struct inode *inode);
 extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len,
 		       int advice);
 
+#if defined(CONFIG_IO_URING)
+extern struct sock *io_uring_get_socket(struct file *file);
+#else
+static inline struct sock *io_uring_get_socket(struct file *file)
+{
+	return NULL;
+}
+#endif
+
 #endif /* _LINUX_FS_H */
diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c09c58..c7b5f86b91a1 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -40,7 +40,7 @@ struct user_struct {
 	kuid_t uid;
 
 #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \
-    defined(CONFIG_NET)
+    defined(CONFIG_NET) || defined(CONFIG_IO_URING)
 	atomic_long_t locked_vm;
 #endif
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 257cccba3062..3072dbaa7869 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct file_handle;
 struct sigaltstack;
 struct rseq;
 union bpf_attr;
+struct io_uring_params;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -309,6 +310,11 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id,
 				struct io_event __user *events,
 				struct old_timespec32 __user *timeout,
 				const struct __aio_sigset *sig);
+asmlinkage long sys_io_uring_setup(u32 entries,
+				struct io_uring_params __user *p);
+asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
+				u32 min_complete, u32 flags,
+				const sigset_t __user *sig, size_t sigsz);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index d90127298f12..87871e7b7ea7 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -740,9 +740,13 @@ __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents)
 __SYSCALL(__NR_rseq, sys_rseq)
 #define __NR_kexec_file_load 294
 __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
+#define __NR_io_uring_setup 425
+__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
+#define __NR_io_uring_enter 426
+__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
 
 #undef __NR_syscalls
-#define __NR_syscalls 295
+#define __NR_syscalls 427
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
new file mode 100644
index 000000000000..ac692823d6f4
--- /dev/null
+++ b/include/uapi/linux/io_uring.h
@@ -0,0 +1,95 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Header file for the io_uring interface.
+ *
+ * Copyright (C) 2019 Jens Axboe
+ * Copyright (C) 2019 Christoph Hellwig
+ */
+#ifndef LINUX_IO_URING_H
+#define LINUX_IO_URING_H
+
+#include <linux/fs.h>
+#include <linux/types.h>
+
+/*
+ * IO submission data structure (Submission Queue Entry)
+ */
+struct io_uring_sqe {
+	__u8	opcode;		/* type of operation for this sqe */
+	__u8	flags;		/* as of now unused */
+	__u16	ioprio;		/* ioprio for the request */
+	__s32	fd;		/* file descriptor to do IO on */
+	__u64	off;		/* offset into file */
+	__u64	addr;		/* pointer to buffer or iovecs */
+	__u32	len;		/* buffer size or number of iovecs */
+	union {
+		__kernel_rwf_t	rw_flags;
+		__u32		__resv;
+	};
+	__u64	user_data;	/* data to be passed back at completion time */
+	__u64	__pad2[3];
+};
+
+#define IORING_OP_NOP		0
+#define IORING_OP_READV		1
+#define IORING_OP_WRITEV	2
+
+/*
+ * IO completion data structure (Completion Queue Entry)
+ */
+struct io_uring_cqe {
+	__u64	user_data;	/* sqe->data submission passed back */
+	__s32	res;		/* result code for this event */
+	__u32	flags;
+};
+
+/*
+ * Magic offsets for the application to mmap the data it needs
+ */
+#define IORING_OFF_SQ_RING		0ULL
+#define IORING_OFF_CQ_RING		0x8000000ULL
+#define IORING_OFF_SQES			0x10000000ULL
+
+/*
+ * Filled with the offset for mmap(2)
+ */
+struct io_sqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 flags;
+	__u32 dropped;
+	__u32 array;
+	__u32 resv1;
+	__u64 resv2;
+};
+
+struct io_cqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 overflow;
+	__u32 cqes;
+	__u64 resv[2];
+};
+
+/*
+ * io_uring_enter(2) flags
+ */
+#define IORING_ENTER_GETEVENTS	(1U << 0)
+
+/*
+ * Passed in for io_uring_setup(2). Copied back with updated info on success
+ */
+struct io_uring_params {
+	__u32 sq_entries;
+	__u32 cq_entries;
+	__u32 flags;
+	__u32 resv[7];
+	struct io_sqring_offsets sq_off;
+	struct io_cqring_offsets cq_off;
+};
+
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index c9386a365eea..53b54214a36e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1414,6 +1414,15 @@ config AIO
 	  by some high performance threaded applications. Disabling
 	  this option saves about 7k.
 
+config IO_URING
+	bool "Enable IO uring support" if EXPERT
+	select ANON_INODES
+	default y
+	help
+	  This option enables support for the io_uring interface, enabling
+	  applications to submit and complete IO through submission and
+	  completion rings that are shared between the kernel and application.
+
 config ADVISE_SYSCALLS
 	bool "Enable madvise/fadvise syscalls" if EXPERT
 	default y
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ab9d0e3c6d50..ee5e523564bb 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents);
 COND_SYSCALL(io_pgetevents);
 COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
+COND_SYSCALL(io_uring_setup);
+COND_SYSCALL(io_uring_enter);
 
 /* fs/xattr.c */
 
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index c36757e72844..f81854d74c7d 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -108,6 +108,9 @@ struct sock *unix_get_socket(struct file *filp)
 		/* PF_UNIX ? */
 		if (s && sock->ops && sock->ops->family == PF_UNIX)
 			u_sock = s;
+	} else {
+		/* Could be an io_uring instance */
+		u_sock = io_uring_get_socket(filp);
 	}
 	return u_sock;
 }
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 06/19] io_uring: add fsync support
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Add a new fsync opcode, which either syncs a range if one is passed,
or the whole file if the offset and length fields are both cleared
to zero.  A flag is provided to use fdatasync semantics, that is only
force out metadata which is required to retrieve the file data, but
not others like metadata.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 54 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/io_uring.h |  8 +++++-
 2 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1b28d38a9b76..dc9155b7294e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -24,6 +24,7 @@
  * data that the application could potentially modify, it remains stable.
  *
  * Copyright (C) 2018-2019 Jens Axboe
+ * Copyright (c) 2018-2019 Christoph Hellwig
  */
 #include <linux/kernel.h>
 #include <linux/init.h>
@@ -557,6 +558,56 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 	return 0;
 }
 
+static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	int fd;
+
+	/* Prep already done */
+	if (req->rw.ki_filp)
+		return 0;
+
+	if (unlikely(sqe->addr || sqe->ioprio))
+		return -EINVAL;
+
+	fd = READ_ONCE(sqe->fd);
+	req->rw.ki_filp = fget(fd);
+	if (unlikely(!req->rw.ki_filp))
+		return -EBADF;
+
+	return 0;
+}
+
+static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		    bool force_nonblock)
+{
+	loff_t sqe_off = READ_ONCE(sqe->off);
+	loff_t sqe_len = READ_ONCE(sqe->len);
+	loff_t end = sqe_off + sqe_len;
+	unsigned fsync_flags;
+	int ret;
+
+	fsync_flags = READ_ONCE(sqe->fsync_flags);
+	if (unlikely(fsync_flags & ~IORING_FSYNC_DATASYNC))
+		return -EINVAL;
+
+	ret = io_prep_fsync(req, sqe);
+	if (ret)
+		return ret;
+
+	/* fsync always requires a blocking context */
+	if (force_nonblock)
+		return -EAGAIN;
+
+	ret = vfs_fsync_range(req->rw.ki_filp, sqe_off,
+				end > 0 ? end : LLONG_MAX,
+				fsync_flags & IORING_FSYNC_DATASYNC);
+
+	fput(req->rw.ki_filp);
+	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
+	io_free_req(req);
+	return 0;
+}
+
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			   const struct sqe_submit *s, bool force_nonblock)
 {
@@ -578,6 +629,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	case IORING_OP_WRITEV:
 		ret = io_write(req, s, force_nonblock);
 		break;
+	case IORING_OP_FSYNC:
+		ret = io_fsync(req, s->sqe, force_nonblock);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ac692823d6f4..4589d56d0b68 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -24,7 +24,7 @@ struct io_uring_sqe {
 	__u32	len;		/* buffer size or number of iovecs */
 	union {
 		__kernel_rwf_t	rw_flags;
-		__u32		__resv;
+		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	__u64	__pad2[3];
@@ -33,6 +33,12 @@ struct io_uring_sqe {
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
+#define IORING_OP_FSYNC		3
+
+/*
+ * sqe->fsync_flags
+ */
+#define IORING_FSYNC_DATASYNC	(1U << 0)
 
 /*
  * IO completion data structure (Completion Queue Entry)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 06/19] io_uring: add fsync support
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

From: Christoph Hellwig <hch@lst.de>

Add a new fsync opcode, which either syncs a range if one is passed,
or the whole file if the offset and length fields are both cleared
to zero.  A flag is provided to use fdatasync semantics, that is only
force out metadata which is required to retrieve the file data, but
not others like metadata.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 54 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/io_uring.h |  8 +++++-
 2 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1b28d38a9b76..dc9155b7294e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -24,6 +24,7 @@
  * data that the application could potentially modify, it remains stable.
  *
  * Copyright (C) 2018-2019 Jens Axboe
+ * Copyright (c) 2018-2019 Christoph Hellwig
  */
 #include <linux/kernel.h>
 #include <linux/init.h>
@@ -557,6 +558,56 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 	return 0;
 }
 
+static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	int fd;
+
+	/* Prep already done */
+	if (req->rw.ki_filp)
+		return 0;
+
+	if (unlikely(sqe->addr || sqe->ioprio))
+		return -EINVAL;
+
+	fd = READ_ONCE(sqe->fd);
+	req->rw.ki_filp = fget(fd);
+	if (unlikely(!req->rw.ki_filp))
+		return -EBADF;
+
+	return 0;
+}
+
+static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+		    bool force_nonblock)
+{
+	loff_t sqe_off = READ_ONCE(sqe->off);
+	loff_t sqe_len = READ_ONCE(sqe->len);
+	loff_t end = sqe_off + sqe_len;
+	unsigned fsync_flags;
+	int ret;
+
+	fsync_flags = READ_ONCE(sqe->fsync_flags);
+	if (unlikely(fsync_flags & ~IORING_FSYNC_DATASYNC))
+		return -EINVAL;
+
+	ret = io_prep_fsync(req, sqe);
+	if (ret)
+		return ret;
+
+	/* fsync always requires a blocking context */
+	if (force_nonblock)
+		return -EAGAIN;
+
+	ret = vfs_fsync_range(req->rw.ki_filp, sqe_off,
+				end > 0 ? end : LLONG_MAX,
+				fsync_flags & IORING_FSYNC_DATASYNC);
+
+	fput(req->rw.ki_filp);
+	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
+	io_free_req(req);
+	return 0;
+}
+
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			   const struct sqe_submit *s, bool force_nonblock)
 {
@@ -578,6 +629,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	case IORING_OP_WRITEV:
 		ret = io_write(req, s, force_nonblock);
 		break;
+	case IORING_OP_FSYNC:
+		ret = io_fsync(req, s->sqe, force_nonblock);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ac692823d6f4..4589d56d0b68 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -24,7 +24,7 @@ struct io_uring_sqe {
 	__u32	len;		/* buffer size or number of iovecs */
 	union {
 		__kernel_rwf_t	rw_flags;
-		__u32		__resv;
+		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	__u64	__pad2[3];
@@ -33,6 +33,12 @@ struct io_uring_sqe {
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
+#define IORING_OP_FSYNC		3
+
+/*
+ * sqe->fsync_flags
+ */
+#define IORING_FSYNC_DATASYNC	(1U << 0)
 
 /*
  * IO completion data structure (Completion Queue Entry)
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 07/19] io_uring: support for IO polling
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Add support for a polled io_uring instance. When a read or write is
submitted to a polled io_uring, the application must poll for
completions on the CQ ring through io_uring_enter(2). Polled IO may not
generate IRQ completions, hence they need to be actively found by the
application itself.

To use polling, io_uring_setup() must be used with the
IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
polled and non-polled IO on an io_uring.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 275 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/io_uring.h |   5 +
 2 files changed, 271 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index dc9155b7294e..30efe5edf6aa 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -124,6 +124,14 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
+		bool			poll_multi_file;
+		/*
+		 * ->poll_list is protected by the ctx->uring_lock for
+		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
+		 * For SQPOLL, only the single threaded io_sq_thread() will
+		 * manipulate the list, hence no extra locking is needed there.
+		 */
+		struct list_head	poll_list;
 	} ____cacheline_aligned_in_smp;
 
 #if defined(CONFIG_UNIX)
@@ -135,6 +143,7 @@ struct sqe_submit {
 	const struct io_uring_sqe	*sqe;
 	unsigned short			index;
 	bool				has_user;
+	bool				needs_lock;
 };
 
 struct io_kiocb {
@@ -146,12 +155,15 @@ struct io_kiocb {
 	struct list_head	list;
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
+#define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 	u64			user_data;
+	u64			error;
 
 	struct work_struct	work;
 };
 
 #define IO_PLUG_THRESHOLD		2
+#define IO_IOPOLL_BATCH			8
 
 static struct kmem_cache *req_cachep;
 
@@ -196,6 +208,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
+	INIT_LIST_HEAD(&ctx->poll_list);
 	return ctx;
 }
 
@@ -297,12 +310,153 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
 	return NULL;
 }
 
+static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
+{
+	if (*nr) {
+		kmem_cache_free_bulk(req_cachep, *nr, reqs);
+		io_ring_drop_ctx_refs(ctx, *nr);
+		*nr = 0;
+	}
+}
+
 static void io_free_req(struct io_kiocb *req)
 {
 	io_ring_drop_ctx_refs(req->ctx, 1);
 	kmem_cache_free(req_cachep, req);
 }
 
+/*
+ * Find and free completed poll iocbs
+ */
+static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
+			       struct list_head *done)
+{
+	void *reqs[IO_IOPOLL_BATCH];
+	struct io_kiocb *req;
+	int to_free = 0;
+
+	while (!list_empty(done)) {
+		req = list_first_entry(done, struct io_kiocb, list);
+		list_del(&req->list);
+
+		io_cqring_fill_event(ctx, req->user_data, req->error, 0);
+
+		reqs[to_free++] = req;
+		(*nr_events)++;
+
+		fput(req->rw.ki_filp);
+		if (to_free == ARRAY_SIZE(reqs))
+			io_free_req_many(ctx, reqs, &to_free);
+	}
+	io_commit_cqring(ctx);
+
+	io_free_req_many(ctx, reqs, &to_free);
+}
+
+static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
+			long min)
+{
+	struct io_kiocb *req, *tmp;
+	LIST_HEAD(done);
+	bool spin;
+	int ret;
+
+	/*
+	 * Only spin for completions if we don't have multiple devices hanging
+	 * off our complete list, and we're under the requested amount.
+	 */
+	spin = !ctx->poll_multi_file && *nr_events < min;
+
+	ret = 0;
+	list_for_each_entry_safe(req, tmp, &ctx->poll_list, list) {
+		struct kiocb *kiocb = &req->rw;
+
+		/*
+		 * Move completed entries to our local list. If we find a
+		 * request that requires polling, break out and complete
+		 * the done list first, if we have entries there.
+		 */
+		if (req->flags & REQ_F_IOPOLL_COMPLETED) {
+			list_move_tail(&req->list, &done);
+			continue;
+		}
+		if (!list_empty(&done))
+			break;
+
+		ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin);
+		if (ret < 0)
+			break;
+
+		if (ret && spin)
+			spin = false;
+		ret = 0;
+	}
+
+	if (!list_empty(&done))
+		io_iopoll_complete(ctx, nr_events, &done);
+
+	return ret;
+}
+
+/*
+ * Poll for a mininum of 'min' events. Note that if min == 0 we consider that a
+ * non-spinning poll check - we'll still enter the driver poll loop, but only
+ * as a non-spinning completion check.
+ */
+static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events,
+				long min)
+{
+	while (!list_empty(&ctx->poll_list)) {
+		int ret;
+
+		ret = io_do_iopoll(ctx, nr_events, min);
+		if (ret < 0)
+			return ret;
+		if (!min || *nr_events >= min)
+			return 0;
+	}
+
+	return 1;
+}
+
+/*
+ * We can't just wait for polled events to come to us, we have to actively
+ * find and complete them.
+ */
+static void io_iopoll_reap_events(struct io_ring_ctx *ctx)
+{
+	if (!(ctx->flags & IORING_SETUP_IOPOLL))
+		return;
+
+	mutex_lock(&ctx->uring_lock);
+	while (!list_empty(&ctx->poll_list)) {
+		unsigned int nr_events = 0;
+
+		io_iopoll_getevents(ctx, &nr_events, 1);
+	}
+	mutex_unlock(&ctx->uring_lock);
+}
+
+static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
+			   long min)
+{
+	int ret = 0;
+
+	do {
+		int tmin = 0;
+
+		if (*nr_events < min)
+			tmin = min - *nr_events;
+
+		ret = io_iopoll_getevents(ctx, nr_events, tmin);
+		if (ret <= 0)
+			break;
+		ret = 0;
+	} while (min && !*nr_events && !need_resched());
+
+	return ret;
+}
+
 static void kiocb_end_write(struct kiocb *kiocb)
 {
 	if (kiocb->ki_flags & IOCB_WRITE) {
@@ -329,6 +483,53 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 	io_free_req(req);
 }
 
+static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
+{
+	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	req->error = res;
+	if (res != -EAGAIN)
+		req->flags |= REQ_F_IOPOLL_COMPLETED;
+}
+
+/*
+ * After the iocb has been issued, it's safe to be found on the poll list.
+ * Adding the kiocb to the list AFTER submission ensures that we don't
+ * find it from a io_iopoll_getevents() thread before the issuer is done
+ * accessing the kiocb cookie.
+ */
+static void io_iopoll_req_issued(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	/*
+	 * Track whether we have multiple files in our lists. This will impact
+	 * how we do polling eventually, not spinning if we're on potentially
+	 * different devices.
+	 */
+	if (list_empty(&ctx->poll_list)) {
+		ctx->poll_multi_file = false;
+	} else if (!ctx->poll_multi_file) {
+		struct io_kiocb *list_req;
+
+		list_req = list_first_entry(&ctx->poll_list, struct io_kiocb,
+						list);
+		if (list_req->rw.ki_filp != req->rw.ki_filp)
+			ctx->poll_multi_file = true;
+	}
+
+	/*
+	 * For fast devices, IO may have already completed. If it has, add
+	 * it to the front so we find it first.
+	 */
+	if (req->flags & REQ_F_IOPOLL_COMPLETED)
+		list_add(&req->list, &ctx->poll_list);
+	else
+		list_add_tail(&req->list, &ctx->poll_list);
+}
+
 /*
  * If we tracked the file through the SCM inflight mechanism, we could support
  * any file. For now, just ensure that anything potentially problematic is done
@@ -349,6 +550,7 @@ static bool io_file_supports_async(struct file *file)
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		      bool force_nonblock)
 {
+	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
 	unsigned ioprio;
 	int fd, ret;
@@ -384,12 +586,22 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		kiocb->ki_flags |= IOCB_NOWAIT;
 		req->flags |= REQ_F_FORCE_NONBLOCK;
 	}
-	if (kiocb->ki_flags & IOCB_HIPRI) {
-		ret = -EINVAL;
-		goto out_fput;
-	}
+	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		ret = -EOPNOTSUPP;
+		if (!(kiocb->ki_flags & IOCB_DIRECT) ||
+		    !kiocb->ki_filp->f_op->iopoll)
+			goto out_fput;
 
-	kiocb->ki_complete = io_complete_rw;
+		req->error = 0;
+		kiocb->ki_flags |= IOCB_HIPRI;
+		kiocb->ki_complete = io_complete_rw_iopoll;
+	} else {
+		if (kiocb->ki_flags & IOCB_HIPRI) {
+			ret = -EINVAL;
+			goto out_fput;
+		}
+		kiocb->ki_complete = io_complete_rw;
+	}
 	return 0;
 out_fput:
 	fput(kiocb->ki_filp);
@@ -543,6 +755,9 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 	struct io_ring_ctx *ctx = req->ctx;
 	long err = 0;
 
+	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+
 	/*
 	 * Twilight zone - it's possible that someone issued an opcode that
 	 * has a file attached, then got -EAGAIN on submission, and changed
@@ -566,6 +781,8 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (req->rw.ki_filp)
 		return 0;
 
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
 	if (unlikely(sqe->addr || sqe->ioprio))
 		return -EINVAL;
 
@@ -637,7 +854,22 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		break;
 	}
 
-	return ret;
+	if (ret)
+		return ret;
+
+	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		if (req->error == -EAGAIN)
+			return -EAGAIN;
+
+		/* workqueue context doesn't hold uring_lock, grab it now */
+		if (s->needs_lock)
+			mutex_lock(&ctx->uring_lock);
+		io_iopoll_req_issued(req);
+		if (s->needs_lock)
+			mutex_unlock(&ctx->uring_lock);
+	}
+
+	return 0;
 }
 
 static void io_sq_wq_submit_work(struct work_struct *work)
@@ -661,8 +893,19 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	use_mm(ctx->sqo_mm);
 	set_fs(USER_DS);
 	s->has_user = true;
+	s->needs_lock = true;
 
-	ret = __io_submit_sqe(ctx, req, s, false);
+	do {
+		ret = __io_submit_sqe(ctx, req, s, false);
+		/*
+		 * We can get EAGAIN for polled IO even though we're forcing
+		 * a sync submission from here, since we can't wait for
+		 * request slots on the block side.
+		 */
+		if (ret != -EAGAIN)
+			break;
+		cond_resched();
+	} while (1);
 
 	set_fs(old_fs);
 	unuse_mm(ctx->sqo_mm);
@@ -793,6 +1036,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 			break;
 
 		s.has_user = true;
+		s.needs_lock = false;
+
 		ret = io_submit_sqe(ctx, &s);
 		if (ret) {
 			io_drop_sqring(ctx);
@@ -938,6 +1183,9 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		destroy_workqueue(ctx->sqo_wq);
 	if (ctx->sqo_mm)
 		mmdrop(ctx->sqo_mm);
+
+	io_iopoll_reap_events(ctx);
+
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
 		sock_release(ctx->ring_sock);
@@ -984,6 +1232,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	percpu_ref_kill(&ctx->refs);
 	mutex_unlock(&ctx->uring_lock);
 
+	io_iopoll_reap_events(ctx);
 	wait_for_completion(&ctx->ctx_done);
 	io_ring_ctx_free(ctx);
 }
@@ -1064,6 +1313,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 			goto out_ctx;
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
+		unsigned nr_events = 0;
+
 		/*
 		 * The application could have included the 'to_submit' count
 		 * in how many events it wanted to wait for. If we failed to
@@ -1073,7 +1324,13 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		if (submitted < to_submit)
 			min_complete = min_t(unsigned, submitted, min_complete);
 
-		ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
+		if (ctx->flags & IORING_SETUP_IOPOLL) {
+			mutex_lock(&ctx->uring_lock);
+			ret = io_iopoll_check(ctx, &nr_events, min_complete);
+			mutex_unlock(&ctx->uring_lock);
+		} else {
+			ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
+		}
 	}
 
 out_ctx:
@@ -1270,7 +1527,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			return -EINVAL;
 	}
 
-	if (p.flags)
+	if (p.flags & ~IORING_SETUP_IOPOLL)
 		return -EINVAL;
 
 	ret = io_uring_create(entries, &p);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 4589d56d0b68..5c457ea396e6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -30,6 +30,11 @@ struct io_uring_sqe {
 	__u64	__pad2[3];
 };
 
+/*
+ * io_uring_setup() flags
+ */
+#define IORING_SETUP_IOPOLL	(1U << 0)	/* io_context is polled */
+
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 07/19] io_uring: support for IO polling
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Add support for a polled io_uring instance. When a read or write is
submitted to a polled io_uring, the application must poll for
completions on the CQ ring through io_uring_enter(2). Polled IO may not
generate IRQ completions, hence they need to be actively found by the
application itself.

To use polling, io_uring_setup() must be used with the
IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
polled and non-polled IO on an io_uring.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 275 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/io_uring.h |   5 +
 2 files changed, 271 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index dc9155b7294e..30efe5edf6aa 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -124,6 +124,14 @@ struct io_ring_ctx {
 
 	struct {
 		spinlock_t		completion_lock;
+		bool			poll_multi_file;
+		/*
+		 * ->poll_list is protected by the ctx->uring_lock for
+		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
+		 * For SQPOLL, only the single threaded io_sq_thread() will
+		 * manipulate the list, hence no extra locking is needed there.
+		 */
+		struct list_head	poll_list;
 	} ____cacheline_aligned_in_smp;
 
 #if defined(CONFIG_UNIX)
@@ -135,6 +143,7 @@ struct sqe_submit {
 	const struct io_uring_sqe	*sqe;
 	unsigned short			index;
 	bool				has_user;
+	bool				needs_lock;
 };
 
 struct io_kiocb {
@@ -146,12 +155,15 @@ struct io_kiocb {
 	struct list_head	list;
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
+#define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 	u64			user_data;
+	u64			error;
 
 	struct work_struct	work;
 };
 
 #define IO_PLUG_THRESHOLD		2
+#define IO_IOPOLL_BATCH			8
 
 static struct kmem_cache *req_cachep;
 
@@ -196,6 +208,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
+	INIT_LIST_HEAD(&ctx->poll_list);
 	return ctx;
 }
 
@@ -297,12 +310,153 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
 	return NULL;
 }
 
+static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
+{
+	if (*nr) {
+		kmem_cache_free_bulk(req_cachep, *nr, reqs);
+		io_ring_drop_ctx_refs(ctx, *nr);
+		*nr = 0;
+	}
+}
+
 static void io_free_req(struct io_kiocb *req)
 {
 	io_ring_drop_ctx_refs(req->ctx, 1);
 	kmem_cache_free(req_cachep, req);
 }
 
+/*
+ * Find and free completed poll iocbs
+ */
+static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
+			       struct list_head *done)
+{
+	void *reqs[IO_IOPOLL_BATCH];
+	struct io_kiocb *req;
+	int to_free = 0;
+
+	while (!list_empty(done)) {
+		req = list_first_entry(done, struct io_kiocb, list);
+		list_del(&req->list);
+
+		io_cqring_fill_event(ctx, req->user_data, req->error, 0);
+
+		reqs[to_free++] = req;
+		(*nr_events)++;
+
+		fput(req->rw.ki_filp);
+		if (to_free == ARRAY_SIZE(reqs))
+			io_free_req_many(ctx, reqs, &to_free);
+	}
+	io_commit_cqring(ctx);
+
+	io_free_req_many(ctx, reqs, &to_free);
+}
+
+static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events,
+			long min)
+{
+	struct io_kiocb *req, *tmp;
+	LIST_HEAD(done);
+	bool spin;
+	int ret;
+
+	/*
+	 * Only spin for completions if we don't have multiple devices hanging
+	 * off our complete list, and we're under the requested amount.
+	 */
+	spin = !ctx->poll_multi_file && *nr_events < min;
+
+	ret = 0;
+	list_for_each_entry_safe(req, tmp, &ctx->poll_list, list) {
+		struct kiocb *kiocb = &req->rw;
+
+		/*
+		 * Move completed entries to our local list. If we find a
+		 * request that requires polling, break out and complete
+		 * the done list first, if we have entries there.
+		 */
+		if (req->flags & REQ_F_IOPOLL_COMPLETED) {
+			list_move_tail(&req->list, &done);
+			continue;
+		}
+		if (!list_empty(&done))
+			break;
+
+		ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin);
+		if (ret < 0)
+			break;
+
+		if (ret && spin)
+			spin = false;
+		ret = 0;
+	}
+
+	if (!list_empty(&done))
+		io_iopoll_complete(ctx, nr_events, &done);
+
+	return ret;
+}
+
+/*
+ * Poll for a mininum of 'min' events. Note that if min == 0 we consider that a
+ * non-spinning poll check - we'll still enter the driver poll loop, but only
+ * as a non-spinning completion check.
+ */
+static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events,
+				long min)
+{
+	while (!list_empty(&ctx->poll_list)) {
+		int ret;
+
+		ret = io_do_iopoll(ctx, nr_events, min);
+		if (ret < 0)
+			return ret;
+		if (!min || *nr_events >= min)
+			return 0;
+	}
+
+	return 1;
+}
+
+/*
+ * We can't just wait for polled events to come to us, we have to actively
+ * find and complete them.
+ */
+static void io_iopoll_reap_events(struct io_ring_ctx *ctx)
+{
+	if (!(ctx->flags & IORING_SETUP_IOPOLL))
+		return;
+
+	mutex_lock(&ctx->uring_lock);
+	while (!list_empty(&ctx->poll_list)) {
+		unsigned int nr_events = 0;
+
+		io_iopoll_getevents(ctx, &nr_events, 1);
+	}
+	mutex_unlock(&ctx->uring_lock);
+}
+
+static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
+			   long min)
+{
+	int ret = 0;
+
+	do {
+		int tmin = 0;
+
+		if (*nr_events < min)
+			tmin = min - *nr_events;
+
+		ret = io_iopoll_getevents(ctx, nr_events, tmin);
+		if (ret <= 0)
+			break;
+		ret = 0;
+	} while (min && !*nr_events && !need_resched());
+
+	return ret;
+}
+
 static void kiocb_end_write(struct kiocb *kiocb)
 {
 	if (kiocb->ki_flags & IOCB_WRITE) {
@@ -329,6 +483,53 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 	io_free_req(req);
 }
 
+static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2)
+{
+	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	req->error = res;
+	if (res != -EAGAIN)
+		req->flags |= REQ_F_IOPOLL_COMPLETED;
+}
+
+/*
+ * After the iocb has been issued, it's safe to be found on the poll list.
+ * Adding the kiocb to the list AFTER submission ensures that we don't
+ * find it from a io_iopoll_getevents() thread before the issuer is done
+ * accessing the kiocb cookie.
+ */
+static void io_iopoll_req_issued(struct io_kiocb *req)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+
+	/*
+	 * Track whether we have multiple files in our lists. This will impact
+	 * how we do polling eventually, not spinning if we're on potentially
+	 * different devices.
+	 */
+	if (list_empty(&ctx->poll_list)) {
+		ctx->poll_multi_file = false;
+	} else if (!ctx->poll_multi_file) {
+		struct io_kiocb *list_req;
+
+		list_req = list_first_entry(&ctx->poll_list, struct io_kiocb,
+						list);
+		if (list_req->rw.ki_filp != req->rw.ki_filp)
+			ctx->poll_multi_file = true;
+	}
+
+	/*
+	 * For fast devices, IO may have already completed. If it has, add
+	 * it to the front so we find it first.
+	 */
+	if (req->flags & REQ_F_IOPOLL_COMPLETED)
+		list_add(&req->list, &ctx->poll_list);
+	else
+		list_add_tail(&req->list, &ctx->poll_list);
+}
+
 /*
  * If we tracked the file through the SCM inflight mechanism, we could support
  * any file. For now, just ensure that anything potentially problematic is done
@@ -349,6 +550,7 @@ static bool io_file_supports_async(struct file *file)
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		      bool force_nonblock)
 {
+	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
 	unsigned ioprio;
 	int fd, ret;
@@ -384,12 +586,22 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		kiocb->ki_flags |= IOCB_NOWAIT;
 		req->flags |= REQ_F_FORCE_NONBLOCK;
 	}
-	if (kiocb->ki_flags & IOCB_HIPRI) {
-		ret = -EINVAL;
-		goto out_fput;
-	}
+	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		ret = -EOPNOTSUPP;
+		if (!(kiocb->ki_flags & IOCB_DIRECT) ||
+		    !kiocb->ki_filp->f_op->iopoll)
+			goto out_fput;
 
-	kiocb->ki_complete = io_complete_rw;
+		req->error = 0;
+		kiocb->ki_flags |= IOCB_HIPRI;
+		kiocb->ki_complete = io_complete_rw_iopoll;
+	} else {
+		if (kiocb->ki_flags & IOCB_HIPRI) {
+			ret = -EINVAL;
+			goto out_fput;
+		}
+		kiocb->ki_complete = io_complete_rw;
+	}
 	return 0;
 out_fput:
 	fput(kiocb->ki_filp);
@@ -543,6 +755,9 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 	struct io_ring_ctx *ctx = req->ctx;
 	long err = 0;
 
+	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+
 	/*
 	 * Twilight zone - it's possible that someone issued an opcode that
 	 * has a file attached, then got -EAGAIN on submission, and changed
@@ -566,6 +781,8 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (req->rw.ki_filp)
 		return 0;
 
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
 	if (unlikely(sqe->addr || sqe->ioprio))
 		return -EINVAL;
 
@@ -637,7 +854,22 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		break;
 	}
 
-	return ret;
+	if (ret)
+		return ret;
+
+	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		if (req->error == -EAGAIN)
+			return -EAGAIN;
+
+		/* workqueue context doesn't hold uring_lock, grab it now */
+		if (s->needs_lock)
+			mutex_lock(&ctx->uring_lock);
+		io_iopoll_req_issued(req);
+		if (s->needs_lock)
+			mutex_unlock(&ctx->uring_lock);
+	}
+
+	return 0;
 }
 
 static void io_sq_wq_submit_work(struct work_struct *work)
@@ -661,8 +893,19 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	use_mm(ctx->sqo_mm);
 	set_fs(USER_DS);
 	s->has_user = true;
+	s->needs_lock = true;
 
-	ret = __io_submit_sqe(ctx, req, s, false);
+	do {
+		ret = __io_submit_sqe(ctx, req, s, false);
+		/*
+		 * We can get EAGAIN for polled IO even though we're forcing
+		 * a sync submission from here, since we can't wait for
+		 * request slots on the block side.
+		 */
+		if (ret != -EAGAIN)
+			break;
+		cond_resched();
+	} while (1);
 
 	set_fs(old_fs);
 	unuse_mm(ctx->sqo_mm);
@@ -793,6 +1036,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 			break;
 
 		s.has_user = true;
+		s.needs_lock = false;
+
 		ret = io_submit_sqe(ctx, &s);
 		if (ret) {
 			io_drop_sqring(ctx);
@@ -938,6 +1183,9 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		destroy_workqueue(ctx->sqo_wq);
 	if (ctx->sqo_mm)
 		mmdrop(ctx->sqo_mm);
+
+	io_iopoll_reap_events(ctx);
+
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
 		sock_release(ctx->ring_sock);
@@ -984,6 +1232,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	percpu_ref_kill(&ctx->refs);
 	mutex_unlock(&ctx->uring_lock);
 
+	io_iopoll_reap_events(ctx);
 	wait_for_completion(&ctx->ctx_done);
 	io_ring_ctx_free(ctx);
 }
@@ -1064,6 +1313,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 			goto out_ctx;
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
+		unsigned nr_events = 0;
+
 		/*
 		 * The application could have included the 'to_submit' count
 		 * in how many events it wanted to wait for. If we failed to
@@ -1073,7 +1324,13 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 		if (submitted < to_submit)
 			min_complete = min_t(unsigned, submitted, min_complete);
 
-		ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
+		if (ctx->flags & IORING_SETUP_IOPOLL) {
+			mutex_lock(&ctx->uring_lock);
+			ret = io_iopoll_check(ctx, &nr_events, min_complete);
+			mutex_unlock(&ctx->uring_lock);
+		} else {
+			ret = io_cqring_wait(ctx, min_complete, sig, sigsz);
+		}
 	}
 
 out_ctx:
@@ -1270,7 +1527,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			return -EINVAL;
 	}
 
-	if (p.flags)
+	if (p.flags & ~IORING_SETUP_IOPOLL)
 		return -EINVAL;
 
 	ret = io_uring_create(entries, &p);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 4589d56d0b68..5c457ea396e6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -30,6 +30,11 @@ struct io_uring_sqe {
 	__u64	__pad2[3];
 };
 
+/*
+ * io_uring_setup() flags
+ */
+#define IORING_SETUP_IOPOLL	(1U << 0)	/* io_context is polled */
+
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 08/19] fs: add fget_many() and fput_many()
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/file.c            | 15 ++++++++++-----
 fs/file_table.c      |  9 +++++++--
 include/linux/file.h |  2 ++
 include/linux/fs.h   |  4 +++-
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 3209ee271c41..97df385d6ab0 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__fget(unsigned int fd, fmode_t mask)
+static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
 {
 	struct files_struct *files = current->files;
 	struct file *file;
@@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 		 */
 		if (file->f_mode & mask)
 			file = NULL;
-		else if (!get_file_rcu(file))
+		else if (!get_file_rcu_many(file, refs))
 			goto loop;
 	}
 	rcu_read_unlock();
@@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 	return file;
 }
 
+struct file *fget_many(unsigned int fd, unsigned int refs)
+{
+	return __fget(fd, FMODE_PATH, refs);
+}
+
 struct file *fget(unsigned int fd)
 {
-	return __fget(fd, FMODE_PATH);
+	return __fget(fd, FMODE_PATH, 1);
 }
 EXPORT_SYMBOL(fget);
 
 struct file *fget_raw(unsigned int fd)
 {
-	return __fget(fd, 0);
+	return __fget(fd, 0, 1);
 }
 EXPORT_SYMBOL(fget_raw);
 
@@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget(fd, mask, 1);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/fs/file_table.c b/fs/file_table.c
index 5679e7fcb6b0..155d7514a094 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -326,9 +326,9 @@ void flush_delayed_fput(void)
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
-void fput(struct file *file)
+void fput_many(struct file *file, unsigned int refs)
 {
-	if (atomic_long_dec_and_test(&file->f_count)) {
+	if (atomic_long_sub_and_test(refs, &file->f_count)) {
 		struct task_struct *task = current;
 
 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
@@ -347,6 +347,11 @@ void fput(struct file *file)
 	}
 }
 
+void fput(struct file *file)
+{
+	fput_many(file, 1);
+}
+
 /*
  * synchronous analog of fput(); for kernel threads that might be needed
  * in some umount() (and thus can't use flush_delayed_fput() without
diff --git a/include/linux/file.h b/include/linux/file.h
index 6b2fb032416c..3fcddff56bc4 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -13,6 +13,7 @@
 struct file;
 
 extern void fput(struct file *);
+extern void fput_many(struct file *, unsigned int);
 
 struct file_operations;
 struct vfsmount;
@@ -44,6 +45,7 @@ static inline void fdput(struct fd fd)
 }
 
 extern struct file *fget(unsigned int fd);
+extern struct file *fget_many(unsigned int fd, unsigned int refs);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
 extern unsigned long __fdget_raw(unsigned int fd);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 61aa210f0c2b..80e1b199a4b1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f)
 	atomic_long_inc(&f->f_count);
 	return f;
 }
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+#define get_file_rcu_many(x, cnt)	\
+	atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+#define get_file_rcu(x) get_file_rcu_many((x), 1)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
 #define file_count(x)	atomic_long_read(&(x)->f_count)
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 08/19] fs: add fget_many() and fput_many()
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/file.c            | 15 ++++++++++-----
 fs/file_table.c      |  9 +++++++--
 include/linux/file.h |  2 ++
 include/linux/fs.h   |  4 +++-
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 3209ee271c41..97df385d6ab0 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__fget(unsigned int fd, fmode_t mask)
+static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
 {
 	struct files_struct *files = current->files;
 	struct file *file;
@@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 		 */
 		if (file->f_mode & mask)
 			file = NULL;
-		else if (!get_file_rcu(file))
+		else if (!get_file_rcu_many(file, refs))
 			goto loop;
 	}
 	rcu_read_unlock();
@@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 	return file;
 }
 
+struct file *fget_many(unsigned int fd, unsigned int refs)
+{
+	return __fget(fd, FMODE_PATH, refs);
+}
+
 struct file *fget(unsigned int fd)
 {
-	return __fget(fd, FMODE_PATH);
+	return __fget(fd, FMODE_PATH, 1);
 }
 EXPORT_SYMBOL(fget);
 
 struct file *fget_raw(unsigned int fd)
 {
-	return __fget(fd, 0);
+	return __fget(fd, 0, 1);
 }
 EXPORT_SYMBOL(fget_raw);
 
@@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget(fd, mask, 1);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/fs/file_table.c b/fs/file_table.c
index 5679e7fcb6b0..155d7514a094 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -326,9 +326,9 @@ void flush_delayed_fput(void)
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
-void fput(struct file *file)
+void fput_many(struct file *file, unsigned int refs)
 {
-	if (atomic_long_dec_and_test(&file->f_count)) {
+	if (atomic_long_sub_and_test(refs, &file->f_count)) {
 		struct task_struct *task = current;
 
 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
@@ -347,6 +347,11 @@ void fput(struct file *file)
 	}
 }
 
+void fput(struct file *file)
+{
+	fput_many(file, 1);
+}
+
 /*
  * synchronous analog of fput(); for kernel threads that might be needed
  * in some umount() (and thus can't use flush_delayed_fput() without
diff --git a/include/linux/file.h b/include/linux/file.h
index 6b2fb032416c..3fcddff56bc4 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -13,6 +13,7 @@
 struct file;
 
 extern void fput(struct file *);
+extern void fput_many(struct file *, unsigned int);
 
 struct file_operations;
 struct vfsmount;
@@ -44,6 +45,7 @@ static inline void fdput(struct fd fd)
 }
 
 extern struct file *fget(unsigned int fd);
+extern struct file *fget_many(unsigned int fd, unsigned int refs);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
 extern unsigned long __fdget_raw(unsigned int fd);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 61aa210f0c2b..80e1b199a4b1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f)
 	atomic_long_inc(&f->f_count);
 	return f;
 }
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+#define get_file_rcu_many(x, cnt)	\
+	atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+#define get_file_rcu(x) get_file_rcu_many((x), 1)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
 #define file_count(x)	atomic_long_read(&(x)->f_count)
 
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 09/19] io_uring: use fget/fput_many() for file references
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Add a separate io_submit_state structure, to cache some of the things
we need for IO submission.

One such example is file reference batching. io_submit_state. We get as
many references as the number of sqes we are submitting, and drop
unused ones if we end up switching files. The assumption here is that
we're usually only dealing with one fd, and if there are multiple,
hopefuly they are at least somewhat ordered. Could trivially be extended
to cover multiple fds, if needed.

On the completion side we do the same thing, except this is trivially
done just locally in io_iopoll_reap().

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 142 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 121 insertions(+), 21 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 30efe5edf6aa..7358dd1dbf3f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -165,6 +165,19 @@ struct io_kiocb {
 #define IO_PLUG_THRESHOLD		2
 #define IO_IOPOLL_BATCH			8
 
+struct io_submit_state {
+	struct blk_plug		plug;
+
+	/*
+	 * File reference cache
+	 */
+	struct file		*file;
+	unsigned int		fd;
+	unsigned int		has_refs;
+	unsigned int		used_refs;
+	unsigned int		ios_left;
+};
+
 static struct kmem_cache *req_cachep;
 
 static const struct file_operations io_uring_fops;
@@ -332,9 +345,11 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 			       struct list_head *done)
 {
 	void *reqs[IO_IOPOLL_BATCH];
+	int file_count, to_free;
+	struct file *file = NULL;
 	struct io_kiocb *req;
-	int to_free = 0;
 
+	file_count = to_free = 0;
 	while (!list_empty(done)) {
 		req = list_first_entry(done, struct io_kiocb, list);
 		list_del(&req->list);
@@ -344,12 +359,28 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		reqs[to_free++] = req;
 		(*nr_events)++;
 
-		fput(req->rw.ki_filp);
+		/*
+		 * Batched puts of the same file, to avoid dirtying the
+		 * file usage count multiple times, if avoidable.
+		 */
+		if (!file) {
+			file = req->rw.ki_filp;
+			file_count = 1;
+		} else if (file == req->rw.ki_filp) {
+			file_count++;
+		} else {
+			fput_many(file, file_count);
+			file = req->rw.ki_filp;
+			file_count = 1;
+		}
+
 		if (to_free == ARRAY_SIZE(reqs))
 			io_free_req_many(ctx, reqs, &to_free);
 	}
 	io_commit_cqring(ctx);
 
+	if (file)
+		fput_many(file, file_count);
 	io_free_req_many(ctx, reqs, &to_free);
 }
 
@@ -530,6 +561,48 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
 		list_add_tail(&req->list, &ctx->poll_list);
 }
 
+static void io_file_put(struct io_submit_state *state, struct file *file)
+{
+	if (!state) {
+		fput(file);
+	} else if (state->file) {
+		int diff = state->has_refs - state->used_refs;
+
+		if (diff)
+			fput_many(state->file, diff);
+		state->file = NULL;
+	}
+}
+
+/*
+ * Get as many references to a file as we have IOs left in this submission,
+ * assuming most submissions are for one file, or at least that each file
+ * has more than one submission.
+ */
+static struct file *io_file_get(struct io_submit_state *state, int fd)
+{
+	if (!state)
+		return fget(fd);
+
+	if (state->file) {
+		if (state->fd == fd) {
+			state->used_refs++;
+			state->ios_left--;
+			return state->file;
+		}
+		io_file_put(state, NULL);
+	}
+	state->file = fget_many(fd, state->ios_left);
+	if (!state->file)
+		return NULL;
+
+	state->fd = fd;
+	state->has_refs = state->ios_left;
+	state->used_refs = 1;
+	state->ios_left--;
+	return state->file;
+}
+
 /*
  * If we tracked the file through the SCM inflight mechanism, we could support
  * any file. For now, just ensure that anything potentially problematic is done
@@ -548,7 +621,7 @@ static bool io_file_supports_async(struct file *file)
 }
 
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		      bool force_nonblock)
+		      bool force_nonblock, struct io_submit_state *state)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
@@ -560,7 +633,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return 0;
 
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = fget(fd);
+	kiocb->ki_filp = io_file_get(state, fd);
 	if (unlikely(!kiocb->ki_filp))
 		return -EBADF;
 	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
@@ -604,7 +677,10 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	fput(kiocb->ki_filp);
+	/* in case of error, we didn't use this file reference. drop it. */
+	if (state)
+		state->used_refs--;
+	io_file_put(state, kiocb->ki_filp);
 	return ret;
 }
 
@@ -650,7 +726,7 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 }
 
 static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
-		       bool force_nonblock)
+		       bool force_nonblock, struct io_submit_state *state)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw;
@@ -658,7 +734,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, s->sqe, force_nonblock);
+	ret = io_prep_rw(req, s->sqe, force_nonblock, state);
 	if (ret)
 		return ret;
 	file = kiocb->ki_filp;
@@ -694,7 +770,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 }
 
 static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
-			bool force_nonblock)
+			bool force_nonblock, struct io_submit_state *state)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw;
@@ -702,7 +778,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, s->sqe, force_nonblock);
+	ret = io_prep_rw(req, s->sqe, force_nonblock, state);
 	if (ret)
 		return ret;
 	/* Hold on to the file for -EAGAIN */
@@ -826,7 +902,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 }
 
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			   const struct sqe_submit *s, bool force_nonblock)
+			   const struct sqe_submit *s, bool force_nonblock,
+			   struct io_submit_state *state)
 {
 	ssize_t ret;
 	int opcode;
@@ -841,10 +918,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
-		ret = io_read(req, s, force_nonblock);
+		ret = io_read(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
-		ret = io_write(req, s, force_nonblock);
+		ret = io_write(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
 		ret = io_fsync(req, s->sqe, force_nonblock);
@@ -896,7 +973,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	s->needs_lock = true;
 
 	do {
-		ret = __io_submit_sqe(ctx, req, s, false);
+		ret = __io_submit_sqe(ctx, req, s, false, NULL);
 		/*
 		 * We can get EAGAIN for polled IO even though we're forcing
 		 * a sync submission from here, since we can't wait for
@@ -920,7 +997,8 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	kfree(sqe);
 }
 
-static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s)
+static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
+			 struct io_submit_state *state)
 {
 	struct io_kiocb *req;
 	ssize_t ret;
@@ -935,7 +1013,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s)
 
 	req->rw.ki_filp = NULL;
 
-	ret = __io_submit_sqe(ctx, req, s, true);
+	ret = __io_submit_sqe(ctx, req, s, true, state);
 	if (ret == -EAGAIN) {
 		struct io_uring_sqe *sqe_copy;
 
@@ -956,6 +1034,26 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	return ret;
 }
 
+/*
+ * Batched submission is done, ensure local IO is flushed out.
+ */
+static void io_submit_state_end(struct io_submit_state *state)
+{
+	blk_finish_plug(&state->plug);
+	io_file_put(state, NULL);
+}
+
+/*
+ * Start submission side cache.
+ */
+static void io_submit_state_start(struct io_submit_state *state,
+				  struct io_ring_ctx *ctx, unsigned max_ios)
+{
+	blk_start_plug(&state->plug);
+	state->file = NULL;
+	state->ios_left = max_ios;
+}
+
 static void io_commit_sqring(struct io_ring_ctx *ctx)
 {
 	struct io_sq_ring *ring = ctx->sq_ring;
@@ -1023,11 +1121,13 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
+	struct io_submit_state state, *statep = NULL;
 	int i, ret = 0, submit = 0;
-	struct blk_plug plug;
 
-	if (to_submit > IO_PLUG_THRESHOLD)
-		blk_start_plug(&plug);
+	if (to_submit > IO_PLUG_THRESHOLD) {
+		io_submit_state_start(&state, ctx, to_submit);
+		statep = &state;
+	}
 
 	for (i = 0; i < to_submit; i++) {
 		struct sqe_submit s;
@@ -1038,7 +1138,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		s.has_user = true;
 		s.needs_lock = false;
 
-		ret = io_submit_sqe(ctx, &s);
+		ret = io_submit_sqe(ctx, &s, statep);
 		if (ret) {
 			io_drop_sqring(ctx);
 			break;
@@ -1048,8 +1148,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 	}
 	io_commit_sqring(ctx);
 
-	if (to_submit > IO_PLUG_THRESHOLD)
-		blk_finish_plug(&plug);
+	if (statep)
+		io_submit_state_end(statep);
 
 	return submit ? submit : ret;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 09/19] io_uring: use fget/fput_many() for file references
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Add a separate io_submit_state structure, to cache some of the things
we need for IO submission.

One such example is file reference batching. io_submit_state. We get as
many references as the number of sqes we are submitting, and drop
unused ones if we end up switching files. The assumption here is that
we're usually only dealing with one fd, and if there are multiple,
hopefuly they are at least somewhat ordered. Could trivially be extended
to cover multiple fds, if needed.

On the completion side we do the same thing, except this is trivially
done just locally in io_iopoll_reap().

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 142 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 121 insertions(+), 21 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 30efe5edf6aa..7358dd1dbf3f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -165,6 +165,19 @@ struct io_kiocb {
 #define IO_PLUG_THRESHOLD		2
 #define IO_IOPOLL_BATCH			8
 
+struct io_submit_state {
+	struct blk_plug		plug;
+
+	/*
+	 * File reference cache
+	 */
+	struct file		*file;
+	unsigned int		fd;
+	unsigned int		has_refs;
+	unsigned int		used_refs;
+	unsigned int		ios_left;
+};
+
 static struct kmem_cache *req_cachep;
 
 static const struct file_operations io_uring_fops;
@@ -332,9 +345,11 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 			       struct list_head *done)
 {
 	void *reqs[IO_IOPOLL_BATCH];
+	int file_count, to_free;
+	struct file *file = NULL;
 	struct io_kiocb *req;
-	int to_free = 0;
 
+	file_count = to_free = 0;
 	while (!list_empty(done)) {
 		req = list_first_entry(done, struct io_kiocb, list);
 		list_del(&req->list);
@@ -344,12 +359,28 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		reqs[to_free++] = req;
 		(*nr_events)++;
 
-		fput(req->rw.ki_filp);
+		/*
+		 * Batched puts of the same file, to avoid dirtying the
+		 * file usage count multiple times, if avoidable.
+		 */
+		if (!file) {
+			file = req->rw.ki_filp;
+			file_count = 1;
+		} else if (file == req->rw.ki_filp) {
+			file_count++;
+		} else {
+			fput_many(file, file_count);
+			file = req->rw.ki_filp;
+			file_count = 1;
+		}
+
 		if (to_free == ARRAY_SIZE(reqs))
 			io_free_req_many(ctx, reqs, &to_free);
 	}
 	io_commit_cqring(ctx);
 
+	if (file)
+		fput_many(file, file_count);
 	io_free_req_many(ctx, reqs, &to_free);
 }
 
@@ -530,6 +561,48 @@ static void io_iopoll_req_issued(struct io_kiocb *req)
 		list_add_tail(&req->list, &ctx->poll_list);
 }
 
+static void io_file_put(struct io_submit_state *state, struct file *file)
+{
+	if (!state) {
+		fput(file);
+	} else if (state->file) {
+		int diff = state->has_refs - state->used_refs;
+
+		if (diff)
+			fput_many(state->file, diff);
+		state->file = NULL;
+	}
+}
+
+/*
+ * Get as many references to a file as we have IOs left in this submission,
+ * assuming most submissions are for one file, or at least that each file
+ * has more than one submission.
+ */
+static struct file *io_file_get(struct io_submit_state *state, int fd)
+{
+	if (!state)
+		return fget(fd);
+
+	if (state->file) {
+		if (state->fd == fd) {
+			state->used_refs++;
+			state->ios_left--;
+			return state->file;
+		}
+		io_file_put(state, NULL);
+	}
+	state->file = fget_many(fd, state->ios_left);
+	if (!state->file)
+		return NULL;
+
+	state->fd = fd;
+	state->has_refs = state->ios_left;
+	state->used_refs = 1;
+	state->ios_left--;
+	return state->file;
+}
+
 /*
  * If we tracked the file through the SCM inflight mechanism, we could support
  * any file. For now, just ensure that anything potentially problematic is done
@@ -548,7 +621,7 @@ static bool io_file_supports_async(struct file *file)
 }
 
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
-		      bool force_nonblock)
+		      bool force_nonblock, struct io_submit_state *state)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
@@ -560,7 +633,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return 0;
 
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = fget(fd);
+	kiocb->ki_filp = io_file_get(state, fd);
 	if (unlikely(!kiocb->ki_filp))
 		return -EBADF;
 	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
@@ -604,7 +677,10 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	fput(kiocb->ki_filp);
+	/* in case of error, we didn't use this file reference. drop it. */
+	if (state)
+		state->used_refs--;
+	io_file_put(state, kiocb->ki_filp);
 	return ret;
 }
 
@@ -650,7 +726,7 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 }
 
 static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
-		       bool force_nonblock)
+		       bool force_nonblock, struct io_submit_state *state)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw;
@@ -658,7 +734,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, s->sqe, force_nonblock);
+	ret = io_prep_rw(req, s->sqe, force_nonblock, state);
 	if (ret)
 		return ret;
 	file = kiocb->ki_filp;
@@ -694,7 +770,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 }
 
 static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
-			bool force_nonblock)
+			bool force_nonblock, struct io_submit_state *state)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *kiocb = &req->rw;
@@ -702,7 +778,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, s->sqe, force_nonblock);
+	ret = io_prep_rw(req, s->sqe, force_nonblock, state);
 	if (ret)
 		return ret;
 	/* Hold on to the file for -EAGAIN */
@@ -826,7 +902,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 }
 
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
-			   const struct sqe_submit *s, bool force_nonblock)
+			   const struct sqe_submit *s, bool force_nonblock,
+			   struct io_submit_state *state)
 {
 	ssize_t ret;
 	int opcode;
@@ -841,10 +918,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
-		ret = io_read(req, s, force_nonblock);
+		ret = io_read(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
-		ret = io_write(req, s, force_nonblock);
+		ret = io_write(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
 		ret = io_fsync(req, s->sqe, force_nonblock);
@@ -896,7 +973,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	s->needs_lock = true;
 
 	do {
-		ret = __io_submit_sqe(ctx, req, s, false);
+		ret = __io_submit_sqe(ctx, req, s, false, NULL);
 		/*
 		 * We can get EAGAIN for polled IO even though we're forcing
 		 * a sync submission from here, since we can't wait for
@@ -920,7 +997,8 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	kfree(sqe);
 }
 
-static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s)
+static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
+			 struct io_submit_state *state)
 {
 	struct io_kiocb *req;
 	ssize_t ret;
@@ -935,7 +1013,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s)
 
 	req->rw.ki_filp = NULL;
 
-	ret = __io_submit_sqe(ctx, req, s, true);
+	ret = __io_submit_sqe(ctx, req, s, true, state);
 	if (ret == -EAGAIN) {
 		struct io_uring_sqe *sqe_copy;
 
@@ -956,6 +1034,26 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	return ret;
 }
 
+/*
+ * Batched submission is done, ensure local IO is flushed out.
+ */
+static void io_submit_state_end(struct io_submit_state *state)
+{
+	blk_finish_plug(&state->plug);
+	io_file_put(state, NULL);
+}
+
+/*
+ * Start submission side cache.
+ */
+static void io_submit_state_start(struct io_submit_state *state,
+				  struct io_ring_ctx *ctx, unsigned max_ios)
+{
+	blk_start_plug(&state->plug);
+	state->file = NULL;
+	state->ios_left = max_ios;
+}
+
 static void io_commit_sqring(struct io_ring_ctx *ctx)
 {
 	struct io_sq_ring *ring = ctx->sq_ring;
@@ -1023,11 +1121,13 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
+	struct io_submit_state state, *statep = NULL;
 	int i, ret = 0, submit = 0;
-	struct blk_plug plug;
 
-	if (to_submit > IO_PLUG_THRESHOLD)
-		blk_start_plug(&plug);
+	if (to_submit > IO_PLUG_THRESHOLD) {
+		io_submit_state_start(&state, ctx, to_submit);
+		statep = &state;
+	}
 
 	for (i = 0; i < to_submit; i++) {
 		struct sqe_submit s;
@@ -1038,7 +1138,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		s.has_user = true;
 		s.needs_lock = false;
 
-		ret = io_submit_sqe(ctx, &s);
+		ret = io_submit_sqe(ctx, &s, statep);
 		if (ret) {
 			io_drop_sqring(ctx);
 			break;
@@ -1048,8 +1148,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 	}
 	io_commit_sqring(ctx);
 
-	if (to_submit > IO_PLUG_THRESHOLD)
-		blk_finish_plug(&plug);
+	if (statep)
+		io_submit_state_end(statep);
 
 	return submit ? submit : ret;
 }
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 10/19] io_uring: batch io_kiocb allocation
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Similarly to how we use the state->ios_left to know how many references
to get to a file, we can use it to allocate the io_kiocb's we need in
bulk.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 45 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7358dd1dbf3f..e330252dc5de 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -168,6 +168,13 @@ struct io_kiocb {
 struct io_submit_state {
 	struct blk_plug		plug;
 
+	/*
+	 * io_kiocb alloc cache
+	 */
+	void			*reqs[IO_IOPOLL_BATCH];
+	unsigned		int free_reqs;
+	unsigned		int cur_req;
+
 	/*
 	 * File reference cache
 	 */
@@ -305,20 +312,40 @@ static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
 		wake_up(&ctx->wait);
 }
 
-static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
+static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
+				   struct io_submit_state *state)
 {
 	struct io_kiocb *req;
 
 	if (!percpu_ref_tryget(&ctx->refs))
 		return NULL;
 
-	req = kmem_cache_alloc(req_cachep, __GFP_NOWARN);
-	if (req) {
-		req->ctx = ctx;
-		req->flags = 0;
-		return req;
+	if (!state) {
+		req = kmem_cache_alloc(req_cachep, __GFP_NOWARN);
+		if (unlikely(!req))
+			goto out;
+	} else if (!state->free_reqs) {
+		size_t sz;
+		int ret;
+
+		sz = min_t(size_t, state->ios_left, ARRAY_SIZE(state->reqs));
+		ret = kmem_cache_alloc_bulk(req_cachep, __GFP_NOWARN, sz,
+						state->reqs);
+		if (unlikely(ret <= 0))
+			goto out;
+		state->free_reqs = ret - 1;
+		state->cur_req = 1;
+		req = state->reqs[0];
+	} else {
+		req = state->reqs[state->cur_req];
+		state->free_reqs--;
+		state->cur_req++;
 	}
 
+	req->ctx = ctx;
+	req->flags = 0;
+	return req;
+out:
 	io_ring_drop_ctx_refs(ctx, 1);
 	return NULL;
 }
@@ -1007,7 +1034,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 	if (unlikely(s->sqe->flags))
 		return -EINVAL;
 
-	req = io_get_req(ctx);
+	req = io_get_req(ctx, state);
 	if (unlikely(!req))
 		return -EAGAIN;
 
@@ -1041,6 +1068,9 @@ static void io_submit_state_end(struct io_submit_state *state)
 {
 	blk_finish_plug(&state->plug);
 	io_file_put(state, NULL);
+	if (state->free_reqs)
+		kmem_cache_free_bulk(req_cachep, state->free_reqs,
+					&state->reqs[state->cur_req]);
 }
 
 /*
@@ -1050,6 +1080,7 @@ static void io_submit_state_start(struct io_submit_state *state,
 				  struct io_ring_ctx *ctx, unsigned max_ios)
 {
 	blk_start_plug(&state->plug);
+	state->free_reqs = 0;
 	state->file = NULL;
 	state->ios_left = max_ios;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 10/19] io_uring: batch io_kiocb allocation
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Similarly to how we use the state->ios_left to know how many references
to get to a file, we can use it to allocate the io_kiocb's we need in
bulk.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 45 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7358dd1dbf3f..e330252dc5de 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -168,6 +168,13 @@ struct io_kiocb {
 struct io_submit_state {
 	struct blk_plug		plug;
 
+	/*
+	 * io_kiocb alloc cache
+	 */
+	void			*reqs[IO_IOPOLL_BATCH];
+	unsigned		int free_reqs;
+	unsigned		int cur_req;
+
 	/*
 	 * File reference cache
 	 */
@@ -305,20 +312,40 @@ static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
 		wake_up(&ctx->wait);
 }
 
-static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
+static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
+				   struct io_submit_state *state)
 {
 	struct io_kiocb *req;
 
 	if (!percpu_ref_tryget(&ctx->refs))
 		return NULL;
 
-	req = kmem_cache_alloc(req_cachep, __GFP_NOWARN);
-	if (req) {
-		req->ctx = ctx;
-		req->flags = 0;
-		return req;
+	if (!state) {
+		req = kmem_cache_alloc(req_cachep, __GFP_NOWARN);
+		if (unlikely(!req))
+			goto out;
+	} else if (!state->free_reqs) {
+		size_t sz;
+		int ret;
+
+		sz = min_t(size_t, state->ios_left, ARRAY_SIZE(state->reqs));
+		ret = kmem_cache_alloc_bulk(req_cachep, __GFP_NOWARN, sz,
+						state->reqs);
+		if (unlikely(ret <= 0))
+			goto out;
+		state->free_reqs = ret - 1;
+		state->cur_req = 1;
+		req = state->reqs[0];
+	} else {
+		req = state->reqs[state->cur_req];
+		state->free_reqs--;
+		state->cur_req++;
 	}
 
+	req->ctx = ctx;
+	req->flags = 0;
+	return req;
+out:
 	io_ring_drop_ctx_refs(ctx, 1);
 	return NULL;
 }
@@ -1007,7 +1034,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 	if (unlikely(s->sqe->flags))
 		return -EINVAL;
 
-	req = io_get_req(ctx);
+	req = io_get_req(ctx, state);
 	if (unlikely(!req))
 		return -EAGAIN;
 
@@ -1041,6 +1068,9 @@ static void io_submit_state_end(struct io_submit_state *state)
 {
 	blk_finish_plug(&state->plug);
 	io_file_put(state, NULL);
+	if (state->free_reqs)
+		kmem_cache_free_bulk(req_cachep, state->free_reqs,
+					&state->reqs[state->cur_req]);
 }
 
 /*
@@ -1050,6 +1080,7 @@ static void io_submit_state_start(struct io_submit_state *state,
 				  struct io_ring_ctx *ctx, unsigned max_ios)
 {
 	blk_start_plug(&state->plug);
+	state->free_reqs = 0;
 	state->file = NULL;
 	state->ios_left = max_ios;
 }
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. This requires that the caller doesn't releases
the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.

The current two callers of bio_iov_iter_get_pages() are updated to
check if they need to release pages on completion. This makes them
work with bvecs that contain kernel mapped pages already.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
 fs/block_dev.c            |  5 ++--
 fs/iomap.c                |  5 ++--
 include/linux/blk_types.h |  1 +
 4 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..330df572cfb8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
+static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned int len;
+	size_t size;
+
+	len = min_t(size_t, bv->bv_len, iter->count);
+	size = bio_add_page(bio, bv->bv_page, len,
+				bv->bv_offset + iter->iov_offset);
+	if (size == len) {
+		iov_iter_advance(iter, size);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
 
 /**
@@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 }
 
 /**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
- * @iter: iov iterator describing the region to be mapped
+ * @iter: iov iterator describing the region to be added
+ *
+ * This takes either an iterator pointing to user memory, or one pointing to
+ * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
+ * map them into the kernel. On IO completion, the caller should put those
+ * pages. If we're adding kernel pages, we just have to add the pages to the
+ * bio directly. We don't grab an extra reference to those pages (the user
+ * should already have that), and we don't put the page on IO completion.
+ * The caller needs to check if the bio is flagged BIO_NO_PAGE_REF on IO
+ * completion. If it isn't, then pages should be released.
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
  * The function tries, but does not guarantee, to pin as many pages as
- * fit into the bio, or are requested in *iter, whatever is smaller.
- * If MM encounters an error pinning the requested pages, it stops.
- * Error is returned only if 0 pages could be pinned.
+ * fit into the bio, or are requested in *iter, whatever is smaller. If
+ * MM encounters an error pinning the requested pages, it stops. Error
+ * is returned only if 0 pages could be pinned.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
+	const bool is_bvec = iov_iter_is_bvec(iter);
 	unsigned short orig_vcnt = bio->bi_vcnt;
 
+	/*
+	 * If this is a BVEC iter, then the pages are kernel pages. Don't
+	 * release them on IO completion.
+	 */
+	if (is_bvec)
+		bio_set_flag(bio, BIO_NO_PAGE_REF);
+
 	do {
-		int ret = __bio_iov_iter_get_pages(bio, iter);
+		int ret;
+
+		if (is_bvec)
+			ret = __bio_iov_bvec_add_pages(bio, iter);
+		else
+			ret = __bio_iov_iter_get_pages(bio, iter);
 
 		if (unlikely(ret))
 			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
@@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work)
 		next = bio->bi_private;
 
 		bio_set_pages_dirty(bio);
-		bio_release_pages(bio);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_release_pages(bio);
 		bio_put(bio);
 	}
 }
@@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio)
 			goto defer;
 	}
 
-	bio_release_pages(bio);
+	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+		bio_release_pages(bio);
 	bio_put(bio);
 	return;
 defer:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 392e2bfb636f..051ab41d1c61 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -338,8 +338,9 @@ static void blkdev_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 2ac9eb746d44..9389cf0a1c6f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1591,8 +1591,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d66bf5f32610..791fee35df88 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -215,6 +215,7 @@ struct bio {
 /*
  * bio flags
  */
+#define BIO_NO_PAGE_REF	0	/* don't put release vec pages */
 #define BIO_SEG_VALID	1	/* bi_phys_segments valid */
 #define BIO_CLONED	2	/* doesn't own data */
 #define BIO_BOUNCED	3	/* bio is a bounce bio */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. This requires that the caller doesn't releases
the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.

The current two callers of bio_iov_iter_get_pages() are updated to
check if they need to release pages on completion. This makes them
work with bvecs that contain kernel mapped pages already.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
 fs/block_dev.c            |  5 ++--
 fs/iomap.c                |  5 ++--
 include/linux/blk_types.h |  1 +
 4 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..330df572cfb8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
+static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned int len;
+	size_t size;
+
+	len = min_t(size_t, bv->bv_len, iter->count);
+	size = bio_add_page(bio, bv->bv_page, len,
+				bv->bv_offset + iter->iov_offset);
+	if (size == len) {
+		iov_iter_advance(iter, size);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
 
 /**
@@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 }
 
 /**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
- * @iter: iov iterator describing the region to be mapped
+ * @iter: iov iterator describing the region to be added
+ *
+ * This takes either an iterator pointing to user memory, or one pointing to
+ * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
+ * map them into the kernel. On IO completion, the caller should put those
+ * pages. If we're adding kernel pages, we just have to add the pages to the
+ * bio directly. We don't grab an extra reference to those pages (the user
+ * should already have that), and we don't put the page on IO completion.
+ * The caller needs to check if the bio is flagged BIO_NO_PAGE_REF on IO
+ * completion. If it isn't, then pages should be released.
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
  * The function tries, but does not guarantee, to pin as many pages as
- * fit into the bio, or are requested in *iter, whatever is smaller.
- * If MM encounters an error pinning the requested pages, it stops.
- * Error is returned only if 0 pages could be pinned.
+ * fit into the bio, or are requested in *iter, whatever is smaller. If
+ * MM encounters an error pinning the requested pages, it stops. Error
+ * is returned only if 0 pages could be pinned.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
+	const bool is_bvec = iov_iter_is_bvec(iter);
 	unsigned short orig_vcnt = bio->bi_vcnt;
 
+	/*
+	 * If this is a BVEC iter, then the pages are kernel pages. Don't
+	 * release them on IO completion.
+	 */
+	if (is_bvec)
+		bio_set_flag(bio, BIO_NO_PAGE_REF);
+
 	do {
-		int ret = __bio_iov_iter_get_pages(bio, iter);
+		int ret;
+
+		if (is_bvec)
+			ret = __bio_iov_bvec_add_pages(bio, iter);
+		else
+			ret = __bio_iov_iter_get_pages(bio, iter);
 
 		if (unlikely(ret))
 			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
@@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work)
 		next = bio->bi_private;
 
 		bio_set_pages_dirty(bio);
-		bio_release_pages(bio);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_release_pages(bio);
 		bio_put(bio);
 	}
 }
@@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio)
 			goto defer;
 	}
 
-	bio_release_pages(bio);
+	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+		bio_release_pages(bio);
 	bio_put(bio);
 	return;
 defer:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 392e2bfb636f..051ab41d1c61 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -338,8 +338,9 @@ static void blkdev_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 2ac9eb746d44..9389cf0a1c6f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1591,8 +1591,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d66bf5f32610..791fee35df88 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -215,6 +215,7 @@ struct bio {
 /*
  * bio flags
  */
+#define BIO_NO_PAGE_REF	0	/* don't put release vec pages */
 #define BIO_SEG_VALID	1	/* bi_phys_segments valid */
 #define BIO_CLONED	2	/* doesn't own data */
 #define BIO_BOUNCED	3	/* bio is a bounce bio */
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.

If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.

The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.

RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/io_uring.c                          | 373 ++++++++++++++++++++++++-
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/io_uring.h          |  13 +-
 kernel/sys_ni.c                        |   1 +
 7 files changed, 380 insertions(+), 15 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 481c126259e9..2eefd2a7c1ce 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
+427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 6a32a430c8e0..65c026185e61 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
+427	common	io_uring_register	__x64_sys_io_uring_register
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e330252dc5de..0eba20d18f53 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -45,6 +45,7 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/blkdev.h>
+#include <linux/bvec.h>
 #include <linux/net.h>
 #include <net/sock.h>
 #include <net/af_unix.h>
@@ -52,6 +53,8 @@
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
+#include <linux/sizes.h>
+#include <linux/hugetlb.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -81,6 +84,13 @@ struct io_cq_ring {
 	struct io_uring_cqe	cqes[];
 };
 
+struct io_mapped_ubuf {
+	u64		ubuf;
+	size_t		len;
+	struct		bio_vec *bvec;
+	unsigned int	nr_bvecs;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -113,6 +123,10 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/* if used, fixed mapped user buffers */
+	unsigned		nr_user_bufs;
+	struct io_mapped_ubuf	*user_bufs;
+
 	struct user_struct	*user;
 
 	struct completion	ctx_done;
@@ -732,6 +746,46 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
+static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe,
+			   struct iov_iter *iter)
+{
+	size_t len = READ_ONCE(sqe->len);
+	struct io_mapped_ubuf *imu;
+	unsigned index, buf_index;
+	size_t offset;
+	u64 buf_addr;
+
+	/* attempt to use fixed buffers without having provided iovecs */
+	if (unlikely(!ctx->user_bufs))
+		return -EFAULT;
+
+	buf_index = READ_ONCE(sqe->buf_index);
+	if (unlikely(buf_index >= ctx->nr_user_bufs))
+		return -EFAULT;
+
+	index = array_index_nospec(buf_index, ctx->nr_user_bufs);
+	imu = &ctx->user_bufs[index];
+	buf_addr = READ_ONCE(sqe->addr);
+
+	/* overflow */
+	if (buf_addr + len < buf_addr)
+		return -EFAULT;
+	/* not inside the mapped region */
+	if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
+		return -EFAULT;
+
+	/*
+	 * May not be a start of buffer, set size appropriately
+	 * and advance us to the beginning.
+	 */
+	offset = buf_addr - imu->ubuf;
+	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
+	if (offset)
+		iov_iter_advance(iter, offset);
+	return 0;
+}
+
 static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 			   const struct sqe_submit *s, struct iovec **iovec,
 			   struct iov_iter *iter)
@@ -739,6 +793,23 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 	const struct io_uring_sqe *sqe = s->sqe;
 	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	size_t sqe_len = READ_ONCE(sqe->len);
+	u8 opcode;
+
+	/*
+	 * We're reading ->opcode for the second time, but the first read
+	 * doesn't care whether it's _FIXED or not, so it doesn't matter
+	 * whether ->opcode changes concurrently. The first read does care
+	 * about whether it is a READ or a WRITE, so we don't trust this read
+	 * for that purpose and instead let the caller pass in the read/write
+	 * flag.
+	 */
+	opcode = READ_ONCE(sqe->opcode);
+	if (opcode == IORING_OP_READ_FIXED ||
+	    opcode == IORING_OP_WRITE_FIXED) {
+		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
+		*iovec = NULL;
+		return ret;
+	}
 
 	if (!s->has_user)
 		return EFAULT;
@@ -886,7 +957,7 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio))
+	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fd = READ_ONCE(sqe->fd);
@@ -945,9 +1016,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
 		ret = io_read(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
+		ret = io_write(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_READ_FIXED:
+		ret = io_read(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_WRITE_FIXED:
 		ret = io_write(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
@@ -976,28 +1057,46 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
+{
+	u8 opcode = READ_ONCE(sqe->opcode);
+
+	return !(opcode == IORING_OP_READ_FIXED ||
+		 opcode == IORING_OP_WRITE_FIXED);
+}
+
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	struct sqe_submit *s = &req->submit;
 	const struct io_uring_sqe *sqe = s->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
-	mm_segment_t old_fs = get_fs();
+	mm_segment_t old_fs;
+	bool needs_user;
 	int ret;
 
 	 /* Ensure we clear previously set forced non-block flag */
 	req->flags &= ~REQ_F_FORCE_NONBLOCK;
 	req->rw.ki_flags &= ~IOCB_NOWAIT;
 
-	if (!mmget_not_zero(ctx->sqo_mm)) {
-		ret = -EFAULT;
-		goto err;
-	}
-
-	use_mm(ctx->sqo_mm);
-	set_fs(USER_DS);
-	s->has_user = true;
 	s->needs_lock = true;
+	s->has_user = false;
+
+	/*
+	 * If we're doing IO to fixed buffers, we don't need to get/set
+	 * user context
+	 */
+	needs_user = io_sqe_needs_user(s->sqe);
+	if (needs_user) {
+		if (!mmget_not_zero(ctx->sqo_mm)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		use_mm(ctx->sqo_mm);
+		old_fs = get_fs();
+		set_fs(USER_DS);
+		s->has_user = true;
+	}
 
 	do {
 		ret = __io_submit_sqe(ctx, req, s, false, NULL);
@@ -1011,9 +1110,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 		cond_resched();
 	} while (1);
 
-	set_fs(old_fs);
-	unuse_mm(ctx->sqo_mm);
-	mmput(ctx->sqo_mm);
+	if (needs_user) {
+		set_fs(old_fs);
+		unuse_mm(ctx->sqo_mm);
+		mmput(ctx->sqo_mm);
+	}
 err:
 	if (ret) {
 		io_cqring_add_event(ctx, sqe->user_data, ret, 0);
@@ -1308,6 +1409,197 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
 }
 
+static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->nr_user_bufs; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+
+		for (j = 0; j < imu->nr_bvecs; j++)
+			put_page(imu->bvec[j].bv_page);
+
+		if (ctx->account_mem)
+			io_unaccount_mem(ctx->user, imu->nr_bvecs);
+		kfree(imu->bvec);
+		imu->nr_bvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+	ctx->nr_user_bufs = 0;
+	return 0;
+}
+
+static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
+		       void __user *arg, unsigned index)
+{
+	struct iovec __user *src;
+
+#ifdef CONFIG_COMPAT
+	if (ctx->compat) {
+		struct compat_iovec __user *ciovs;
+		struct compat_iovec ciov;
+
+		ciovs = (struct compat_iovec __user *) arg;
+		if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
+			return -EFAULT;
+
+		dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
+		dst->iov_len = ciov.iov_len;
+		return 0;
+	}
+#endif
+	src = (struct iovec __user *) arg;
+	if (copy_from_user(dst, &src[index], sizeof(*dst)))
+		return -EFAULT;
+	return 0;
+}
+
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
+				  unsigned nr_args)
+{
+	struct vm_area_struct **vmas = NULL;
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	int ret = -EINVAL;
+
+	if (ctx->user_bufs)
+		return -EBUSY;
+	if (!nr_args || nr_args > UIO_MAXIOV)
+		return -EINVAL;
+
+	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		struct iovec iov;
+		size_t size;
+
+		ret = io_copy_iov(ctx, &iov, arg, i);
+		if (ret)
+			break;
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iov.iov_base || !iov.iov_len)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iov.iov_len > SZ_1G)
+			goto err;
+
+		ubuf = (unsigned long) iov.iov_base;
+		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		if (ctx->account_mem) {
+			ret = io_account_mem(ctx->user, nr_pages);
+			if (ret)
+				goto err;
+		}
+
+		ret = 0;
+		if (!pages || nr_pages > got_pages) {
+			kfree(vmas);
+			kfree(pages);
+			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+						GFP_KERNEL);
+			vmas = kmalloc_array(nr_pages,
+					sizeof(struct vma_area_struct *),
+					GFP_KERNEL);
+			if (!pages || !vmas) {
+				ret = -ENOMEM;
+				if (ctx->account_mem)
+					io_unaccount_mem(ctx->user, nr_pages);
+				goto err;
+			}
+			got_pages = nr_pages;
+		}
+
+		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+						GFP_KERNEL);
+		ret = -ENOMEM;
+		if (!imu->bvec) {
+			if (ctx->account_mem)
+				io_unaccount_mem(ctx->user, nr_pages);
+			goto err;
+		}
+
+		ret = 0;
+		down_read(&current->mm->mmap_sem);
+		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
+						pages, vmas);
+		if (pret == nr_pages) {
+			/* don't support file backed memory */
+			for (j = 0; j < nr_pages; j++) {
+				struct vm_area_struct *vma = vmas[j];
+
+				if (vma->vm_file &&
+				    !is_file_hugepages(vma->vm_file)) {
+					ret = -EOPNOTSUPP;
+					break;
+				}
+			}
+		} else {
+			ret = pret < 0 ? pret : -EFAULT;
+		}
+		up_read(&current->mm->mmap_sem);
+		if (ret) {
+			/*
+			 * if we did partial map, or found file backed vmas,
+			 * release any pages we did get
+			 */
+			if (pret > 0) {
+				for (j = 0; j < pret; j++)
+					put_page(pages[j]);
+			}
+			if (ctx->account_mem)
+				io_unaccount_mem(ctx->user, nr_pages);
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iov.iov_len;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			imu->bvec[j].bv_page = pages[j];
+			imu->bvec[j].bv_len = vec_len;
+			imu->bvec[j].bv_offset = off;
+			off = 0;
+			size -= vec_len;
+		}
+		/* store original address for later verification */
+		imu->ubuf = ubuf;
+		imu->len = iov.iov_len;
+		imu->nr_bvecs = nr_pages;
+	}
+	kfree(pages);
+	kfree(vmas);
+	ctx->nr_user_bufs = nr_args;
+	return 0;
+err:
+	kfree(pages);
+	kfree(vmas);
+	io_sqe_buffer_unregister(ctx);
+	return ret;
+}
+
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	if (ctx->sqo_wq)
@@ -1316,6 +1608,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		mmdrop(ctx->sqo_mm);
 
 	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
 
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
@@ -1677,6 +1970,60 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 	return io_uring_setup(entries, params);
 }
 
+static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
+			       void __user *arg, unsigned nr_args)
+{
+	int ret;
+
+	percpu_ref_kill(&ctx->refs);
+	wait_for_completion(&ctx->ctx_done);
+
+	switch (opcode) {
+	case IORING_REGISTER_BUFFERS:
+		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_BUFFERS:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_buffer_unregister(ctx);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	/* bring the ctx back to life */
+	reinit_completion(&ctx->ctx_done);
+	percpu_ref_reinit(&ctx->refs);
+	return ret;
+}
+
+SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
+		void __user *, arg, unsigned int, nr_args)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ctx = f.file->private_data;
+
+	mutex_lock(&ctx->uring_lock);
+	ret = __io_uring_register(ctx, opcode, arg, nr_args);
+	mutex_unlock(&ctx->uring_lock);
+out_fput:
+	fdput(f);
+	return ret;
+}
+
 static int __init io_uring_init(void)
 {
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3072dbaa7869..3681c05ac538 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries,
 asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
 				u32 min_complete, u32 flags,
 				const sigset_t __user *sig, size_t sigsz);
+asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
+				void __user *arg, unsigned int nr_args);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 87871e7b7ea7..d346229a1eb0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
 __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 #define __NR_io_uring_enter 426
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+#define __NR_io_uring_register 427
+__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
 
 #undef __NR_syscalls
-#define __NR_syscalls 427
+#define __NR_syscalls 428
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5c457ea396e6..cf28f7a11f12 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -27,7 +27,10 @@ struct io_uring_sqe {
 		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
-	__u64	__pad2[3];
+	union {
+		__u16	buf_index;	/* index into fixed buffers, if used */
+		__u64	__pad2[3];
+	};
 };
 
 /*
@@ -39,6 +42,8 @@ struct io_uring_sqe {
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
 #define IORING_OP_FSYNC		3
+#define IORING_OP_READ_FIXED	4
+#define IORING_OP_WRITE_FIXED	5
 
 /*
  * sqe->fsync_flags
@@ -103,4 +108,10 @@ struct io_uring_params {
 	struct io_cqring_offsets cq_off;
 };
 
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+#define IORING_REGISTER_BUFFERS		0
+#define IORING_UNREGISTER_BUFFERS	1
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ee5e523564bb..1bb6604dc19f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
+COND_SYSCALL(io_uring_register);
 
 /* fs/xattr.c */
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

If we have fixed user buffers, we can map them into the kernel when we
setup the io_uring. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must call io_uring_register()
after having setup an io_uring instance, passing in
IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
an iovec array, and the nr_args should contain how many iovecs the
application wishes to map.

If successful, these buffers are now mapped into the kernel, eligible
for IO. To use these fixed buffers, the application must use the
IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
must point to somewhere inside the indexed buffer.

The application may register buffers throughout the lifetime of the
io_uring instance. It can call io_uring_register() with
IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
buffers, and then register a new set. The application need not
unregister buffers explicitly before shutting down the io_uring
instance.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

For now, buffers must not be file backed. If file backed buffers are
passed in, the registration will fail with -1/EOPNOTSUPP. This
restriction may be relaxed in the future.

RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
arbitrary 1G per buffer size is also imposed.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_32.tbl |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 fs/io_uring.c                          | 373 ++++++++++++++++++++++++-
 include/linux/syscalls.h               |   2 +
 include/uapi/asm-generic/unistd.h      |   4 +-
 include/uapi/linux/io_uring.h          |  13 +-
 kernel/sys_ni.c                        |   1 +
 7 files changed, 380 insertions(+), 15 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 481c126259e9..2eefd2a7c1ce 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -400,3 +400,4 @@
 386	i386	rseq			sys_rseq			__ia32_sys_rseq
 425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
 426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
+427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 6a32a430c8e0..65c026185e61 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -345,6 +345,7 @@
 334	common	rseq			__x64_sys_rseq
 425	common	io_uring_setup		__x64_sys_io_uring_setup
 426	common	io_uring_enter		__x64_sys_io_uring_enter
+427	common	io_uring_register	__x64_sys_io_uring_register
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e330252dc5de..0eba20d18f53 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -45,6 +45,7 @@
 #include <linux/slab.h>
 #include <linux/workqueue.h>
 #include <linux/blkdev.h>
+#include <linux/bvec.h>
 #include <linux/net.h>
 #include <net/sock.h>
 #include <net/af_unix.h>
@@ -52,6 +53,8 @@
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
+#include <linux/sizes.h>
+#include <linux/hugetlb.h>
 
 #include <uapi/linux/io_uring.h>
 
@@ -81,6 +84,13 @@ struct io_cq_ring {
 	struct io_uring_cqe	cqes[];
 };
 
+struct io_mapped_ubuf {
+	u64		ubuf;
+	size_t		len;
+	struct		bio_vec *bvec;
+	unsigned int	nr_bvecs;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -113,6 +123,10 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/* if used, fixed mapped user buffers */
+	unsigned		nr_user_bufs;
+	struct io_mapped_ubuf	*user_bufs;
+
 	struct user_struct	*user;
 
 	struct completion	ctx_done;
@@ -732,6 +746,46 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret)
 	}
 }
 
+static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
+			   const struct io_uring_sqe *sqe,
+			   struct iov_iter *iter)
+{
+	size_t len = READ_ONCE(sqe->len);
+	struct io_mapped_ubuf *imu;
+	unsigned index, buf_index;
+	size_t offset;
+	u64 buf_addr;
+
+	/* attempt to use fixed buffers without having provided iovecs */
+	if (unlikely(!ctx->user_bufs))
+		return -EFAULT;
+
+	buf_index = READ_ONCE(sqe->buf_index);
+	if (unlikely(buf_index >= ctx->nr_user_bufs))
+		return -EFAULT;
+
+	index = array_index_nospec(buf_index, ctx->nr_user_bufs);
+	imu = &ctx->user_bufs[index];
+	buf_addr = READ_ONCE(sqe->addr);
+
+	/* overflow */
+	if (buf_addr + len < buf_addr)
+		return -EFAULT;
+	/* not inside the mapped region */
+	if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len)
+		return -EFAULT;
+
+	/*
+	 * May not be a start of buffer, set size appropriately
+	 * and advance us to the beginning.
+	 */
+	offset = buf_addr - imu->ubuf;
+	iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len);
+	if (offset)
+		iov_iter_advance(iter, offset);
+	return 0;
+}
+
 static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 			   const struct sqe_submit *s, struct iovec **iovec,
 			   struct iov_iter *iter)
@@ -739,6 +793,23 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 	const struct io_uring_sqe *sqe = s->sqe;
 	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	size_t sqe_len = READ_ONCE(sqe->len);
+	u8 opcode;
+
+	/*
+	 * We're reading ->opcode for the second time, but the first read
+	 * doesn't care whether it's _FIXED or not, so it doesn't matter
+	 * whether ->opcode changes concurrently. The first read does care
+	 * about whether it is a READ or a WRITE, so we don't trust this read
+	 * for that purpose and instead let the caller pass in the read/write
+	 * flag.
+	 */
+	opcode = READ_ONCE(sqe->opcode);
+	if (opcode == IORING_OP_READ_FIXED ||
+	    opcode == IORING_OP_WRITE_FIXED) {
+		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
+		*iovec = NULL;
+		return ret;
+	}
 
 	if (!s->has_user)
 		return EFAULT;
@@ -886,7 +957,7 @@ static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (unlikely(sqe->addr || sqe->ioprio))
+	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fd = READ_ONCE(sqe->fd);
@@ -945,9 +1016,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		ret = io_nop(req, req->user_data);
 		break;
 	case IORING_OP_READV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
 		ret = io_read(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_WRITEV:
+		if (unlikely(s->sqe->buf_index))
+			return -EINVAL;
+		ret = io_write(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_READ_FIXED:
+		ret = io_read(req, s, force_nonblock, state);
+		break;
+	case IORING_OP_WRITE_FIXED:
 		ret = io_write(req, s, force_nonblock, state);
 		break;
 	case IORING_OP_FSYNC:
@@ -976,28 +1057,46 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
+{
+	u8 opcode = READ_ONCE(sqe->opcode);
+
+	return !(opcode == IORING_OP_READ_FIXED ||
+		 opcode == IORING_OP_WRITE_FIXED);
+}
+
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
 	struct sqe_submit *s = &req->submit;
 	const struct io_uring_sqe *sqe = s->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
-	mm_segment_t old_fs = get_fs();
+	mm_segment_t old_fs;
+	bool needs_user;
 	int ret;
 
 	 /* Ensure we clear previously set forced non-block flag */
 	req->flags &= ~REQ_F_FORCE_NONBLOCK;
 	req->rw.ki_flags &= ~IOCB_NOWAIT;
 
-	if (!mmget_not_zero(ctx->sqo_mm)) {
-		ret = -EFAULT;
-		goto err;
-	}
-
-	use_mm(ctx->sqo_mm);
-	set_fs(USER_DS);
-	s->has_user = true;
 	s->needs_lock = true;
+	s->has_user = false;
+
+	/*
+	 * If we're doing IO to fixed buffers, we don't need to get/set
+	 * user context
+	 */
+	needs_user = io_sqe_needs_user(s->sqe);
+	if (needs_user) {
+		if (!mmget_not_zero(ctx->sqo_mm)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		use_mm(ctx->sqo_mm);
+		old_fs = get_fs();
+		set_fs(USER_DS);
+		s->has_user = true;
+	}
 
 	do {
 		ret = __io_submit_sqe(ctx, req, s, false, NULL);
@@ -1011,9 +1110,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 		cond_resched();
 	} while (1);
 
-	set_fs(old_fs);
-	unuse_mm(ctx->sqo_mm);
-	mmput(ctx->sqo_mm);
+	if (needs_user) {
+		set_fs(old_fs);
+		unuse_mm(ctx->sqo_mm);
+		mmput(ctx->sqo_mm);
+	}
 err:
 	if (ret) {
 		io_cqring_add_event(ctx, sqe->user_data, ret, 0);
@@ -1308,6 +1409,197 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
 	return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
 }
 
+static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return -ENXIO;
+
+	for (i = 0; i < ctx->nr_user_bufs; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+
+		for (j = 0; j < imu->nr_bvecs; j++)
+			put_page(imu->bvec[j].bv_page);
+
+		if (ctx->account_mem)
+			io_unaccount_mem(ctx->user, imu->nr_bvecs);
+		kfree(imu->bvec);
+		imu->nr_bvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+	ctx->nr_user_bufs = 0;
+	return 0;
+}
+
+static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
+		       void __user *arg, unsigned index)
+{
+	struct iovec __user *src;
+
+#ifdef CONFIG_COMPAT
+	if (ctx->compat) {
+		struct compat_iovec __user *ciovs;
+		struct compat_iovec ciov;
+
+		ciovs = (struct compat_iovec __user *) arg;
+		if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov)))
+			return -EFAULT;
+
+		dst->iov_base = (void __user *) (unsigned long) ciov.iov_base;
+		dst->iov_len = ciov.iov_len;
+		return 0;
+	}
+#endif
+	src = (struct iovec __user *) arg;
+	if (copy_from_user(dst, &src[index], sizeof(*dst)))
+		return -EFAULT;
+	return 0;
+}
+
+static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
+				  unsigned nr_args)
+{
+	struct vm_area_struct **vmas = NULL;
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	int ret = -EINVAL;
+
+	if (ctx->user_bufs)
+		return -EBUSY;
+	if (!nr_args || nr_args > UIO_MAXIOV)
+		return -EINVAL;
+
+	ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		struct iovec iov;
+		size_t size;
+
+		ret = io_copy_iov(ctx, &iov, arg, i);
+		if (ret)
+			break;
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iov.iov_base || !iov.iov_len)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iov.iov_len > SZ_1G)
+			goto err;
+
+		ubuf = (unsigned long) iov.iov_base;
+		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		if (ctx->account_mem) {
+			ret = io_account_mem(ctx->user, nr_pages);
+			if (ret)
+				goto err;
+		}
+
+		ret = 0;
+		if (!pages || nr_pages > got_pages) {
+			kfree(vmas);
+			kfree(pages);
+			pages = kmalloc_array(nr_pages, sizeof(struct page *),
+						GFP_KERNEL);
+			vmas = kmalloc_array(nr_pages,
+					sizeof(struct vma_area_struct *),
+					GFP_KERNEL);
+			if (!pages || !vmas) {
+				ret = -ENOMEM;
+				if (ctx->account_mem)
+					io_unaccount_mem(ctx->user, nr_pages);
+				goto err;
+			}
+			got_pages = nr_pages;
+		}
+
+		imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
+						GFP_KERNEL);
+		ret = -ENOMEM;
+		if (!imu->bvec) {
+			if (ctx->account_mem)
+				io_unaccount_mem(ctx->user, nr_pages);
+			goto err;
+		}
+
+		ret = 0;
+		down_read(&current->mm->mmap_sem);
+		pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
+						pages, vmas);
+		if (pret == nr_pages) {
+			/* don't support file backed memory */
+			for (j = 0; j < nr_pages; j++) {
+				struct vm_area_struct *vma = vmas[j];
+
+				if (vma->vm_file &&
+				    !is_file_hugepages(vma->vm_file)) {
+					ret = -EOPNOTSUPP;
+					break;
+				}
+			}
+		} else {
+			ret = pret < 0 ? pret : -EFAULT;
+		}
+		up_read(&current->mm->mmap_sem);
+		if (ret) {
+			/*
+			 * if we did partial map, or found file backed vmas,
+			 * release any pages we did get
+			 */
+			if (pret > 0) {
+				for (j = 0; j < pret; j++)
+					put_page(pages[j]);
+			}
+			if (ctx->account_mem)
+				io_unaccount_mem(ctx->user, nr_pages);
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iov.iov_len;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			imu->bvec[j].bv_page = pages[j];
+			imu->bvec[j].bv_len = vec_len;
+			imu->bvec[j].bv_offset = off;
+			off = 0;
+			size -= vec_len;
+		}
+		/* store original address for later verification */
+		imu->ubuf = ubuf;
+		imu->len = iov.iov_len;
+		imu->nr_bvecs = nr_pages;
+	}
+	kfree(pages);
+	kfree(vmas);
+	ctx->nr_user_bufs = nr_args;
+	return 0;
+err:
+	kfree(pages);
+	kfree(vmas);
+	io_sqe_buffer_unregister(ctx);
+	return ret;
+}
+
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
 	if (ctx->sqo_wq)
@@ -1316,6 +1608,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		mmdrop(ctx->sqo_mm);
 
 	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
 
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
@@ -1677,6 +1970,60 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
 	return io_uring_setup(entries, params);
 }
 
+static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
+			       void __user *arg, unsigned nr_args)
+{
+	int ret;
+
+	percpu_ref_kill(&ctx->refs);
+	wait_for_completion(&ctx->ctx_done);
+
+	switch (opcode) {
+	case IORING_REGISTER_BUFFERS:
+		ret = io_sqe_buffer_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_BUFFERS:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_buffer_unregister(ctx);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	/* bring the ctx back to life */
+	reinit_completion(&ctx->ctx_done);
+	percpu_ref_reinit(&ctx->refs);
+	return ret;
+}
+
+SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
+		void __user *, arg, unsigned int, nr_args)
+{
+	struct io_ring_ctx *ctx;
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (!f.file)
+		return -EBADF;
+
+	ret = -EOPNOTSUPP;
+	if (f.file->f_op != &io_uring_fops)
+		goto out_fput;
+
+	ctx = f.file->private_data;
+
+	mutex_lock(&ctx->uring_lock);
+	ret = __io_uring_register(ctx, opcode, arg, nr_args);
+	mutex_unlock(&ctx->uring_lock);
+out_fput:
+	fdput(f);
+	return ret;
+}
+
 static int __init io_uring_init(void)
 {
 	req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3072dbaa7869..3681c05ac538 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries,
 asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
 				u32 min_complete, u32 flags,
 				const sigset_t __user *sig, size_t sigsz);
+asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op,
+				void __user *arg, unsigned int nr_args);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 87871e7b7ea7..d346229a1eb0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load,     sys_kexec_file_load)
 __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
 #define __NR_io_uring_enter 426
 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
+#define __NR_io_uring_register 427
+__SYSCALL(__NR_io_uring_register, sys_io_uring_register)
 
 #undef __NR_syscalls
-#define __NR_syscalls 427
+#define __NR_syscalls 428
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5c457ea396e6..cf28f7a11f12 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -27,7 +27,10 @@ struct io_uring_sqe {
 		__u32		fsync_flags;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
-	__u64	__pad2[3];
+	union {
+		__u16	buf_index;	/* index into fixed buffers, if used */
+		__u64	__pad2[3];
+	};
 };
 
 /*
@@ -39,6 +42,8 @@ struct io_uring_sqe {
 #define IORING_OP_READV		1
 #define IORING_OP_WRITEV	2
 #define IORING_OP_FSYNC		3
+#define IORING_OP_READ_FIXED	4
+#define IORING_OP_WRITE_FIXED	5
 
 /*
  * sqe->fsync_flags
@@ -103,4 +108,10 @@ struct io_uring_params {
 	struct io_cqring_offsets cq_off;
 };
 
+/*
+ * io_uring_register(2) opcodes and arguments
+ */
+#define IORING_REGISTER_BUFFERS		0
+#define IORING_UNREGISTER_BUFFERS	1
+
 #endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ee5e523564bb..1bb6604dc19f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
 COND_SYSCALL(io_uring_setup);
 COND_SYSCALL(io_uring_enter);
+COND_SYSCALL(io_uring_register);
 
 /* fs/xattr.c */
 
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 13/19] net: split out functions related to registering inflight socket files
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We need this functionality for the io_uring file registration, but
we cannot rely on it since CONFIG_UNIX can be modular. Move the helpers
to a separate file, that's always builtin to the kernel if CONFIG_UNIX is
m/y.

No functional changes in this patch, just moving code around.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/net/af_unix.h |   1 +
 net/Makefile          |   2 +-
 net/unix/Kconfig      |   5 ++
 net/unix/Makefile     |   2 +
 net/unix/af_unix.c    |  63 +-----------------
 net/unix/garbage.c    |  71 +-------------------
 net/unix/scm.c        | 151 ++++++++++++++++++++++++++++++++++++++++++
 net/unix/scm.h        |  10 +++
 8 files changed, 174 insertions(+), 131 deletions(-)
 create mode 100644 net/unix/scm.c
 create mode 100644 net/unix/scm.h

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index ddbba838d048..3426d6dacc45 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -10,6 +10,7 @@
 
 void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
+void unix_destruct_scm(struct sk_buff *skb);
 void unix_gc(void);
 void wait_for_unix_gc(void);
 struct sock *unix_get_socket(struct file *filp);
diff --git a/net/Makefile b/net/Makefile
index bdaf53925acd..449fc0b221f8 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -18,7 +18,7 @@ obj-$(CONFIG_NETFILTER)		+= netfilter/
 obj-$(CONFIG_INET)		+= ipv4/
 obj-$(CONFIG_TLS)		+= tls/
 obj-$(CONFIG_XFRM)		+= xfrm/
-obj-$(CONFIG_UNIX)		+= unix/
+obj-$(CONFIG_UNIX_SCM)		+= unix/
 obj-$(CONFIG_NET)		+= ipv6/
 obj-$(CONFIG_BPFILTER)		+= bpfilter/
 obj-$(CONFIG_PACKET)		+= packet/
diff --git a/net/unix/Kconfig b/net/unix/Kconfig
index 8b31ab85d050..3b9e450656a4 100644
--- a/net/unix/Kconfig
+++ b/net/unix/Kconfig
@@ -19,6 +19,11 @@ config UNIX
 
 	  Say Y unless you know what you are doing.
 
+config UNIX_SCM
+	bool
+	depends on UNIX
+	default y
+
 config UNIX_DIAG
 	tristate "UNIX: socket monitoring interface"
 	depends on UNIX
diff --git a/net/unix/Makefile b/net/unix/Makefile
index ffd0a275c3a7..54e58cc4f945 100644
--- a/net/unix/Makefile
+++ b/net/unix/Makefile
@@ -10,3 +10,5 @@ unix-$(CONFIG_SYSCTL)	+= sysctl_net_unix.o
 
 obj-$(CONFIG_UNIX_DIAG)	+= unix_diag.o
 unix_diag-y		:= diag.o
+
+obj-$(CONFIG_UNIX_SCM)	+= scm.o
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 74d1eed7cbd4..2ce32dbb2feb 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -119,6 +119,8 @@
 #include <linux/freezer.h>
 #include <linux/file.h>
 
+#include "scm.h"
+
 struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE];
 EXPORT_SYMBOL_GPL(unix_socket_table);
 DEFINE_SPINLOCK(unix_table_lock);
@@ -1486,67 +1488,6 @@ static int unix_getname(struct socket *sock, struct sockaddr *uaddr, int peer)
 	return err;
 }
 
-static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
-{
-	int i;
-
-	scm->fp = UNIXCB(skb).fp;
-	UNIXCB(skb).fp = NULL;
-
-	for (i = scm->fp->count-1; i >= 0; i--)
-		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
-}
-
-static void unix_destruct_scm(struct sk_buff *skb)
-{
-	struct scm_cookie scm;
-	memset(&scm, 0, sizeof(scm));
-	scm.pid  = UNIXCB(skb).pid;
-	if (UNIXCB(skb).fp)
-		unix_detach_fds(&scm, skb);
-
-	/* Alas, it calls VFS */
-	/* So fscking what? fput() had been SMP-safe since the last Summer */
-	scm_destroy(&scm);
-	sock_wfree(skb);
-}
-
-/*
- * The "user->unix_inflight" variable is protected by the garbage
- * collection lock, and we just read it locklessly here. If you go
- * over the limit, there might be a tiny race in actually noticing
- * it across threads. Tough.
- */
-static inline bool too_many_unix_fds(struct task_struct *p)
-{
-	struct user_struct *user = current_user();
-
-	if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
-		return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
-	return false;
-}
-
-static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
-{
-	int i;
-
-	if (too_many_unix_fds(current))
-		return -ETOOMANYREFS;
-
-	/*
-	 * Need to duplicate file references for the sake of garbage
-	 * collection.  Otherwise a socket in the fps might become a
-	 * candidate for GC while the skb is not yet queued.
-	 */
-	UNIXCB(skb).fp = scm_fp_dup(scm->fp);
-	if (!UNIXCB(skb).fp)
-		return -ENOMEM;
-
-	for (i = scm->fp->count - 1; i >= 0; i--)
-		unix_inflight(scm->fp->user, scm->fp->fp[i]);
-	return 0;
-}
-
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds)
 {
 	int err = 0;
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index f81854d74c7d..8bbe1b8e4ff7 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -86,80 +86,13 @@
 #include <net/scm.h>
 #include <net/tcp_states.h>
 
+#include "scm.h"
+
 /* Internal data structures and random procedures: */
 
-static LIST_HEAD(gc_inflight_list);
 static LIST_HEAD(gc_candidates);
-static DEFINE_SPINLOCK(unix_gc_lock);
 static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
 
-unsigned int unix_tot_inflight;
-
-struct sock *unix_get_socket(struct file *filp)
-{
-	struct sock *u_sock = NULL;
-	struct inode *inode = file_inode(filp);
-
-	/* Socket ? */
-	if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) {
-		struct socket *sock = SOCKET_I(inode);
-		struct sock *s = sock->sk;
-
-		/* PF_UNIX ? */
-		if (s && sock->ops && sock->ops->family == PF_UNIX)
-			u_sock = s;
-	} else {
-		/* Could be an io_uring instance */
-		u_sock = io_uring_get_socket(filp);
-	}
-	return u_sock;
-}
-
-/* Keep the number of times in flight count for the file
- * descriptor if it is for an AF_UNIX socket.
- */
-
-void unix_inflight(struct user_struct *user, struct file *fp)
-{
-	struct sock *s = unix_get_socket(fp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (s) {
-		struct unix_sock *u = unix_sk(s);
-
-		if (atomic_long_inc_return(&u->inflight) == 1) {
-			BUG_ON(!list_empty(&u->link));
-			list_add_tail(&u->link, &gc_inflight_list);
-		} else {
-			BUG_ON(list_empty(&u->link));
-		}
-		unix_tot_inflight++;
-	}
-	user->unix_inflight++;
-	spin_unlock(&unix_gc_lock);
-}
-
-void unix_notinflight(struct user_struct *user, struct file *fp)
-{
-	struct sock *s = unix_get_socket(fp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (s) {
-		struct unix_sock *u = unix_sk(s);
-
-		BUG_ON(!atomic_long_read(&u->inflight));
-		BUG_ON(list_empty(&u->link));
-
-		if (atomic_long_dec_and_test(&u->inflight))
-			list_del_init(&u->link);
-		unix_tot_inflight--;
-	}
-	user->unix_inflight--;
-	spin_unlock(&unix_gc_lock);
-}
-
 static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
 			  struct sk_buff_head *hitlist)
 {
diff --git a/net/unix/scm.c b/net/unix/scm.c
new file mode 100644
index 000000000000..8c40f2b32392
--- /dev/null
+++ b/net/unix/scm.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/socket.h>
+#include <linux/net.h>
+#include <linux/fs.h>
+#include <net/af_unix.h>
+#include <net/scm.h>
+#include <linux/init.h>
+
+#include "scm.h"
+
+unsigned int unix_tot_inflight;
+EXPORT_SYMBOL(unix_tot_inflight);
+
+LIST_HEAD(gc_inflight_list);
+EXPORT_SYMBOL(gc_inflight_list);
+
+DEFINE_SPINLOCK(unix_gc_lock);
+EXPORT_SYMBOL(unix_gc_lock);
+
+struct sock *unix_get_socket(struct file *filp)
+{
+	struct sock *u_sock = NULL;
+	struct inode *inode = file_inode(filp);
+
+	/* Socket ? */
+	if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) {
+		struct socket *sock = SOCKET_I(inode);
+		struct sock *s = sock->sk;
+
+		/* PF_UNIX ? */
+		if (s && sock->ops && sock->ops->family == PF_UNIX)
+			u_sock = s;
+	} else {
+		/* Could be an io_uring instance */
+		u_sock = io_uring_get_socket(filp);
+	}
+	return u_sock;
+}
+EXPORT_SYMBOL(unix_get_socket);
+
+/* Keep the number of times in flight count for the file
+ * descriptor if it is for an AF_UNIX socket.
+ */
+void unix_inflight(struct user_struct *user, struct file *fp)
+{
+	struct sock *s = unix_get_socket(fp);
+
+	spin_lock(&unix_gc_lock);
+
+	if (s) {
+		struct unix_sock *u = unix_sk(s);
+
+		if (atomic_long_inc_return(&u->inflight) == 1) {
+			BUG_ON(!list_empty(&u->link));
+			list_add_tail(&u->link, &gc_inflight_list);
+		} else {
+			BUG_ON(list_empty(&u->link));
+		}
+		unix_tot_inflight++;
+	}
+	user->unix_inflight++;
+	spin_unlock(&unix_gc_lock);
+}
+
+void unix_notinflight(struct user_struct *user, struct file *fp)
+{
+	struct sock *s = unix_get_socket(fp);
+
+	spin_lock(&unix_gc_lock);
+
+	if (s) {
+		struct unix_sock *u = unix_sk(s);
+
+		BUG_ON(!atomic_long_read(&u->inflight));
+		BUG_ON(list_empty(&u->link));
+
+		if (atomic_long_dec_and_test(&u->inflight))
+			list_del_init(&u->link);
+		unix_tot_inflight--;
+	}
+	user->unix_inflight--;
+	spin_unlock(&unix_gc_lock);
+}
+
+/*
+ * The "user->unix_inflight" variable is protected by the garbage
+ * collection lock, and we just read it locklessly here. If you go
+ * over the limit, there might be a tiny race in actually noticing
+ * it across threads. Tough.
+ */
+static inline bool too_many_unix_fds(struct task_struct *p)
+{
+	struct user_struct *user = current_user();
+
+	if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
+		return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
+	return false;
+}
+
+int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
+{
+	int i;
+
+	if (too_many_unix_fds(current))
+		return -ETOOMANYREFS;
+
+	/*
+	 * Need to duplicate file references for the sake of garbage
+	 * collection.  Otherwise a socket in the fps might become a
+	 * candidate for GC while the skb is not yet queued.
+	 */
+	UNIXCB(skb).fp = scm_fp_dup(scm->fp);
+	if (!UNIXCB(skb).fp)
+		return -ENOMEM;
+
+	for (i = scm->fp->count - 1; i >= 0; i--)
+		unix_inflight(scm->fp->user, scm->fp->fp[i]);
+	return 0;
+}
+EXPORT_SYMBOL(unix_attach_fds);
+
+void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
+{
+	int i;
+
+	scm->fp = UNIXCB(skb).fp;
+	UNIXCB(skb).fp = NULL;
+
+	for (i = scm->fp->count-1; i >= 0; i--)
+		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
+}
+EXPORT_SYMBOL(unix_detach_fds);
+
+void unix_destruct_scm(struct sk_buff *skb)
+{
+	struct scm_cookie scm;
+
+	memset(&scm, 0, sizeof(scm));
+	scm.pid  = UNIXCB(skb).pid;
+	if (UNIXCB(skb).fp)
+		unix_detach_fds(&scm, skb);
+
+	/* Alas, it calls VFS */
+	/* So fscking what? fput() had been SMP-safe since the last Summer */
+	scm_destroy(&scm);
+	sock_wfree(skb);
+}
+EXPORT_SYMBOL(unix_destruct_scm);
diff --git a/net/unix/scm.h b/net/unix/scm.h
new file mode 100644
index 000000000000..5a255a477f16
--- /dev/null
+++ b/net/unix/scm.h
@@ -0,0 +1,10 @@
+#ifndef NET_UNIX_SCM_H
+#define NET_UNIX_SCM_H
+
+extern struct list_head gc_inflight_list;
+extern spinlock_t unix_gc_lock;
+
+int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb);
+void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb);
+
+#endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 13/19] net: split out functions related to registering inflight socket files
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We need this functionality for the io_uring file registration, but
we cannot rely on it since CONFIG_UNIX can be modular. Move the helpers
to a separate file, that's always builtin to the kernel if CONFIG_UNIX is
m/y.

No functional changes in this patch, just moving code around.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/net/af_unix.h |   1 +
 net/Makefile          |   2 +-
 net/unix/Kconfig      |   5 ++
 net/unix/Makefile     |   2 +
 net/unix/af_unix.c    |  63 +-----------------
 net/unix/garbage.c    |  71 +-------------------
 net/unix/scm.c        | 151 ++++++++++++++++++++++++++++++++++++++++++
 net/unix/scm.h        |  10 +++
 8 files changed, 174 insertions(+), 131 deletions(-)
 create mode 100644 net/unix/scm.c
 create mode 100644 net/unix/scm.h

diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index ddbba838d048..3426d6dacc45 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -10,6 +10,7 @@
 
 void unix_inflight(struct user_struct *user, struct file *fp);
 void unix_notinflight(struct user_struct *user, struct file *fp);
+void unix_destruct_scm(struct sk_buff *skb);
 void unix_gc(void);
 void wait_for_unix_gc(void);
 struct sock *unix_get_socket(struct file *filp);
diff --git a/net/Makefile b/net/Makefile
index bdaf53925acd..449fc0b221f8 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -18,7 +18,7 @@ obj-$(CONFIG_NETFILTER)		+= netfilter/
 obj-$(CONFIG_INET)		+= ipv4/
 obj-$(CONFIG_TLS)		+= tls/
 obj-$(CONFIG_XFRM)		+= xfrm/
-obj-$(CONFIG_UNIX)		+= unix/
+obj-$(CONFIG_UNIX_SCM)		+= unix/
 obj-$(CONFIG_NET)		+= ipv6/
 obj-$(CONFIG_BPFILTER)		+= bpfilter/
 obj-$(CONFIG_PACKET)		+= packet/
diff --git a/net/unix/Kconfig b/net/unix/Kconfig
index 8b31ab85d050..3b9e450656a4 100644
--- a/net/unix/Kconfig
+++ b/net/unix/Kconfig
@@ -19,6 +19,11 @@ config UNIX
 
 	  Say Y unless you know what you are doing.
 
+config UNIX_SCM
+	bool
+	depends on UNIX
+	default y
+
 config UNIX_DIAG
 	tristate "UNIX: socket monitoring interface"
 	depends on UNIX
diff --git a/net/unix/Makefile b/net/unix/Makefile
index ffd0a275c3a7..54e58cc4f945 100644
--- a/net/unix/Makefile
+++ b/net/unix/Makefile
@@ -10,3 +10,5 @@ unix-$(CONFIG_SYSCTL)	+= sysctl_net_unix.o
 
 obj-$(CONFIG_UNIX_DIAG)	+= unix_diag.o
 unix_diag-y		:= diag.o
+
+obj-$(CONFIG_UNIX_SCM)	+= scm.o
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 74d1eed7cbd4..2ce32dbb2feb 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -119,6 +119,8 @@
 #include <linux/freezer.h>
 #include <linux/file.h>
 
+#include "scm.h"
+
 struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE];
 EXPORT_SYMBOL_GPL(unix_socket_table);
 DEFINE_SPINLOCK(unix_table_lock);
@@ -1486,67 +1488,6 @@ static int unix_getname(struct socket *sock, struct sockaddr *uaddr, int peer)
 	return err;
 }
 
-static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
-{
-	int i;
-
-	scm->fp = UNIXCB(skb).fp;
-	UNIXCB(skb).fp = NULL;
-
-	for (i = scm->fp->count-1; i >= 0; i--)
-		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
-}
-
-static void unix_destruct_scm(struct sk_buff *skb)
-{
-	struct scm_cookie scm;
-	memset(&scm, 0, sizeof(scm));
-	scm.pid  = UNIXCB(skb).pid;
-	if (UNIXCB(skb).fp)
-		unix_detach_fds(&scm, skb);
-
-	/* Alas, it calls VFS */
-	/* So fscking what? fput() had been SMP-safe since the last Summer */
-	scm_destroy(&scm);
-	sock_wfree(skb);
-}
-
-/*
- * The "user->unix_inflight" variable is protected by the garbage
- * collection lock, and we just read it locklessly here. If you go
- * over the limit, there might be a tiny race in actually noticing
- * it across threads. Tough.
- */
-static inline bool too_many_unix_fds(struct task_struct *p)
-{
-	struct user_struct *user = current_user();
-
-	if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
-		return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
-	return false;
-}
-
-static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
-{
-	int i;
-
-	if (too_many_unix_fds(current))
-		return -ETOOMANYREFS;
-
-	/*
-	 * Need to duplicate file references for the sake of garbage
-	 * collection.  Otherwise a socket in the fps might become a
-	 * candidate for GC while the skb is not yet queued.
-	 */
-	UNIXCB(skb).fp = scm_fp_dup(scm->fp);
-	if (!UNIXCB(skb).fp)
-		return -ENOMEM;
-
-	for (i = scm->fp->count - 1; i >= 0; i--)
-		unix_inflight(scm->fp->user, scm->fp->fp[i]);
-	return 0;
-}
-
 static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds)
 {
 	int err = 0;
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index f81854d74c7d..8bbe1b8e4ff7 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -86,80 +86,13 @@
 #include <net/scm.h>
 #include <net/tcp_states.h>
 
+#include "scm.h"
+
 /* Internal data structures and random procedures: */
 
-static LIST_HEAD(gc_inflight_list);
 static LIST_HEAD(gc_candidates);
-static DEFINE_SPINLOCK(unix_gc_lock);
 static DECLARE_WAIT_QUEUE_HEAD(unix_gc_wait);
 
-unsigned int unix_tot_inflight;
-
-struct sock *unix_get_socket(struct file *filp)
-{
-	struct sock *u_sock = NULL;
-	struct inode *inode = file_inode(filp);
-
-	/* Socket ? */
-	if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) {
-		struct socket *sock = SOCKET_I(inode);
-		struct sock *s = sock->sk;
-
-		/* PF_UNIX ? */
-		if (s && sock->ops && sock->ops->family == PF_UNIX)
-			u_sock = s;
-	} else {
-		/* Could be an io_uring instance */
-		u_sock = io_uring_get_socket(filp);
-	}
-	return u_sock;
-}
-
-/* Keep the number of times in flight count for the file
- * descriptor if it is for an AF_UNIX socket.
- */
-
-void unix_inflight(struct user_struct *user, struct file *fp)
-{
-	struct sock *s = unix_get_socket(fp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (s) {
-		struct unix_sock *u = unix_sk(s);
-
-		if (atomic_long_inc_return(&u->inflight) == 1) {
-			BUG_ON(!list_empty(&u->link));
-			list_add_tail(&u->link, &gc_inflight_list);
-		} else {
-			BUG_ON(list_empty(&u->link));
-		}
-		unix_tot_inflight++;
-	}
-	user->unix_inflight++;
-	spin_unlock(&unix_gc_lock);
-}
-
-void unix_notinflight(struct user_struct *user, struct file *fp)
-{
-	struct sock *s = unix_get_socket(fp);
-
-	spin_lock(&unix_gc_lock);
-
-	if (s) {
-		struct unix_sock *u = unix_sk(s);
-
-		BUG_ON(!atomic_long_read(&u->inflight));
-		BUG_ON(list_empty(&u->link));
-
-		if (atomic_long_dec_and_test(&u->inflight))
-			list_del_init(&u->link);
-		unix_tot_inflight--;
-	}
-	user->unix_inflight--;
-	spin_unlock(&unix_gc_lock);
-}
-
 static void scan_inflight(struct sock *x, void (*func)(struct unix_sock *),
 			  struct sk_buff_head *hitlist)
 {
diff --git a/net/unix/scm.c b/net/unix/scm.c
new file mode 100644
index 000000000000..8c40f2b32392
--- /dev/null
+++ b/net/unix/scm.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/socket.h>
+#include <linux/net.h>
+#include <linux/fs.h>
+#include <net/af_unix.h>
+#include <net/scm.h>
+#include <linux/init.h>
+
+#include "scm.h"
+
+unsigned int unix_tot_inflight;
+EXPORT_SYMBOL(unix_tot_inflight);
+
+LIST_HEAD(gc_inflight_list);
+EXPORT_SYMBOL(gc_inflight_list);
+
+DEFINE_SPINLOCK(unix_gc_lock);
+EXPORT_SYMBOL(unix_gc_lock);
+
+struct sock *unix_get_socket(struct file *filp)
+{
+	struct sock *u_sock = NULL;
+	struct inode *inode = file_inode(filp);
+
+	/* Socket ? */
+	if (S_ISSOCK(inode->i_mode) && !(filp->f_mode & FMODE_PATH)) {
+		struct socket *sock = SOCKET_I(inode);
+		struct sock *s = sock->sk;
+
+		/* PF_UNIX ? */
+		if (s && sock->ops && sock->ops->family == PF_UNIX)
+			u_sock = s;
+	} else {
+		/* Could be an io_uring instance */
+		u_sock = io_uring_get_socket(filp);
+	}
+	return u_sock;
+}
+EXPORT_SYMBOL(unix_get_socket);
+
+/* Keep the number of times in flight count for the file
+ * descriptor if it is for an AF_UNIX socket.
+ */
+void unix_inflight(struct user_struct *user, struct file *fp)
+{
+	struct sock *s = unix_get_socket(fp);
+
+	spin_lock(&unix_gc_lock);
+
+	if (s) {
+		struct unix_sock *u = unix_sk(s);
+
+		if (atomic_long_inc_return(&u->inflight) == 1) {
+			BUG_ON(!list_empty(&u->link));
+			list_add_tail(&u->link, &gc_inflight_list);
+		} else {
+			BUG_ON(list_empty(&u->link));
+		}
+		unix_tot_inflight++;
+	}
+	user->unix_inflight++;
+	spin_unlock(&unix_gc_lock);
+}
+
+void unix_notinflight(struct user_struct *user, struct file *fp)
+{
+	struct sock *s = unix_get_socket(fp);
+
+	spin_lock(&unix_gc_lock);
+
+	if (s) {
+		struct unix_sock *u = unix_sk(s);
+
+		BUG_ON(!atomic_long_read(&u->inflight));
+		BUG_ON(list_empty(&u->link));
+
+		if (atomic_long_dec_and_test(&u->inflight))
+			list_del_init(&u->link);
+		unix_tot_inflight--;
+	}
+	user->unix_inflight--;
+	spin_unlock(&unix_gc_lock);
+}
+
+/*
+ * The "user->unix_inflight" variable is protected by the garbage
+ * collection lock, and we just read it locklessly here. If you go
+ * over the limit, there might be a tiny race in actually noticing
+ * it across threads. Tough.
+ */
+static inline bool too_many_unix_fds(struct task_struct *p)
+{
+	struct user_struct *user = current_user();
+
+	if (unlikely(user->unix_inflight > task_rlimit(p, RLIMIT_NOFILE)))
+		return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
+	return false;
+}
+
+int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
+{
+	int i;
+
+	if (too_many_unix_fds(current))
+		return -ETOOMANYREFS;
+
+	/*
+	 * Need to duplicate file references for the sake of garbage
+	 * collection.  Otherwise a socket in the fps might become a
+	 * candidate for GC while the skb is not yet queued.
+	 */
+	UNIXCB(skb).fp = scm_fp_dup(scm->fp);
+	if (!UNIXCB(skb).fp)
+		return -ENOMEM;
+
+	for (i = scm->fp->count - 1; i >= 0; i--)
+		unix_inflight(scm->fp->user, scm->fp->fp[i]);
+	return 0;
+}
+EXPORT_SYMBOL(unix_attach_fds);
+
+void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
+{
+	int i;
+
+	scm->fp = UNIXCB(skb).fp;
+	UNIXCB(skb).fp = NULL;
+
+	for (i = scm->fp->count-1; i >= 0; i--)
+		unix_notinflight(scm->fp->user, scm->fp->fp[i]);
+}
+EXPORT_SYMBOL(unix_detach_fds);
+
+void unix_destruct_scm(struct sk_buff *skb)
+{
+	struct scm_cookie scm;
+
+	memset(&scm, 0, sizeof(scm));
+	scm.pid  = UNIXCB(skb).pid;
+	if (UNIXCB(skb).fp)
+		unix_detach_fds(&scm, skb);
+
+	/* Alas, it calls VFS */
+	/* So fscking what? fput() had been SMP-safe since the last Summer */
+	scm_destroy(&scm);
+	sock_wfree(skb);
+}
+EXPORT_SYMBOL(unix_destruct_scm);
diff --git a/net/unix/scm.h b/net/unix/scm.h
new file mode 100644
index 000000000000..5a255a477f16
--- /dev/null
+++ b/net/unix/scm.h
@@ -0,0 +1,10 @@
+#ifndef NET_UNIX_SCM_H
+#define NET_UNIX_SCM_H
+
+extern struct list_head gc_inflight_list;
+extern spinlock_t unix_gc_lock;
+
+int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb);
+void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb);
+
+#endif
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 14/19] io_uring: add file set registration
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We normally have to fget/fput for each IO we do on a file. Even with
the batching we do, the cost of the atomic inc/dec of the file usage
count adds up.

This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
for the io_uring_register(2) system call. The arguments passed in must
be an array of __s32 holding file descriptors, and nr_args should hold
the number of file descriptors the application wishes to pin for the
duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
called).

When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
to the index in the array passed in to IORING_REGISTER_FILES.

Files are automatically unregistered when the io_uring instance is torn
down. An application need only unregister if it wishes to register a new
set of fds.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 267 ++++++++++++++++++++++++++++++----
 include/uapi/linux/io_uring.h |   9 +-
 2 files changed, 246 insertions(+), 30 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0eba20d18f53..167c7f96666f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -49,6 +49,7 @@
 #include <linux/net.h>
 #include <net/sock.h>
 #include <net/af_unix.h>
+#include <net/scm.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
@@ -61,6 +62,7 @@
 #include "internal.h"
 
 #define IORING_MAX_ENTRIES	4096
+#define IORING_MAX_FIXED_FILES	1024
 
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
@@ -123,6 +125,14 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/*
+	 * If used, fixed file set. Writers must ensure that ->refs is dead,
+	 * readers must ensure that ->refs is alive as long as the file* is
+	 * used. Only updated through io_uring_register(2).
+	 */
+	struct file		**user_files;
+	unsigned		nr_user_files;
+
 	/* if used, fixed mapped user buffers */
 	unsigned		nr_user_bufs;
 	struct io_mapped_ubuf	*user_bufs;
@@ -170,6 +180,7 @@ struct io_kiocb {
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
+#define REQ_F_FIXED_FILE	4	/* ctx owns file */
 	u64			user_data;
 	u64			error;
 
@@ -404,15 +415,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		 * Batched puts of the same file, to avoid dirtying the
 		 * file usage count multiple times, if avoidable.
 		 */
-		if (!file) {
-			file = req->rw.ki_filp;
-			file_count = 1;
-		} else if (file == req->rw.ki_filp) {
-			file_count++;
-		} else {
-			fput_many(file, file_count);
-			file = req->rw.ki_filp;
-			file_count = 1;
+		if (!(req->flags & REQ_F_FIXED_FILE)) {
+			if (!file) {
+				file = req->rw.ki_filp;
+				file_count = 1;
+			} else if (file == req->rw.ki_filp) {
+				file_count++;
+			} else {
+				fput_many(file, file_count);
+				file = req->rw.ki_filp;
+				file_count = 1;
+			}
 		}
 
 		if (to_free == ARRAY_SIZE(reqs))
@@ -544,13 +557,19 @@ static void kiocb_end_write(struct kiocb *kiocb)
 	}
 }
 
+static void io_fput(struct io_kiocb *req)
+{
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		fput(req->rw.ki_filp);
+}
+
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
 
 	kiocb_end_write(kiocb);
 
-	fput(kiocb->ki_filp);
+	io_fput(req);
 	io_cqring_add_event(req->ctx, req->user_data, res, 0);
 	io_free_req(req);
 }
@@ -666,19 +685,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
-	unsigned ioprio;
+	unsigned ioprio, flags;
 	int fd, ret;
 
 	/* For -EAGAIN retry, everything is already prepped */
 	if (kiocb->ki_filp)
 		return 0;
 
+	flags = READ_ONCE(sqe->flags);
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = io_file_get(state, fd);
-	if (unlikely(!kiocb->ki_filp))
-		return -EBADF;
-	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
-		force_nonblock = false;
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files ||
+		    (unsigned) fd >= ctx->nr_user_files))
+			return -EBADF;
+		kiocb->ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		kiocb->ki_filp = io_file_get(state, fd);
+		if (unlikely(!kiocb->ki_filp))
+			return -EBADF;
+		if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
+			force_nonblock = false;
+	}
 	kiocb->ki_pos = READ_ONCE(sqe->off);
 	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
@@ -718,10 +747,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	/* in case of error, we didn't use this file reference. drop it. */
-	if (state)
-		state->used_refs--;
-	io_file_put(state, kiocb->ki_filp);
+	if (!(flags & IOSQE_FIXED_FILE)) {
+		/*
+		 * in case of error, we didn't use this file reference. drop it.
+		 */
+		if (state)
+			state->used_refs--;
+		io_file_put(state, kiocb->ki_filp);
+	}
 	return ret;
 }
 
@@ -863,7 +896,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 out_fput:
 	/* Hold on to the file for -EAGAIN */
 	if (unlikely(ret && ret != -EAGAIN))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -917,7 +950,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -940,7 +973,7 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 	 */
 	if (req->rw.ki_filp) {
 		err = -EBADF;
-		fput(req->rw.ki_filp);
+		io_fput(req);
 	}
 	io_cqring_add_event(ctx, user_data, err, 0);
 	io_free_req(req);
@@ -949,21 +982,32 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 
 static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
+	struct io_ring_ctx *ctx = req->ctx;
+	unsigned flags;
 	int fd;
 
 	/* Prep already done */
 	if (req->rw.ki_filp)
 		return 0;
 
-	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
 	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fd = READ_ONCE(sqe->fd);
-	req->rw.ki_filp = fget(fd);
-	if (unlikely(!req->rw.ki_filp))
-		return -EBADF;
+	flags = READ_ONCE(sqe->flags);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		req->rw.ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		req->rw.ki_filp = fget(fd);
+		if (unlikely(!req->rw.ki_filp))
+			return -EBADF;
+	}
 
 	return 0;
 }
@@ -993,7 +1037,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 				end > 0 ? end : LLONG_MAX,
 				fsync_flags & IORING_FSYNC_DATASYNC);
 
-	fput(req->rw.ki_filp);
+	io_fput(req);
 	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
 	io_free_req(req);
 	return 0;
@@ -1132,7 +1176,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
-	if (unlikely(s->sqe->flags))
+	if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
 		return -EINVAL;
 
 	req = io_get_req(ctx, state);
@@ -1335,6 +1379,161 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
 }
 
+static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+#if defined(CONFIG_UNIX)
+	if (ctx->ring_sock) {
+		struct sock *sock = ctx->ring_sock->sk;
+		struct sk_buff *skb;
+
+		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
+			kfree_skb(skb);
+	}
+#else
+	int i;
+
+	for (i = 0; i < ctx->nr_user_files; i++)
+		fput(ctx->user_files[i]);
+#endif
+}
+
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+	if (!ctx->user_files)
+		return -ENXIO;
+
+	__io_sqe_files_unregister(ctx);
+	kfree(ctx->user_files);
+	ctx->user_files = NULL;
+	return 0;
+}
+
+#if defined(CONFIG_UNIX)
+/*
+ * Ensure the UNIX gc is aware of our file set, so we are certain that
+ * the io_uring can be safely unregistered on process exit, even if we have
+ * loops in the file referencing.
+ */
+static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
+{
+	struct sock *sk = ctx->ring_sock->sk;
+	struct scm_fp_list *fpl;
+	struct sk_buff *skb;
+	int i;
+
+	fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
+	if (!fpl)
+		return -ENOMEM;
+
+	skb = alloc_skb(0, GFP_KERNEL);
+	if (!skb) {
+		kfree(fpl);
+		return -ENOMEM;
+	}
+
+	skb->sk = sk;
+	skb->destructor = unix_destruct_scm;
+
+	fpl->user = get_uid(ctx->user);
+	for (i = 0; i < nr; i++) {
+		fpl->fp[i] = get_file(ctx->user_files[i + offset]);
+		unix_inflight(fpl->user, fpl->fp[i]);
+	}
+
+	fpl->max = fpl->count = nr;
+	UNIXCB(skb).fp = fpl;
+	refcount_add(skb->truesize, &sk->sk_wmem_alloc);
+	skb_queue_head(&sk->sk_receive_queue, skb);
+
+	for (i = 0; i < nr; i++)
+		fput(fpl->fp[i]);
+
+	return 0;
+}
+
+/*
+ * If UNIX sockets are enabled, fd passing can cause a reference cycle which
+ * causes regular reference counting to break down. We rely on the UNIX
+ * garbage collection to take care of this problem for us.
+ */
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	unsigned left, total;
+	int ret = 0;
+
+	total = 0;
+	left = ctx->nr_user_files;
+	while (left) {
+		unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
+		int ret;
+
+		ret = __io_sqe_files_scm(ctx, this_files, total);
+		if (ret)
+			break;
+		left -= this_files;
+		total += this_files;
+	}
+
+	return ret;
+}
+#else
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	return 0;
+}
+#endif
+
+static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
+				 unsigned nr_args)
+{
+	__s32 __user *fds = (__s32 __user *) arg;
+	int fd, ret = 0;
+	unsigned i;
+
+	if (ctx->user_files)
+		return -EBUSY;
+	if (!nr_args)
+		return -EINVAL;
+	if (nr_args > IORING_MAX_FIXED_FILES)
+		return -EMFILE;
+
+	ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
+	if (!ctx->user_files)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		ret = -EFAULT;
+		if (copy_from_user(&fd, &fds[i], sizeof(fd)))
+			break;
+
+		ctx->user_files[i] = fget(fd);
+
+		ret = -EBADF;
+		if (!ctx->user_files[i])
+			break;
+		/*
+		 * Don't allow io_uring instances to be registered. If UNIX
+		 * isn't enabled, then this causes a reference cycle and this
+		 * instance can never get freed. If UNIX is enabled we'll
+		 * handle it just fine, but there's still no point in allowing
+		 * a ring fd as it doesn't support regular read/write anyway.
+		 */
+		if (ctx->user_files[i]->f_op == &io_uring_fops) {
+			fput(ctx->user_files[i]);
+			break;
+		}
+		ctx->nr_user_files++;
+		ret = 0;
+	}
+
+	if (!ret)
+		ret = io_sqe_files_scm(ctx);
+	if (ret)
+		io_sqe_files_unregister(ctx);
+
+	return ret;
+}
+
 static int io_sq_offload_start(struct io_ring_ctx *ctx)
 {
 	int ret;
@@ -1609,6 +1808,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 
 	io_iopoll_reap_events(ctx);
 	io_sqe_buffer_unregister(ctx);
+	io_sqe_files_unregister(ctx);
 
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
@@ -1988,6 +2188,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_sqe_buffer_unregister(ctx);
 		break;
+	case IORING_REGISTER_FILES:
+		ret = io_sqe_files_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_FILES:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_files_unregister(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index cf28f7a11f12..6257478d55e9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -16,7 +16,7 @@
  */
 struct io_uring_sqe {
 	__u8	opcode;		/* type of operation for this sqe */
-	__u8	flags;		/* as of now unused */
+	__u8	flags;		/* IOSQE_ flags */
 	__u16	ioprio;		/* ioprio for the request */
 	__s32	fd;		/* file descriptor to do IO on */
 	__u64	off;		/* offset into file */
@@ -33,6 +33,11 @@ struct io_uring_sqe {
 	};
 };
 
+/*
+ * sqe->flags
+ */
+#define IOSQE_FIXED_FILE	(1U << 0)	/* use fixed fileset */
+
 /*
  * io_uring_setup() flags
  */
@@ -113,5 +118,7 @@ struct io_uring_params {
  */
 #define IORING_REGISTER_BUFFERS		0
 #define IORING_UNREGISTER_BUFFERS	1
+#define IORING_REGISTER_FILES		2
+#define IORING_UNREGISTER_FILES		3
 
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 14/19] io_uring: add file set registration
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We normally have to fget/fput for each IO we do on a file. Even with
the batching we do, the cost of the atomic inc/dec of the file usage
count adds up.

This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
for the io_uring_register(2) system call. The arguments passed in must
be an array of __s32 holding file descriptors, and nr_args should hold
the number of file descriptors the application wishes to pin for the
duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
called).

When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
to the index in the array passed in to IORING_REGISTER_FILES.

Files are automatically unregistered when the io_uring instance is torn
down. An application need only unregister if it wishes to register a new
set of fds.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 267 ++++++++++++++++++++++++++++++----
 include/uapi/linux/io_uring.h |   9 +-
 2 files changed, 246 insertions(+), 30 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0eba20d18f53..167c7f96666f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -49,6 +49,7 @@
 #include <linux/net.h>
 #include <net/sock.h>
 #include <net/af_unix.h>
+#include <net/scm.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
@@ -61,6 +62,7 @@
 #include "internal.h"
 
 #define IORING_MAX_ENTRIES	4096
+#define IORING_MAX_FIXED_FILES	1024
 
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
@@ -123,6 +125,14 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/*
+	 * If used, fixed file set. Writers must ensure that ->refs is dead,
+	 * readers must ensure that ->refs is alive as long as the file* is
+	 * used. Only updated through io_uring_register(2).
+	 */
+	struct file		**user_files;
+	unsigned		nr_user_files;
+
 	/* if used, fixed mapped user buffers */
 	unsigned		nr_user_bufs;
 	struct io_mapped_ubuf	*user_bufs;
@@ -170,6 +180,7 @@ struct io_kiocb {
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
+#define REQ_F_FIXED_FILE	4	/* ctx owns file */
 	u64			user_data;
 	u64			error;
 
@@ -404,15 +415,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		 * Batched puts of the same file, to avoid dirtying the
 		 * file usage count multiple times, if avoidable.
 		 */
-		if (!file) {
-			file = req->rw.ki_filp;
-			file_count = 1;
-		} else if (file == req->rw.ki_filp) {
-			file_count++;
-		} else {
-			fput_many(file, file_count);
-			file = req->rw.ki_filp;
-			file_count = 1;
+		if (!(req->flags & REQ_F_FIXED_FILE)) {
+			if (!file) {
+				file = req->rw.ki_filp;
+				file_count = 1;
+			} else if (file == req->rw.ki_filp) {
+				file_count++;
+			} else {
+				fput_many(file, file_count);
+				file = req->rw.ki_filp;
+				file_count = 1;
+			}
 		}
 
 		if (to_free == ARRAY_SIZE(reqs))
@@ -544,13 +557,19 @@ static void kiocb_end_write(struct kiocb *kiocb)
 	}
 }
 
+static void io_fput(struct io_kiocb *req)
+{
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		fput(req->rw.ki_filp);
+}
+
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
 
 	kiocb_end_write(kiocb);
 
-	fput(kiocb->ki_filp);
+	io_fput(req);
 	io_cqring_add_event(req->ctx, req->user_data, res, 0);
 	io_free_req(req);
 }
@@ -666,19 +685,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
-	unsigned ioprio;
+	unsigned ioprio, flags;
 	int fd, ret;
 
 	/* For -EAGAIN retry, everything is already prepped */
 	if (kiocb->ki_filp)
 		return 0;
 
+	flags = READ_ONCE(sqe->flags);
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = io_file_get(state, fd);
-	if (unlikely(!kiocb->ki_filp))
-		return -EBADF;
-	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
-		force_nonblock = false;
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files ||
+		    (unsigned) fd >= ctx->nr_user_files))
+			return -EBADF;
+		kiocb->ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		kiocb->ki_filp = io_file_get(state, fd);
+		if (unlikely(!kiocb->ki_filp))
+			return -EBADF;
+		if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
+			force_nonblock = false;
+	}
 	kiocb->ki_pos = READ_ONCE(sqe->off);
 	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
@@ -718,10 +747,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	/* in case of error, we didn't use this file reference. drop it. */
-	if (state)
-		state->used_refs--;
-	io_file_put(state, kiocb->ki_filp);
+	if (!(flags & IOSQE_FIXED_FILE)) {
+		/*
+		 * in case of error, we didn't use this file reference. drop it.
+		 */
+		if (state)
+			state->used_refs--;
+		io_file_put(state, kiocb->ki_filp);
+	}
 	return ret;
 }
 
@@ -863,7 +896,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 out_fput:
 	/* Hold on to the file for -EAGAIN */
 	if (unlikely(ret && ret != -EAGAIN))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -917,7 +950,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -940,7 +973,7 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 	 */
 	if (req->rw.ki_filp) {
 		err = -EBADF;
-		fput(req->rw.ki_filp);
+		io_fput(req);
 	}
 	io_cqring_add_event(ctx, user_data, err, 0);
 	io_free_req(req);
@@ -949,21 +982,32 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 
 static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
+	struct io_ring_ctx *ctx = req->ctx;
+	unsigned flags;
 	int fd;
 
 	/* Prep already done */
 	if (req->rw.ki_filp)
 		return 0;
 
-	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
 	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fd = READ_ONCE(sqe->fd);
-	req->rw.ki_filp = fget(fd);
-	if (unlikely(!req->rw.ki_filp))
-		return -EBADF;
+	flags = READ_ONCE(sqe->flags);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		req->rw.ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		req->rw.ki_filp = fget(fd);
+		if (unlikely(!req->rw.ki_filp))
+			return -EBADF;
+	}
 
 	return 0;
 }
@@ -993,7 +1037,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 				end > 0 ? end : LLONG_MAX,
 				fsync_flags & IORING_FSYNC_DATASYNC);
 
-	fput(req->rw.ki_filp);
+	io_fput(req);
 	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
 	io_free_req(req);
 	return 0;
@@ -1132,7 +1176,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
-	if (unlikely(s->sqe->flags))
+	if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
 		return -EINVAL;
 
 	req = io_get_req(ctx, state);
@@ -1335,6 +1379,161 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
 }
 
+static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+#if defined(CONFIG_UNIX)
+	if (ctx->ring_sock) {
+		struct sock *sock = ctx->ring_sock->sk;
+		struct sk_buff *skb;
+
+		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
+			kfree_skb(skb);
+	}
+#else
+	int i;
+
+	for (i = 0; i < ctx->nr_user_files; i++)
+		fput(ctx->user_files[i]);
+#endif
+}
+
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+	if (!ctx->user_files)
+		return -ENXIO;
+
+	__io_sqe_files_unregister(ctx);
+	kfree(ctx->user_files);
+	ctx->user_files = NULL;
+	return 0;
+}
+
+#if defined(CONFIG_UNIX)
+/*
+ * Ensure the UNIX gc is aware of our file set, so we are certain that
+ * the io_uring can be safely unregistered on process exit, even if we have
+ * loops in the file referencing.
+ */
+static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
+{
+	struct sock *sk = ctx->ring_sock->sk;
+	struct scm_fp_list *fpl;
+	struct sk_buff *skb;
+	int i;
+
+	fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
+	if (!fpl)
+		return -ENOMEM;
+
+	skb = alloc_skb(0, GFP_KERNEL);
+	if (!skb) {
+		kfree(fpl);
+		return -ENOMEM;
+	}
+
+	skb->sk = sk;
+	skb->destructor = unix_destruct_scm;
+
+	fpl->user = get_uid(ctx->user);
+	for (i = 0; i < nr; i++) {
+		fpl->fp[i] = get_file(ctx->user_files[i + offset]);
+		unix_inflight(fpl->user, fpl->fp[i]);
+	}
+
+	fpl->max = fpl->count = nr;
+	UNIXCB(skb).fp = fpl;
+	refcount_add(skb->truesize, &sk->sk_wmem_alloc);
+	skb_queue_head(&sk->sk_receive_queue, skb);
+
+	for (i = 0; i < nr; i++)
+		fput(fpl->fp[i]);
+
+	return 0;
+}
+
+/*
+ * If UNIX sockets are enabled, fd passing can cause a reference cycle which
+ * causes regular reference counting to break down. We rely on the UNIX
+ * garbage collection to take care of this problem for us.
+ */
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	unsigned left, total;
+	int ret = 0;
+
+	total = 0;
+	left = ctx->nr_user_files;
+	while (left) {
+		unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
+		int ret;
+
+		ret = __io_sqe_files_scm(ctx, this_files, total);
+		if (ret)
+			break;
+		left -= this_files;
+		total += this_files;
+	}
+
+	return ret;
+}
+#else
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	return 0;
+}
+#endif
+
+static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
+				 unsigned nr_args)
+{
+	__s32 __user *fds = (__s32 __user *) arg;
+	int fd, ret = 0;
+	unsigned i;
+
+	if (ctx->user_files)
+		return -EBUSY;
+	if (!nr_args)
+		return -EINVAL;
+	if (nr_args > IORING_MAX_FIXED_FILES)
+		return -EMFILE;
+
+	ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
+	if (!ctx->user_files)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		ret = -EFAULT;
+		if (copy_from_user(&fd, &fds[i], sizeof(fd)))
+			break;
+
+		ctx->user_files[i] = fget(fd);
+
+		ret = -EBADF;
+		if (!ctx->user_files[i])
+			break;
+		/*
+		 * Don't allow io_uring instances to be registered. If UNIX
+		 * isn't enabled, then this causes a reference cycle and this
+		 * instance can never get freed. If UNIX is enabled we'll
+		 * handle it just fine, but there's still no point in allowing
+		 * a ring fd as it doesn't support regular read/write anyway.
+		 */
+		if (ctx->user_files[i]->f_op == &io_uring_fops) {
+			fput(ctx->user_files[i]);
+			break;
+		}
+		ctx->nr_user_files++;
+		ret = 0;
+	}
+
+	if (!ret)
+		ret = io_sqe_files_scm(ctx);
+	if (ret)
+		io_sqe_files_unregister(ctx);
+
+	return ret;
+}
+
 static int io_sq_offload_start(struct io_ring_ctx *ctx)
 {
 	int ret;
@@ -1609,6 +1808,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 
 	io_iopoll_reap_events(ctx);
 	io_sqe_buffer_unregister(ctx);
+	io_sqe_files_unregister(ctx);
 
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
@@ -1988,6 +2188,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_sqe_buffer_unregister(ctx);
 		break;
+	case IORING_REGISTER_FILES:
+		ret = io_sqe_files_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_FILES:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_files_unregister(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index cf28f7a11f12..6257478d55e9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -16,7 +16,7 @@
  */
 struct io_uring_sqe {
 	__u8	opcode;		/* type of operation for this sqe */
-	__u8	flags;		/* as of now unused */
+	__u8	flags;		/* IOSQE_ flags */
 	__u16	ioprio;		/* ioprio for the request */
 	__s32	fd;		/* file descriptor to do IO on */
 	__u64	off;		/* offset into file */
@@ -33,6 +33,11 @@ struct io_uring_sqe {
 	};
 };
 
+/*
+ * sqe->flags
+ */
+#define IOSQE_FIXED_FILE	(1U << 0)	/* use fixed fileset */
+
 /*
  * io_uring_setup() flags
  */
@@ -113,5 +118,7 @@ struct io_uring_params {
  */
 #define IORING_REGISTER_BUFFERS		0
 #define IORING_UNREGISTER_BUFFERS	1
+#define IORING_REGISTER_FILES		2
+#define IORING_UNREGISTER_FILES		3
 
 #endif
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 15/19] io_uring: add submission polling
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.

By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:

sq_ring->flags |= IORING_SQ_NEED_WAKEUP.

The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:

read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

instead of calling it unconditionally.

It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 248 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |  12 +-
 2 files changed, 252 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 167c7f96666f..24c280076e81 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -44,6 +44,7 @@
 #include <linux/percpu.h>
 #include <linux/slab.h>
 #include <linux/workqueue.h>
+#include <linux/kthread.h>
 #include <linux/blkdev.h>
 #include <linux/bvec.h>
 #include <linux/net.h>
@@ -108,12 +109,16 @@ struct io_ring_ctx {
 		unsigned		cached_sq_head;
 		unsigned		sq_entries;
 		unsigned		sq_mask;
+		unsigned		sq_thread_idle;
 		struct io_uring_sqe	*sq_sqes;
 	} ____cacheline_aligned_in_smp;
 
 	/* IO offload */
 	struct workqueue_struct	*sqo_wq;
+	struct task_struct	*sqo_thread;	/* if using sq thread polling */
 	struct mm_struct	*sqo_mm;
+	wait_queue_head_t	sqo_wait;
+	unsigned		sqo_stop;
 
 	struct {
 		/* CQ ring */
@@ -168,6 +173,7 @@ struct sqe_submit {
 	unsigned short			index;
 	bool				has_user;
 	bool				needs_lock;
+	bool				needs_fixed_file;
 };
 
 struct io_kiocb {
@@ -327,6 +333,8 @@ static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
 
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
+	if (waitqueue_active(&ctx->sqo_wait))
+		wake_up(&ctx->sqo_wait);
 }
 
 static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
@@ -680,9 +688,10 @@ static bool io_file_supports_async(struct file *file)
 	return false;
 }
 
-static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
 		      bool force_nonblock, struct io_submit_state *state)
 {
+	const struct io_uring_sqe *sqe = s->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
 	unsigned ioprio, flags;
@@ -702,6 +711,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		kiocb->ki_filp = ctx->user_files[fd];
 		req->flags |= REQ_F_FIXED_FILE;
 	} else {
+		if (s->needs_fixed_file)
+			return -EBADF;
 		kiocb->ki_filp = io_file_get(state, fd);
 		if (unlikely(!kiocb->ki_filp))
 			return -EBADF;
@@ -865,7 +876,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, s->sqe, force_nonblock, state);
+	ret = io_prep_rw(req, s, force_nonblock, state);
 	if (ret)
 		return ret;
 	file = kiocb->ki_filp;
@@ -909,7 +920,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, s->sqe, force_nonblock, state);
+	ret = io_prep_rw(req, s, force_nonblock, state);
 	if (ret)
 		return ret;
 	/* Hold on to the file for -EAGAIN */
@@ -1295,6 +1306,170 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	return false;
 }
 
+static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
+			  unsigned int nr, bool has_user, bool mm_fault)
+{
+	struct io_submit_state state, *statep = NULL;
+	int ret, i, submitted = 0;
+
+	if (nr > IO_PLUG_THRESHOLD) {
+		io_submit_state_start(&state, ctx, nr);
+		statep = &state;
+	}
+
+	for (i = 0; i < nr; i++) {
+		if (unlikely(mm_fault)) {
+			ret = -EFAULT;
+		} else {
+			sqes[i].has_user = has_user;
+			sqes[i].needs_lock = true;
+			sqes[i].needs_fixed_file = true;
+			ret = io_submit_sqe(ctx, &sqes[i], statep);
+		}
+		if (!ret) {
+			submitted++;
+			continue;
+		}
+
+		io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret, 0);
+	}
+
+	if (statep)
+		io_submit_state_end(&state);
+
+	return submitted;
+}
+
+static int io_sq_thread(void *data)
+{
+	struct sqe_submit sqes[IO_IOPOLL_BATCH];
+	struct io_ring_ctx *ctx = data;
+	struct mm_struct *cur_mm = NULL;
+	mm_segment_t old_fs;
+	DEFINE_WAIT(wait);
+	unsigned inflight;
+	unsigned long timeout;
+
+	old_fs = get_fs();
+	set_fs(USER_DS);
+
+	timeout = inflight = 0;
+	while (!kthread_should_stop() && !ctx->sqo_stop) {
+		bool all_fixed, mm_fault = false;
+		int i;
+
+		if (inflight) {
+			unsigned nr_events = 0;
+
+			if (ctx->flags & IORING_SETUP_IOPOLL) {
+				/*
+				 * We disallow the app entering submit/complete
+				 * with polling, but we still need to lock the
+				 * ring to prevent racing with polled issue
+				 * that got punted to a workqueue.
+				 */
+				mutex_lock(&ctx->uring_lock);
+				io_iopoll_check(ctx, &nr_events, 0);
+				mutex_unlock(&ctx->uring_lock);
+			} else {
+				/*
+				 * Normal IO, just pretend everything completed.
+				 * We don't have to poll completions for that.
+				 */
+				nr_events = inflight;
+			}
+
+			inflight -= nr_events;
+			if (!inflight)
+				timeout = jiffies + ctx->sq_thread_idle;
+		}
+
+		if (!io_get_sqring(ctx, &sqes[0])) {
+			/*
+			 * We're polling. If we're within the defined idle
+			 * period, then let us spin without work before going
+			 * to sleep.
+			 */
+			if (inflight || !time_after(jiffies, timeout)) {
+				cpu_relax();
+				continue;
+			}
+
+			/*
+			 * Drop cur_mm before scheduling, we can't hold it for
+			 * long periods (or over schedule()). Do this before
+			 * adding ourselves to the waitqueue, as the unuse/drop
+			 * may sleep.
+			 */
+			if (cur_mm) {
+				unuse_mm(cur_mm);
+				mmput(cur_mm);
+				cur_mm = NULL;
+			}
+
+			prepare_to_wait(&ctx->sqo_wait, &wait,
+						TASK_INTERRUPTIBLE);
+
+			/* Tell userspace we may need a wakeup call */
+			ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP;
+			smp_wmb();
+
+			if (!io_get_sqring(ctx, &sqes[0])) {
+				if (kthread_should_stop()) {
+					finish_wait(&ctx->sqo_wait, &wait);
+					break;
+				}
+				if (signal_pending(current))
+					flush_signals(current);
+				schedule();
+				finish_wait(&ctx->sqo_wait, &wait);
+
+				ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
+				smp_wmb();
+				continue;
+			}
+			finish_wait(&ctx->sqo_wait, &wait);
+
+			ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
+			smp_wmb();
+		}
+
+		i = 0;
+		all_fixed = true;
+		do {
+			if (all_fixed && io_sqe_needs_user(sqes[i].sqe))
+				all_fixed = false;
+
+			i++;
+			if (i == ARRAY_SIZE(sqes))
+				break;
+		} while (io_get_sqring(ctx, &sqes[i]));
+
+		io_commit_sqring(ctx);
+
+		/* Unless all new commands are FIXED regions, grab mm */
+		if (!all_fixed && !cur_mm) {
+			mm_fault = !mmget_not_zero(ctx->sqo_mm);
+			if (!mm_fault) {
+				use_mm(ctx->sqo_mm);
+				cur_mm = ctx->sqo_mm;
+			}
+		}
+
+		inflight += io_submit_sqes(ctx, sqes, i, cur_mm != NULL,
+						mm_fault);
+	}
+
+	io_iopoll_reap_events(ctx);
+
+	set_fs(old_fs);
+	if (cur_mm) {
+		unuse_mm(cur_mm);
+		mmput(cur_mm);
+	}
+	return 0;
+}
+
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
 	struct io_submit_state state, *statep = NULL;
@@ -1313,6 +1488,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 
 		s.has_user = true;
 		s.needs_lock = false;
+		s.needs_fixed_file = false;
 
 		ret = io_submit_sqe(ctx, &s, statep);
 		if (ret) {
@@ -1534,13 +1710,47 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
-static int io_sq_offload_start(struct io_ring_ctx *ctx)
+static int io_sq_offload_start(struct io_ring_ctx *ctx,
+			       struct io_uring_params *p)
 {
 	int ret;
 
+	init_waitqueue_head(&ctx->sqo_wait);
 	mmgrab(current->mm);
 	ctx->sqo_mm = current->mm;
 
+	ctx->sq_thread_idle = msecs_to_jiffies(p->sq_thread_idle);
+	if (!ctx->sq_thread_idle)
+		ctx->sq_thread_idle = HZ;
+
+	ret = -EINVAL;
+	if (!cpu_possible(p->sq_thread_cpu))
+		goto err;
+
+	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		if (p->flags & IORING_SETUP_SQ_AFF) {
+			int cpu;
+
+			cpu = array_index_nospec(p->sq_thread_cpu, NR_CPUS);
+			ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
+							ctx, cpu,
+							"io_uring-sq");
+		} else {
+			ctx->sqo_thread = kthread_create(io_sq_thread, ctx,
+							"io_uring-sq");
+		}
+		if (IS_ERR(ctx->sqo_thread)) {
+			ret = PTR_ERR(ctx->sqo_thread);
+			ctx->sqo_thread = NULL;
+			goto err;
+		}
+		wake_up_process(ctx->sqo_thread);
+	} else if (p->flags & IORING_SETUP_SQ_AFF) {
+		/* Can't have SQ_AFF without SQPOLL */
+		ret = -EINVAL;
+		goto err;
+	}
+
 	/* Do QD, or 2 * CPUS, whatever is smallest */
 	ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
 			min(ctx->sq_entries - 1, 2 * num_online_cpus()));
@@ -1551,6 +1761,12 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx)
 
 	return 0;
 err:
+	if (ctx->sqo_thread) {
+		ctx->sqo_stop = 1;
+		mb();
+		kthread_stop(ctx->sqo_thread);
+		ctx->sqo_thread = NULL;
+	}
 	mmdrop(ctx->sqo_mm);
 	ctx->sqo_mm = NULL;
 	return ret;
@@ -1801,6 +2017,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
+	if (ctx->sqo_thread) {
+		ctx->sqo_stop = 1;
+		mb();
+		kthread_stop(ctx->sqo_thread);
+	}
 	if (ctx->sqo_wq)
 		destroy_workqueue(ctx->sqo_wq);
 	if (ctx->sqo_mm)
@@ -1910,7 +2131,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	int submitted = 0;
 	struct fd f;
 
-	if (flags & ~IORING_ENTER_GETEVENTS)
+	if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP))
 		return -EINVAL;
 
 	f = fdget(fd);
@@ -1926,6 +2147,18 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	if (!percpu_ref_tryget(&ctx->refs))
 		goto out_fput;
 
+	/*
+	 * For SQ polling, the thread will do all submissions and completions.
+	 * Just return the requested submit count, and wake the thread if
+	 * we were asked to.
+	 */
+	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		if (flags & IORING_ENTER_SQ_WAKEUP)
+			wake_up(&ctx->sqo_wait);
+		submitted = to_submit;
+		goto out_ctx;
+	}
+
 	if (to_submit) {
 		to_submit = min(to_submit, ctx->sq_entries);
 
@@ -2103,7 +2336,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p)
 	if (ret)
 		goto err;
 
-	ret = io_sq_offload_start(ctx);
+	ret = io_sq_offload_start(ctx, p);
 	if (ret)
 		goto err;
 
@@ -2151,7 +2384,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			return -EINVAL;
 	}
 
-	if (p.flags & ~IORING_SETUP_IOPOLL)
+	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
+			IORING_SETUP_SQ_AFF))
 		return -EINVAL;
 
 	ret = io_uring_create(entries, &p);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6257478d55e9..0ec74bab8dbe 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -42,6 +42,8 @@ struct io_uring_sqe {
  * io_uring_setup() flags
  */
 #define IORING_SETUP_IOPOLL	(1U << 0)	/* io_context is polled */
+#define IORING_SETUP_SQPOLL	(1U << 1)	/* SQ poll thread */
+#define IORING_SETUP_SQ_AFF	(1U << 2)	/* sq_thread_cpu is valid */
 
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
@@ -86,6 +88,11 @@ struct io_sqring_offsets {
 	__u64 resv2;
 };
 
+/*
+ * sq_ring->flags
+ */
+#define IORING_SQ_NEED_WAKEUP	(1U << 0) /* needs io_uring_enter wakeup */
+
 struct io_cqring_offsets {
 	__u32 head;
 	__u32 tail;
@@ -100,6 +107,7 @@ struct io_cqring_offsets {
  * io_uring_enter(2) flags
  */
 #define IORING_ENTER_GETEVENTS	(1U << 0)
+#define IORING_ENTER_SQ_WAKEUP	(1U << 1)
 
 /*
  * Passed in for io_uring_setup(2). Copied back with updated info on success
@@ -108,7 +116,9 @@ struct io_uring_params {
 	__u32 sq_entries;
 	__u32 cq_entries;
 	__u32 flags;
-	__u32 resv[7];
+	__u32 sq_thread_cpu;
+	__u32 sq_thread_idle;
+	__u32 resv[5];
 	struct io_sqring_offsets sq_off;
 	struct io_cqring_offsets cq_off;
 };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 15/19] io_uring: add submission polling
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.

By default, we allow 1 second of active spinning. This can by changed
by passing in a different grace period at io_uring_register(2) time.
If the thread exceeds this idle time without having any work to do, it
will set:

sq_ring->flags |= IORING_SQ_NEED_WAKEUP.

The application will have to call io_uring_enter() to start things back
up again. If IO is kept busy, that will never be needed. Basically an
application that has this feature enabled will guard it's
io_uring_enter(2) call with:

read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
	io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);

instead of calling it unconditionally.

It's mandatory to use fixed files with this feature. Failure to do so
will result in the application getting an -EBADF CQ entry when
submitting IO.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 248 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |  12 +-
 2 files changed, 252 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 167c7f96666f..24c280076e81 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -44,6 +44,7 @@
 #include <linux/percpu.h>
 #include <linux/slab.h>
 #include <linux/workqueue.h>
+#include <linux/kthread.h>
 #include <linux/blkdev.h>
 #include <linux/bvec.h>
 #include <linux/net.h>
@@ -108,12 +109,16 @@ struct io_ring_ctx {
 		unsigned		cached_sq_head;
 		unsigned		sq_entries;
 		unsigned		sq_mask;
+		unsigned		sq_thread_idle;
 		struct io_uring_sqe	*sq_sqes;
 	} ____cacheline_aligned_in_smp;
 
 	/* IO offload */
 	struct workqueue_struct	*sqo_wq;
+	struct task_struct	*sqo_thread;	/* if using sq thread polling */
 	struct mm_struct	*sqo_mm;
+	wait_queue_head_t	sqo_wait;
+	unsigned		sqo_stop;
 
 	struct {
 		/* CQ ring */
@@ -168,6 +173,7 @@ struct sqe_submit {
 	unsigned short			index;
 	bool				has_user;
 	bool				needs_lock;
+	bool				needs_fixed_file;
 };
 
 struct io_kiocb {
@@ -327,6 +333,8 @@ static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
 
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
+	if (waitqueue_active(&ctx->sqo_wait))
+		wake_up(&ctx->sqo_wait);
 }
 
 static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
@@ -680,9 +688,10 @@ static bool io_file_supports_async(struct file *file)
 	return false;
 }
 
-static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
 		      bool force_nonblock, struct io_submit_state *state)
 {
+	const struct io_uring_sqe *sqe = s->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
 	unsigned ioprio, flags;
@@ -702,6 +711,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		kiocb->ki_filp = ctx->user_files[fd];
 		req->flags |= REQ_F_FIXED_FILE;
 	} else {
+		if (s->needs_fixed_file)
+			return -EBADF;
 		kiocb->ki_filp = io_file_get(state, fd);
 		if (unlikely(!kiocb->ki_filp))
 			return -EBADF;
@@ -865,7 +876,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, s->sqe, force_nonblock, state);
+	ret = io_prep_rw(req, s, force_nonblock, state);
 	if (ret)
 		return ret;
 	file = kiocb->ki_filp;
@@ -909,7 +920,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(req, s->sqe, force_nonblock, state);
+	ret = io_prep_rw(req, s, force_nonblock, state);
 	if (ret)
 		return ret;
 	/* Hold on to the file for -EAGAIN */
@@ -1295,6 +1306,170 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	return false;
 }
 
+static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
+			  unsigned int nr, bool has_user, bool mm_fault)
+{
+	struct io_submit_state state, *statep = NULL;
+	int ret, i, submitted = 0;
+
+	if (nr > IO_PLUG_THRESHOLD) {
+		io_submit_state_start(&state, ctx, nr);
+		statep = &state;
+	}
+
+	for (i = 0; i < nr; i++) {
+		if (unlikely(mm_fault)) {
+			ret = -EFAULT;
+		} else {
+			sqes[i].has_user = has_user;
+			sqes[i].needs_lock = true;
+			sqes[i].needs_fixed_file = true;
+			ret = io_submit_sqe(ctx, &sqes[i], statep);
+		}
+		if (!ret) {
+			submitted++;
+			continue;
+		}
+
+		io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret, 0);
+	}
+
+	if (statep)
+		io_submit_state_end(&state);
+
+	return submitted;
+}
+
+static int io_sq_thread(void *data)
+{
+	struct sqe_submit sqes[IO_IOPOLL_BATCH];
+	struct io_ring_ctx *ctx = data;
+	struct mm_struct *cur_mm = NULL;
+	mm_segment_t old_fs;
+	DEFINE_WAIT(wait);
+	unsigned inflight;
+	unsigned long timeout;
+
+	old_fs = get_fs();
+	set_fs(USER_DS);
+
+	timeout = inflight = 0;
+	while (!kthread_should_stop() && !ctx->sqo_stop) {
+		bool all_fixed, mm_fault = false;
+		int i;
+
+		if (inflight) {
+			unsigned nr_events = 0;
+
+			if (ctx->flags & IORING_SETUP_IOPOLL) {
+				/*
+				 * We disallow the app entering submit/complete
+				 * with polling, but we still need to lock the
+				 * ring to prevent racing with polled issue
+				 * that got punted to a workqueue.
+				 */
+				mutex_lock(&ctx->uring_lock);
+				io_iopoll_check(ctx, &nr_events, 0);
+				mutex_unlock(&ctx->uring_lock);
+			} else {
+				/*
+				 * Normal IO, just pretend everything completed.
+				 * We don't have to poll completions for that.
+				 */
+				nr_events = inflight;
+			}
+
+			inflight -= nr_events;
+			if (!inflight)
+				timeout = jiffies + ctx->sq_thread_idle;
+		}
+
+		if (!io_get_sqring(ctx, &sqes[0])) {
+			/*
+			 * We're polling. If we're within the defined idle
+			 * period, then let us spin without work before going
+			 * to sleep.
+			 */
+			if (inflight || !time_after(jiffies, timeout)) {
+				cpu_relax();
+				continue;
+			}
+
+			/*
+			 * Drop cur_mm before scheduling, we can't hold it for
+			 * long periods (or over schedule()). Do this before
+			 * adding ourselves to the waitqueue, as the unuse/drop
+			 * may sleep.
+			 */
+			if (cur_mm) {
+				unuse_mm(cur_mm);
+				mmput(cur_mm);
+				cur_mm = NULL;
+			}
+
+			prepare_to_wait(&ctx->sqo_wait, &wait,
+						TASK_INTERRUPTIBLE);
+
+			/* Tell userspace we may need a wakeup call */
+			ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP;
+			smp_wmb();
+
+			if (!io_get_sqring(ctx, &sqes[0])) {
+				if (kthread_should_stop()) {
+					finish_wait(&ctx->sqo_wait, &wait);
+					break;
+				}
+				if (signal_pending(current))
+					flush_signals(current);
+				schedule();
+				finish_wait(&ctx->sqo_wait, &wait);
+
+				ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
+				smp_wmb();
+				continue;
+			}
+			finish_wait(&ctx->sqo_wait, &wait);
+
+			ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
+			smp_wmb();
+		}
+
+		i = 0;
+		all_fixed = true;
+		do {
+			if (all_fixed && io_sqe_needs_user(sqes[i].sqe))
+				all_fixed = false;
+
+			i++;
+			if (i == ARRAY_SIZE(sqes))
+				break;
+		} while (io_get_sqring(ctx, &sqes[i]));
+
+		io_commit_sqring(ctx);
+
+		/* Unless all new commands are FIXED regions, grab mm */
+		if (!all_fixed && !cur_mm) {
+			mm_fault = !mmget_not_zero(ctx->sqo_mm);
+			if (!mm_fault) {
+				use_mm(ctx->sqo_mm);
+				cur_mm = ctx->sqo_mm;
+			}
+		}
+
+		inflight += io_submit_sqes(ctx, sqes, i, cur_mm != NULL,
+						mm_fault);
+	}
+
+	io_iopoll_reap_events(ctx);
+
+	set_fs(old_fs);
+	if (cur_mm) {
+		unuse_mm(cur_mm);
+		mmput(cur_mm);
+	}
+	return 0;
+}
+
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
 	struct io_submit_state state, *statep = NULL;
@@ -1313,6 +1488,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 
 		s.has_user = true;
 		s.needs_lock = false;
+		s.needs_fixed_file = false;
 
 		ret = io_submit_sqe(ctx, &s, statep);
 		if (ret) {
@@ -1534,13 +1710,47 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
-static int io_sq_offload_start(struct io_ring_ctx *ctx)
+static int io_sq_offload_start(struct io_ring_ctx *ctx,
+			       struct io_uring_params *p)
 {
 	int ret;
 
+	init_waitqueue_head(&ctx->sqo_wait);
 	mmgrab(current->mm);
 	ctx->sqo_mm = current->mm;
 
+	ctx->sq_thread_idle = msecs_to_jiffies(p->sq_thread_idle);
+	if (!ctx->sq_thread_idle)
+		ctx->sq_thread_idle = HZ;
+
+	ret = -EINVAL;
+	if (!cpu_possible(p->sq_thread_cpu))
+		goto err;
+
+	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		if (p->flags & IORING_SETUP_SQ_AFF) {
+			int cpu;
+
+			cpu = array_index_nospec(p->sq_thread_cpu, NR_CPUS);
+			ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
+							ctx, cpu,
+							"io_uring-sq");
+		} else {
+			ctx->sqo_thread = kthread_create(io_sq_thread, ctx,
+							"io_uring-sq");
+		}
+		if (IS_ERR(ctx->sqo_thread)) {
+			ret = PTR_ERR(ctx->sqo_thread);
+			ctx->sqo_thread = NULL;
+			goto err;
+		}
+		wake_up_process(ctx->sqo_thread);
+	} else if (p->flags & IORING_SETUP_SQ_AFF) {
+		/* Can't have SQ_AFF without SQPOLL */
+		ret = -EINVAL;
+		goto err;
+	}
+
 	/* Do QD, or 2 * CPUS, whatever is smallest */
 	ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
 			min(ctx->sq_entries - 1, 2 * num_online_cpus()));
@@ -1551,6 +1761,12 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx)
 
 	return 0;
 err:
+	if (ctx->sqo_thread) {
+		ctx->sqo_stop = 1;
+		mb();
+		kthread_stop(ctx->sqo_thread);
+		ctx->sqo_thread = NULL;
+	}
 	mmdrop(ctx->sqo_mm);
 	ctx->sqo_mm = NULL;
 	return ret;
@@ -1801,6 +2017,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 {
+	if (ctx->sqo_thread) {
+		ctx->sqo_stop = 1;
+		mb();
+		kthread_stop(ctx->sqo_thread);
+	}
 	if (ctx->sqo_wq)
 		destroy_workqueue(ctx->sqo_wq);
 	if (ctx->sqo_mm)
@@ -1910,7 +2131,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	int submitted = 0;
 	struct fd f;
 
-	if (flags & ~IORING_ENTER_GETEVENTS)
+	if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP))
 		return -EINVAL;
 
 	f = fdget(fd);
@@ -1926,6 +2147,18 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	if (!percpu_ref_tryget(&ctx->refs))
 		goto out_fput;
 
+	/*
+	 * For SQ polling, the thread will do all submissions and completions.
+	 * Just return the requested submit count, and wake the thread if
+	 * we were asked to.
+	 */
+	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		if (flags & IORING_ENTER_SQ_WAKEUP)
+			wake_up(&ctx->sqo_wait);
+		submitted = to_submit;
+		goto out_ctx;
+	}
+
 	if (to_submit) {
 		to_submit = min(to_submit, ctx->sq_entries);
 
@@ -2103,7 +2336,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p)
 	if (ret)
 		goto err;
 
-	ret = io_sq_offload_start(ctx);
+	ret = io_sq_offload_start(ctx, p);
 	if (ret)
 		goto err;
 
@@ -2151,7 +2384,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			return -EINVAL;
 	}
 
-	if (p.flags & ~IORING_SETUP_IOPOLL)
+	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
+			IORING_SETUP_SQ_AFF))
 		return -EINVAL;
 
 	ret = io_uring_create(entries, &p);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6257478d55e9..0ec74bab8dbe 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -42,6 +42,8 @@ struct io_uring_sqe {
  * io_uring_setup() flags
  */
 #define IORING_SETUP_IOPOLL	(1U << 0)	/* io_context is polled */
+#define IORING_SETUP_SQPOLL	(1U << 1)	/* SQ poll thread */
+#define IORING_SETUP_SQ_AFF	(1U << 2)	/* sq_thread_cpu is valid */
 
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
@@ -86,6 +88,11 @@ struct io_sqring_offsets {
 	__u64 resv2;
 };
 
+/*
+ * sq_ring->flags
+ */
+#define IORING_SQ_NEED_WAKEUP	(1U << 0) /* needs io_uring_enter wakeup */
+
 struct io_cqring_offsets {
 	__u32 head;
 	__u32 tail;
@@ -100,6 +107,7 @@ struct io_cqring_offsets {
  * io_uring_enter(2) flags
  */
 #define IORING_ENTER_GETEVENTS	(1U << 0)
+#define IORING_ENTER_SQ_WAKEUP	(1U << 1)
 
 /*
  * Passed in for io_uring_setup(2). Copied back with updated info on success
@@ -108,7 +116,9 @@ struct io_uring_params {
 	__u32 sq_entries;
 	__u32 cq_entries;
 	__u32 flags;
-	__u32 resv[7];
+	__u32 sq_thread_cpu;
+	__u32 sq_thread_idle;
+	__u32 resv[5];
 	struct io_sqring_offsets sq_off;
 	struct io_cqring_offsets cq_off;
 };
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 16/19] io_uring: add io_kiocb ref count
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We'll use this for the POLL implementation. Regular requests will
NOT be using references, so initialize it to 0. Any real use of
the io_kiocb ref will initialize it to at least 2.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 24c280076e81..33b6c6167595 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -184,6 +184,7 @@ struct io_kiocb {
 	struct io_ring_ctx	*ctx;
 	struct list_head	list;
 	unsigned int		flags;
+	refcount_t		refs;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 #define REQ_F_FIXED_FILE	4	/* ctx owns file */
@@ -377,6 +378,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 
 	req->ctx = ctx;
 	req->flags = 0;
+	refcount_set(&req->refs, 0);
 	return req;
 out:
 	io_ring_drop_ctx_refs(ctx, 1);
@@ -394,8 +396,10 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
 
 static void io_free_req(struct io_kiocb *req)
 {
-	io_ring_drop_ctx_refs(req->ctx, 1);
-	kmem_cache_free(req_cachep, req);
+	if (!refcount_read(&req->refs) || refcount_dec_and_test(&req->refs)) {
+		io_ring_drop_ctx_refs(req->ctx, 1);
+		kmem_cache_free(req_cachep, req);
+	}
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 16/19] io_uring: add io_kiocb ref count
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We'll use this for the POLL implementation. Regular requests will
NOT be using references, so initialize it to 0. Any real use of
the io_kiocb ref will initialize it to at least 2.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 24c280076e81..33b6c6167595 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -184,6 +184,7 @@ struct io_kiocb {
 	struct io_ring_ctx	*ctx;
 	struct list_head	list;
 	unsigned int		flags;
+	refcount_t		refs;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 #define REQ_F_FIXED_FILE	4	/* ctx owns file */
@@ -377,6 +378,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
 
 	req->ctx = ctx;
 	req->flags = 0;
+	refcount_set(&req->refs, 0);
 	return req;
 out:
 	io_ring_drop_ctx_refs(ctx, 1);
@@ -394,8 +396,10 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
 
 static void io_free_req(struct io_kiocb *req)
 {
-	io_ring_drop_ctx_refs(req->ctx, 1);
-	kmem_cache_free(req_cachep, req);
+	if (!refcount_read(&req->refs) || refcount_dec_and_test(&req->refs)) {
+		io_ring_drop_ctx_refs(req->ctx, 1);
+		kmem_cache_free(req_cachep, req);
+	}
 }
 
 /*
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 17/19] io_uring: add support for IORING_OP_POLL
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

This is basically a direct port of bfe4037e722e, which implements a
one-shot poll command through aio. Description below is based on that
commit as well. However, instead of adding a POLL command and relying
on io_cancel(2) to remove it, we mimic the epoll(2) interface of
having a command to add a poll notification, IORING_OP_POLL_ADD,
and one to remove it again, IORING_OP_POLL_REMOVE.

To poll for a file descriptor the application should submit an sqe of
type IORING_OP_POLL. It will poll the fd for the events specified in the
poll_events field.

Unlike poll or epoll without EPOLLONESHOT this interface always works in
one shot mode, that is once the sqe is completed, it will have to be
resubmitted.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Based-on-code-from: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 261 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |   3 +
 2 files changed, 263 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 33b6c6167595..a0513d4bc35d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -161,6 +161,7 @@ struct io_ring_ctx {
 		 * manipulate the list, hence no extra locking is needed there.
 		 */
 		struct list_head	poll_list;
+		struct list_head	cancel_list;
 	} ____cacheline_aligned_in_smp;
 
 #if defined(CONFIG_UNIX)
@@ -176,8 +177,20 @@ struct sqe_submit {
 	bool				needs_fixed_file;
 };
 
+struct io_poll_iocb {
+	struct file			*file;
+	struct wait_queue_head		*head;
+	__poll_t			events;
+	bool				woken;
+	bool				canceled;
+	struct wait_queue_entry		wait;
+};
+
 struct io_kiocb {
-	struct kiocb		rw;
+	union {
+		struct kiocb		rw;
+		struct io_poll_iocb	poll;
+	};
 
 	struct sqe_submit	submit;
 
@@ -261,6 +274,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
 	INIT_LIST_HEAD(&ctx->poll_list);
+	INIT_LIST_HEAD(&ctx->cancel_list);
 	return ctx;
 }
 
@@ -1058,6 +1072,244 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	return 0;
 }
 
+static void io_poll_remove_one(struct io_kiocb *req)
+{
+	struct io_poll_iocb *poll = &req->poll;
+
+	spin_lock(&poll->head->lock);
+	WRITE_ONCE(poll->canceled, true);
+	if (!list_empty(&poll->wait.entry)) {
+		list_del_init(&poll->wait.entry);
+		queue_work(req->ctx->sqo_wq, &req->work);
+	}
+	spin_unlock(&poll->head->lock);
+
+	list_del_init(&req->list);
+}
+
+static void io_poll_remove_all(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	spin_lock_irq(&ctx->completion_lock);
+	while (!list_empty(&ctx->cancel_list)) {
+		req = list_first_entry(&ctx->cancel_list, struct io_kiocb,list);
+		io_poll_remove_one(req);
+	}
+	spin_unlock_irq(&ctx->completion_lock);
+}
+
+/*
+ * Find a running poll command that matches one specified in sqe->addr,
+ * and remove it if found.
+ */
+static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_kiocb *poll_req, *next;
+	int ret = -ENOENT;
+
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+	if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index ||
+	    sqe->poll_events)
+		return -EINVAL;
+
+	spin_lock_irq(&ctx->completion_lock);
+	list_for_each_entry_safe(poll_req, next, &ctx->cancel_list, list) {
+		if (READ_ONCE(sqe->addr) == poll_req->user_data) {
+			io_poll_remove_one(poll_req);
+			ret = 0;
+			break;
+		}
+	}
+	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
+	io_free_req(req);
+	return 0;
+}
+
+static void io_poll_complete(struct io_kiocb *req, __poll_t mask)
+{
+	io_cqring_add_event(req->ctx, req->user_data, mangle_poll(mask), 0);
+	io_fput(req);
+	io_free_req(req);
+}
+
+static void io_poll_complete_work(struct work_struct *work)
+{
+	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+	struct io_poll_iocb *poll = &req->poll;
+	struct poll_table_struct pt = { ._key = poll->events };
+	struct io_ring_ctx *ctx = req->ctx;
+	__poll_t mask = 0;
+
+	if (!READ_ONCE(poll->canceled))
+		mask = vfs_poll(poll->file, &pt) & poll->events;
+
+	/*
+	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
+	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
+	 * synchronize with them.  In the cancellation case the list_del_init
+	 * itself is not actually needed, but harmless so we keep it in to
+	 * avoid further branches in the fast path.
+	 */
+	spin_lock_irq(&ctx->completion_lock);
+	if (!mask && !READ_ONCE(poll->canceled)) {
+		add_wait_queue(poll->head, &poll->wait);
+		spin_unlock_irq(&ctx->completion_lock);
+		return;
+	}
+	list_del_init(&req->list);
+	spin_unlock_irq(&ctx->completion_lock);
+
+	io_poll_complete(req, mask);
+}
+
+static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
+			void *key)
+{
+	struct io_poll_iocb *poll = container_of(wait, struct io_poll_iocb,
+							wait);
+	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
+	struct io_ring_ctx *ctx = req->ctx;
+	__poll_t mask = key_to_poll(key);
+
+	poll->woken = true;
+
+	/* for instances that support it check for an event match first: */
+	if (mask) {
+		if (!(mask & poll->events))
+			return 0;
+
+		/* try to complete the iocb inline if we can: */
+		if (spin_trylock(&ctx->completion_lock)) {
+			list_del(&req->list);
+			spin_unlock(&ctx->completion_lock);
+
+			list_del_init(&poll->wait.entry);
+			io_poll_complete(req, mask);
+			return 1;
+		}
+	}
+
+	list_del_init(&poll->wait.entry);
+	queue_work(ctx->sqo_wq, &req->work);
+	return 1;
+}
+
+struct io_poll_table {
+	struct poll_table_struct pt;
+	struct io_kiocb *req;
+	int error;
+};
+
+static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
+			       struct poll_table_struct *p)
+{
+	struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
+
+	if (unlikely(pt->req->poll.head)) {
+		pt->error = -EINVAL;
+		return;
+	}
+
+	pt->error = 0;
+	pt->req->poll.head = head;
+	add_wait_queue(head, &pt->req->poll.wait);
+}
+
+static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_poll_iocb *poll = &req->poll;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_poll_table ipt;
+	unsigned flags;
+	__poll_t mask;
+	u16 events;
+	int fd;
+
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+	if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index)
+		return -EINVAL;
+
+	INIT_WORK(&req->work, io_poll_complete_work);
+	events = READ_ONCE(sqe->poll_events);
+	poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
+
+	flags = READ_ONCE(sqe->flags);
+	fd = READ_ONCE(sqe->fd);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		poll->file = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		poll->file = fget(fd);
+	}
+	if (unlikely(!poll->file))
+		return -EBADF;
+
+	poll->head = NULL;
+	poll->woken = false;
+	poll->canceled = false;
+
+	ipt.pt._qproc = io_poll_queue_proc;
+	ipt.pt._key = poll->events;
+	ipt.req = req;
+	ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */
+
+	/* initialized the list so that we can do list_empty checks */
+	INIT_LIST_HEAD(&poll->wait.entry);
+	init_waitqueue_func_entry(&poll->wait, io_poll_wake);
+
+	/* one for removal from waitqueue, one for this function */
+	refcount_set(&req->refs, 2);
+
+	mask = vfs_poll(poll->file, &ipt.pt) & poll->events;
+	if (unlikely(!poll->head)) {
+		/* we did not manage to set up a waitqueue, done */
+		goto out;
+	}
+
+	spin_lock_irq(&ctx->completion_lock);
+	spin_lock(&poll->head->lock);
+	if (poll->woken) {
+		/* wake_up context handles the rest */
+		mask = 0;
+		ipt.error = 0;
+	} else if (mask || ipt.error) {
+		/* if we get an error or a mask we are done */
+		WARN_ON_ONCE(list_empty(&poll->wait.entry));
+		list_del_init(&poll->wait.entry);
+	} else {
+		/* actually waiting for an event */
+		list_add_tail(&req->list, &ctx->cancel_list);
+	}
+	spin_unlock(&poll->head->lock);
+	spin_unlock_irq(&ctx->completion_lock);
+
+out:
+	if (unlikely(ipt.error)) {
+		if (!(flags & IOSQE_FIXED_FILE))
+			fput(poll->file);
+		/*
+		 * Drop one of our refs to this req, __io_submit_sqe() will
+		 * drop the other one since we're returning an error.
+		 */
+		io_free_req(req);
+		return ipt.error;
+	}
+
+	if (mask)
+		io_poll_complete(req, mask);
+	io_free_req(req);
+	return 0;
+}
+
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			   const struct sqe_submit *s, bool force_nonblock,
 			   struct io_submit_state *state)
@@ -1093,6 +1345,12 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	case IORING_OP_FSYNC:
 		ret = io_fsync(req, s->sqe, force_nonblock);
 		break;
+	case IORING_OP_POLL_ADD:
+		ret = io_poll_add(req, s->sqe);
+		break;
+	case IORING_OP_POLL_REMOVE:
+		ret = io_poll_remove(req, s->sqe);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -2081,6 +2339,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	percpu_ref_kill(&ctx->refs);
 	mutex_unlock(&ctx->uring_lock);
 
+	io_poll_remove_all(ctx);
 	io_iopoll_reap_events(ctx);
 	wait_for_completion(&ctx->ctx_done);
 	io_ring_ctx_free(ctx);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0ec74bab8dbe..e23408692118 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -25,6 +25,7 @@ struct io_uring_sqe {
 	union {
 		__kernel_rwf_t	rw_flags;
 		__u32		fsync_flags;
+		__u16		poll_events;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	union {
@@ -51,6 +52,8 @@ struct io_uring_sqe {
 #define IORING_OP_FSYNC		3
 #define IORING_OP_READ_FIXED	4
 #define IORING_OP_WRITE_FIXED	5
+#define IORING_OP_POLL_ADD	6
+#define IORING_OP_POLL_REMOVE	7
 
 /*
  * sqe->fsync_flags
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 17/19] io_uring: add support for IORING_OP_POLL
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

This is basically a direct port of bfe4037e722e, which implements a
one-shot poll command through aio. Description below is based on that
commit as well. However, instead of adding a POLL command and relying
on io_cancel(2) to remove it, we mimic the epoll(2) interface of
having a command to add a poll notification, IORING_OP_POLL_ADD,
and one to remove it again, IORING_OP_POLL_REMOVE.

To poll for a file descriptor the application should submit an sqe of
type IORING_OP_POLL. It will poll the fd for the events specified in the
poll_events field.

Unlike poll or epoll without EPOLLONESHOT this interface always works in
one shot mode, that is once the sqe is completed, it will have to be
resubmitted.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Based-on-code-from: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 261 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |   3 +
 2 files changed, 263 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 33b6c6167595..a0513d4bc35d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -161,6 +161,7 @@ struct io_ring_ctx {
 		 * manipulate the list, hence no extra locking is needed there.
 		 */
 		struct list_head	poll_list;
+		struct list_head	cancel_list;
 	} ____cacheline_aligned_in_smp;
 
 #if defined(CONFIG_UNIX)
@@ -176,8 +177,20 @@ struct sqe_submit {
 	bool				needs_fixed_file;
 };
 
+struct io_poll_iocb {
+	struct file			*file;
+	struct wait_queue_head		*head;
+	__poll_t			events;
+	bool				woken;
+	bool				canceled;
+	struct wait_queue_entry		wait;
+};
+
 struct io_kiocb {
-	struct kiocb		rw;
+	union {
+		struct kiocb		rw;
+		struct io_poll_iocb	poll;
+	};
 
 	struct sqe_submit	submit;
 
@@ -261,6 +274,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	init_waitqueue_head(&ctx->wait);
 	spin_lock_init(&ctx->completion_lock);
 	INIT_LIST_HEAD(&ctx->poll_list);
+	INIT_LIST_HEAD(&ctx->cancel_list);
 	return ctx;
 }
 
@@ -1058,6 +1072,244 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	return 0;
 }
 
+static void io_poll_remove_one(struct io_kiocb *req)
+{
+	struct io_poll_iocb *poll = &req->poll;
+
+	spin_lock(&poll->head->lock);
+	WRITE_ONCE(poll->canceled, true);
+	if (!list_empty(&poll->wait.entry)) {
+		list_del_init(&poll->wait.entry);
+		queue_work(req->ctx->sqo_wq, &req->work);
+	}
+	spin_unlock(&poll->head->lock);
+
+	list_del_init(&req->list);
+}
+
+static void io_poll_remove_all(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	spin_lock_irq(&ctx->completion_lock);
+	while (!list_empty(&ctx->cancel_list)) {
+		req = list_first_entry(&ctx->cancel_list, struct io_kiocb,list);
+		io_poll_remove_one(req);
+	}
+	spin_unlock_irq(&ctx->completion_lock);
+}
+
+/*
+ * Find a running poll command that matches one specified in sqe->addr,
+ * and remove it if found.
+ */
+static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_kiocb *poll_req, *next;
+	int ret = -ENOENT;
+
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+	if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index ||
+	    sqe->poll_events)
+		return -EINVAL;
+
+	spin_lock_irq(&ctx->completion_lock);
+	list_for_each_entry_safe(poll_req, next, &ctx->cancel_list, list) {
+		if (READ_ONCE(sqe->addr) == poll_req->user_data) {
+			io_poll_remove_one(poll_req);
+			ret = 0;
+			break;
+		}
+	}
+	spin_unlock_irq(&ctx->completion_lock);
+
+	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
+	io_free_req(req);
+	return 0;
+}
+
+static void io_poll_complete(struct io_kiocb *req, __poll_t mask)
+{
+	io_cqring_add_event(req->ctx, req->user_data, mangle_poll(mask), 0);
+	io_fput(req);
+	io_free_req(req);
+}
+
+static void io_poll_complete_work(struct work_struct *work)
+{
+	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
+	struct io_poll_iocb *poll = &req->poll;
+	struct poll_table_struct pt = { ._key = poll->events };
+	struct io_ring_ctx *ctx = req->ctx;
+	__poll_t mask = 0;
+
+	if (!READ_ONCE(poll->canceled))
+		mask = vfs_poll(poll->file, &pt) & poll->events;
+
+	/*
+	 * Note that ->ki_cancel callers also delete iocb from active_reqs after
+	 * calling ->ki_cancel.  We need the ctx_lock roundtrip here to
+	 * synchronize with them.  In the cancellation case the list_del_init
+	 * itself is not actually needed, but harmless so we keep it in to
+	 * avoid further branches in the fast path.
+	 */
+	spin_lock_irq(&ctx->completion_lock);
+	if (!mask && !READ_ONCE(poll->canceled)) {
+		add_wait_queue(poll->head, &poll->wait);
+		spin_unlock_irq(&ctx->completion_lock);
+		return;
+	}
+	list_del_init(&req->list);
+	spin_unlock_irq(&ctx->completion_lock);
+
+	io_poll_complete(req, mask);
+}
+
+static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
+			void *key)
+{
+	struct io_poll_iocb *poll = container_of(wait, struct io_poll_iocb,
+							wait);
+	struct io_kiocb *req = container_of(poll, struct io_kiocb, poll);
+	struct io_ring_ctx *ctx = req->ctx;
+	__poll_t mask = key_to_poll(key);
+
+	poll->woken = true;
+
+	/* for instances that support it check for an event match first: */
+	if (mask) {
+		if (!(mask & poll->events))
+			return 0;
+
+		/* try to complete the iocb inline if we can: */
+		if (spin_trylock(&ctx->completion_lock)) {
+			list_del(&req->list);
+			spin_unlock(&ctx->completion_lock);
+
+			list_del_init(&poll->wait.entry);
+			io_poll_complete(req, mask);
+			return 1;
+		}
+	}
+
+	list_del_init(&poll->wait.entry);
+	queue_work(ctx->sqo_wq, &req->work);
+	return 1;
+}
+
+struct io_poll_table {
+	struct poll_table_struct pt;
+	struct io_kiocb *req;
+	int error;
+};
+
+static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
+			       struct poll_table_struct *p)
+{
+	struct io_poll_table *pt = container_of(p, struct io_poll_table, pt);
+
+	if (unlikely(pt->req->poll.head)) {
+		pt->error = -EINVAL;
+		return;
+	}
+
+	pt->error = 0;
+	pt->req->poll.head = head;
+	add_wait_queue(head, &pt->req->poll.wait);
+}
+
+static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_poll_iocb *poll = &req->poll;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_poll_table ipt;
+	unsigned flags;
+	__poll_t mask;
+	u16 events;
+	int fd;
+
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+	if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index)
+		return -EINVAL;
+
+	INIT_WORK(&req->work, io_poll_complete_work);
+	events = READ_ONCE(sqe->poll_events);
+	poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP;
+
+	flags = READ_ONCE(sqe->flags);
+	fd = READ_ONCE(sqe->fd);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		poll->file = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		poll->file = fget(fd);
+	}
+	if (unlikely(!poll->file))
+		return -EBADF;
+
+	poll->head = NULL;
+	poll->woken = false;
+	poll->canceled = false;
+
+	ipt.pt._qproc = io_poll_queue_proc;
+	ipt.pt._key = poll->events;
+	ipt.req = req;
+	ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */
+
+	/* initialized the list so that we can do list_empty checks */
+	INIT_LIST_HEAD(&poll->wait.entry);
+	init_waitqueue_func_entry(&poll->wait, io_poll_wake);
+
+	/* one for removal from waitqueue, one for this function */
+	refcount_set(&req->refs, 2);
+
+	mask = vfs_poll(poll->file, &ipt.pt) & poll->events;
+	if (unlikely(!poll->head)) {
+		/* we did not manage to set up a waitqueue, done */
+		goto out;
+	}
+
+	spin_lock_irq(&ctx->completion_lock);
+	spin_lock(&poll->head->lock);
+	if (poll->woken) {
+		/* wake_up context handles the rest */
+		mask = 0;
+		ipt.error = 0;
+	} else if (mask || ipt.error) {
+		/* if we get an error or a mask we are done */
+		WARN_ON_ONCE(list_empty(&poll->wait.entry));
+		list_del_init(&poll->wait.entry);
+	} else {
+		/* actually waiting for an event */
+		list_add_tail(&req->list, &ctx->cancel_list);
+	}
+	spin_unlock(&poll->head->lock);
+	spin_unlock_irq(&ctx->completion_lock);
+
+out:
+	if (unlikely(ipt.error)) {
+		if (!(flags & IOSQE_FIXED_FILE))
+			fput(poll->file);
+		/*
+		 * Drop one of our refs to this req, __io_submit_sqe() will
+		 * drop the other one since we're returning an error.
+		 */
+		io_free_req(req);
+		return ipt.error;
+	}
+
+	if (mask)
+		io_poll_complete(req, mask);
+	io_free_req(req);
+	return 0;
+}
+
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			   const struct sqe_submit *s, bool force_nonblock,
 			   struct io_submit_state *state)
@@ -1093,6 +1345,12 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	case IORING_OP_FSYNC:
 		ret = io_fsync(req, s->sqe, force_nonblock);
 		break;
+	case IORING_OP_POLL_ADD:
+		ret = io_poll_add(req, s->sqe);
+		break;
+	case IORING_OP_POLL_REMOVE:
+		ret = io_poll_remove(req, s->sqe);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -2081,6 +2339,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	percpu_ref_kill(&ctx->refs);
 	mutex_unlock(&ctx->uring_lock);
 
+	io_poll_remove_all(ctx);
 	io_iopoll_reap_events(ctx);
 	wait_for_completion(&ctx->ctx_done);
 	io_ring_ctx_free(ctx);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 0ec74bab8dbe..e23408692118 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -25,6 +25,7 @@ struct io_uring_sqe {
 	union {
 		__kernel_rwf_t	rw_flags;
 		__u32		fsync_flags;
+		__u16		poll_events;
 	};
 	__u64	user_data;	/* data to be passed back at completion time */
 	union {
@@ -51,6 +52,8 @@ struct io_uring_sqe {
 #define IORING_OP_FSYNC		3
 #define IORING_OP_READ_FIXED	4
 #define IORING_OP_WRITE_FIXED	5
+#define IORING_OP_POLL_ADD	6
+#define IORING_OP_POLL_REMOVE	7
 
 /*
  * sqe->fsync_flags
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 18/19] io_uring: allow workqueue item to handle multiple buffered requests
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Right now we punt any buffered request that ends up triggering an
-EAGAIN to an async workqueue. This works fine in terms of providing
async execution of them, but it also can create quite a lot of work
queue items. For sequentially buffered IO, it's advantageous to
serialize the issue of them. For reads, the first one will trigger a
read-ahead, and subsequent request merely end up waiting on later pages
to complete. For writes, devices usually respond better to streamed
sequential writes.

Add state to track the last buffered request we punted to a work queue,
and if the next one is sequential to the previous, attempt to get the
previous work item to handle it. We limit the number of sequential
add-ons to the a multiple (8) of the max read-ahead size of the file.
This should be a good number for both reads and wries, as it defines the
max IO size the device can do directly.

This drastically cuts down on the number of context switches we need to
handle buffered sequential IO, and a basic test case of copying a big
file with io_uring sees a 5x speedup.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 281 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 229 insertions(+), 52 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a0513d4bc35d..ce446f59f092 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -94,6 +94,16 @@ struct io_mapped_ubuf {
 	unsigned int	nr_bvecs;
 };
 
+struct async_list {
+	spinlock_t		lock;
+	atomic_t		cnt;
+	struct list_head	list;
+
+	struct file		*file;
+	off_t			io_end;
+	size_t			io_pages;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -164,6 +174,8 @@ struct io_ring_ctx {
 		struct list_head	cancel_list;
 	} ____cacheline_aligned_in_smp;
 
+	struct async_list	pending_async[2];
+
 #if defined(CONFIG_UNIX)
 	struct socket		*ring_sock;
 #endif
@@ -201,6 +213,7 @@ struct io_kiocb {
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 #define REQ_F_FIXED_FILE	4	/* ctx owns file */
+#define REQ_F_SEQ_PREV		8	/* sequential with previous */
 	u64			user_data;
 	u64			error;
 
@@ -257,6 +270,7 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref)
 static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 {
 	struct io_ring_ctx *ctx;
+	int i;
 
 	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
 	if (!ctx)
@@ -272,6 +286,11 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	init_completion(&ctx->ctx_done);
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
+	for (i = 0; i < ARRAY_SIZE(ctx->pending_async); i++) {
+		spin_lock_init(&ctx->pending_async[i].lock);
+		INIT_LIST_HEAD(&ctx->pending_async[i].list);
+		atomic_set(&ctx->pending_async[i].cnt, 0);
+	}
 	spin_lock_init(&ctx->completion_lock);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->cancel_list);
@@ -885,6 +904,47 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 	return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter);
 }
 
+/*
+ * Make a note of the last file/offset/direction we punted to async
+ * context. We'll use this information to see if we can piggy back a
+ * sequential request onto the previous one, if it's still hasn't been
+ * completed by the async worker.
+ */
+static void io_async_list_note(int rw, struct io_kiocb *req, size_t len)
+{
+	struct async_list *async_list = &req->ctx->pending_async[rw];
+	struct kiocb *kiocb = &req->rw;
+	struct file *filp = kiocb->ki_filp;
+	off_t io_end = kiocb->ki_pos + len;
+
+	if (filp == async_list->file && kiocb->ki_pos == async_list->io_end) {
+		unsigned long max_pages;
+
+		/* Use 8x RA size as a decent limiter for both reads/writes */
+		max_pages = filp->f_ra.ra_pages;
+		if (!max_pages)
+			max_pages = VM_MAX_READAHEAD >> (PAGE_SHIFT - 10);
+		max_pages *= 8;
+
+		/* If max pages are exceeded, reset the state */
+		len >>= PAGE_SHIFT;
+		if (async_list->io_pages + len <= max_pages) {
+			req->flags |= REQ_F_SEQ_PREV;
+			async_list->io_pages += len;
+		} else {
+			io_end = 0;
+			async_list->io_pages = 0;
+		}
+	}
+
+	/* New file? Reset state. */
+	if (async_list->file != filp) {
+		async_list->io_pages = 0;
+		async_list->file = filp;
+	}
+	async_list->io_end = io_end;
+}
+
 static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 		       bool force_nonblock, struct io_submit_state *state)
 {
@@ -892,6 +952,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	struct kiocb *kiocb = &req->rw;
 	struct iov_iter iter;
 	struct file *file;
+	size_t iov_count;
 	ssize_t ret;
 
 	ret = io_prep_rw(req, s, force_nonblock, state);
@@ -910,16 +971,24 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	if (ret)
 		goto out_fput;
 
-	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
+	iov_count = iov_iter_count(&iter);
+	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count);
 	if (!ret) {
 		ssize_t ret2;
 
 		/* Catch -EAGAIN return for forced non-blocking submission */
 		ret2 = call_read_iter(file, kiocb, &iter);
-		if (!force_nonblock || ret2 != -EAGAIN)
+		if (!force_nonblock || ret2 != -EAGAIN) {
 			io_rw_done(kiocb, ret2);
-		else
+		} else {
+			/*
+			 * If ->needs_lock is true, we're already in async
+			 * context.
+			 */
+			if (!s->needs_lock)
+				io_async_list_note(READ, req, iov_count);
 			ret = -EAGAIN;
+		}
 	}
 	kfree(iovec);
 out_fput:
@@ -936,14 +1005,12 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	struct kiocb *kiocb = &req->rw;
 	struct iov_iter iter;
 	struct file *file;
+	size_t iov_count;
 	ssize_t ret;
 
 	ret = io_prep_rw(req, s, force_nonblock, state);
 	if (ret)
 		return ret;
-	/* Hold on to the file for -EAGAIN */
-	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT))
-		return -EAGAIN;
 
 	ret = -EBADF;
 	file = kiocb->ki_filp;
@@ -957,8 +1024,17 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	if (ret)
 		goto out_fput;
 
-	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos,
-				iov_iter_count(&iter));
+	iov_count = iov_iter_count(&iter);
+
+	ret = -EAGAIN;
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) {
+		/* If ->needs_lock is true, we're already in async context. */
+		if (!s->needs_lock)
+			io_async_list_note(WRITE, req, iov_count);
+		goto out_free;
+	}
+
+	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count);
 	if (!ret) {
 		/*
 		 * Open-code file_start_write here to grab freeze protection,
@@ -976,9 +1052,11 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 		kiocb->ki_flags |= IOCB_WRITE;
 		io_rw_done(kiocb, call_write_iter(file, kiocb, &iter));
 	}
+out_free:
 	kfree(iovec);
 out_fput:
-	if (unlikely(ret))
+	/* Hold on to the file for -EAGAIN */
+	if (unlikely(ret && ret != -EAGAIN))
 		io_fput(req);
 	return ret;
 }
@@ -1374,6 +1452,21 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx,
+						 const struct io_uring_sqe *sqe)
+{
+	switch (sqe->opcode) {
+	case IORING_OP_READV:
+	case IORING_OP_READ_FIXED:
+		return &ctx->pending_async[READ];
+	case IORING_OP_WRITEV:
+	case IORING_OP_WRITE_FIXED:
+		return &ctx->pending_async[WRITE];
+	default:
+		return NULL;
+	}
+}
+
 static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
 {
 	u8 opcode = READ_ONCE(sqe->opcode);
@@ -1385,61 +1478,138 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct sqe_submit *s = &req->submit;
-	const struct io_uring_sqe *sqe = s->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
+	struct mm_struct *cur_mm = NULL;
+	struct async_list *async_list;
+	LIST_HEAD(req_list);
 	mm_segment_t old_fs;
-	bool needs_user;
 	int ret;
 
-	 /* Ensure we clear previously set forced non-block flag */
-	req->flags &= ~REQ_F_FORCE_NONBLOCK;
-	req->rw.ki_flags &= ~IOCB_NOWAIT;
+	async_list = io_async_list_from_sqe(ctx, req->submit.sqe);
+restart:
+	do {
+		struct sqe_submit *s = &req->submit;
+		const struct io_uring_sqe *sqe = s->sqe;
+
+		/* Ensure we clear previously set forced non-block flag */
+		req->flags &= ~REQ_F_FORCE_NONBLOCK;
+		req->rw.ki_flags &= ~IOCB_NOWAIT;
 
-	s->needs_lock = true;
-	s->has_user = false;
+		ret = 0;
+		if (io_sqe_needs_user(sqe) && !cur_mm) {
+			if (!mmget_not_zero(ctx->sqo_mm)) {
+				ret = -EFAULT;
+			} else {
+				cur_mm = ctx->sqo_mm;
+				use_mm(cur_mm);
+				old_fs = get_fs();
+				set_fs(USER_DS);
+			}
+		}
+
+		if (!ret) {
+			s->has_user = cur_mm != NULL;
+			s->needs_lock = true;
+			do {
+				ret = __io_submit_sqe(ctx, req, s, false, NULL);
+				/*
+				 * We can get EAGAIN for polled IO even though
+				 * we're forcing a sync submission from here,
+				 * since we can't wait for request slots on the
+				 * block side.
+				 */
+				if (ret != -EAGAIN)
+					break;
+				cond_resched();
+			} while (1);
+		}
+		if (ret) {
+			io_cqring_add_event(ctx, sqe->user_data, ret, 0);
+			io_free_req(req);
+		}
+
+		/* async context always use a copy of the sqe */
+		kfree(sqe);
+
+		if (!async_list)
+			break;
+		if (!list_empty(&req_list)) {
+			req = list_first_entry(&req_list, struct io_kiocb,
+						list);
+			list_del(&req->list);
+			continue;
+		}
+		if (list_empty(&async_list->list))
+			break;
+
+		req = NULL;
+		spin_lock(&async_list->lock);
+		if (list_empty(&async_list->list)) {
+			spin_unlock(&async_list->lock);
+			break;
+		}
+		list_splice_init(&async_list->list, &req_list);
+		spin_unlock(&async_list->lock);
+
+		req = list_first_entry(&req_list, struct io_kiocb, list);
+		list_del(&req->list);
+	} while (req);
 
 	/*
-	 * If we're doing IO to fixed buffers, we don't need to get/set
-	 * user context
+	 * Rare case of racing with a submitter. If we find the count has
+	 * dropped to zero AND we have pending work items, then restart
+	 * the processing. This is a tiny race window.
 	 */
-	needs_user = io_sqe_needs_user(s->sqe);
-	if (needs_user) {
-		if (!mmget_not_zero(ctx->sqo_mm)) {
-			ret = -EFAULT;
-			goto err;
+	if (async_list) {
+		ret = atomic_dec_return(&async_list->cnt);
+		while (!ret && !list_empty(&async_list->list)) {
+			spin_lock(&async_list->lock);
+			atomic_inc(&async_list->cnt);
+			list_splice_init(&async_list->list, &req_list);
+			spin_unlock(&async_list->lock);
+
+			if (!list_empty(&req_list)) {
+				req = list_first_entry(&req_list,
+							struct io_kiocb, list);
+				list_del(&req->list);
+				goto restart;
+			}
+			ret = atomic_dec_return(&async_list->cnt);
 		}
-		use_mm(ctx->sqo_mm);
-		old_fs = get_fs();
-		set_fs(USER_DS);
-		s->has_user = true;
 	}
 
-	do {
-		ret = __io_submit_sqe(ctx, req, s, false, NULL);
-		/*
-		 * We can get EAGAIN for polled IO even though we're forcing
-		 * a sync submission from here, since we can't wait for
-		 * request slots on the block side.
-		 */
-		if (ret != -EAGAIN)
-			break;
-		cond_resched();
-	} while (1);
-
-	if (needs_user) {
+	if (cur_mm) {
 		set_fs(old_fs);
-		unuse_mm(ctx->sqo_mm);
-		mmput(ctx->sqo_mm);
-	}
-err:
-	if (ret) {
-		io_cqring_add_event(ctx, sqe->user_data, ret, 0);
-		io_free_req(req);
+		unuse_mm(cur_mm);
+		mmput(cur_mm);
 	}
+}
 
-	/* async context always use a copy of the sqe */
-	kfree(sqe);
+/*
+ * See if we can piggy back onto previously submitted work, that is still
+ * running. We currently only allow this if the new request is sequential
+ * to the previous one we punted.
+ */
+static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req)
+{
+	bool ret = false;
+
+	if (!list)
+		return false;
+	if (!(req->flags & REQ_F_SEQ_PREV))
+		return false;
+	if (!atomic_read(&list->cnt))
+		return false;
+
+	ret = true;
+	spin_lock(&list->lock);
+	list_add_tail(&req->list, &list->list);
+	if (!atomic_read(&list->cnt)) {
+		list_del_init(&req->list);
+		ret = false;
+	}
+	spin_unlock(&list->lock);
+	return ret;
 }
 
 static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
@@ -1464,12 +1634,19 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 
 		sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL);
 		if (sqe_copy) {
+			struct async_list *list;
+
 			memcpy(sqe_copy, s->sqe, sizeof(*sqe_copy));
 			s->sqe = sqe_copy;
 
 			memcpy(&req->submit, s, sizeof(*s));
-			INIT_WORK(&req->work, io_sq_wq_submit_work);
-			queue_work(ctx->sqo_wq, &req->work);
+			list = io_async_list_from_sqe(ctx, s->sqe);
+			if (!io_add_to_prev_work(list, req)) {
+				if (list)
+					atomic_inc(&list->cnt);
+				INIT_WORK(&req->work, io_sq_wq_submit_work);
+				queue_work(ctx->sqo_wq, &req->work);
+			}
 			ret = 0;
 		}
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 18/19] io_uring: allow workqueue item to handle multiple buffered requests
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Right now we punt any buffered request that ends up triggering an
-EAGAIN to an async workqueue. This works fine in terms of providing
async execution of them, but it also can create quite a lot of work
queue items. For sequentially buffered IO, it's advantageous to
serialize the issue of them. For reads, the first one will trigger a
read-ahead, and subsequent request merely end up waiting on later pages
to complete. For writes, devices usually respond better to streamed
sequential writes.

Add state to track the last buffered request we punted to a work queue,
and if the next one is sequential to the previous, attempt to get the
previous work item to handle it. We limit the number of sequential
add-ons to the a multiple (8) of the max read-ahead size of the file.
This should be a good number for both reads and wries, as it defines the
max IO size the device can do directly.

This drastically cuts down on the number of context switches we need to
handle buffered sequential IO, and a basic test case of copying a big
file with io_uring sees a 5x speedup.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 281 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 229 insertions(+), 52 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a0513d4bc35d..ce446f59f092 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -94,6 +94,16 @@ struct io_mapped_ubuf {
 	unsigned int	nr_bvecs;
 };
 
+struct async_list {
+	spinlock_t		lock;
+	atomic_t		cnt;
+	struct list_head	list;
+
+	struct file		*file;
+	off_t			io_end;
+	size_t			io_pages;
+};
+
 struct io_ring_ctx {
 	struct {
 		struct percpu_ref	refs;
@@ -164,6 +174,8 @@ struct io_ring_ctx {
 		struct list_head	cancel_list;
 	} ____cacheline_aligned_in_smp;
 
+	struct async_list	pending_async[2];
+
 #if defined(CONFIG_UNIX)
 	struct socket		*ring_sock;
 #endif
@@ -201,6 +213,7 @@ struct io_kiocb {
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
 #define REQ_F_FIXED_FILE	4	/* ctx owns file */
+#define REQ_F_SEQ_PREV		8	/* sequential with previous */
 	u64			user_data;
 	u64			error;
 
@@ -257,6 +270,7 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref)
 static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 {
 	struct io_ring_ctx *ctx;
+	int i;
 
 	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
 	if (!ctx)
@@ -272,6 +286,11 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	init_completion(&ctx->ctx_done);
 	mutex_init(&ctx->uring_lock);
 	init_waitqueue_head(&ctx->wait);
+	for (i = 0; i < ARRAY_SIZE(ctx->pending_async); i++) {
+		spin_lock_init(&ctx->pending_async[i].lock);
+		INIT_LIST_HEAD(&ctx->pending_async[i].list);
+		atomic_set(&ctx->pending_async[i].cnt, 0);
+	}
 	spin_lock_init(&ctx->completion_lock);
 	INIT_LIST_HEAD(&ctx->poll_list);
 	INIT_LIST_HEAD(&ctx->cancel_list);
@@ -885,6 +904,47 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw,
 	return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter);
 }
 
+/*
+ * Make a note of the last file/offset/direction we punted to async
+ * context. We'll use this information to see if we can piggy back a
+ * sequential request onto the previous one, if it's still hasn't been
+ * completed by the async worker.
+ */
+static void io_async_list_note(int rw, struct io_kiocb *req, size_t len)
+{
+	struct async_list *async_list = &req->ctx->pending_async[rw];
+	struct kiocb *kiocb = &req->rw;
+	struct file *filp = kiocb->ki_filp;
+	off_t io_end = kiocb->ki_pos + len;
+
+	if (filp == async_list->file && kiocb->ki_pos == async_list->io_end) {
+		unsigned long max_pages;
+
+		/* Use 8x RA size as a decent limiter for both reads/writes */
+		max_pages = filp->f_ra.ra_pages;
+		if (!max_pages)
+			max_pages = VM_MAX_READAHEAD >> (PAGE_SHIFT - 10);
+		max_pages *= 8;
+
+		/* If max pages are exceeded, reset the state */
+		len >>= PAGE_SHIFT;
+		if (async_list->io_pages + len <= max_pages) {
+			req->flags |= REQ_F_SEQ_PREV;
+			async_list->io_pages += len;
+		} else {
+			io_end = 0;
+			async_list->io_pages = 0;
+		}
+	}
+
+	/* New file? Reset state. */
+	if (async_list->file != filp) {
+		async_list->io_pages = 0;
+		async_list->file = filp;
+	}
+	async_list->io_end = io_end;
+}
+
 static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 		       bool force_nonblock, struct io_submit_state *state)
 {
@@ -892,6 +952,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	struct kiocb *kiocb = &req->rw;
 	struct iov_iter iter;
 	struct file *file;
+	size_t iov_count;
 	ssize_t ret;
 
 	ret = io_prep_rw(req, s, force_nonblock, state);
@@ -910,16 +971,24 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	if (ret)
 		goto out_fput;
 
-	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
+	iov_count = iov_iter_count(&iter);
+	ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count);
 	if (!ret) {
 		ssize_t ret2;
 
 		/* Catch -EAGAIN return for forced non-blocking submission */
 		ret2 = call_read_iter(file, kiocb, &iter);
-		if (!force_nonblock || ret2 != -EAGAIN)
+		if (!force_nonblock || ret2 != -EAGAIN) {
 			io_rw_done(kiocb, ret2);
-		else
+		} else {
+			/*
+			 * If ->needs_lock is true, we're already in async
+			 * context.
+			 */
+			if (!s->needs_lock)
+				io_async_list_note(READ, req, iov_count);
 			ret = -EAGAIN;
+		}
 	}
 	kfree(iovec);
 out_fput:
@@ -936,14 +1005,12 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	struct kiocb *kiocb = &req->rw;
 	struct iov_iter iter;
 	struct file *file;
+	size_t iov_count;
 	ssize_t ret;
 
 	ret = io_prep_rw(req, s, force_nonblock, state);
 	if (ret)
 		return ret;
-	/* Hold on to the file for -EAGAIN */
-	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT))
-		return -EAGAIN;
 
 	ret = -EBADF;
 	file = kiocb->ki_filp;
@@ -957,8 +1024,17 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	if (ret)
 		goto out_fput;
 
-	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos,
-				iov_iter_count(&iter));
+	iov_count = iov_iter_count(&iter);
+
+	ret = -EAGAIN;
+	if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) {
+		/* If ->needs_lock is true, we're already in async context. */
+		if (!s->needs_lock)
+			io_async_list_note(WRITE, req, iov_count);
+		goto out_free;
+	}
+
+	ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count);
 	if (!ret) {
 		/*
 		 * Open-code file_start_write here to grab freeze protection,
@@ -976,9 +1052,11 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 		kiocb->ki_flags |= IOCB_WRITE;
 		io_rw_done(kiocb, call_write_iter(file, kiocb, &iter));
 	}
+out_free:
 	kfree(iovec);
 out_fput:
-	if (unlikely(ret))
+	/* Hold on to the file for -EAGAIN */
+	if (unlikely(ret && ret != -EAGAIN))
 		io_fput(req);
 	return ret;
 }
@@ -1374,6 +1452,21 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
+static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx,
+						 const struct io_uring_sqe *sqe)
+{
+	switch (sqe->opcode) {
+	case IORING_OP_READV:
+	case IORING_OP_READ_FIXED:
+		return &ctx->pending_async[READ];
+	case IORING_OP_WRITEV:
+	case IORING_OP_WRITE_FIXED:
+		return &ctx->pending_async[WRITE];
+	default:
+		return NULL;
+	}
+}
+
 static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
 {
 	u8 opcode = READ_ONCE(sqe->opcode);
@@ -1385,61 +1478,138 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
 static void io_sq_wq_submit_work(struct work_struct *work)
 {
 	struct io_kiocb *req = container_of(work, struct io_kiocb, work);
-	struct sqe_submit *s = &req->submit;
-	const struct io_uring_sqe *sqe = s->sqe;
 	struct io_ring_ctx *ctx = req->ctx;
+	struct mm_struct *cur_mm = NULL;
+	struct async_list *async_list;
+	LIST_HEAD(req_list);
 	mm_segment_t old_fs;
-	bool needs_user;
 	int ret;
 
-	 /* Ensure we clear previously set forced non-block flag */
-	req->flags &= ~REQ_F_FORCE_NONBLOCK;
-	req->rw.ki_flags &= ~IOCB_NOWAIT;
+	async_list = io_async_list_from_sqe(ctx, req->submit.sqe);
+restart:
+	do {
+		struct sqe_submit *s = &req->submit;
+		const struct io_uring_sqe *sqe = s->sqe;
+
+		/* Ensure we clear previously set forced non-block flag */
+		req->flags &= ~REQ_F_FORCE_NONBLOCK;
+		req->rw.ki_flags &= ~IOCB_NOWAIT;
 
-	s->needs_lock = true;
-	s->has_user = false;
+		ret = 0;
+		if (io_sqe_needs_user(sqe) && !cur_mm) {
+			if (!mmget_not_zero(ctx->sqo_mm)) {
+				ret = -EFAULT;
+			} else {
+				cur_mm = ctx->sqo_mm;
+				use_mm(cur_mm);
+				old_fs = get_fs();
+				set_fs(USER_DS);
+			}
+		}
+
+		if (!ret) {
+			s->has_user = cur_mm != NULL;
+			s->needs_lock = true;
+			do {
+				ret = __io_submit_sqe(ctx, req, s, false, NULL);
+				/*
+				 * We can get EAGAIN for polled IO even though
+				 * we're forcing a sync submission from here,
+				 * since we can't wait for request slots on the
+				 * block side.
+				 */
+				if (ret != -EAGAIN)
+					break;
+				cond_resched();
+			} while (1);
+		}
+		if (ret) {
+			io_cqring_add_event(ctx, sqe->user_data, ret, 0);
+			io_free_req(req);
+		}
+
+		/* async context always use a copy of the sqe */
+		kfree(sqe);
+
+		if (!async_list)
+			break;
+		if (!list_empty(&req_list)) {
+			req = list_first_entry(&req_list, struct io_kiocb,
+						list);
+			list_del(&req->list);
+			continue;
+		}
+		if (list_empty(&async_list->list))
+			break;
+
+		req = NULL;
+		spin_lock(&async_list->lock);
+		if (list_empty(&async_list->list)) {
+			spin_unlock(&async_list->lock);
+			break;
+		}
+		list_splice_init(&async_list->list, &req_list);
+		spin_unlock(&async_list->lock);
+
+		req = list_first_entry(&req_list, struct io_kiocb, list);
+		list_del(&req->list);
+	} while (req);
 
 	/*
-	 * If we're doing IO to fixed buffers, we don't need to get/set
-	 * user context
+	 * Rare case of racing with a submitter. If we find the count has
+	 * dropped to zero AND we have pending work items, then restart
+	 * the processing. This is a tiny race window.
 	 */
-	needs_user = io_sqe_needs_user(s->sqe);
-	if (needs_user) {
-		if (!mmget_not_zero(ctx->sqo_mm)) {
-			ret = -EFAULT;
-			goto err;
+	if (async_list) {
+		ret = atomic_dec_return(&async_list->cnt);
+		while (!ret && !list_empty(&async_list->list)) {
+			spin_lock(&async_list->lock);
+			atomic_inc(&async_list->cnt);
+			list_splice_init(&async_list->list, &req_list);
+			spin_unlock(&async_list->lock);
+
+			if (!list_empty(&req_list)) {
+				req = list_first_entry(&req_list,
+							struct io_kiocb, list);
+				list_del(&req->list);
+				goto restart;
+			}
+			ret = atomic_dec_return(&async_list->cnt);
 		}
-		use_mm(ctx->sqo_mm);
-		old_fs = get_fs();
-		set_fs(USER_DS);
-		s->has_user = true;
 	}
 
-	do {
-		ret = __io_submit_sqe(ctx, req, s, false, NULL);
-		/*
-		 * We can get EAGAIN for polled IO even though we're forcing
-		 * a sync submission from here, since we can't wait for
-		 * request slots on the block side.
-		 */
-		if (ret != -EAGAIN)
-			break;
-		cond_resched();
-	} while (1);
-
-	if (needs_user) {
+	if (cur_mm) {
 		set_fs(old_fs);
-		unuse_mm(ctx->sqo_mm);
-		mmput(ctx->sqo_mm);
-	}
-err:
-	if (ret) {
-		io_cqring_add_event(ctx, sqe->user_data, ret, 0);
-		io_free_req(req);
+		unuse_mm(cur_mm);
+		mmput(cur_mm);
 	}
+}
 
-	/* async context always use a copy of the sqe */
-	kfree(sqe);
+/*
+ * See if we can piggy back onto previously submitted work, that is still
+ * running. We currently only allow this if the new request is sequential
+ * to the previous one we punted.
+ */
+static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req)
+{
+	bool ret = false;
+
+	if (!list)
+		return false;
+	if (!(req->flags & REQ_F_SEQ_PREV))
+		return false;
+	if (!atomic_read(&list->cnt))
+		return false;
+
+	ret = true;
+	spin_lock(&list->lock);
+	list_add_tail(&req->list, &list->list);
+	if (!atomic_read(&list->cnt)) {
+		list_del_init(&req->list);
+		ret = false;
+	}
+	spin_unlock(&list->lock);
+	return ret;
 }
 
 static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
@@ -1464,12 +1634,19 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 
 		sqe_copy = kmalloc(sizeof(*sqe_copy), GFP_KERNEL);
 		if (sqe_copy) {
+			struct async_list *list;
+
 			memcpy(sqe_copy, s->sqe, sizeof(*sqe_copy));
 			s->sqe = sqe_copy;
 
 			memcpy(&req->submit, s, sizeof(*s));
-			INIT_WORK(&req->work, io_sq_wq_submit_work);
-			queue_work(ctx->sqo_wq, &req->work);
+			list = io_async_list_from_sqe(ctx, s->sqe);
+			if (!io_add_to_prev_work(list, req)) {
+				if (list)
+					atomic_inc(&list->cnt);
+				INIT_WORK(&req->work, io_sq_wq_submit_work);
+				queue_work(ctx->sqo_wq, &req->work);
+			}
 			ret = 0;
 		}
 	}
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 19/19] io_uring: add io_uring_event cache hit information
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-11 19:00   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Add hint on whether a read was served out of the page cache, or if it
hit media. This is useful for buffered async IO, O_DIRECT reads would
never have this set (for obvious reasons).

If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT
set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 7 ++++++-
 include/uapi/linux/io_uring.h | 5 +++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ce446f59f092..a4973af1c272 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -611,11 +611,16 @@ static void io_fput(struct io_kiocb *req)
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+	unsigned ev_flags = 0;
 
 	kiocb_end_write(kiocb);
 
 	io_fput(req);
-	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+
+	if (res > 0 && (req->flags & REQ_F_FORCE_NONBLOCK))
+		ev_flags = IOCQE_FLAG_CACHEHIT;
+
+	io_cqring_add_event(req->ctx, req->user_data, res, ev_flags);
 	io_free_req(req);
 }
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e23408692118..24906e99fdc7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -69,6 +69,11 @@ struct io_uring_cqe {
 	__u32	flags;
 };
 
+/*
+ * io_uring_event->flags
+ */
+#define IOCQE_FLAG_CACHEHIT	(1U << 0)	/* IO did not hit media */
+
 /*
  * Magic offsets for the application to mmap the data it needs
  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 19/19] io_uring: add io_uring_event cache hit information
@ 2019-02-11 19:00   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-11 19:00 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

Add hint on whether a read was served out of the page cache, or if it
hit media. This is useful for buffered async IO, O_DIRECT reads would
never have this set (for obvious reasons).

If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT
set.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 7 ++++++-
 include/uapi/linux/io_uring.h | 5 +++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ce446f59f092..a4973af1c272 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -611,11 +611,16 @@ static void io_fput(struct io_kiocb *req)
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
+	unsigned ev_flags = 0;
 
 	kiocb_end_write(kiocb);
 
 	io_fput(req);
-	io_cqring_add_event(req->ctx, req->user_data, res, 0);
+
+	if (res > 0 && (req->flags & REQ_F_FORCE_NONBLOCK))
+		ev_flags = IOCQE_FLAG_CACHEHIT;
+
+	io_cqring_add_event(req->ctx, req->user_data, res, ev_flags);
 	io_free_req(req);
 }
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e23408692118..24906e99fdc7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -69,6 +69,11 @@ struct io_uring_cqe {
 	__u32	flags;
 };
 
+/*
+ * io_uring_event->flags
+ */
+#define IOCQE_FLAG_CACHEHIT	(1U << 0)	/* IO did not hit media */
+
 /*
  * Magic offsets for the application to mmap the data it needs
  */
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-11 19:00   ` Jens Axboe
@ 2019-02-19 16:12     ` Jann Horn
  -1 siblings, 0 replies; 128+ messages in thread
From: Jann Horn @ 2019-02-19 16:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> We normally have to fget/fput for each IO we do on a file. Even with
> the batching we do, the cost of the atomic inc/dec of the file usage
> count adds up.
>
> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> for the io_uring_register(2) system call. The arguments passed in must
> be an array of __s32 holding file descriptors, and nr_args should hold
> the number of file descriptors the application wishes to pin for the
> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
> called).
>
> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> to the index in the array passed in to IORING_REGISTER_FILES.
>
> Files are automatically unregistered when the io_uring instance is torn
> down. An application need only unregister if it wishes to register a new
> set of fds.
>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
[...]
> @@ -1335,6 +1379,161 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>         return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>  }
>
> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +#if defined(CONFIG_UNIX)
> +       if (ctx->ring_sock) {
> +               struct sock *sock = ctx->ring_sock->sk;
> +               struct sk_buff *skb;
> +
> +               while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
> +                       kfree_skb(skb);
> +       }
> +#else
> +       int i;
> +
> +       for (i = 0; i < ctx->nr_user_files; i++)
> +               fput(ctx->user_files[i]);
> +#endif
> +}
> +
> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +       if (!ctx->user_files)
> +               return -ENXIO;
> +
> +       __io_sqe_files_unregister(ctx);
> +       kfree(ctx->user_files);
> +       ctx->user_files = NULL;
> +       return 0;
> +}
> +
> +#if defined(CONFIG_UNIX)
> +/*
> + * Ensure the UNIX gc is aware of our file set, so we are certain that
> + * the io_uring can be safely unregistered on process exit, even if we have
> + * loops in the file referencing.
> + */

I still don't get how this is supposed to work. Quoting from an
earlier version of the patch:

|> I think the overall concept here is still broken: You're giving the
|> user_files to the GC, and I think the GC can drop their refcounts, but
|> I don't see you actually getting feedback from the GC anywhere that
|> would let the GC break your references? E.g. in io_prep_rw() you grab
|> file pointers from ctx->user_files after simply checking
|> ctx->nr_user_files, and there is no path from the GC that touches
|> those fields. As far as I can tell, the GC is just going to go through
|> unix_destruct_scm() and drop references on your files, causing
|> use-after-free.
|>
|> But the unix GC is complicated, and maybe I'm just missing something...
|
| Only when the skb is released, which is either done when the io_uring
| is torn down (and then definitely safe), or if the socket is released,
| which is again also at a safe time.

I'll try to add inline comments on my understanding of the code, maybe
you can point out where exactly we're understanding it differently...

> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
> +{
> +       struct sock *sk = ctx->ring_sock->sk;
> +       struct scm_fp_list *fpl;
> +       struct sk_buff *skb;
> +       int i;
> +
> +       fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
> +       if (!fpl)
> +               return -ENOMEM;
> +
            // here we allocate a new `skb` with ->users==1
> +       skb = alloc_skb(0, GFP_KERNEL);
> +       if (!skb) {
> +               kfree(fpl);
> +               return -ENOMEM;
> +       }
> +
> +       skb->sk = sk;
            // set the skb's destructor, invoked when ->users drops to 0;
            // destructor drops file refcounts
> +       skb->destructor = unix_destruct_scm;
> +
> +       fpl->user = get_uid(ctx->user);
> +       for (i = 0; i < nr; i++) {
                    // grab a reference to each file for the skb
> +               fpl->fp[i] = get_file(ctx->user_files[i + offset]);
> +               unix_inflight(fpl->user, fpl->fp[i]);
> +       }
> +
> +       fpl->max = fpl->count = nr;
> +       UNIXCB(skb).fp = fpl;
> +       refcount_add(skb->truesize, &sk->sk_wmem_alloc);
            // put the skb in the sk_receive_queue, still with a refcount of 1.
> +       skb_queue_head(&sk->sk_receive_queue, skb);
> +
            // drop a reference from each file; after this, only the
skb owns references to files;
            // the ctx->user_files entries borrow their lifetime from the skb
> +       for (i = 0; i < nr; i++)
> +               fput(fpl->fp[i]);
> +
> +       return 0;
> +}

So let's say you have a cyclic dependency where an io_uring points to
a unix domain socket, and the unix domain socket points back at the
uring. The last reference from outside the loop goes away when the
user closes the uring's fd, but the uring's busypolling kernel thread
is still running and busypolling for new submission queue entries.

The GC can then come along and run scan_inflight(), detect that
ctx->ring_sock->sk->sk_receive_queue contains a reference to a unix
domain socket, and steal the skb (unlinking it from the ring_sock and
linking it into the hitlist):

__skb_unlink(skb, &x->sk_receive_queue);
__skb_queue_tail(hitlist, skb);

And then the hitlist will be processed by __skb_queue_purge(),
dropping the refcount of the skb from 1 to 0. At that point, the unix
domain socket can be freed, and you still have a pointer to it in
ctx->user_files.

> +
> +/*
> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which
> + * causes regular reference counting to break down. We rely on the UNIX
> + * garbage collection to take care of this problem for us.
> + */
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       unsigned left, total;
> +       int ret = 0;
> +
> +       total = 0;
> +       left = ctx->nr_user_files;
> +       while (left) {
> +               unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
> +               int ret;
> +
> +               ret = __io_sqe_files_scm(ctx, this_files, total);
> +               if (ret)
> +                       break;

If we bail out in the middle of translating the ->user_files here, we
have to make sure that we both destroy the already-created SKBs and
drop our references on the files we haven't dealt with yet.

> +               left -= this_files;
> +               total += this_files;
> +       }
> +
> +       return ret;
> +}
> +#else
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       return 0;
> +}
> +#endif
> +
> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
> +                                unsigned nr_args)
> +{
> +       __s32 __user *fds = (__s32 __user *) arg;
> +       int fd, ret = 0;
> +       unsigned i;
> +
> +       if (ctx->user_files)
> +               return -EBUSY;
> +       if (!nr_args)
> +               return -EINVAL;
> +       if (nr_args > IORING_MAX_FIXED_FILES)
> +               return -EMFILE;
> +
> +       ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
> +       if (!ctx->user_files)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nr_args; i++) {
> +               ret = -EFAULT;
> +               if (copy_from_user(&fd, &fds[i], sizeof(fd)))
> +                       break;
> +
> +               ctx->user_files[i] = fget(fd);
> +
> +               ret = -EBADF;
> +               if (!ctx->user_files[i])
> +                       break;

Let's say we hit this error condition after N successful loop
iterations, on a kernel with CONFIG_UNIX. At that point, we've filled
N file pointers into ctx->user_files[], and we've incremented
ctx->nr_user_files up to N. Now we jump to the `if (ret)` branch,
which goes into io_sqe_files_unregister(); but that's going to attempt
to dequeue inflight files from ctx->ring_sock, so that's not going to
work.

> +               /*
> +                * Don't allow io_uring instances to be registered. If UNIX
> +                * isn't enabled, then this causes a reference cycle and this
> +                * instance can never get freed. If UNIX is enabled we'll
> +                * handle it just fine, but there's still no point in allowing
> +                * a ring fd as it doesn't support regular read/write anyway.
> +                */
> +               if (ctx->user_files[i]->f_op == &io_uring_fops) {
> +                       fput(ctx->user_files[i]);
> +                       break;
> +               }
> +               ctx->nr_user_files++;

I don't see anything that can set ctx->nr_user_files back down to
zero; as far as I can tell, if you repeatedly register and unregister
a set of files, ctx->nr_user_files will just grow, and since it's used
as an upper bound for array accesses, that's bad.

> +               ret = 0;
> +       }
> +
> +       if (!ret)
> +               ret = io_sqe_files_scm(ctx);
> +       if (ret)
> +               io_sqe_files_unregister(ctx);
> +
> +       return ret;
> +}

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
@ 2019-02-19 16:12     ` Jann Horn
  0 siblings, 0 replies; 128+ messages in thread
From: Jann Horn @ 2019-02-19 16:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> We normally have to fget/fput for each IO we do on a file. Even with
> the batching we do, the cost of the atomic inc/dec of the file usage
> count adds up.
>
> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> for the io_uring_register(2) system call. The arguments passed in must
> be an array of __s32 holding file descriptors, and nr_args should hold
> the number of file descriptors the application wishes to pin for the
> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
> called).
>
> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> to the index in the array passed in to IORING_REGISTER_FILES.
>
> Files are automatically unregistered when the io_uring instance is torn
> down. An application need only unregister if it wishes to register a new
> set of fds.
>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
[...]
> @@ -1335,6 +1379,161 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>         return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>  }
>
> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +#if defined(CONFIG_UNIX)
> +       if (ctx->ring_sock) {
> +               struct sock *sock = ctx->ring_sock->sk;
> +               struct sk_buff *skb;
> +
> +               while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
> +                       kfree_skb(skb);
> +       }
> +#else
> +       int i;
> +
> +       for (i = 0; i < ctx->nr_user_files; i++)
> +               fput(ctx->user_files[i]);
> +#endif
> +}
> +
> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +       if (!ctx->user_files)
> +               return -ENXIO;
> +
> +       __io_sqe_files_unregister(ctx);
> +       kfree(ctx->user_files);
> +       ctx->user_files = NULL;
> +       return 0;
> +}
> +
> +#if defined(CONFIG_UNIX)
> +/*
> + * Ensure the UNIX gc is aware of our file set, so we are certain that
> + * the io_uring can be safely unregistered on process exit, even if we have
> + * loops in the file referencing.
> + */

I still don't get how this is supposed to work. Quoting from an
earlier version of the patch:

|> I think the overall concept here is still broken: You're giving the
|> user_files to the GC, and I think the GC can drop their refcounts, but
|> I don't see you actually getting feedback from the GC anywhere that
|> would let the GC break your references? E.g. in io_prep_rw() you grab
|> file pointers from ctx->user_files after simply checking
|> ctx->nr_user_files, and there is no path from the GC that touches
|> those fields. As far as I can tell, the GC is just going to go through
|> unix_destruct_scm() and drop references on your files, causing
|> use-after-free.
|>
|> But the unix GC is complicated, and maybe I'm just missing something...
|
| Only when the skb is released, which is either done when the io_uring
| is torn down (and then definitely safe), or if the socket is released,
| which is again also at a safe time.

I'll try to add inline comments on my understanding of the code, maybe
you can point out where exactly we're understanding it differently...

> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
> +{
> +       struct sock *sk = ctx->ring_sock->sk;
> +       struct scm_fp_list *fpl;
> +       struct sk_buff *skb;
> +       int i;
> +
> +       fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
> +       if (!fpl)
> +               return -ENOMEM;
> +
            // here we allocate a new `skb` with ->users==1
> +       skb = alloc_skb(0, GFP_KERNEL);
> +       if (!skb) {
> +               kfree(fpl);
> +               return -ENOMEM;
> +       }
> +
> +       skb->sk = sk;
            // set the skb's destructor, invoked when ->users drops to 0;
            // destructor drops file refcounts
> +       skb->destructor = unix_destruct_scm;
> +
> +       fpl->user = get_uid(ctx->user);
> +       for (i = 0; i < nr; i++) {
                    // grab a reference to each file for the skb
> +               fpl->fp[i] = get_file(ctx->user_files[i + offset]);
> +               unix_inflight(fpl->user, fpl->fp[i]);
> +       }
> +
> +       fpl->max = fpl->count = nr;
> +       UNIXCB(skb).fp = fpl;
> +       refcount_add(skb->truesize, &sk->sk_wmem_alloc);
            // put the skb in the sk_receive_queue, still with a refcount of 1.
> +       skb_queue_head(&sk->sk_receive_queue, skb);
> +
            // drop a reference from each file; after this, only the
skb owns references to files;
            // the ctx->user_files entries borrow their lifetime from the skb
> +       for (i = 0; i < nr; i++)
> +               fput(fpl->fp[i]);
> +
> +       return 0;
> +}

So let's say you have a cyclic dependency where an io_uring points to
a unix domain socket, and the unix domain socket points back at the
uring. The last reference from outside the loop goes away when the
user closes the uring's fd, but the uring's busypolling kernel thread
is still running and busypolling for new submission queue entries.

The GC can then come along and run scan_inflight(), detect that
ctx->ring_sock->sk->sk_receive_queue contains a reference to a unix
domain socket, and steal the skb (unlinking it from the ring_sock and
linking it into the hitlist):

__skb_unlink(skb, &x->sk_receive_queue);
__skb_queue_tail(hitlist, skb);

And then the hitlist will be processed by __skb_queue_purge(),
dropping the refcount of the skb from 1 to 0. At that point, the unix
domain socket can be freed, and you still have a pointer to it in
ctx->user_files.

> +
> +/*
> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which
> + * causes regular reference counting to break down. We rely on the UNIX
> + * garbage collection to take care of this problem for us.
> + */
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       unsigned left, total;
> +       int ret = 0;
> +
> +       total = 0;
> +       left = ctx->nr_user_files;
> +       while (left) {
> +               unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
> +               int ret;
> +
> +               ret = __io_sqe_files_scm(ctx, this_files, total);
> +               if (ret)
> +                       break;

If we bail out in the middle of translating the ->user_files here, we
have to make sure that we both destroy the already-created SKBs and
drop our references on the files we haven't dealt with yet.

> +               left -= this_files;
> +               total += this_files;
> +       }
> +
> +       return ret;
> +}
> +#else
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       return 0;
> +}
> +#endif
> +
> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
> +                                unsigned nr_args)
> +{
> +       __s32 __user *fds = (__s32 __user *) arg;
> +       int fd, ret = 0;
> +       unsigned i;
> +
> +       if (ctx->user_files)
> +               return -EBUSY;
> +       if (!nr_args)
> +               return -EINVAL;
> +       if (nr_args > IORING_MAX_FIXED_FILES)
> +               return -EMFILE;
> +
> +       ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
> +       if (!ctx->user_files)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nr_args; i++) {
> +               ret = -EFAULT;
> +               if (copy_from_user(&fd, &fds[i], sizeof(fd)))
> +                       break;
> +
> +               ctx->user_files[i] = fget(fd);
> +
> +               ret = -EBADF;
> +               if (!ctx->user_files[i])
> +                       break;

Let's say we hit this error condition after N successful loop
iterations, on a kernel with CONFIG_UNIX. At that point, we've filled
N file pointers into ctx->user_files[], and we've incremented
ctx->nr_user_files up to N. Now we jump to the `if (ret)` branch,
which goes into io_sqe_files_unregister(); but that's going to attempt
to dequeue inflight files from ctx->ring_sock, so that's not going to
work.

> +               /*
> +                * Don't allow io_uring instances to be registered. If UNIX
> +                * isn't enabled, then this causes a reference cycle and this
> +                * instance can never get freed. If UNIX is enabled we'll
> +                * handle it just fine, but there's still no point in allowing
> +                * a ring fd as it doesn't support regular read/write anyway.
> +                */
> +               if (ctx->user_files[i]->f_op == &io_uring_fops) {
> +                       fput(ctx->user_files[i]);
> +                       break;
> +               }
> +               ctx->nr_user_files++;

I don't see anything that can set ctx->nr_user_files back down to
zero; as far as I can tell, if you repeatedly register and unregister
a set of files, ctx->nr_user_files will just grow, and since it's used
as an upper bound for array accesses, that's bad.

> +               ret = 0;
> +       }
> +
> +       if (!ret)
> +               ret = io_sqe_files_scm(ctx);
> +       if (ret)
> +               io_sqe_files_unregister(ctx);
> +
> +       return ret;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers
  2019-02-11 19:00   ` Jens Axboe
@ 2019-02-19 19:08     ` Jann Horn
  -1 siblings, 0 replies; 128+ messages in thread
From: Jann Horn @ 2019-02-19 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> If we have fixed user buffers, we can map them into the kernel when we
> setup the io_uring. That avoids the need to do get_user_pages() for
> each and every IO.
>
> To utilize this feature, the application must call io_uring_register()
> after having setup an io_uring instance, passing in
> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
> an iovec array, and the nr_args should contain how many iovecs the
> application wishes to map.
>
> If successful, these buffers are now mapped into the kernel, eligible
> for IO. To use these fixed buffers, the application must use the
> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
> must point to somewhere inside the indexed buffer.
>
> The application may register buffers throughout the lifetime of the
> io_uring instance. It can call io_uring_register() with
> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
> buffers, and then register a new set. The application need not
> unregister buffers explicitly before shutting down the io_uring
> instance.
>
> It's perfectly valid to setup a larger buffer, and then sometimes only
> use parts of it for an IO. As long as the range is within the originally
> mapped region, it will work just fine.
>
> For now, buffers must not be file backed. If file backed buffers are
> passed in, the registration will fail with -1/EOPNOTSUPP. This
> restriction may be relaxed in the future.
>
> RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
> arbitrary 1G per buffer size is also imposed.
>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
[...]
>  static void io_sq_wq_submit_work(struct work_struct *work)
>  {
>         struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>         struct sqe_submit *s = &req->submit;
>         const struct io_uring_sqe *sqe = s->sqe;
>         struct io_ring_ctx *ctx = req->ctx;
> -       mm_segment_t old_fs = get_fs();
> +       mm_segment_t old_fs;
> +       bool needs_user;
>         int ret;
>
>          /* Ensure we clear previously set forced non-block flag */
>         req->flags &= ~REQ_F_FORCE_NONBLOCK;
>         req->rw.ki_flags &= ~IOCB_NOWAIT;
>
> -       if (!mmget_not_zero(ctx->sqo_mm)) {
> -               ret = -EFAULT;
> -               goto err;
> -       }
> -
> -       use_mm(ctx->sqo_mm);
> -       set_fs(USER_DS);
> -       s->has_user = true;
>         s->needs_lock = true;
> +       s->has_user = false;
> +
> +       /*
> +        * If we're doing IO to fixed buffers, we don't need to get/set
> +        * user context
> +        */
> +       needs_user = io_sqe_needs_user(s->sqe);
> +       if (needs_user) {
> +               if (!mmget_not_zero(ctx->sqo_mm)) {
> +                       ret = -EFAULT;
> +                       goto err;
> +               }
> +               use_mm(ctx->sqo_mm);
> +               old_fs = get_fs();
> +               set_fs(USER_DS);
> +               s->has_user = true;
> +       }
>
>         do {
>                 ret = __io_submit_sqe(ctx, req, s, false, NULL);
> @@ -1011,9 +1110,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>                 cond_resched();
>         } while (1);
>
> -       set_fs(old_fs);
> -       unuse_mm(ctx->sqo_mm);
> -       mmput(ctx->sqo_mm);
> +       if (needs_user) {
> +               set_fs(old_fs);
> +               unuse_mm(ctx->sqo_mm);
> +               mmput(ctx->sqo_mm);
> +       }
>  err:
>         if (ret) {
>                 io_cqring_add_event(ctx, sqe->user_data, ret, 0);
> @@ -1308,6 +1409,197 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
>         return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
>  }
>
> +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
> +{
> +       int i, j;
> +
> +       if (!ctx->user_bufs)
> +               return -ENXIO;
> +
> +       for (i = 0; i < ctx->nr_user_bufs; i++) {
> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
> +
> +               for (j = 0; j < imu->nr_bvecs; j++)
> +                       put_page(imu->bvec[j].bv_page);
> +
> +               if (ctx->account_mem)
> +                       io_unaccount_mem(ctx->user, imu->nr_bvecs);
> +               kfree(imu->bvec);
> +               imu->nr_bvecs = 0;
> +       }
> +
> +       kfree(ctx->user_bufs);
> +       ctx->user_bufs = NULL;
> +       ctx->nr_user_bufs = 0;
> +       return 0;
> +}
[...]
> +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
> +                                 unsigned nr_args)
> +{
> +       struct vm_area_struct **vmas = NULL;
> +       struct page **pages = NULL;
> +       int i, j, got_pages = 0;
> +       int ret = -EINVAL;
> +
> +       if (ctx->user_bufs)
> +               return -EBUSY;
> +       if (!nr_args || nr_args > UIO_MAXIOV)
> +               return -EINVAL;
> +
> +       ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
> +                                       GFP_KERNEL);
> +       if (!ctx->user_bufs)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nr_args; i++) {
> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
> +               unsigned long off, start, end, ubuf;
> +               int pret, nr_pages;
> +               struct iovec iov;
> +               size_t size;
> +
> +               ret = io_copy_iov(ctx, &iov, arg, i);
> +               if (ret)
> +                       break;
> +
> +               /*
> +                * Don't impose further limits on the size and buffer
> +                * constraints here, we'll -EINVAL later when IO is
> +                * submitted if they are wrong.
> +                */
> +               ret = -EFAULT;
> +               if (!iov.iov_base || !iov.iov_len)
> +                       goto err;
> +
> +               /* arbitrary limit, but we need something */
> +               if (iov.iov_len > SZ_1G)
> +                       goto err;
> +
> +               ubuf = (unsigned long) iov.iov_base;
> +               end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +               start = ubuf >> PAGE_SHIFT;
> +               nr_pages = end - start;
> +
> +               if (ctx->account_mem) {
> +                       ret = io_account_mem(ctx->user, nr_pages);
> +                       if (ret)
> +                               goto err;
> +               }
> +
> +               ret = 0;
> +               if (!pages || nr_pages > got_pages) {

Nit: No need to check for `!pages` as long as `pages` and `got_pages`
are synchronized (which guarantees that `!pages` implies
`got_pages==0`).

> +                       kfree(vmas);
> +                       kfree(pages);
> +                       pages = kmalloc_array(nr_pages, sizeof(struct page *),
> +                                               GFP_KERNEL);
> +                       vmas = kmalloc_array(nr_pages,
> +                                       sizeof(struct vma_area_struct *),

typo: s/vma_area_struct/vm_area_struct/

> +                                       GFP_KERNEL);
> +                       if (!pages || !vmas) {
> +                               ret = -ENOMEM;
> +                               if (ctx->account_mem)
> +                                       io_unaccount_mem(ctx->user, nr_pages);
> +                               goto err;
> +                       }
> +                       got_pages = nr_pages;
> +               }
> +
> +               imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
> +                                               GFP_KERNEL);
> +               ret = -ENOMEM;
> +               if (!imu->bvec) {
> +                       if (ctx->account_mem)
> +                               io_unaccount_mem(ctx->user, nr_pages);
> +                       goto err;
> +               }
> +
> +               ret = 0;
> +               down_read(&current->mm->mmap_sem);
> +               pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
> +                                               pages, vmas);
> +               if (pret == nr_pages) {
> +                       /* don't support file backed memory */
> +                       for (j = 0; j < nr_pages; j++) {
> +                               struct vm_area_struct *vma = vmas[j];
> +
> +                               if (vma->vm_file &&
> +                                   !is_file_hugepages(vma->vm_file)) {
> +                                       ret = -EOPNOTSUPP;
> +                                       break;
> +                               }
> +                       }
> +               } else {
> +                       ret = pret < 0 ? pret : -EFAULT;
> +               }
> +               up_read(&current->mm->mmap_sem);
> +               if (ret) {
> +                       /*
> +                        * if we did partial map, or found file backed vmas,
> +                        * release any pages we did get
> +                        */
> +                       if (pret > 0) {
> +                               for (j = 0; j < pret; j++)
> +                                       put_page(pages[j]);
> +                       }
> +                       if (ctx->account_mem)
> +                               io_unaccount_mem(ctx->user, nr_pages);
> +                       goto err;
> +               }
> +
> +               off = ubuf & ~PAGE_MASK;
> +               size = iov.iov_len;
> +               for (j = 0; j < nr_pages; j++) {
> +                       size_t vec_len;
> +
> +                       vec_len = min_t(size_t, size, PAGE_SIZE - off);
> +                       imu->bvec[j].bv_page = pages[j];
> +                       imu->bvec[j].bv_len = vec_len;
> +                       imu->bvec[j].bv_offset = off;
> +                       off = 0;
> +                       size -= vec_len;
> +               }
> +               /* store original address for later verification */
> +               imu->ubuf = ubuf;
> +               imu->len = iov.iov_len;
> +               imu->nr_bvecs = nr_pages;
> +       }
> +       kfree(pages);
> +       kfree(vmas);
> +       ctx->nr_user_bufs = nr_args;
> +       return 0;
> +err:
> +       kfree(pages);
> +       kfree(vmas);
> +       io_sqe_buffer_unregister(ctx);

io_sqe_buffer_unregister() gets rid of elements up to
ctx->nr_user_bufs, but as far as I can tell, ctx->nr_user_bufs is
always zero here. I think that's going to cause a reference leak.

> +       return ret;
> +}

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers
@ 2019-02-19 19:08     ` Jann Horn
  0 siblings, 0 replies; 128+ messages in thread
From: Jann Horn @ 2019-02-19 19:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> If we have fixed user buffers, we can map them into the kernel when we
> setup the io_uring. That avoids the need to do get_user_pages() for
> each and every IO.
>
> To utilize this feature, the application must call io_uring_register()
> after having setup an io_uring instance, passing in
> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
> an iovec array, and the nr_args should contain how many iovecs the
> application wishes to map.
>
> If successful, these buffers are now mapped into the kernel, eligible
> for IO. To use these fixed buffers, the application must use the
> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
> must point to somewhere inside the indexed buffer.
>
> The application may register buffers throughout the lifetime of the
> io_uring instance. It can call io_uring_register() with
> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
> buffers, and then register a new set. The application need not
> unregister buffers explicitly before shutting down the io_uring
> instance.
>
> It's perfectly valid to setup a larger buffer, and then sometimes only
> use parts of it for an IO. As long as the range is within the originally
> mapped region, it will work just fine.
>
> For now, buffers must not be file backed. If file backed buffers are
> passed in, the registration will fail with -1/EOPNOTSUPP. This
> restriction may be relaxed in the future.
>
> RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
> arbitrary 1G per buffer size is also imposed.
>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
[...]
>  static void io_sq_wq_submit_work(struct work_struct *work)
>  {
>         struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>         struct sqe_submit *s = &req->submit;
>         const struct io_uring_sqe *sqe = s->sqe;
>         struct io_ring_ctx *ctx = req->ctx;
> -       mm_segment_t old_fs = get_fs();
> +       mm_segment_t old_fs;
> +       bool needs_user;
>         int ret;
>
>          /* Ensure we clear previously set forced non-block flag */
>         req->flags &= ~REQ_F_FORCE_NONBLOCK;
>         req->rw.ki_flags &= ~IOCB_NOWAIT;
>
> -       if (!mmget_not_zero(ctx->sqo_mm)) {
> -               ret = -EFAULT;
> -               goto err;
> -       }
> -
> -       use_mm(ctx->sqo_mm);
> -       set_fs(USER_DS);
> -       s->has_user = true;
>         s->needs_lock = true;
> +       s->has_user = false;
> +
> +       /*
> +        * If we're doing IO to fixed buffers, we don't need to get/set
> +        * user context
> +        */
> +       needs_user = io_sqe_needs_user(s->sqe);
> +       if (needs_user) {
> +               if (!mmget_not_zero(ctx->sqo_mm)) {
> +                       ret = -EFAULT;
> +                       goto err;
> +               }
> +               use_mm(ctx->sqo_mm);
> +               old_fs = get_fs();
> +               set_fs(USER_DS);
> +               s->has_user = true;
> +       }
>
>         do {
>                 ret = __io_submit_sqe(ctx, req, s, false, NULL);
> @@ -1011,9 +1110,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>                 cond_resched();
>         } while (1);
>
> -       set_fs(old_fs);
> -       unuse_mm(ctx->sqo_mm);
> -       mmput(ctx->sqo_mm);
> +       if (needs_user) {
> +               set_fs(old_fs);
> +               unuse_mm(ctx->sqo_mm);
> +               mmput(ctx->sqo_mm);
> +       }
>  err:
>         if (ret) {
>                 io_cqring_add_event(ctx, sqe->user_data, ret, 0);
> @@ -1308,6 +1409,197 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
>         return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
>  }
>
> +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
> +{
> +       int i, j;
> +
> +       if (!ctx->user_bufs)
> +               return -ENXIO;
> +
> +       for (i = 0; i < ctx->nr_user_bufs; i++) {
> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
> +
> +               for (j = 0; j < imu->nr_bvecs; j++)
> +                       put_page(imu->bvec[j].bv_page);
> +
> +               if (ctx->account_mem)
> +                       io_unaccount_mem(ctx->user, imu->nr_bvecs);
> +               kfree(imu->bvec);
> +               imu->nr_bvecs = 0;
> +       }
> +
> +       kfree(ctx->user_bufs);
> +       ctx->user_bufs = NULL;
> +       ctx->nr_user_bufs = 0;
> +       return 0;
> +}
[...]
> +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
> +                                 unsigned nr_args)
> +{
> +       struct vm_area_struct **vmas = NULL;
> +       struct page **pages = NULL;
> +       int i, j, got_pages = 0;
> +       int ret = -EINVAL;
> +
> +       if (ctx->user_bufs)
> +               return -EBUSY;
> +       if (!nr_args || nr_args > UIO_MAXIOV)
> +               return -EINVAL;
> +
> +       ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
> +                                       GFP_KERNEL);
> +       if (!ctx->user_bufs)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nr_args; i++) {
> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
> +               unsigned long off, start, end, ubuf;
> +               int pret, nr_pages;
> +               struct iovec iov;
> +               size_t size;
> +
> +               ret = io_copy_iov(ctx, &iov, arg, i);
> +               if (ret)
> +                       break;
> +
> +               /*
> +                * Don't impose further limits on the size and buffer
> +                * constraints here, we'll -EINVAL later when IO is
> +                * submitted if they are wrong.
> +                */
> +               ret = -EFAULT;
> +               if (!iov.iov_base || !iov.iov_len)
> +                       goto err;
> +
> +               /* arbitrary limit, but we need something */
> +               if (iov.iov_len > SZ_1G)
> +                       goto err;
> +
> +               ubuf = (unsigned long) iov.iov_base;
> +               end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +               start = ubuf >> PAGE_SHIFT;
> +               nr_pages = end - start;
> +
> +               if (ctx->account_mem) {
> +                       ret = io_account_mem(ctx->user, nr_pages);
> +                       if (ret)
> +                               goto err;
> +               }
> +
> +               ret = 0;
> +               if (!pages || nr_pages > got_pages) {

Nit: No need to check for `!pages` as long as `pages` and `got_pages`
are synchronized (which guarantees that `!pages` implies
`got_pages==0`).

> +                       kfree(vmas);
> +                       kfree(pages);
> +                       pages = kmalloc_array(nr_pages, sizeof(struct page *),
> +                                               GFP_KERNEL);
> +                       vmas = kmalloc_array(nr_pages,
> +                                       sizeof(struct vma_area_struct *),

typo: s/vma_area_struct/vm_area_struct/

> +                                       GFP_KERNEL);
> +                       if (!pages || !vmas) {
> +                               ret = -ENOMEM;
> +                               if (ctx->account_mem)
> +                                       io_unaccount_mem(ctx->user, nr_pages);
> +                               goto err;
> +                       }
> +                       got_pages = nr_pages;
> +               }
> +
> +               imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
> +                                               GFP_KERNEL);
> +               ret = -ENOMEM;
> +               if (!imu->bvec) {
> +                       if (ctx->account_mem)
> +                               io_unaccount_mem(ctx->user, nr_pages);
> +                       goto err;
> +               }
> +
> +               ret = 0;
> +               down_read(&current->mm->mmap_sem);
> +               pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
> +                                               pages, vmas);
> +               if (pret == nr_pages) {
> +                       /* don't support file backed memory */
> +                       for (j = 0; j < nr_pages; j++) {
> +                               struct vm_area_struct *vma = vmas[j];
> +
> +                               if (vma->vm_file &&
> +                                   !is_file_hugepages(vma->vm_file)) {
> +                                       ret = -EOPNOTSUPP;
> +                                       break;
> +                               }
> +                       }
> +               } else {
> +                       ret = pret < 0 ? pret : -EFAULT;
> +               }
> +               up_read(&current->mm->mmap_sem);
> +               if (ret) {
> +                       /*
> +                        * if we did partial map, or found file backed vmas,
> +                        * release any pages we did get
> +                        */
> +                       if (pret > 0) {
> +                               for (j = 0; j < pret; j++)
> +                                       put_page(pages[j]);
> +                       }
> +                       if (ctx->account_mem)
> +                               io_unaccount_mem(ctx->user, nr_pages);
> +                       goto err;
> +               }
> +
> +               off = ubuf & ~PAGE_MASK;
> +               size = iov.iov_len;
> +               for (j = 0; j < nr_pages; j++) {
> +                       size_t vec_len;
> +
> +                       vec_len = min_t(size_t, size, PAGE_SIZE - off);
> +                       imu->bvec[j].bv_page = pages[j];
> +                       imu->bvec[j].bv_len = vec_len;
> +                       imu->bvec[j].bv_offset = off;
> +                       off = 0;
> +                       size -= vec_len;
> +               }
> +               /* store original address for later verification */
> +               imu->ubuf = ubuf;
> +               imu->len = iov.iov_len;
> +               imu->nr_bvecs = nr_pages;
> +       }
> +       kfree(pages);
> +       kfree(vmas);
> +       ctx->nr_user_bufs = nr_args;
> +       return 0;
> +err:
> +       kfree(pages);
> +       kfree(vmas);
> +       io_sqe_buffer_unregister(ctx);

io_sqe_buffer_unregister() gets rid of elements up to
ctx->nr_user_bufs, but as far as I can tell, ctx->nr_user_bufs is
always zero here. I think that's going to cause a reference leak.

> +       return ret;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-11 19:00   ` Jens Axboe
@ 2019-02-20 22:58     ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-20 22:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, linux-api, hch, jmoyer, avi, jannh, viro

On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> For an ITER_BVEC, we can just iterate the iov and add the pages
> to the bio directly. This requires that the caller doesn't releases
> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> 
> The current two callers of bio_iov_iter_get_pages() are updated to
> check if they need to release pages on completion. This makes them
> work with bvecs that contain kernel mapped pages already.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>  fs/block_dev.c            |  5 ++--
>  fs/iomap.c                |  5 ++--
>  include/linux/blk_types.h |  1 +
>  4 files changed, 56 insertions(+), 14 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 4db1008309ed..330df572cfb8 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>  }
>  EXPORT_SYMBOL(bio_add_page);
>  
> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> +{
> +	const struct bio_vec *bv = iter->bvec;
> +	unsigned int len;
> +	size_t size;
> +
> +	len = min_t(size_t, bv->bv_len, iter->count);
> +	size = bio_add_page(bio, bv->bv_page, len,
> +				bv->bv_offset + iter->iov_offset);

iter->iov_offset needs to be subtracted from 'len', looks
the following delta change[1] is required, otherwise memory corruption
can be observed when running xfstests over loop/dio.

Another interesting thing is that bio_add_page() is capable of
adding multi contiguous pages actually, especially loop uses
ITER_BVEC to pass multi-page bvecs. Even though pages in loop's
ITER_BVEC may belong to user-space, looks it is still safe to not
grab the page ref given it has been done by fs. 

[1]
diff --git a/block/bio.c b/block/bio.c
index 3b49963676fc..df99bb3816a1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -842,7 +842,10 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned int len;
 	size_t size;
 
-	len = min_t(size_t, bv->bv_len, iter->count);
+	if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
+		return -EINVAL;
+
+	len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
 	size = bio_add_page(bio, bv->bv_page, len,
 				bv->bv_offset + iter->iov_offset);
 	if (size == len) {

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-20 22:58     ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-20 22:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, linux-api, hch, jmoyer, avi, jannh, viro

On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> For an ITER_BVEC, we can just iterate the iov and add the pages
> to the bio directly. This requires that the caller doesn't releases
> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> 
> The current two callers of bio_iov_iter_get_pages() are updated to
> check if they need to release pages on completion. This makes them
> work with bvecs that contain kernel mapped pages already.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>  fs/block_dev.c            |  5 ++--
>  fs/iomap.c                |  5 ++--
>  include/linux/blk_types.h |  1 +
>  4 files changed, 56 insertions(+), 14 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 4db1008309ed..330df572cfb8 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>  }
>  EXPORT_SYMBOL(bio_add_page);
>  
> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> +{
> +	const struct bio_vec *bv = iter->bvec;
> +	unsigned int len;
> +	size_t size;
> +
> +	len = min_t(size_t, bv->bv_len, iter->count);
> +	size = bio_add_page(bio, bv->bv_page, len,
> +				bv->bv_offset + iter->iov_offset);

iter->iov_offset needs to be subtracted from 'len', looks
the following delta change[1] is required, otherwise memory corruption
can be observed when running xfstests over loop/dio.

Another interesting thing is that bio_add_page() is capable of
adding multi contiguous pages actually, especially loop uses
ITER_BVEC to pass multi-page bvecs. Even though pages in loop's
ITER_BVEC may belong to user-space, looks it is still safe to not
grab the page ref given it has been done by fs. 

[1]
diff --git a/block/bio.c b/block/bio.c
index 3b49963676fc..df99bb3816a1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -842,7 +842,10 @@ static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned int len;
 	size_t size;
 
-	len = min_t(size_t, bv->bv_len, iter->count);
+	if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))
+		return -EINVAL;
+
+	len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);
 	size = bio_add_page(bio, bv->bv_page, len,
 				bv->bv_offset + iter->iov_offset);
 	if (size == len) {

Thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCHSET v15] io_uring IO interface
  2019-02-11 19:00 ` Jens Axboe
@ 2019-02-21 12:10   ` Marek Majkowski
  -1 siblings, 0 replies; 128+ messages in thread
From: Marek Majkowski @ 2019-02-21 12:10 UTC (permalink / raw)
  To: axboe; +Cc: avi, hch, jannh, jmoyer, linux-aio, linux-api, linux-block, viro

> From: Jens Axboe <axboe@kernel.dk>
> Subject: [PATCHSET v15] io_uring IO interface
> Message-ID: <20190211190049.7888-1-axboe@kernel.dk> (raw)
>
> Some final tweaks, mostly cosmetic, but also two important fixes:
> 
> 1) Ensure that we account the skb appropriately against the socket.
>    Some network config options apparently return is an skb with
>    ->truesize != 0 when allocated with a size of 0, ensure we add
>    those as references against sock->sk_wmem_alloc. Reported by
>    Matt Mullins.

Jens,

I tried using io_uring with network sockets. It seem to be doing the
right thing. One bit is missing though: "flags" as in recv(2).

In perfect world I would like to specify at least:
 - MSG_DONTWAIT
 - MSG_WAITALL
 - MSG_NOSIGNAL

Right now, unless I'm missing something, io_uring_sqe doesn't have a
place where we could store these. "flags" is needed for any
non-trivial network I/O.

A separate discussion is about io_uring supporting more complex
network stuff in future like 'struct msghdr', MSG_ERRQUEUE or CMSG.

Cheers,
   Marek

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCHSET v15] io_uring IO interface
@ 2019-02-21 12:10   ` Marek Majkowski
  0 siblings, 0 replies; 128+ messages in thread
From: Marek Majkowski @ 2019-02-21 12:10 UTC (permalink / raw)
  To: axboe; +Cc: avi, hch, jannh, jmoyer, linux-aio, linux-api, linux-block, viro

> From: Jens Axboe <axboe@kernel.dk>
> Subject: [PATCHSET v15] io_uring IO interface
> Message-ID: <20190211190049.7888-1-axboe@kernel.dk> (raw)
>
> Some final tweaks, mostly cosmetic, but also two important fixes:
> 
> 1) Ensure that we account the skb appropriately against the socket.
>    Some network config options apparently return is an skb with
>    ->truesize != 0 when allocated with a size of 0, ensure we add
>    those as references against sock->sk_wmem_alloc. Reported by
>    Matt Mullins.

Jens,

I tried using io_uring with network sockets. It seem to be doing the
right thing. One bit is missing though: "flags" as in recv(2).

In perfect world I would like to specify at least:
 - MSG_DONTWAIT
 - MSG_WAITALL
 - MSG_NOSIGNAL

Right now, unless I'm missing something, io_uring_sqe doesn't have a
place where we could store these. "flags" is needed for any
non-trivial network I/O.

A separate discussion is about io_uring supporting more complex
network stuff in future like 'struct msghdr', MSG_ERRQUEUE or CMSG.

Cheers,
   Marek

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-20 22:58     ` Ming Lei
@ 2019-02-21 17:45       ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-21 17:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-aio, linux-block, linux-api, hch, jmoyer, avi, jannh, viro

On 2/20/19 3:58 PM, Ming Lei wrote:
> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>> For an ITER_BVEC, we can just iterate the iov and add the pages
>> to the bio directly. This requires that the caller doesn't releases
>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>
>> The current two callers of bio_iov_iter_get_pages() are updated to
>> check if they need to release pages on completion. This makes them
>> work with bvecs that contain kernel mapped pages already.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>  fs/block_dev.c            |  5 ++--
>>  fs/iomap.c                |  5 ++--
>>  include/linux/blk_types.h |  1 +
>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>
>> diff --git a/block/bio.c b/block/bio.c
>> index 4db1008309ed..330df572cfb8 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>  }
>>  EXPORT_SYMBOL(bio_add_page);
>>  
>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>> +{
>> +	const struct bio_vec *bv = iter->bvec;
>> +	unsigned int len;
>> +	size_t size;
>> +
>> +	len = min_t(size_t, bv->bv_len, iter->count);
>> +	size = bio_add_page(bio, bv->bv_page, len,
>> +				bv->bv_offset + iter->iov_offset);
> 
> iter->iov_offset needs to be subtracted from 'len', looks
> the following delta change[1] is required, otherwise memory corruption
> can be observed when running xfstests over loop/dio.

Thanks, I folded this in.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-21 17:45       ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-21 17:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-aio, linux-block, linux-api, hch, jmoyer, avi, jannh, viro

On 2/20/19 3:58 PM, Ming Lei wrote:
> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>> For an ITER_BVEC, we can just iterate the iov and add the pages
>> to the bio directly. This requires that the caller doesn't releases
>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>
>> The current two callers of bio_iov_iter_get_pages() are updated to
>> check if they need to release pages on completion. This makes them
>> work with bvecs that contain kernel mapped pages already.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>  fs/block_dev.c            |  5 ++--
>>  fs/iomap.c                |  5 ++--
>>  include/linux/blk_types.h |  1 +
>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>
>> diff --git a/block/bio.c b/block/bio.c
>> index 4db1008309ed..330df572cfb8 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>  }
>>  EXPORT_SYMBOL(bio_add_page);
>>  
>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>> +{
>> +	const struct bio_vec *bv = iter->bvec;
>> +	unsigned int len;
>> +	size_t size;
>> +
>> +	len = min_t(size_t, bv->bv_len, iter->count);
>> +	size = bio_add_page(bio, bv->bv_page, len,
>> +				bv->bv_offset + iter->iov_offset);
> 
> iter->iov_offset needs to be subtracted from 'len', looks
> the following delta change[1] is required, otherwise memory corruption
> can be observed when running xfstests over loop/dio.

Thanks, I folded this in.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCHSET v15] io_uring IO interface
  2019-02-21 12:10   ` Marek Majkowski
@ 2019-02-21 17:48     ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-21 17:48 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: avi, hch, jannh, jmoyer, linux-aio, linux-api, linux-block, viro

On 2/21/19 5:10 AM, Marek Majkowski wrote:
>> From: Jens Axboe <axboe@kernel.dk>
>> Subject: [PATCHSET v15] io_uring IO interface
>> Message-ID: <20190211190049.7888-1-axboe@kernel.dk> (raw)
>>
>> Some final tweaks, mostly cosmetic, but also two important fixes:
>>
>> 1) Ensure that we account the skb appropriately against the socket.
>>    Some network config options apparently return is an skb with
>>    ->truesize != 0 when allocated with a size of 0, ensure we add
>>    those as references against sock->sk_wmem_alloc. Reported by
>>    Matt Mullins.
> 
> Jens,
> 
> I tried using io_uring with network sockets. It seem to be doing the
> right thing. One bit is missing though: "flags" as in recv(2).
> 
> In perfect world I would like to specify at least:
>  - MSG_DONTWAIT
>  - MSG_WAITALL
>  - MSG_NOSIGNAL
> 
> Right now, unless I'm missing something, io_uring_sqe doesn't have a
> place where we could store these. "flags" is needed for any
> non-trivial network I/O.

We have flags for sqes, depending on the type. You can add to the
union that already holds rw_flags/fsync_flags/poll_events? There's
also a (smaller) flags field that applies for all types, which
currently only holds the fixed file flag.

If you're talking on a per-syscall type of flag, io_uring_enter(2)
does take a flags member.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCHSET v15] io_uring IO interface
@ 2019-02-21 17:48     ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-21 17:48 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: avi, hch, jannh, jmoyer, linux-aio, linux-api, linux-block, viro

On 2/21/19 5:10 AM, Marek Majkowski wrote:
>> From: Jens Axboe <axboe@kernel.dk>
>> Subject: [PATCHSET v15] io_uring IO interface
>> Message-ID: <20190211190049.7888-1-axboe@kernel.dk> (raw)
>>
>> Some final tweaks, mostly cosmetic, but also two important fixes:
>>
>> 1) Ensure that we account the skb appropriately against the socket.
>>    Some network config options apparently return is an skb with
>>    ->truesize != 0 when allocated with a size of 0, ensure we add
>>    those as references against sock->sk_wmem_alloc. Reported by
>>    Matt Mullins.
> 
> Jens,
> 
> I tried using io_uring with network sockets. It seem to be doing the
> right thing. One bit is missing though: "flags" as in recv(2).
> 
> In perfect world I would like to specify at least:
>  - MSG_DONTWAIT
>  - MSG_WAITALL
>  - MSG_NOSIGNAL
> 
> Right now, unless I'm missing something, io_uring_sqe doesn't have a
> place where we could store these. "flags" is needed for any
> non-trivial network I/O.

We have flags for sqes, depending on the type. You can add to the
union that already holds rw_flags/fsync_flags/poll_events? There's
also a (smaller) flags field that applies for all types, which
currently only holds the fixed file flag.

If you're talking on a per-syscall type of flag, io_uring_enter(2)
does take a flags member.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCHSET v15] io_uring IO interface
  2019-02-21 17:48     ` Jens Axboe
@ 2019-02-22 15:01       ` Marek Majkowski
  -1 siblings, 0 replies; 128+ messages in thread
From: Marek Majkowski @ 2019-02-22 15:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Avi Kivity, hch, Jann Horn, jmoyer, linux-aio, linux-api,
	linux-block, viro

On Thu, Feb 21, 2019 at 6:48 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/21/19 5:10 AM, Marek Majkowski wrote:
> >> From: Jens Axboe <axboe@kernel.dk>
> >> Subject: [PATCHSET v15] io_uring IO interface
> >> Message-ID: <20190211190049.7888-1-axboe@kernel.dk> (raw)
> >>
> >> Some final tweaks, mostly cosmetic, but also two important fixes:
> >>
> >> 1) Ensure that we account the skb appropriately against the socket.
> >>    Some network config options apparently return is an skb with
> >>    ->truesize != 0 when allocated with a size of 0, ensure we add
> >>    those as references against sock->sk_wmem_alloc. Reported by
> >>    Matt Mullins.
> >
> > Jens,
> >
> > I tried using io_uring with network sockets. It seem to be doing the
> > right thing. One bit is missing though: "flags" as in recv(2).
> >
> > In perfect world I would like to specify at least:
> >  - MSG_DONTWAIT
> >  - MSG_WAITALL
> >  - MSG_NOSIGNAL
> >
> > Right now, unless I'm missing something, io_uring_sqe doesn't have a
> > place where we could store these. "flags" is needed for any
> > non-trivial network I/O.
>
> We have flags for sqes, depending on the type. You can add to the
> union that already holds rw_flags/fsync_flags/poll_events? There's
> also a (smaller) flags field that applies for all types, which
> currently only holds the fixed file flag.

The "sqe->flags" right now is used by the IOSQE_FIXED_FILE which has
the same value as MSG_OOB.

Sticking recv/send flags into the "rw_flags" union perhaps could work,
barring the discussion about naming. The obvious names don't make
sense. recv_flags, send_flags or socket_flags don't sound right.

If we tried to add networking stuff to io_uring (for batchinig and async), then:
 - send()/recv() could work, only needs the "flags" field
 - sendmsg()/recvmsg() likewise
 - sendto()/recvfrom() require two more pointers: (struct sockaddr
*dest_addr, socklen_t addrlen)
 - sendmmsg() / recvmmsg() are perhaps irrelevant

Non-blocking stuff like socket(), setsockopt(), bind() perhaps don't
need to be considered, although could benefit from batching.

Not sure what to think about connect() and accept(). In the
prehistoric days there seem to have been an attempt to add socket
things to libaio struct iocb. See:

https://code.woboq.org/linux/include/libaio.h.html#iocb::(anonymous)::saddr

struct iocb {
    ...
    union {
        ...
        struct io_iocb_sockaddr    saddr;
    } u;
};

Are there chances of reserving space for two pointers in io_uring_sqe,
which could be used for sendto/recvfrom/accept if we decided to add
more network support?

Cheers,
    Marek

> If you're talking on a per-syscall type of flag, io_uring_enter(2)
> does take a flags member.
>
> --
> Jens Axboe
>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCHSET v15] io_uring IO interface
@ 2019-02-22 15:01       ` Marek Majkowski
  0 siblings, 0 replies; 128+ messages in thread
From: Marek Majkowski @ 2019-02-22 15:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Avi Kivity, hch, Jann Horn, jmoyer, linux-aio, linux-api,
	linux-block, viro

On Thu, Feb 21, 2019 at 6:48 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/21/19 5:10 AM, Marek Majkowski wrote:
> >> From: Jens Axboe <axboe@kernel.dk>
> >> Subject: [PATCHSET v15] io_uring IO interface
> >> Message-ID: <20190211190049.7888-1-axboe@kernel.dk> (raw)
> >>
> >> Some final tweaks, mostly cosmetic, but also two important fixes:
> >>
> >> 1) Ensure that we account the skb appropriately against the socket.
> >>    Some network config options apparently return is an skb with
> >>    ->truesize != 0 when allocated with a size of 0, ensure we add
> >>    those as references against sock->sk_wmem_alloc. Reported by
> >>    Matt Mullins.
> >
> > Jens,
> >
> > I tried using io_uring with network sockets. It seem to be doing the
> > right thing. One bit is missing though: "flags" as in recv(2).
> >
> > In perfect world I would like to specify at least:
> >  - MSG_DONTWAIT
> >  - MSG_WAITALL
> >  - MSG_NOSIGNAL
> >
> > Right now, unless I'm missing something, io_uring_sqe doesn't have a
> > place where we could store these. "flags" is needed for any
> > non-trivial network I/O.
>
> We have flags for sqes, depending on the type. You can add to the
> union that already holds rw_flags/fsync_flags/poll_events? There's
> also a (smaller) flags field that applies for all types, which
> currently only holds the fixed file flag.

The "sqe->flags" right now is used by the IOSQE_FIXED_FILE which has
the same value as MSG_OOB.

Sticking recv/send flags into the "rw_flags" union perhaps could work,
barring the discussion about naming. The obvious names don't make
sense. recv_flags, send_flags or socket_flags don't sound right.

If we tried to add networking stuff to io_uring (for batchinig and async), then:
 - send()/recv() could work, only needs the "flags" field
 - sendmsg()/recvmsg() likewise
 - sendto()/recvfrom() require two more pointers: (struct sockaddr
*dest_addr, socklen_t addrlen)
 - sendmmsg() / recvmmsg() are perhaps irrelevant

Non-blocking stuff like socket(), setsockopt(), bind() perhaps don't
need to be considered, although could benefit from batching.

Not sure what to think about connect() and accept(). In the
prehistoric days there seem to have been an attempt to add socket
things to libaio struct iocb. See:

https://code.woboq.org/linux/include/libaio.h.html#iocb::(anonymous)::saddr

struct iocb {
    ...
    union {
        ...
        struct io_iocb_sockaddr    saddr;
    } u;
};

Are there chances of reserving space for two pointers in io_uring_sqe,
which could be used for sendto/recvfrom/accept if we decided to add
more network support?

Cheers,
    Marek

> If you're talking on a per-syscall type of flag, io_uring_enter(2)
> does take a flags member.
>
> --
> Jens Axboe
>

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers
  2019-02-19 19:08     ` Jann Horn
@ 2019-02-22 22:29       ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-22 22:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On 2/19/19 12:08 PM, Jann Horn wrote:
> On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> If we have fixed user buffers, we can map them into the kernel when we
>> setup the io_uring. That avoids the need to do get_user_pages() for
>> each and every IO.
>>
>> To utilize this feature, the application must call io_uring_register()
>> after having setup an io_uring instance, passing in
>> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
>> an iovec array, and the nr_args should contain how many iovecs the
>> application wishes to map.
>>
>> If successful, these buffers are now mapped into the kernel, eligible
>> for IO. To use these fixed buffers, the application must use the
>> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
>> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
>> must point to somewhere inside the indexed buffer.
>>
>> The application may register buffers throughout the lifetime of the
>> io_uring instance. It can call io_uring_register() with
>> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
>> buffers, and then register a new set. The application need not
>> unregister buffers explicitly before shutting down the io_uring
>> instance.
>>
>> It's perfectly valid to setup a larger buffer, and then sometimes only
>> use parts of it for an IO. As long as the range is within the originally
>> mapped region, it will work just fine.
>>
>> For now, buffers must not be file backed. If file backed buffers are
>> passed in, the registration will fail with -1/EOPNOTSUPP. This
>> restriction may be relaxed in the future.
>>
>> RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
>> arbitrary 1G per buffer size is also imposed.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
> [...]
>>  static void io_sq_wq_submit_work(struct work_struct *work)
>>  {
>>         struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>>         struct sqe_submit *s = &req->submit;
>>         const struct io_uring_sqe *sqe = s->sqe;
>>         struct io_ring_ctx *ctx = req->ctx;
>> -       mm_segment_t old_fs = get_fs();
>> +       mm_segment_t old_fs;
>> +       bool needs_user;
>>         int ret;
>>
>>          /* Ensure we clear previously set forced non-block flag */
>>         req->flags &= ~REQ_F_FORCE_NONBLOCK;
>>         req->rw.ki_flags &= ~IOCB_NOWAIT;
>>
>> -       if (!mmget_not_zero(ctx->sqo_mm)) {
>> -               ret = -EFAULT;
>> -               goto err;
>> -       }
>> -
>> -       use_mm(ctx->sqo_mm);
>> -       set_fs(USER_DS);
>> -       s->has_user = true;
>>         s->needs_lock = true;
>> +       s->has_user = false;
>> +
>> +       /*
>> +        * If we're doing IO to fixed buffers, we don't need to get/set
>> +        * user context
>> +        */
>> +       needs_user = io_sqe_needs_user(s->sqe);
>> +       if (needs_user) {
>> +               if (!mmget_not_zero(ctx->sqo_mm)) {
>> +                       ret = -EFAULT;
>> +                       goto err;
>> +               }
>> +               use_mm(ctx->sqo_mm);
>> +               old_fs = get_fs();
>> +               set_fs(USER_DS);
>> +               s->has_user = true;
>> +       }
>>
>>         do {
>>                 ret = __io_submit_sqe(ctx, req, s, false, NULL);
>> @@ -1011,9 +1110,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>                 cond_resched();
>>         } while (1);
>>
>> -       set_fs(old_fs);
>> -       unuse_mm(ctx->sqo_mm);
>> -       mmput(ctx->sqo_mm);
>> +       if (needs_user) {
>> +               set_fs(old_fs);
>> +               unuse_mm(ctx->sqo_mm);
>> +               mmput(ctx->sqo_mm);
>> +       }
>>  err:
>>         if (ret) {
>>                 io_cqring_add_event(ctx, sqe->user_data, ret, 0);
>> @@ -1308,6 +1409,197 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
>>         return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
>>  }
>>
>> +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
>> +{
>> +       int i, j;
>> +
>> +       if (!ctx->user_bufs)
>> +               return -ENXIO;
>> +
>> +       for (i = 0; i < ctx->nr_user_bufs; i++) {
>> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
>> +
>> +               for (j = 0; j < imu->nr_bvecs; j++)
>> +                       put_page(imu->bvec[j].bv_page);
>> +
>> +               if (ctx->account_mem)
>> +                       io_unaccount_mem(ctx->user, imu->nr_bvecs);
>> +               kfree(imu->bvec);
>> +               imu->nr_bvecs = 0;
>> +       }
>> +
>> +       kfree(ctx->user_bufs);
>> +       ctx->user_bufs = NULL;
>> +       ctx->nr_user_bufs = 0;
>> +       return 0;
>> +}
> [...]
>> +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
>> +                                 unsigned nr_args)
>> +{
>> +       struct vm_area_struct **vmas = NULL;
>> +       struct page **pages = NULL;
>> +       int i, j, got_pages = 0;
>> +       int ret = -EINVAL;
>> +
>> +       if (ctx->user_bufs)
>> +               return -EBUSY;
>> +       if (!nr_args || nr_args > UIO_MAXIOV)
>> +               return -EINVAL;
>> +
>> +       ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
>> +                                       GFP_KERNEL);
>> +       if (!ctx->user_bufs)
>> +               return -ENOMEM;
>> +
>> +       for (i = 0; i < nr_args; i++) {
>> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
>> +               unsigned long off, start, end, ubuf;
>> +               int pret, nr_pages;
>> +               struct iovec iov;
>> +               size_t size;
>> +
>> +               ret = io_copy_iov(ctx, &iov, arg, i);
>> +               if (ret)
>> +                       break;
>> +
>> +               /*
>> +                * Don't impose further limits on the size and buffer
>> +                * constraints here, we'll -EINVAL later when IO is
>> +                * submitted if they are wrong.
>> +                */
>> +               ret = -EFAULT;
>> +               if (!iov.iov_base || !iov.iov_len)
>> +                       goto err;
>> +
>> +               /* arbitrary limit, but we need something */
>> +               if (iov.iov_len > SZ_1G)
>> +                       goto err;
>> +
>> +               ubuf = (unsigned long) iov.iov_base;
>> +               end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> +               start = ubuf >> PAGE_SHIFT;
>> +               nr_pages = end - start;
>> +
>> +               if (ctx->account_mem) {
>> +                       ret = io_account_mem(ctx->user, nr_pages);
>> +                       if (ret)
>> +                               goto err;
>> +               }
>> +
>> +               ret = 0;
>> +               if (!pages || nr_pages > got_pages) {
> 
> Nit: No need to check for `!pages` as long as `pages` and `got_pages`
> are synchronized (which guarantees that `!pages` implies
> `got_pages==0`).

I just prefer it that way, less confusion and past history this always
confuses the compiler and then we have to deal with a bogus warning.

>> +                       kfree(vmas);
>> +                       kfree(pages);
>> +                       pages = kmalloc_array(nr_pages, sizeof(struct page *),
>> +                                               GFP_KERNEL);
>> +                       vmas = kmalloc_array(nr_pages,
>> +                                       sizeof(struct vma_area_struct *),
> 
> typo: s/vma_area_struct/vm_area_struct/

Fixed, thanks.

>> +                                       GFP_KERNEL);
>> +                       if (!pages || !vmas) {
>> +                               ret = -ENOMEM;
>> +                               if (ctx->account_mem)
>> +                                       io_unaccount_mem(ctx->user, nr_pages);
>> +                               goto err;
>> +                       }
>> +                       got_pages = nr_pages;
>> +               }
>> +
>> +               imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
>> +                                               GFP_KERNEL);
>> +               ret = -ENOMEM;
>> +               if (!imu->bvec) {
>> +                       if (ctx->account_mem)
>> +                               io_unaccount_mem(ctx->user, nr_pages);
>> +                       goto err;
>> +               }
>> +
>> +               ret = 0;
>> +               down_read(&current->mm->mmap_sem);
>> +               pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
>> +                                               pages, vmas);
>> +               if (pret == nr_pages) {
>> +                       /* don't support file backed memory */
>> +                       for (j = 0; j < nr_pages; j++) {
>> +                               struct vm_area_struct *vma = vmas[j];
>> +
>> +                               if (vma->vm_file &&
>> +                                   !is_file_hugepages(vma->vm_file)) {
>> +                                       ret = -EOPNOTSUPP;
>> +                                       break;
>> +                               }
>> +                       }
>> +               } else {
>> +                       ret = pret < 0 ? pret : -EFAULT;
>> +               }
>> +               up_read(&current->mm->mmap_sem);
>> +               if (ret) {
>> +                       /*
>> +                        * if we did partial map, or found file backed vmas,
>> +                        * release any pages we did get
>> +                        */
>> +                       if (pret > 0) {
>> +                               for (j = 0; j < pret; j++)
>> +                                       put_page(pages[j]);
>> +                       }
>> +                       if (ctx->account_mem)
>> +                               io_unaccount_mem(ctx->user, nr_pages);
>> +                       goto err;
>> +               }
>> +
>> +               off = ubuf & ~PAGE_MASK;
>> +               size = iov.iov_len;
>> +               for (j = 0; j < nr_pages; j++) {
>> +                       size_t vec_len;
>> +
>> +                       vec_len = min_t(size_t, size, PAGE_SIZE - off);
>> +                       imu->bvec[j].bv_page = pages[j];
>> +                       imu->bvec[j].bv_len = vec_len;
>> +                       imu->bvec[j].bv_offset = off;
>> +                       off = 0;
>> +                       size -= vec_len;
>> +               }
>> +               /* store original address for later verification */
>> +               imu->ubuf = ubuf;
>> +               imu->len = iov.iov_len;
>> +               imu->nr_bvecs = nr_pages;
>> +       }
>> +       kfree(pages);
>> +       kfree(vmas);
>> +       ctx->nr_user_bufs = nr_args;
>> +       return 0;
>> +err:
>> +       kfree(pages);
>> +       kfree(vmas);
>> +       io_sqe_buffer_unregister(ctx);
> 
> io_sqe_buffer_unregister() gets rid of elements up to
> ctx->nr_user_bufs, but as far as I can tell, ctx->nr_user_bufs is
> always zero here. I think that's going to cause a reference leak.

Fixed, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers
@ 2019-02-22 22:29       ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-22 22:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On 2/19/19 12:08 PM, Jann Horn wrote:
> On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> If we have fixed user buffers, we can map them into the kernel when we
>> setup the io_uring. That avoids the need to do get_user_pages() for
>> each and every IO.
>>
>> To utilize this feature, the application must call io_uring_register()
>> after having setup an io_uring instance, passing in
>> IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to
>> an iovec array, and the nr_args should contain how many iovecs the
>> application wishes to map.
>>
>> If successful, these buffers are now mapped into the kernel, eligible
>> for IO. To use these fixed buffers, the application must use the
>> IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then
>> set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len
>> must point to somewhere inside the indexed buffer.
>>
>> The application may register buffers throughout the lifetime of the
>> io_uring instance. It can call io_uring_register() with
>> IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of
>> buffers, and then register a new set. The application need not
>> unregister buffers explicitly before shutting down the io_uring
>> instance.
>>
>> It's perfectly valid to setup a larger buffer, and then sometimes only
>> use parts of it for an IO. As long as the range is within the originally
>> mapped region, it will work just fine.
>>
>> For now, buffers must not be file backed. If file backed buffers are
>> passed in, the registration will fail with -1/EOPNOTSUPP. This
>> restriction may be relaxed in the future.
>>
>> RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat
>> arbitrary 1G per buffer size is also imposed.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
> [...]
>>  static void io_sq_wq_submit_work(struct work_struct *work)
>>  {
>>         struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>>         struct sqe_submit *s = &req->submit;
>>         const struct io_uring_sqe *sqe = s->sqe;
>>         struct io_ring_ctx *ctx = req->ctx;
>> -       mm_segment_t old_fs = get_fs();
>> +       mm_segment_t old_fs;
>> +       bool needs_user;
>>         int ret;
>>
>>          /* Ensure we clear previously set forced non-block flag */
>>         req->flags &= ~REQ_F_FORCE_NONBLOCK;
>>         req->rw.ki_flags &= ~IOCB_NOWAIT;
>>
>> -       if (!mmget_not_zero(ctx->sqo_mm)) {
>> -               ret = -EFAULT;
>> -               goto err;
>> -       }
>> -
>> -       use_mm(ctx->sqo_mm);
>> -       set_fs(USER_DS);
>> -       s->has_user = true;
>>         s->needs_lock = true;
>> +       s->has_user = false;
>> +
>> +       /*
>> +        * If we're doing IO to fixed buffers, we don't need to get/set
>> +        * user context
>> +        */
>> +       needs_user = io_sqe_needs_user(s->sqe);
>> +       if (needs_user) {
>> +               if (!mmget_not_zero(ctx->sqo_mm)) {
>> +                       ret = -EFAULT;
>> +                       goto err;
>> +               }
>> +               use_mm(ctx->sqo_mm);
>> +               old_fs = get_fs();
>> +               set_fs(USER_DS);
>> +               s->has_user = true;
>> +       }
>>
>>         do {
>>                 ret = __io_submit_sqe(ctx, req, s, false, NULL);
>> @@ -1011,9 +1110,11 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>                 cond_resched();
>>         } while (1);
>>
>> -       set_fs(old_fs);
>> -       unuse_mm(ctx->sqo_mm);
>> -       mmput(ctx->sqo_mm);
>> +       if (needs_user) {
>> +               set_fs(old_fs);
>> +               unuse_mm(ctx->sqo_mm);
>> +               mmput(ctx->sqo_mm);
>> +       }
>>  err:
>>         if (ret) {
>>                 io_cqring_add_event(ctx, sqe->user_data, ret, 0);
>> @@ -1308,6 +1409,197 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries)
>>         return (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
>>  }
>>
>> +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
>> +{
>> +       int i, j;
>> +
>> +       if (!ctx->user_bufs)
>> +               return -ENXIO;
>> +
>> +       for (i = 0; i < ctx->nr_user_bufs; i++) {
>> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
>> +
>> +               for (j = 0; j < imu->nr_bvecs; j++)
>> +                       put_page(imu->bvec[j].bv_page);
>> +
>> +               if (ctx->account_mem)
>> +                       io_unaccount_mem(ctx->user, imu->nr_bvecs);
>> +               kfree(imu->bvec);
>> +               imu->nr_bvecs = 0;
>> +       }
>> +
>> +       kfree(ctx->user_bufs);
>> +       ctx->user_bufs = NULL;
>> +       ctx->nr_user_bufs = 0;
>> +       return 0;
>> +}
> [...]
>> +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
>> +                                 unsigned nr_args)
>> +{
>> +       struct vm_area_struct **vmas = NULL;
>> +       struct page **pages = NULL;
>> +       int i, j, got_pages = 0;
>> +       int ret = -EINVAL;
>> +
>> +       if (ctx->user_bufs)
>> +               return -EBUSY;
>> +       if (!nr_args || nr_args > UIO_MAXIOV)
>> +               return -EINVAL;
>> +
>> +       ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf),
>> +                                       GFP_KERNEL);
>> +       if (!ctx->user_bufs)
>> +               return -ENOMEM;
>> +
>> +       for (i = 0; i < nr_args; i++) {
>> +               struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
>> +               unsigned long off, start, end, ubuf;
>> +               int pret, nr_pages;
>> +               struct iovec iov;
>> +               size_t size;
>> +
>> +               ret = io_copy_iov(ctx, &iov, arg, i);
>> +               if (ret)
>> +                       break;
>> +
>> +               /*
>> +                * Don't impose further limits on the size and buffer
>> +                * constraints here, we'll -EINVAL later when IO is
>> +                * submitted if they are wrong.
>> +                */
>> +               ret = -EFAULT;
>> +               if (!iov.iov_base || !iov.iov_len)
>> +                       goto err;
>> +
>> +               /* arbitrary limit, but we need something */
>> +               if (iov.iov_len > SZ_1G)
>> +                       goto err;
>> +
>> +               ubuf = (unsigned long) iov.iov_base;
>> +               end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
>> +               start = ubuf >> PAGE_SHIFT;
>> +               nr_pages = end - start;
>> +
>> +               if (ctx->account_mem) {
>> +                       ret = io_account_mem(ctx->user, nr_pages);
>> +                       if (ret)
>> +                               goto err;
>> +               }
>> +
>> +               ret = 0;
>> +               if (!pages || nr_pages > got_pages) {
> 
> Nit: No need to check for `!pages` as long as `pages` and `got_pages`
> are synchronized (which guarantees that `!pages` implies
> `got_pages==0`).

I just prefer it that way, less confusion and past history this always
confuses the compiler and then we have to deal with a bogus warning.

>> +                       kfree(vmas);
>> +                       kfree(pages);
>> +                       pages = kmalloc_array(nr_pages, sizeof(struct page *),
>> +                                               GFP_KERNEL);
>> +                       vmas = kmalloc_array(nr_pages,
>> +                                       sizeof(struct vma_area_struct *),
> 
> typo: s/vma_area_struct/vm_area_struct/

Fixed, thanks.

>> +                                       GFP_KERNEL);
>> +                       if (!pages || !vmas) {
>> +                               ret = -ENOMEM;
>> +                               if (ctx->account_mem)
>> +                                       io_unaccount_mem(ctx->user, nr_pages);
>> +                               goto err;
>> +                       }
>> +                       got_pages = nr_pages;
>> +               }
>> +
>> +               imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec),
>> +                                               GFP_KERNEL);
>> +               ret = -ENOMEM;
>> +               if (!imu->bvec) {
>> +                       if (ctx->account_mem)
>> +                               io_unaccount_mem(ctx->user, nr_pages);
>> +                       goto err;
>> +               }
>> +
>> +               ret = 0;
>> +               down_read(&current->mm->mmap_sem);
>> +               pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE,
>> +                                               pages, vmas);
>> +               if (pret == nr_pages) {
>> +                       /* don't support file backed memory */
>> +                       for (j = 0; j < nr_pages; j++) {
>> +                               struct vm_area_struct *vma = vmas[j];
>> +
>> +                               if (vma->vm_file &&
>> +                                   !is_file_hugepages(vma->vm_file)) {
>> +                                       ret = -EOPNOTSUPP;
>> +                                       break;
>> +                               }
>> +                       }
>> +               } else {
>> +                       ret = pret < 0 ? pret : -EFAULT;
>> +               }
>> +               up_read(&current->mm->mmap_sem);
>> +               if (ret) {
>> +                       /*
>> +                        * if we did partial map, or found file backed vmas,
>> +                        * release any pages we did get
>> +                        */
>> +                       if (pret > 0) {
>> +                               for (j = 0; j < pret; j++)
>> +                                       put_page(pages[j]);
>> +                       }
>> +                       if (ctx->account_mem)
>> +                               io_unaccount_mem(ctx->user, nr_pages);
>> +                       goto err;
>> +               }
>> +
>> +               off = ubuf & ~PAGE_MASK;
>> +               size = iov.iov_len;
>> +               for (j = 0; j < nr_pages; j++) {
>> +                       size_t vec_len;
>> +
>> +                       vec_len = min_t(size_t, size, PAGE_SIZE - off);
>> +                       imu->bvec[j].bv_page = pages[j];
>> +                       imu->bvec[j].bv_len = vec_len;
>> +                       imu->bvec[j].bv_offset = off;
>> +                       off = 0;
>> +                       size -= vec_len;
>> +               }
>> +               /* store original address for later verification */
>> +               imu->ubuf = ubuf;
>> +               imu->len = iov.iov_len;
>> +               imu->nr_bvecs = nr_pages;
>> +       }
>> +       kfree(pages);
>> +       kfree(vmas);
>> +       ctx->nr_user_bufs = nr_args;
>> +       return 0;
>> +err:
>> +       kfree(pages);
>> +       kfree(vmas);
>> +       io_sqe_buffer_unregister(ctx);
> 
> io_sqe_buffer_unregister() gets rid of elements up to
> ctx->nr_user_bufs, but as far as I can tell, ctx->nr_user_bufs is
> always zero here. I think that's going to cause a reference leak.

Fixed, thanks.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-19 16:12     ` Jann Horn
@ 2019-02-22 22:29       ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-22 22:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On 2/19/19 9:12 AM, Jann Horn wrote:
> On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> We normally have to fget/fput for each IO we do on a file. Even with
>> the batching we do, the cost of the atomic inc/dec of the file usage
>> count adds up.
>>
>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>> for the io_uring_register(2) system call. The arguments passed in must
>> be an array of __s32 holding file descriptors, and nr_args should hold
>> the number of file descriptors the application wishes to pin for the
>> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
>> called).
>>
>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>> to the index in the array passed in to IORING_REGISTER_FILES.
>>
>> Files are automatically unregistered when the io_uring instance is torn
>> down. An application need only unregister if it wishes to register a new
>> set of fds.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
> [...]
>> @@ -1335,6 +1379,161 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>         return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>>  }
>>
>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +#if defined(CONFIG_UNIX)
>> +       if (ctx->ring_sock) {
>> +               struct sock *sock = ctx->ring_sock->sk;
>> +               struct sk_buff *skb;
>> +
>> +               while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
>> +                       kfree_skb(skb);
>> +       }
>> +#else
>> +       int i;
>> +
>> +       for (i = 0; i < ctx->nr_user_files; i++)
>> +               fput(ctx->user_files[i]);
>> +#endif
>> +}
>> +
>> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +       if (!ctx->user_files)
>> +               return -ENXIO;
>> +
>> +       __io_sqe_files_unregister(ctx);
>> +       kfree(ctx->user_files);
>> +       ctx->user_files = NULL;
>> +       return 0;
>> +}
>> +
>> +#if defined(CONFIG_UNIX)
>> +/*
>> + * Ensure the UNIX gc is aware of our file set, so we are certain that
>> + * the io_uring can be safely unregistered on process exit, even if we have
>> + * loops in the file referencing.
>> + */
> 
> I still don't get how this is supposed to work. Quoting from an
> earlier version of the patch:
> 
> |> I think the overall concept here is still broken: You're giving the
> |> user_files to the GC, and I think the GC can drop their refcounts, but
> |> I don't see you actually getting feedback from the GC anywhere that
> |> would let the GC break your references? E.g. in io_prep_rw() you grab
> |> file pointers from ctx->user_files after simply checking
> |> ctx->nr_user_files, and there is no path from the GC that touches
> |> those fields. As far as I can tell, the GC is just going to go through
> |> unix_destruct_scm() and drop references on your files, causing
> |> use-after-free.
> |>
> |> But the unix GC is complicated, and maybe I'm just missing something...
> |
> | Only when the skb is released, which is either done when the io_uring
> | is torn down (and then definitely safe), or if the socket is released,
> | which is again also at a safe time.
> 
> I'll try to add inline comments on my understanding of the code, maybe
> you can point out where exactly we're understanding it differently...
> 
>> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
>> +{
>> +       struct sock *sk = ctx->ring_sock->sk;
>> +       struct scm_fp_list *fpl;
>> +       struct sk_buff *skb;
>> +       int i;
>> +
>> +       fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
>> +       if (!fpl)
>> +               return -ENOMEM;
>> +
>             // here we allocate a new `skb` with ->users==1
>> +       skb = alloc_skb(0, GFP_KERNEL);
>> +       if (!skb) {
>> +               kfree(fpl);
>> +               return -ENOMEM;
>> +       }
>> +
>> +       skb->sk = sk;
>             // set the skb's destructor, invoked when ->users drops to 0;
>             // destructor drops file refcounts
>> +       skb->destructor = unix_destruct_scm;
>> +
>> +       fpl->user = get_uid(ctx->user);
>> +       for (i = 0; i < nr; i++) {
>                     // grab a reference to each file for the skb
>> +               fpl->fp[i] = get_file(ctx->user_files[i + offset]);
>> +               unix_inflight(fpl->user, fpl->fp[i]);
>> +       }
>> +
>> +       fpl->max = fpl->count = nr;
>> +       UNIXCB(skb).fp = fpl;
>> +       refcount_add(skb->truesize, &sk->sk_wmem_alloc);
>             // put the skb in the sk_receive_queue, still with a refcount of 1.
>> +       skb_queue_head(&sk->sk_receive_queue, skb);
>> +
>             // drop a reference from each file; after this, only the
> skb owns references to files;
>             // the ctx->user_files entries borrow their lifetime from the skb
>> +       for (i = 0; i < nr; i++)
>> +               fput(fpl->fp[i]);
>> +
>> +       return 0;
>> +}
> 
> So let's say you have a cyclic dependency where an io_uring points to
> a unix domain socket, and the unix domain socket points back at the
> uring. The last reference from outside the loop goes away when the
> user closes the uring's fd, but the uring's busypolling kernel thread
> is still running and busypolling for new submission queue entries.
> 
> The GC can then come along and run scan_inflight(), detect that
> ctx->ring_sock->sk->sk_receive_queue contains a reference to a unix
> domain socket, and steal the skb (unlinking it from the ring_sock and
> linking it into the hitlist):
> 
> __skb_unlink(skb, &x->sk_receive_queue);
> __skb_queue_tail(hitlist, skb);
> 
> And then the hitlist will be processed by __skb_queue_purge(),
> dropping the refcount of the skb from 1 to 0. At that point, the unix
> domain socket can be freed, and you still have a pointer to it in
> ctx->user_files.

I've fixed this for the sq offload case.

>> +
>> +/*
>> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which
>> + * causes regular reference counting to break down. We rely on the UNIX
>> + * garbage collection to take care of this problem for us.
>> + */
>> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
>> +{
>> +       unsigned left, total;
>> +       int ret = 0;
>> +
>> +       total = 0;
>> +       left = ctx->nr_user_files;
>> +       while (left) {
>> +               unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
>> +               int ret;
>> +
>> +               ret = __io_sqe_files_scm(ctx, this_files, total);
>> +               if (ret)
>> +                       break;
> 
> If we bail out in the middle of translating the ->user_files here, we
> have to make sure that we both destroy the already-created SKBs and
> drop our references on the files we haven't dealt with yet.

Good catch, fixed.

>> +               left -= this_files;
>> +               total += this_files;
>> +       }
>> +
>> +       return ret;
>> +}
>> +#else
>> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
>> +{
>> +       return 0;
>> +}
>> +#endif
>> +
>> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
>> +                                unsigned nr_args)
>> +{
>> +       __s32 __user *fds = (__s32 __user *) arg;
>> +       int fd, ret = 0;
>> +       unsigned i;
>> +
>> +       if (ctx->user_files)
>> +               return -EBUSY;
>> +       if (!nr_args)
>> +               return -EINVAL;
>> +       if (nr_args > IORING_MAX_FIXED_FILES)
>> +               return -EMFILE;
>> +
>> +       ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
>> +       if (!ctx->user_files)
>> +               return -ENOMEM;
>> +
>> +       for (i = 0; i < nr_args; i++) {
>> +               ret = -EFAULT;
>> +               if (copy_from_user(&fd, &fds[i], sizeof(fd)))
>> +                       break;
>> +
>> +               ctx->user_files[i] = fget(fd);
>> +
>> +               ret = -EBADF;
>> +               if (!ctx->user_files[i])
>> +                       break;
> 
> Let's say we hit this error condition after N successful loop
> iterations, on a kernel with CONFIG_UNIX. At that point, we've filled
> N file pointers into ctx->user_files[], and we've incremented
> ctx->nr_user_files up to N. Now we jump to the `if (ret)` branch,
> which goes into io_sqe_files_unregister(); but that's going to attempt
> to dequeue inflight files from ctx->ring_sock, so that's not going to
> work.

Fixed, thanks.

>> +               /*
>> +                * Don't allow io_uring instances to be registered. If UNIX
>> +                * isn't enabled, then this causes a reference cycle and this
>> +                * instance can never get freed. If UNIX is enabled we'll
>> +                * handle it just fine, but there's still no point in allowing
>> +                * a ring fd as it doesn't support regular read/write anyway.
>> +                */
>> +               if (ctx->user_files[i]->f_op == &io_uring_fops) {
>> +                       fput(ctx->user_files[i]);
>> +                       break;
>> +               }
>> +               ctx->nr_user_files++;
> 
> I don't see anything that can set ctx->nr_user_files back down to
> zero; as far as I can tell, if you repeatedly register and unregister
> a set of files, ctx->nr_user_files will just grow, and since it's used
> as an upper bound for array accesses, that's bad.

Fixed that one earlier.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
@ 2019-02-22 22:29       ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-22 22:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On 2/19/19 9:12 AM, Jann Horn wrote:
> On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> We normally have to fget/fput for each IO we do on a file. Even with
>> the batching we do, the cost of the atomic inc/dec of the file usage
>> count adds up.
>>
>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>> for the io_uring_register(2) system call. The arguments passed in must
>> be an array of __s32 holding file descriptors, and nr_args should hold
>> the number of file descriptors the application wishes to pin for the
>> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
>> called).
>>
>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>> to the index in the array passed in to IORING_REGISTER_FILES.
>>
>> Files are automatically unregistered when the io_uring instance is torn
>> down. An application need only unregister if it wishes to register a new
>> set of fds.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
> [...]
>> @@ -1335,6 +1379,161 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>         return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>>  }
>>
>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +#if defined(CONFIG_UNIX)
>> +       if (ctx->ring_sock) {
>> +               struct sock *sock = ctx->ring_sock->sk;
>> +               struct sk_buff *skb;
>> +
>> +               while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
>> +                       kfree_skb(skb);
>> +       }
>> +#else
>> +       int i;
>> +
>> +       for (i = 0; i < ctx->nr_user_files; i++)
>> +               fput(ctx->user_files[i]);
>> +#endif
>> +}
>> +
>> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +       if (!ctx->user_files)
>> +               return -ENXIO;
>> +
>> +       __io_sqe_files_unregister(ctx);
>> +       kfree(ctx->user_files);
>> +       ctx->user_files = NULL;
>> +       return 0;
>> +}
>> +
>> +#if defined(CONFIG_UNIX)
>> +/*
>> + * Ensure the UNIX gc is aware of our file set, so we are certain that
>> + * the io_uring can be safely unregistered on process exit, even if we have
>> + * loops in the file referencing.
>> + */
> 
> I still don't get how this is supposed to work. Quoting from an
> earlier version of the patch:
> 
> |> I think the overall concept here is still broken: You're giving the
> |> user_files to the GC, and I think the GC can drop their refcounts, but
> |> I don't see you actually getting feedback from the GC anywhere that
> |> would let the GC break your references? E.g. in io_prep_rw() you grab
> |> file pointers from ctx->user_files after simply checking
> |> ctx->nr_user_files, and there is no path from the GC that touches
> |> those fields. As far as I can tell, the GC is just going to go through
> |> unix_destruct_scm() and drop references on your files, causing
> |> use-after-free.
> |>
> |> But the unix GC is complicated, and maybe I'm just missing something...
> |
> | Only when the skb is released, which is either done when the io_uring
> | is torn down (and then definitely safe), or if the socket is released,
> | which is again also at a safe time.
> 
> I'll try to add inline comments on my understanding of the code, maybe
> you can point out where exactly we're understanding it differently...
> 
>> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
>> +{
>> +       struct sock *sk = ctx->ring_sock->sk;
>> +       struct scm_fp_list *fpl;
>> +       struct sk_buff *skb;
>> +       int i;
>> +
>> +       fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
>> +       if (!fpl)
>> +               return -ENOMEM;
>> +
>             // here we allocate a new `skb` with ->users==1
>> +       skb = alloc_skb(0, GFP_KERNEL);
>> +       if (!skb) {
>> +               kfree(fpl);
>> +               return -ENOMEM;
>> +       }
>> +
>> +       skb->sk = sk;
>             // set the skb's destructor, invoked when ->users drops to 0;
>             // destructor drops file refcounts
>> +       skb->destructor = unix_destruct_scm;
>> +
>> +       fpl->user = get_uid(ctx->user);
>> +       for (i = 0; i < nr; i++) {
>                     // grab a reference to each file for the skb
>> +               fpl->fp[i] = get_file(ctx->user_files[i + offset]);
>> +               unix_inflight(fpl->user, fpl->fp[i]);
>> +       }
>> +
>> +       fpl->max = fpl->count = nr;
>> +       UNIXCB(skb).fp = fpl;
>> +       refcount_add(skb->truesize, &sk->sk_wmem_alloc);
>             // put the skb in the sk_receive_queue, still with a refcount of 1.
>> +       skb_queue_head(&sk->sk_receive_queue, skb);
>> +
>             // drop a reference from each file; after this, only the
> skb owns references to files;
>             // the ctx->user_files entries borrow their lifetime from the skb
>> +       for (i = 0; i < nr; i++)
>> +               fput(fpl->fp[i]);
>> +
>> +       return 0;
>> +}
> 
> So let's say you have a cyclic dependency where an io_uring points to
> a unix domain socket, and the unix domain socket points back at the
> uring. The last reference from outside the loop goes away when the
> user closes the uring's fd, but the uring's busypolling kernel thread
> is still running and busypolling for new submission queue entries.
> 
> The GC can then come along and run scan_inflight(), detect that
> ctx->ring_sock->sk->sk_receive_queue contains a reference to a unix
> domain socket, and steal the skb (unlinking it from the ring_sock and
> linking it into the hitlist):
> 
> __skb_unlink(skb, &x->sk_receive_queue);
> __skb_queue_tail(hitlist, skb);
> 
> And then the hitlist will be processed by __skb_queue_purge(),
> dropping the refcount of the skb from 1 to 0. At that point, the unix
> domain socket can be freed, and you still have a pointer to it in
> ctx->user_files.

I've fixed this for the sq offload case.

>> +
>> +/*
>> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which
>> + * causes regular reference counting to break down. We rely on the UNIX
>> + * garbage collection to take care of this problem for us.
>> + */
>> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
>> +{
>> +       unsigned left, total;
>> +       int ret = 0;
>> +
>> +       total = 0;
>> +       left = ctx->nr_user_files;
>> +       while (left) {
>> +               unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
>> +               int ret;
>> +
>> +               ret = __io_sqe_files_scm(ctx, this_files, total);
>> +               if (ret)
>> +                       break;
> 
> If we bail out in the middle of translating the ->user_files here, we
> have to make sure that we both destroy the already-created SKBs and
> drop our references on the files we haven't dealt with yet.

Good catch, fixed.

>> +               left -= this_files;
>> +               total += this_files;
>> +       }
>> +
>> +       return ret;
>> +}
>> +#else
>> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
>> +{
>> +       return 0;
>> +}
>> +#endif
>> +
>> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
>> +                                unsigned nr_args)
>> +{
>> +       __s32 __user *fds = (__s32 __user *) arg;
>> +       int fd, ret = 0;
>> +       unsigned i;
>> +
>> +       if (ctx->user_files)
>> +               return -EBUSY;
>> +       if (!nr_args)
>> +               return -EINVAL;
>> +       if (nr_args > IORING_MAX_FIXED_FILES)
>> +               return -EMFILE;
>> +
>> +       ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
>> +       if (!ctx->user_files)
>> +               return -ENOMEM;
>> +
>> +       for (i = 0; i < nr_args; i++) {
>> +               ret = -EFAULT;
>> +               if (copy_from_user(&fd, &fds[i], sizeof(fd)))
>> +                       break;
>> +
>> +               ctx->user_files[i] = fget(fd);
>> +
>> +               ret = -EBADF;
>> +               if (!ctx->user_files[i])
>> +                       break;
> 
> Let's say we hit this error condition after N successful loop
> iterations, on a kernel with CONFIG_UNIX. At that point, we've filled
> N file pointers into ctx->user_files[], and we've incremented
> ctx->nr_user_files up to N. Now we jump to the `if (ret)` branch,
> which goes into io_sqe_files_unregister(); but that's going to attempt
> to dequeue inflight files from ctx->ring_sock, so that's not going to
> work.

Fixed, thanks.

>> +               /*
>> +                * Don't allow io_uring instances to be registered. If UNIX
>> +                * isn't enabled, then this causes a reference cycle and this
>> +                * instance can never get freed. If UNIX is enabled we'll
>> +                * handle it just fine, but there's still no point in allowing
>> +                * a ring fd as it doesn't support regular read/write anyway.
>> +                */
>> +               if (ctx->user_files[i]->f_op == &io_uring_fops) {
>> +                       fput(ctx->user_files[i]);
>> +                       break;
>> +               }
>> +               ctx->nr_user_files++;
> 
> I don't see anything that can set ctx->nr_user_files back down to
> zero; as far as I can tell, if you repeatedly register and unregister
> a set of files, ctx->nr_user_files will just grow, and since it's used
> as an upper bound for array accesses, that's bad.

Fixed that one earlier.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCHSET v15] io_uring IO interface
  2019-02-22 15:01       ` Marek Majkowski
@ 2019-02-22 22:32         ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-22 22:32 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: Avi Kivity, hch, Jann Horn, jmoyer, linux-aio, linux-api,
	linux-block, viro

On 2/22/19 8:01 AM, Marek Majkowski wrote:
> On Thu, Feb 21, 2019 at 6:48 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/21/19 5:10 AM, Marek Majkowski wrote:
>>>> From: Jens Axboe <axboe@kernel.dk>
>>>> Subject: [PATCHSET v15] io_uring IO interface
>>>> Message-ID: <20190211190049.7888-1-axboe@kernel.dk> (raw)
>>>>
>>>> Some final tweaks, mostly cosmetic, but also two important fixes:
>>>>
>>>> 1) Ensure that we account the skb appropriately against the socket.
>>>>    Some network config options apparently return is an skb with
>>>>    ->truesize != 0 when allocated with a size of 0, ensure we add
>>>>    those as references against sock->sk_wmem_alloc. Reported by
>>>>    Matt Mullins.
>>>
>>> Jens,
>>>
>>> I tried using io_uring with network sockets. It seem to be doing the
>>> right thing. One bit is missing though: "flags" as in recv(2).
>>>
>>> In perfect world I would like to specify at least:
>>>  - MSG_DONTWAIT
>>>  - MSG_WAITALL
>>>  - MSG_NOSIGNAL
>>>
>>> Right now, unless I'm missing something, io_uring_sqe doesn't have a
>>> place where we could store these. "flags" is needed for any
>>> non-trivial network I/O.
>>
>> We have flags for sqes, depending on the type. You can add to the
>> union that already holds rw_flags/fsync_flags/poll_events? There's
>> also a (smaller) flags field that applies for all types, which
>> currently only holds the fixed file flag.
> 
> The "sqe->flags" right now is used by the IOSQE_FIXED_FILE which has
> the same value as MSG_OOB.
> 
> Sticking recv/send flags into the "rw_flags" union perhaps could work,
> barring the discussion about naming. The obvious names don't make
> sense. recv_flags, send_flags or socket_flags don't sound right.
> 
> If we tried to add networking stuff to io_uring (for batchinig and async), then:
>  - send()/recv() could work, only needs the "flags" field
>  - sendmsg()/recvmsg() likewise
>  - sendto()/recvfrom() require two more pointers: (struct sockaddr
> *dest_addr, socklen_t addrlen)
>  - sendmmsg() / recvmmsg() are perhaps irrelevant
> 
> Non-blocking stuff like socket(), setsockopt(), bind() perhaps don't
> need to be considered, although could benefit from batching.

If we just do separate opcodes for them, then there's 32 bits of flag
space for each one. That should be more than adequate.

> Not sure what to think about connect() and accept(). In the
> prehistoric days there seem to have been an attempt to add socket
> things to libaio struct iocb. See:
> 
> https://code.woboq.org/linux/include/libaio.h.html#iocb::(anonymous)::saddr
> 
> struct iocb {
>     ...
>     union {
>         ...
>         struct io_iocb_sockaddr    saddr;
>     } u;
> };
> 
> Are there chances of reserving space for two pointers in io_uring_sqe,
> which could be used for sendto/recvfrom/accept if we decided to add
> more network support?

There is already space for that. We have 3 x 64 bits at the end of the
sqe, 16 of those are used for the fixed buffers which networking
probably wants to support as well. But that still leaves 112 bits of
space for things opcode specific data.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCHSET v15] io_uring IO interface
@ 2019-02-22 22:32         ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-22 22:32 UTC (permalink / raw)
  To: Marek Majkowski
  Cc: Avi Kivity, hch, Jann Horn, jmoyer, linux-aio, linux-api,
	linux-block, viro

On 2/22/19 8:01 AM, Marek Majkowski wrote:
> On Thu, Feb 21, 2019 at 6:48 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/21/19 5:10 AM, Marek Majkowski wrote:
>>>> From: Jens Axboe <axboe@kernel.dk>
>>>> Subject: [PATCHSET v15] io_uring IO interface
>>>> Message-ID: <20190211190049.7888-1-axboe@kernel.dk> (raw)
>>>>
>>>> Some final tweaks, mostly cosmetic, but also two important fixes:
>>>>
>>>> 1) Ensure that we account the skb appropriately against the socket.
>>>>    Some network config options apparently return is an skb with
>>>>    ->truesize != 0 when allocated with a size of 0, ensure we add
>>>>    those as references against sock->sk_wmem_alloc. Reported by
>>>>    Matt Mullins.
>>>
>>> Jens,
>>>
>>> I tried using io_uring with network sockets. It seem to be doing the
>>> right thing. One bit is missing though: "flags" as in recv(2).
>>>
>>> In perfect world I would like to specify at least:
>>>  - MSG_DONTWAIT
>>>  - MSG_WAITALL
>>>  - MSG_NOSIGNAL
>>>
>>> Right now, unless I'm missing something, io_uring_sqe doesn't have a
>>> place where we could store these. "flags" is needed for any
>>> non-trivial network I/O.
>>
>> We have flags for sqes, depending on the type. You can add to the
>> union that already holds rw_flags/fsync_flags/poll_events? There's
>> also a (smaller) flags field that applies for all types, which
>> currently only holds the fixed file flag.
> 
> The "sqe->flags" right now is used by the IOSQE_FIXED_FILE which has
> the same value as MSG_OOB.
> 
> Sticking recv/send flags into the "rw_flags" union perhaps could work,
> barring the discussion about naming. The obvious names don't make
> sense. recv_flags, send_flags or socket_flags don't sound right.
> 
> If we tried to add networking stuff to io_uring (for batchinig and async), then:
>  - send()/recv() could work, only needs the "flags" field
>  - sendmsg()/recvmsg() likewise
>  - sendto()/recvfrom() require two more pointers: (struct sockaddr
> *dest_addr, socklen_t addrlen)
>  - sendmmsg() / recvmmsg() are perhaps irrelevant
> 
> Non-blocking stuff like socket(), setsockopt(), bind() perhaps don't
> need to be considered, although could benefit from batching.

If we just do separate opcodes for them, then there's 32 bits of flag
space for each one. That should be more than adequate.

> Not sure what to think about connect() and accept(). In the
> prehistoric days there seem to have been an attempt to add socket
> things to libaio struct iocb. See:
> 
> https://code.woboq.org/linux/include/libaio.h.html#iocb::(anonymous)::saddr
> 
> struct iocb {
>     ...
>     union {
>         ...
>         struct io_iocb_sockaddr    saddr;
>     } u;
> };
> 
> Are there chances of reserving space for two pointers in io_uring_sqe,
> which could be used for sendto/recvfrom/accept if we decided to add
> more network support?

There is already space for that. We have 3 x 64 bits at the end of the
sqe, 16 of those are used for the fixed buffers which networking
probably wants to support as well. But that still leaves 112 bits of
space for things opcode specific data.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-21 17:45       ` Jens Axboe
@ 2019-02-26  3:46         ` Eric Biggers
  -1 siblings, 0 replies; 128+ messages in thread
From: Eric Biggers @ 2019-02-26  3:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, linux-aio, linux-block, linux-api, hch, jmoyer, avi,
	jannh, viro

Hi Jens,

On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> On 2/20/19 3:58 PM, Ming Lei wrote:
> > On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >> For an ITER_BVEC, we can just iterate the iov and add the pages
> >> to the bio directly. This requires that the caller doesn't releases
> >> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>
> >> The current two callers of bio_iov_iter_get_pages() are updated to
> >> check if they need to release pages on completion. This makes them
> >> work with bvecs that contain kernel mapped pages already.
> >>
> >> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >> ---
> >>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>  fs/block_dev.c            |  5 ++--
> >>  fs/iomap.c                |  5 ++--
> >>  include/linux/blk_types.h |  1 +
> >>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/block/bio.c b/block/bio.c
> >> index 4db1008309ed..330df572cfb8 100644
> >> --- a/block/bio.c
> >> +++ b/block/bio.c
> >> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>  }
> >>  EXPORT_SYMBOL(bio_add_page);
> >>  
> >> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >> +{
> >> +	const struct bio_vec *bv = iter->bvec;
> >> +	unsigned int len;
> >> +	size_t size;
> >> +
> >> +	len = min_t(size_t, bv->bv_len, iter->count);
> >> +	size = bio_add_page(bio, bv->bv_page, len,
> >> +				bv->bv_offset + iter->iov_offset);
> > 
> > iter->iov_offset needs to be subtracted from 'len', looks
> > the following delta change[1] is required, otherwise memory corruption
> > can be observed when running xfstests over loop/dio.
> 
> Thanks, I folded this in.
> 
> -- 
> Jens Axboe
> 

syzkaller started hitting a crash on linux-next starting with this commit, and
it still occurs even with your latest version that has Ming's fix folded in.
Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
Sun Feb 24 08:20:53 2019 -0700.

Reproducer:

#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/loop.h>
#include <sys/ioctl.h>
#include <sys/sendfile.h>
#include <sys/syscall.h>
#include <unistd.h>

int main(void)
{
        int memfd, loopfd;

        memfd = syscall(__NR_memfd_create, "foo", 0);

        pwrite(memfd, "\xa8", 1, 4096);

        loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);

        ioctl(loopfd, LOOP_SET_FD, memfd);

        sendfile(loopfd, loopfd, NULL, 1000000);
}


Crash:

page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:546!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 173 Comm: syz_mm Not tainted 5.0.0-rc6-00007-ga566653ab5ab8 #22
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
RIP: 0010:put_page_testzero include/linux/mm.h:546 [inline]
RIP: 0010:put_page include/linux/mm.h:992 [inline]
RIP: 0010:generic_pipe_buf_release+0x37/0x40 fs/pipe.c:225
Code: 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 74 0f f0 ff 4f 34 74 02 5d c3 e8 c7 1b fa ff 5d c3 48 c7 c6 60 aa b1 81 e8 59 25 fc ff <0f> 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 e8 a0
RSP: 0018:ffffc90000783cb0 EFLAGS: 00010246
RAX: 000000000000003e RBX: ffff88807c358800 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88807fc95420
RBP: ffffc90000783cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000001000 R14: 0000000000000000 R15: ffff88807c0b6e00
FS:  00007fd858adb240(0000) GS:ffff88807fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dc13859000 CR3: 000000007a96b000 CR4: 00000000003406e0
Call Trace:
 pipe_buf_release include/linux/pipe_fs_i.h:136 [inline]
 iter_file_splice_write+0x2df/0x3f0 fs/splice.c:763
 do_splice_from fs/splice.c:851 [inline]
 direct_splice_actor+0x31/0x40 fs/splice.c:1023
 splice_direct_to_actor+0xff/0x240 fs/splice.c:978
 do_splice_direct+0x92/0xc0 fs/splice.c:1066
 do_sendfile+0x1be/0x390 fs/read_write.c:1436
 __do_sys_sendfile64 fs/read_write.c:1497 [inline]
 __se_sys_sendfile64+0xa6/0xc0 fs/read_write.c:1483
 __x64_sys_sendfile64+0x19/0x20 fs/read_write.c:1483
 do_syscall_64+0x4a/0x180 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fd858bd224e
Code: 89 ce 5b e9 b4 fd ff ff 0f 1f 40 00 31 c0 5b c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e2 cb 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007fffc517d148 EFLAGS: 00000206 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fd858bd224e
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000004
RBP: 0000000000000003 R08: 00007fd858ca0be0 R09: 00007fffc517d240
R10: 00000000000f4240 R11: 0000000000000206 R12: 000055dc13858100
R13: 00007fffc517d240 R14: 0000000000000000 R15: 0000000000000000
---[ end trace 1d878656972e4a26 ]---
RIP: 0010:put_page_testzero include/linux/mm.h:546 [inline]
RIP: 0010:put_page include/linux/mm.h:992 [inline]
RIP: 0010:generic_pipe_buf_release+0x37/0x40 fs/pipe.c:225
Code: 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 74 0f f0 ff 4f 34 74 02 5d c3 e8 c7 1b fa ff 5d c3 48 c7 c6 60 aa b1 81 e8 59 25 fc ff <0f> 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 e8 a0
RSP: 0018:ffffc90000783cb0 EFLAGS: 00010246
RAX: 000000000000003e RBX: ffff88807c358800 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88807fc95420
RBP: ffffc90000783cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000001000 R14: 0000000000000000 R15: ffff88807c0b6e00
FS:  00007fd858adb240(0000) GS:ffff88807fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dc13859000 CR3: 000000007a96b000 CR4: 00000000003406e0

- Eric

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-26  3:46         ` Eric Biggers
  0 siblings, 0 replies; 128+ messages in thread
From: Eric Biggers @ 2019-02-26  3:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, linux-aio, linux-block, linux-api, hch, jmoyer, avi,
	jannh, viro

Hi Jens,

On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> On 2/20/19 3:58 PM, Ming Lei wrote:
> > On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >> For an ITER_BVEC, we can just iterate the iov and add the pages
> >> to the bio directly. This requires that the caller doesn't releases
> >> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>
> >> The current two callers of bio_iov_iter_get_pages() are updated to
> >> check if they need to release pages on completion. This makes them
> >> work with bvecs that contain kernel mapped pages already.
> >>
> >> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >> ---
> >>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>  fs/block_dev.c            |  5 ++--
> >>  fs/iomap.c                |  5 ++--
> >>  include/linux/blk_types.h |  1 +
> >>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/block/bio.c b/block/bio.c
> >> index 4db1008309ed..330df572cfb8 100644
> >> --- a/block/bio.c
> >> +++ b/block/bio.c
> >> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>  }
> >>  EXPORT_SYMBOL(bio_add_page);
> >>  
> >> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >> +{
> >> +	const struct bio_vec *bv = iter->bvec;
> >> +	unsigned int len;
> >> +	size_t size;
> >> +
> >> +	len = min_t(size_t, bv->bv_len, iter->count);
> >> +	size = bio_add_page(bio, bv->bv_page, len,
> >> +				bv->bv_offset + iter->iov_offset);
> > 
> > iter->iov_offset needs to be subtracted from 'len', looks
> > the following delta change[1] is required, otherwise memory corruption
> > can be observed when running xfstests over loop/dio.
> 
> Thanks, I folded this in.
> 
> -- 
> Jens Axboe
> 

syzkaller started hitting a crash on linux-next starting with this commit, and
it still occurs even with your latest version that has Ming's fix folded in.
Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
Sun Feb 24 08:20:53 2019 -0700.

Reproducer:

#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/loop.h>
#include <sys/ioctl.h>
#include <sys/sendfile.h>
#include <sys/syscall.h>
#include <unistd.h>

int main(void)
{
        int memfd, loopfd;

        memfd = syscall(__NR_memfd_create, "foo", 0);

        pwrite(memfd, "\xa8", 1, 4096);

        loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);

        ioctl(loopfd, LOOP_SET_FD, memfd);

        sendfile(loopfd, loopfd, NULL, 1000000);
}


Crash:

page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x100000000000000()
raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:546!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 173 Comm: syz_mm Not tainted 5.0.0-rc6-00007-ga566653ab5ab8 #22
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014
RIP: 0010:put_page_testzero include/linux/mm.h:546 [inline]
RIP: 0010:put_page include/linux/mm.h:992 [inline]
RIP: 0010:generic_pipe_buf_release+0x37/0x40 fs/pipe.c:225
Code: 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 74 0f f0 ff 4f 34 74 02 5d c3 e8 c7 1b fa ff 5d c3 48 c7 c6 60 aa b1 81 e8 59 25 fc ff <0f> 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 e8 a0
RSP: 0018:ffffc90000783cb0 EFLAGS: 00010246
RAX: 000000000000003e RBX: ffff88807c358800 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88807fc95420
RBP: ffffc90000783cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000001000 R14: 0000000000000000 R15: ffff88807c0b6e00
FS:  00007fd858adb240(0000) GS:ffff88807fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dc13859000 CR3: 000000007a96b000 CR4: 00000000003406e0
Call Trace:
 pipe_buf_release include/linux/pipe_fs_i.h:136 [inline]
 iter_file_splice_write+0x2df/0x3f0 fs/splice.c:763
 do_splice_from fs/splice.c:851 [inline]
 direct_splice_actor+0x31/0x40 fs/splice.c:1023
 splice_direct_to_actor+0xff/0x240 fs/splice.c:978
 do_splice_direct+0x92/0xc0 fs/splice.c:1066
 do_sendfile+0x1be/0x390 fs/read_write.c:1436
 __do_sys_sendfile64 fs/read_write.c:1497 [inline]
 __se_sys_sendfile64+0xa6/0xc0 fs/read_write.c:1483
 __x64_sys_sendfile64+0x19/0x20 fs/read_write.c:1483
 do_syscall_64+0x4a/0x180 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fd858bd224e
Code: 89 ce 5b e9 b4 fd ff ff 0f 1f 40 00 31 c0 5b c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e2 cb 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007fffc517d148 EFLAGS: 00000206 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fd858bd224e
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000004
RBP: 0000000000000003 R08: 00007fd858ca0be0 R09: 00007fffc517d240
R10: 00000000000f4240 R11: 0000000000000206 R12: 000055dc13858100
R13: 00007fffc517d240 R14: 0000000000000000 R15: 0000000000000000
---[ end trace 1d878656972e4a26 ]---
RIP: 0010:put_page_testzero include/linux/mm.h:546 [inline]
RIP: 0010:put_page include/linux/mm.h:992 [inline]
RIP: 0010:generic_pipe_buf_release+0x37/0x40 fs/pipe.c:225
Code: 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 74 0f f0 ff 4f 34 74 02 5d c3 e8 c7 1b fa ff 5d c3 48 c7 c6 60 aa b1 81 e8 59 25 fc ff <0f> 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 56 41 55 41 54 53 e8 a0
RSP: 0018:ffffc90000783cb0 EFLAGS: 00010246
RAX: 000000000000003e RBX: ffff88807c358800 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88807fc95420
RBP: ffffc90000783cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000001000
R13: 0000000000001000 R14: 0000000000000000 R15: ffff88807c0b6e00
FS:  00007fd858adb240(0000) GS:ffff88807fc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dc13859000 CR3: 000000007a96b000 CR4: 00000000003406e0

- Eric

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-26  3:46         ` Eric Biggers
@ 2019-02-26  4:34           ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-26  4:34 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ming Lei, linux-aio, linux-block, linux-api, hch, jmoyer, avi,
	jannh, viro

On 2/25/19 8:46 PM, Eric Biggers wrote:
> Hi Jens,
> 
> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>> to the bio directly. This requires that the caller doesn't releases
>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>
>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>> check if they need to release pages on completion. This makes them
>>>> work with bvecs that contain kernel mapped pages already.
>>>>
>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>  fs/block_dev.c            |  5 ++--
>>>>  fs/iomap.c                |  5 ++--
>>>>  include/linux/blk_types.h |  1 +
>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index 4db1008309ed..330df572cfb8 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>  }
>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>  
>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>> +{
>>>> +	const struct bio_vec *bv = iter->bvec;
>>>> +	unsigned int len;
>>>> +	size_t size;
>>>> +
>>>> +	len = min_t(size_t, bv->bv_len, iter->count);
>>>> +	size = bio_add_page(bio, bv->bv_page, len,
>>>> +				bv->bv_offset + iter->iov_offset);
>>>
>>> iter->iov_offset needs to be subtracted from 'len', looks
>>> the following delta change[1] is required, otherwise memory corruption
>>> can be observed when running xfstests over loop/dio.
>>
>> Thanks, I folded this in.
>>
>> -- 
>> Jens Axboe
>>
> 
> syzkaller started hitting a crash on linux-next starting with this commit, and
> it still occurs even with your latest version that has Ming's fix folded in.
> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> Sun Feb 24 08:20:53 2019 -0700.
> 
> Reproducer:
> 
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <linux/loop.h>
> #include <sys/ioctl.h>
> #include <sys/sendfile.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> 
> int main(void)
> {
>         int memfd, loopfd;
> 
>         memfd = syscall(__NR_memfd_create, "foo", 0);
> 
>         pwrite(memfd, "\xa8", 1, 4096);
> 
>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> 
>         ioctl(loopfd, LOOP_SET_FD, memfd);
> 
>         sendfile(loopfd, loopfd, NULL, 1000000);
> }
> 
> 
> Crash:
> 
> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> flags: 0x100000000000000()
> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> raw: 0000000000000000 0000000000000000 00000000ffffffff
> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)

I see what this is, I'll cut a fix for this tomorrow.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-26  4:34           ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-26  4:34 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ming Lei, linux-aio, linux-block, linux-api, hch, jmoyer, avi,
	jannh, viro

On 2/25/19 8:46 PM, Eric Biggers wrote:
> Hi Jens,
> 
> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>> to the bio directly. This requires that the caller doesn't releases
>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>
>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>> check if they need to release pages on completion. This makes them
>>>> work with bvecs that contain kernel mapped pages already.
>>>>
>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>  fs/block_dev.c            |  5 ++--
>>>>  fs/iomap.c                |  5 ++--
>>>>  include/linux/blk_types.h |  1 +
>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index 4db1008309ed..330df572cfb8 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>  }
>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>  
>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>> +{
>>>> +	const struct bio_vec *bv = iter->bvec;
>>>> +	unsigned int len;
>>>> +	size_t size;
>>>> +
>>>> +	len = min_t(size_t, bv->bv_len, iter->count);
>>>> +	size = bio_add_page(bio, bv->bv_page, len,
>>>> +				bv->bv_offset + iter->iov_offset);
>>>
>>> iter->iov_offset needs to be subtracted from 'len', looks
>>> the following delta change[1] is required, otherwise memory corruption
>>> can be observed when running xfstests over loop/dio.
>>
>> Thanks, I folded this in.
>>
>> -- 
>> Jens Axboe
>>
> 
> syzkaller started hitting a crash on linux-next starting with this commit, and
> it still occurs even with your latest version that has Ming's fix folded in.
> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> Sun Feb 24 08:20:53 2019 -0700.
> 
> Reproducer:
> 
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <linux/loop.h>
> #include <sys/ioctl.h>
> #include <sys/sendfile.h>
> #include <sys/syscall.h>
> #include <unistd.h>
> 
> int main(void)
> {
>         int memfd, loopfd;
> 
>         memfd = syscall(__NR_memfd_create, "foo", 0);
> 
>         pwrite(memfd, "\xa8", 1, 4096);
> 
>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> 
>         ioctl(loopfd, LOOP_SET_FD, memfd);
> 
>         sendfile(loopfd, loopfd, NULL, 1000000);
> }
> 
> 
> Crash:
> 
> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> flags: 0x100000000000000()
> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> raw: 0000000000000000 0000000000000000 00000000ffffffff
> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)

I see what this is, I'll cut a fix for this tomorrow.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-26  4:34           ` Jens Axboe
@ 2019-02-26 15:54             ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-26 15:54 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ming Lei, linux-aio, linux-block, linux-api, hch, jmoyer, avi,
	jannh, viro

On 2/25/19 9:34 PM, Jens Axboe wrote:
> On 2/25/19 8:46 PM, Eric Biggers wrote:
>> Hi Jens,
>>
>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>
>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>> check if they need to release pages on completion. This makes them
>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>
>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>> ---
>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>  fs/block_dev.c            |  5 ++--
>>>>>  fs/iomap.c                |  5 ++--
>>>>>  include/linux/blk_types.h |  1 +
>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>> --- a/block/bio.c
>>>>> +++ b/block/bio.c
>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>  }
>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>  
>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>> +{
>>>>> +	const struct bio_vec *bv = iter->bvec;
>>>>> +	unsigned int len;
>>>>> +	size_t size;
>>>>> +
>>>>> +	len = min_t(size_t, bv->bv_len, iter->count);
>>>>> +	size = bio_add_page(bio, bv->bv_page, len,
>>>>> +				bv->bv_offset + iter->iov_offset);
>>>>
>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>> the following delta change[1] is required, otherwise memory corruption
>>>> can be observed when running xfstests over loop/dio.
>>>
>>> Thanks, I folded this in.
>>>
>>> -- 
>>> Jens Axboe
>>>
>>
>> syzkaller started hitting a crash on linux-next starting with this commit, and
>> it still occurs even with your latest version that has Ming's fix folded in.
>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>> Sun Feb 24 08:20:53 2019 -0700.
>>
>> Reproducer:
>>
>> #define _GNU_SOURCE
>> #include <fcntl.h>
>> #include <linux/loop.h>
>> #include <sys/ioctl.h>
>> #include <sys/sendfile.h>
>> #include <sys/syscall.h>
>> #include <unistd.h>
>>
>> int main(void)
>> {
>>         int memfd, loopfd;
>>
>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>
>>         pwrite(memfd, "\xa8", 1, 4096);
>>
>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>
>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>
>>         sendfile(loopfd, loopfd, NULL, 1000000);
>> }
>>
>>
>> Crash:
>>
>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>> flags: 0x100000000000000()
>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> 
> I see what this is, I'll cut a fix for this tomorrow.

Folded in a fix for this, it's in my current io_uring branch and my for-next
branch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-26 15:54             ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-26 15:54 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Ming Lei, linux-aio, linux-block, linux-api, hch, jmoyer, avi,
	jannh, viro

On 2/25/19 9:34 PM, Jens Axboe wrote:
> On 2/25/19 8:46 PM, Eric Biggers wrote:
>> Hi Jens,
>>
>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>
>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>> check if they need to release pages on completion. This makes them
>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>
>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>> ---
>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>  fs/block_dev.c            |  5 ++--
>>>>>  fs/iomap.c                |  5 ++--
>>>>>  include/linux/blk_types.h |  1 +
>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>> --- a/block/bio.c
>>>>> +++ b/block/bio.c
>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>  }
>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>  
>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>> +{
>>>>> +	const struct bio_vec *bv = iter->bvec;
>>>>> +	unsigned int len;
>>>>> +	size_t size;
>>>>> +
>>>>> +	len = min_t(size_t, bv->bv_len, iter->count);
>>>>> +	size = bio_add_page(bio, bv->bv_page, len,
>>>>> +				bv->bv_offset + iter->iov_offset);
>>>>
>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>> the following delta change[1] is required, otherwise memory corruption
>>>> can be observed when running xfstests over loop/dio.
>>>
>>> Thanks, I folded this in.
>>>
>>> -- 
>>> Jens Axboe
>>>
>>
>> syzkaller started hitting a crash on linux-next starting with this commit, and
>> it still occurs even with your latest version that has Ming's fix folded in.
>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>> Sun Feb 24 08:20:53 2019 -0700.
>>
>> Reproducer:
>>
>> #define _GNU_SOURCE
>> #include <fcntl.h>
>> #include <linux/loop.h>
>> #include <sys/ioctl.h>
>> #include <sys/sendfile.h>
>> #include <sys/syscall.h>
>> #include <unistd.h>
>>
>> int main(void)
>> {
>>         int memfd, loopfd;
>>
>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>
>>         pwrite(memfd, "\xa8", 1, 4096);
>>
>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>
>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>
>>         sendfile(loopfd, loopfd, NULL, 1000000);
>> }
>>
>>
>> Crash:
>>
>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>> flags: 0x100000000000000()
>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> 
> I see what this is, I'll cut a fix for this tomorrow.

Folded in a fix for this, it's in my current io_uring branch and my for-next
branch.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-26 15:54             ` Jens Axboe
@ 2019-02-27  1:21               ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  1:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Eric Biggers, Ming Lei, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/25/19 9:34 PM, Jens Axboe wrote:
> > On 2/25/19 8:46 PM, Eric Biggers wrote:
> >> Hi Jens,
> >>
> >> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>
> >>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>> check if they need to release pages on completion. This makes them
> >>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>
> >>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>> ---
> >>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>  fs/block_dev.c            |  5 ++--
> >>>>>  fs/iomap.c                |  5 ++--
> >>>>>  include/linux/blk_types.h |  1 +
> >>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>
> >>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>> --- a/block/bio.c
> >>>>> +++ b/block/bio.c
> >>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>  }
> >>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>
> >>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>> +{
> >>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>> + unsigned int len;
> >>>>> + size_t size;
> >>>>> +
> >>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>
> >>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>> the following delta change[1] is required, otherwise memory corruption
> >>>> can be observed when running xfstests over loop/dio.
> >>>
> >>> Thanks, I folded this in.
> >>>
> >>> --
> >>> Jens Axboe
> >>>
> >>
> >> syzkaller started hitting a crash on linux-next starting with this commit, and
> >> it still occurs even with your latest version that has Ming's fix folded in.
> >> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >> Sun Feb 24 08:20:53 2019 -0700.
> >>
> >> Reproducer:
> >>
> >> #define _GNU_SOURCE
> >> #include <fcntl.h>
> >> #include <linux/loop.h>
> >> #include <sys/ioctl.h>
> >> #include <sys/sendfile.h>
> >> #include <sys/syscall.h>
> >> #include <unistd.h>
> >>
> >> int main(void)
> >> {
> >>         int memfd, loopfd;
> >>
> >>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>
> >>         pwrite(memfd, "\xa8", 1, 4096);
> >>
> >>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>
> >>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>
> >>         sendfile(loopfd, loopfd, NULL, 1000000);
> >> }
> >>
> >>
> >> Crash:
> >>
> >> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >> flags: 0x100000000000000()
> >> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >
> > I see what this is, I'll cut a fix for this tomorrow.
>
> Folded in a fix for this, it's in my current io_uring branch and my for-next
> branch.

Hi Jens,

I saw the following change is added:

+ if (size == len) {
+ /*
+ * For the normal O_DIRECT case, we could skip grabbing this
+ * reference and then not have to put them again when IO
+ * completes. But this breaks some in-kernel users, like
+ * splicing to/from a loop device, where we release the pipe
+ * pages unconditionally. If we can fix that case, we can
+ * get rid of the get here and the need to call
+ * bio_release_pages() at IO completion time.
+ */
+ get_page(bv->bv_page);

Now the 'bv' may point to more than one page, so the following one may be
needed:

int i;
struct bvec_iter_all iter_all;
struct bio_vec *tmp;

mp_bvec_for_each_segment(tmp, bv, i, iter_all)
      get_page(tmp->bv_page);

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  1:21               ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  1:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Eric Biggers, Ming Lei, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/25/19 9:34 PM, Jens Axboe wrote:
> > On 2/25/19 8:46 PM, Eric Biggers wrote:
> >> Hi Jens,
> >>
> >> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>
> >>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>> check if they need to release pages on completion. This makes them
> >>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>
> >>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>> ---
> >>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>  fs/block_dev.c            |  5 ++--
> >>>>>  fs/iomap.c                |  5 ++--
> >>>>>  include/linux/blk_types.h |  1 +
> >>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>
> >>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>> --- a/block/bio.c
> >>>>> +++ b/block/bio.c
> >>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>  }
> >>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>
> >>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>> +{
> >>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>> + unsigned int len;
> >>>>> + size_t size;
> >>>>> +
> >>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>
> >>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>> the following delta change[1] is required, otherwise memory corruption
> >>>> can be observed when running xfstests over loop/dio.
> >>>
> >>> Thanks, I folded this in.
> >>>
> >>> --
> >>> Jens Axboe
> >>>
> >>
> >> syzkaller started hitting a crash on linux-next starting with this commit, and
> >> it still occurs even with your latest version that has Ming's fix folded in.
> >> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >> Sun Feb 24 08:20:53 2019 -0700.
> >>
> >> Reproducer:
> >>
> >> #define _GNU_SOURCE
> >> #include <fcntl.h>
> >> #include <linux/loop.h>
> >> #include <sys/ioctl.h>
> >> #include <sys/sendfile.h>
> >> #include <sys/syscall.h>
> >> #include <unistd.h>
> >>
> >> int main(void)
> >> {
> >>         int memfd, loopfd;
> >>
> >>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>
> >>         pwrite(memfd, "\xa8", 1, 4096);
> >>
> >>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>
> >>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>
> >>         sendfile(loopfd, loopfd, NULL, 1000000);
> >> }
> >>
> >>
> >> Crash:
> >>
> >> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >> flags: 0x100000000000000()
> >> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >
> > I see what this is, I'll cut a fix for this tomorrow.
>
> Folded in a fix for this, it's in my current io_uring branch and my for-next
> branch.

Hi Jens,

I saw the following change is added:

+ if (size == len) {
+ /*
+ * For the normal O_DIRECT case, we could skip grabbing this
+ * reference and then not have to put them again when IO
+ * completes. But this breaks some in-kernel users, like
+ * splicing to/from a loop device, where we release the pipe
+ * pages unconditionally. If we can fix that case, we can
+ * get rid of the get here and the need to call
+ * bio_release_pages() at IO completion time.
+ */
+ get_page(bv->bv_page);

Now the 'bv' may point to more than one page, so the following one may be
needed:

int i;
struct bvec_iter_all iter_all;
struct bio_vec *tmp;

mp_bvec_for_each_segment(tmp, bv, i, iter_all)
      get_page(tmp->bv_page);

Thanks,
Ming Lei

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  1:21               ` Ming Lei
@ 2019-02-27  1:47                 ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  1:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Eric Biggers, Ming Lei, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 6:21 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>> Hi Jens,
>>>>
>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>
>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>
>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>> ---
>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>> --- a/block/bio.c
>>>>>>> +++ b/block/bio.c
>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>  }
>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>
>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>> +{
>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>> + unsigned int len;
>>>>>>> + size_t size;
>>>>>>> +
>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>
>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>> can be observed when running xfstests over loop/dio.
>>>>>
>>>>> Thanks, I folded this in.
>>>>>
>>>>> --
>>>>> Jens Axboe
>>>>>
>>>>
>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>
>>>> Reproducer:
>>>>
>>>> #define _GNU_SOURCE
>>>> #include <fcntl.h>
>>>> #include <linux/loop.h>
>>>> #include <sys/ioctl.h>
>>>> #include <sys/sendfile.h>
>>>> #include <sys/syscall.h>
>>>> #include <unistd.h>
>>>>
>>>> int main(void)
>>>> {
>>>>         int memfd, loopfd;
>>>>
>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>
>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>
>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>
>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>
>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>> }
>>>>
>>>>
>>>> Crash:
>>>>
>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>> flags: 0x100000000000000()
>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>
>>> I see what this is, I'll cut a fix for this tomorrow.
>>
>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>> branch.
> 
> Hi Jens,
> 
> I saw the following change is added:
> 
> + if (size == len) {
> + /*
> + * For the normal O_DIRECT case, we could skip grabbing this
> + * reference and then not have to put them again when IO
> + * completes. But this breaks some in-kernel users, like
> + * splicing to/from a loop device, where we release the pipe
> + * pages unconditionally. If we can fix that case, we can
> + * get rid of the get here and the need to call
> + * bio_release_pages() at IO completion time.
> + */
> + get_page(bv->bv_page);
> 
> Now the 'bv' may point to more than one page, so the following one may be
> needed:
> 
> int i;
> struct bvec_iter_all iter_all;
> struct bio_vec *tmp;
> 
> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>       get_page(tmp->bv_page);

I guess that would be the safest, even if we don't currently have more
than one page in there. I'll fix it up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  1:47                 ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  1:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Eric Biggers, Ming Lei, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 6:21 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>> Hi Jens,
>>>>
>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>
>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>
>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>> ---
>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>
>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>> --- a/block/bio.c
>>>>>>> +++ b/block/bio.c
>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>  }
>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>
>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>> +{
>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>> + unsigned int len;
>>>>>>> + size_t size;
>>>>>>> +
>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>
>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>> can be observed when running xfstests over loop/dio.
>>>>>
>>>>> Thanks, I folded this in.
>>>>>
>>>>> --
>>>>> Jens Axboe
>>>>>
>>>>
>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>
>>>> Reproducer:
>>>>
>>>> #define _GNU_SOURCE
>>>> #include <fcntl.h>
>>>> #include <linux/loop.h>
>>>> #include <sys/ioctl.h>
>>>> #include <sys/sendfile.h>
>>>> #include <sys/syscall.h>
>>>> #include <unistd.h>
>>>>
>>>> int main(void)
>>>> {
>>>>         int memfd, loopfd;
>>>>
>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>
>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>
>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>
>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>
>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>> }
>>>>
>>>>
>>>> Crash:
>>>>
>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>> flags: 0x100000000000000()
>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>
>>> I see what this is, I'll cut a fix for this tomorrow.
>>
>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>> branch.
> 
> Hi Jens,
> 
> I saw the following change is added:
> 
> + if (size == len) {
> + /*
> + * For the normal O_DIRECT case, we could skip grabbing this
> + * reference and then not have to put them again when IO
> + * completes. But this breaks some in-kernel users, like
> + * splicing to/from a loop device, where we release the pipe
> + * pages unconditionally. If we can fix that case, we can
> + * get rid of the get here and the need to call
> + * bio_release_pages() at IO completion time.
> + */
> + get_page(bv->bv_page);
> 
> Now the 'bv' may point to more than one page, so the following one may be
> needed:
> 
> int i;
> struct bvec_iter_all iter_all;
> struct bio_vec *tmp;
> 
> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>       get_page(tmp->bv_page);

I guess that would be the safest, even if we don't currently have more
than one page in there. I'll fix it up.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  1:47                 ` Jens Axboe
@ 2019-02-27  1:53                   ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  1:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> On 2/26/19 6:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>> Hi Jens,
> >>>>
> >>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>
> >>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>
> >>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>> ---
> >>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>> --- a/block/bio.c
> >>>>>>> +++ b/block/bio.c
> >>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>  }
> >>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>
> >>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>> +{
> >>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>> + unsigned int len;
> >>>>>>> + size_t size;
> >>>>>>> +
> >>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>
> >>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>> can be observed when running xfstests over loop/dio.
> >>>>>
> >>>>> Thanks, I folded this in.
> >>>>>
> >>>>> --
> >>>>> Jens Axboe
> >>>>>
> >>>>
> >>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>
> >>>> Reproducer:
> >>>>
> >>>> #define _GNU_SOURCE
> >>>> #include <fcntl.h>
> >>>> #include <linux/loop.h>
> >>>> #include <sys/ioctl.h>
> >>>> #include <sys/sendfile.h>
> >>>> #include <sys/syscall.h>
> >>>> #include <unistd.h>
> >>>>
> >>>> int main(void)
> >>>> {
> >>>>         int memfd, loopfd;
> >>>>
> >>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>
> >>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>
> >>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>
> >>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>
> >>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>> }
> >>>>
> >>>>
> >>>> Crash:
> >>>>
> >>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>> flags: 0x100000000000000()
> >>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>
> >>> I see what this is, I'll cut a fix for this tomorrow.
> >>
> >> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >> branch.
> > 
> > Hi Jens,
> > 
> > I saw the following change is added:
> > 
> > + if (size == len) {
> > + /*
> > + * For the normal O_DIRECT case, we could skip grabbing this
> > + * reference and then not have to put them again when IO
> > + * completes. But this breaks some in-kernel users, like
> > + * splicing to/from a loop device, where we release the pipe
> > + * pages unconditionally. If we can fix that case, we can
> > + * get rid of the get here and the need to call
> > + * bio_release_pages() at IO completion time.
> > + */
> > + get_page(bv->bv_page);
> > 
> > Now the 'bv' may point to more than one page, so the following one may be
> > needed:
> > 
> > int i;
> > struct bvec_iter_all iter_all;
> > struct bio_vec *tmp;
> > 
> > mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >       get_page(tmp->bv_page);
> 
> I guess that would be the safest, even if we don't currently have more
> than one page in there. I'll fix it up.

It is easy to see multipage bvec from loop, :-)

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  1:53                   ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  1:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> On 2/26/19 6:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>> Hi Jens,
> >>>>
> >>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>
> >>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>
> >>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>> ---
> >>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>> --- a/block/bio.c
> >>>>>>> +++ b/block/bio.c
> >>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>  }
> >>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>
> >>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>> +{
> >>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>> + unsigned int len;
> >>>>>>> + size_t size;
> >>>>>>> +
> >>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>
> >>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>> can be observed when running xfstests over loop/dio.
> >>>>>
> >>>>> Thanks, I folded this in.
> >>>>>
> >>>>> --
> >>>>> Jens Axboe
> >>>>>
> >>>>
> >>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>
> >>>> Reproducer:
> >>>>
> >>>> #define _GNU_SOURCE
> >>>> #include <fcntl.h>
> >>>> #include <linux/loop.h>
> >>>> #include <sys/ioctl.h>
> >>>> #include <sys/sendfile.h>
> >>>> #include <sys/syscall.h>
> >>>> #include <unistd.h>
> >>>>
> >>>> int main(void)
> >>>> {
> >>>>         int memfd, loopfd;
> >>>>
> >>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>
> >>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>
> >>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>
> >>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>
> >>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>> }
> >>>>
> >>>>
> >>>> Crash:
> >>>>
> >>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>> flags: 0x100000000000000()
> >>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>
> >>> I see what this is, I'll cut a fix for this tomorrow.
> >>
> >> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >> branch.
> > 
> > Hi Jens,
> > 
> > I saw the following change is added:
> > 
> > + if (size == len) {
> > + /*
> > + * For the normal O_DIRECT case, we could skip grabbing this
> > + * reference and then not have to put them again when IO
> > + * completes. But this breaks some in-kernel users, like
> > + * splicing to/from a loop device, where we release the pipe
> > + * pages unconditionally. If we can fix that case, we can
> > + * get rid of the get here and the need to call
> > + * bio_release_pages() at IO completion time.
> > + */
> > + get_page(bv->bv_page);
> > 
> > Now the 'bv' may point to more than one page, so the following one may be
> > needed:
> > 
> > int i;
> > struct bvec_iter_all iter_all;
> > struct bio_vec *tmp;
> > 
> > mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >       get_page(tmp->bv_page);
> 
> I guess that would be the safest, even if we don't currently have more
> than one page in there. I'll fix it up.

It is easy to see multipage bvec from loop, :-)

Thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  1:53                   ` Ming Lei
@ 2019-02-27  1:57                     ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  1:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 6:53 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>> Hi Jens,
>>>>>>
>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>
>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>
>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>> ---
>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>> --- a/block/bio.c
>>>>>>>>> +++ b/block/bio.c
>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>  }
>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>
>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>> +{
>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>> + unsigned int len;
>>>>>>>>> + size_t size;
>>>>>>>>> +
>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>
>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>
>>>>>>> Thanks, I folded this in.
>>>>>>>
>>>>>>> --
>>>>>>> Jens Axboe
>>>>>>>
>>>>>>
>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>
>>>>>> Reproducer:
>>>>>>
>>>>>> #define _GNU_SOURCE
>>>>>> #include <fcntl.h>
>>>>>> #include <linux/loop.h>
>>>>>> #include <sys/ioctl.h>
>>>>>> #include <sys/sendfile.h>
>>>>>> #include <sys/syscall.h>
>>>>>> #include <unistd.h>
>>>>>>
>>>>>> int main(void)
>>>>>> {
>>>>>>         int memfd, loopfd;
>>>>>>
>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>
>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>
>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>
>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>
>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Crash:
>>>>>>
>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>> flags: 0x100000000000000()
>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>
>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>
>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>> branch.
>>>
>>> Hi Jens,
>>>
>>> I saw the following change is added:
>>>
>>> + if (size == len) {
>>> + /*
>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>> + * reference and then not have to put them again when IO
>>> + * completes. But this breaks some in-kernel users, like
>>> + * splicing to/from a loop device, where we release the pipe
>>> + * pages unconditionally. If we can fix that case, we can
>>> + * get rid of the get here and the need to call
>>> + * bio_release_pages() at IO completion time.
>>> + */
>>> + get_page(bv->bv_page);
>>>
>>> Now the 'bv' may point to more than one page, so the following one may be
>>> needed:
>>>
>>> int i;
>>> struct bvec_iter_all iter_all;
>>> struct bio_vec *tmp;
>>>
>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>       get_page(tmp->bv_page);
>>
>> I guess that would be the safest, even if we don't currently have more
>> than one page in there. I'll fix it up.
> 
> It is easy to see multipage bvec from loop, :-)

Speaking of this, I took a quick look at why we've now regressed a lot
on IOPS perf with the multipage work. It looks like it's all related to
the (much) fatter setup around iteration, which is related to this very
topic too.

Basically setup of things like bio_for_each_bvec() and indexing through
nth_page() is MUCH slower than before.

We need to do something about this, it's like tossing out months of
optimizations.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  1:57                     ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  1:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 6:53 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>> Hi Jens,
>>>>>>
>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>
>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>
>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>> ---
>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>> --- a/block/bio.c
>>>>>>>>> +++ b/block/bio.c
>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>  }
>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>
>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>> +{
>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>> + unsigned int len;
>>>>>>>>> + size_t size;
>>>>>>>>> +
>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>
>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>
>>>>>>> Thanks, I folded this in.
>>>>>>>
>>>>>>> --
>>>>>>> Jens Axboe
>>>>>>>
>>>>>>
>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>
>>>>>> Reproducer:
>>>>>>
>>>>>> #define _GNU_SOURCE
>>>>>> #include <fcntl.h>
>>>>>> #include <linux/loop.h>
>>>>>> #include <sys/ioctl.h>
>>>>>> #include <sys/sendfile.h>
>>>>>> #include <sys/syscall.h>
>>>>>> #include <unistd.h>
>>>>>>
>>>>>> int main(void)
>>>>>> {
>>>>>>         int memfd, loopfd;
>>>>>>
>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>
>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>
>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>
>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>
>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Crash:
>>>>>>
>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>> flags: 0x100000000000000()
>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>
>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>
>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>> branch.
>>>
>>> Hi Jens,
>>>
>>> I saw the following change is added:
>>>
>>> + if (size == len) {
>>> + /*
>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>> + * reference and then not have to put them again when IO
>>> + * completes. But this breaks some in-kernel users, like
>>> + * splicing to/from a loop device, where we release the pipe
>>> + * pages unconditionally. If we can fix that case, we can
>>> + * get rid of the get here and the need to call
>>> + * bio_release_pages() at IO completion time.
>>> + */
>>> + get_page(bv->bv_page);
>>>
>>> Now the 'bv' may point to more than one page, so the following one may be
>>> needed:
>>>
>>> int i;
>>> struct bvec_iter_all iter_all;
>>> struct bio_vec *tmp;
>>>
>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>       get_page(tmp->bv_page);
>>
>> I guess that would be the safest, even if we don't currently have more
>> than one page in there. I'll fix it up.
> 
> It is easy to see multipage bvec from loop, :-)

Speaking of this, I took a quick look at why we've now regressed a lot
on IOPS perf with the multipage work. It looks like it's all related to
the (much) fatter setup around iteration, which is related to this very
topic too.

Basically setup of things like bio_for_each_bvec() and indexing through
nth_page() is MUCH slower than before.

We need to do something about this, it's like tossing out months of
optimizations.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  1:57                     ` Jens Axboe
  (?)
@ 2019-02-27  2:03                     ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  2:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]

On 2/26/19 6:57 PM, Jens Axboe wrote:
> Speaking of this, I took a quick look at why we've now regressed a lot
> on IOPS perf with the multipage work. It looks like it's all related to
> the (much) fatter setup around iteration, which is related to this very
> topic too.
> 
> Basically setup of things like bio_for_each_bvec() and indexing through
> nth_page() is MUCH slower than before.
> 
> We need to do something about this, it's like tossing out months of
> optimizations.

See attached quick profile. This is doing 4k IOS, btw, so just single
segment requests. Yet we spend most of the time in blk_rq_map_sg(), and
blk_queue_split() is not far behind. Between just the two of them,
that's more than 8% of the total CPU utilization. As this test case is
running on just one CPU for all of it, that's a LOT of wasted time.

-- 
Jens Axboe


[-- Attachment #2: perf.jpg --]
[-- Type: image/jpeg, Size: 191091 bytes --]

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  1:57                     ` Jens Axboe
@ 2019-02-27  2:21                       ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  2:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> On 2/26/19 6:53 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>> Hi Jens,
> >>>>>>
> >>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>
> >>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>
> >>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>> ---
> >>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>> --- a/block/bio.c
> >>>>>>>>> +++ b/block/bio.c
> >>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>  }
> >>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>
> >>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>> +{
> >>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>> + unsigned int len;
> >>>>>>>>> + size_t size;
> >>>>>>>>> +
> >>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>
> >>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>
> >>>>>>> Thanks, I folded this in.
> >>>>>>>
> >>>>>>> --
> >>>>>>> Jens Axboe
> >>>>>>>
> >>>>>>
> >>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>
> >>>>>> Reproducer:
> >>>>>>
> >>>>>> #define _GNU_SOURCE
> >>>>>> #include <fcntl.h>
> >>>>>> #include <linux/loop.h>
> >>>>>> #include <sys/ioctl.h>
> >>>>>> #include <sys/sendfile.h>
> >>>>>> #include <sys/syscall.h>
> >>>>>> #include <unistd.h>
> >>>>>>
> >>>>>> int main(void)
> >>>>>> {
> >>>>>>         int memfd, loopfd;
> >>>>>>
> >>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>
> >>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>
> >>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>
> >>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>
> >>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>> }
> >>>>>>
> >>>>>>
> >>>>>> Crash:
> >>>>>>
> >>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>> flags: 0x100000000000000()
> >>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>
> >>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>
> >>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>> branch.
> >>>
> >>> Hi Jens,
> >>>
> >>> I saw the following change is added:
> >>>
> >>> + if (size == len) {
> >>> + /*
> >>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>> + * reference and then not have to put them again when IO
> >>> + * completes. But this breaks some in-kernel users, like
> >>> + * splicing to/from a loop device, where we release the pipe
> >>> + * pages unconditionally. If we can fix that case, we can
> >>> + * get rid of the get here and the need to call
> >>> + * bio_release_pages() at IO completion time.
> >>> + */
> >>> + get_page(bv->bv_page);
> >>>
> >>> Now the 'bv' may point to more than one page, so the following one may be
> >>> needed:
> >>>
> >>> int i;
> >>> struct bvec_iter_all iter_all;
> >>> struct bio_vec *tmp;
> >>>
> >>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>       get_page(tmp->bv_page);
> >>
> >> I guess that would be the safest, even if we don't currently have more
> >> than one page in there. I'll fix it up.
> > 
> > It is easy to see multipage bvec from loop, :-)
> 
> Speaking of this, I took a quick look at why we've now regressed a lot
> on IOPS perf with the multipage work. It looks like it's all related to
> the (much) fatter setup around iteration, which is related to this very
> topic too.
> 
> Basically setup of things like bio_for_each_bvec() and indexing through
> nth_page() is MUCH slower than before.

But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
needs that. However, bio_for_each_segment() isn't called from
blk_queue_split() and blk_rq_map_sg().

One issue is that bio_for_each_bvec() still advances by page size
instead of bvec->len, I guess that is the problem, will cook a patch
for your test.

> 
> We need to do something about this, it's like tossing out months of
> optimizations.

Some following optimization can be done, such as removing
biovec_phys_mergeable() from blk_bio_segment_split().


Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  2:21                       ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  2:21 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> On 2/26/19 6:53 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>> Hi Jens,
> >>>>>>
> >>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>
> >>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>
> >>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>> ---
> >>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>> --- a/block/bio.c
> >>>>>>>>> +++ b/block/bio.c
> >>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>  }
> >>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>
> >>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>> +{
> >>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>> + unsigned int len;
> >>>>>>>>> + size_t size;
> >>>>>>>>> +
> >>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>
> >>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>
> >>>>>>> Thanks, I folded this in.
> >>>>>>>
> >>>>>>> --
> >>>>>>> Jens Axboe
> >>>>>>>
> >>>>>>
> >>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>
> >>>>>> Reproducer:
> >>>>>>
> >>>>>> #define _GNU_SOURCE
> >>>>>> #include <fcntl.h>
> >>>>>> #include <linux/loop.h>
> >>>>>> #include <sys/ioctl.h>
> >>>>>> #include <sys/sendfile.h>
> >>>>>> #include <sys/syscall.h>
> >>>>>> #include <unistd.h>
> >>>>>>
> >>>>>> int main(void)
> >>>>>> {
> >>>>>>         int memfd, loopfd;
> >>>>>>
> >>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>
> >>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>
> >>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>
> >>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>
> >>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>> }
> >>>>>>
> >>>>>>
> >>>>>> Crash:
> >>>>>>
> >>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>> flags: 0x100000000000000()
> >>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>
> >>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>
> >>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>> branch.
> >>>
> >>> Hi Jens,
> >>>
> >>> I saw the following change is added:
> >>>
> >>> + if (size == len) {
> >>> + /*
> >>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>> + * reference and then not have to put them again when IO
> >>> + * completes. But this breaks some in-kernel users, like
> >>> + * splicing to/from a loop device, where we release the pipe
> >>> + * pages unconditionally. If we can fix that case, we can
> >>> + * get rid of the get here and the need to call
> >>> + * bio_release_pages() at IO completion time.
> >>> + */
> >>> + get_page(bv->bv_page);
> >>>
> >>> Now the 'bv' may point to more than one page, so the following one may be
> >>> needed:
> >>>
> >>> int i;
> >>> struct bvec_iter_all iter_all;
> >>> struct bio_vec *tmp;
> >>>
> >>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>       get_page(tmp->bv_page);
> >>
> >> I guess that would be the safest, even if we don't currently have more
> >> than one page in there. I'll fix it up.
> > 
> > It is easy to see multipage bvec from loop, :-)
> 
> Speaking of this, I took a quick look at why we've now regressed a lot
> on IOPS perf with the multipage work. It looks like it's all related to
> the (much) fatter setup around iteration, which is related to this very
> topic too.
> 
> Basically setup of things like bio_for_each_bvec() and indexing through
> nth_page() is MUCH slower than before.

But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
needs that. However, bio_for_each_segment() isn't called from
blk_queue_split() and blk_rq_map_sg().

One issue is that bio_for_each_bvec() still advances by page size
instead of bvec->len, I guess that is the problem, will cook a patch
for your test.

> 
> We need to do something about this, it's like tossing out months of
> optimizations.

Some following optimization can be done, such as removing
biovec_phys_mergeable() from blk_bio_segment_split().


Thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  2:21                       ` Ming Lei
@ 2019-02-27  2:28                         ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  2:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 7:21 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>> Hi Jens,
>>>>>>>>
>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>
>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>> ---
>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>  }
>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>
>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>> +{
>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>> + size_t size;
>>>>>>>>>>> +
>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>
>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>
>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jens Axboe
>>>>>>>>>
>>>>>>>>
>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>
>>>>>>>> Reproducer:
>>>>>>>>
>>>>>>>> #define _GNU_SOURCE
>>>>>>>> #include <fcntl.h>
>>>>>>>> #include <linux/loop.h>
>>>>>>>> #include <sys/ioctl.h>
>>>>>>>> #include <sys/sendfile.h>
>>>>>>>> #include <sys/syscall.h>
>>>>>>>> #include <unistd.h>
>>>>>>>>
>>>>>>>> int main(void)
>>>>>>>> {
>>>>>>>>         int memfd, loopfd;
>>>>>>>>
>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>
>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>
>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>
>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>
>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Crash:
>>>>>>>>
>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>> flags: 0x100000000000000()
>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>
>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>
>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>> branch.
>>>>>
>>>>> Hi Jens,
>>>>>
>>>>> I saw the following change is added:
>>>>>
>>>>> + if (size == len) {
>>>>> + /*
>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>> + * reference and then not have to put them again when IO
>>>>> + * completes. But this breaks some in-kernel users, like
>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>> + * get rid of the get here and the need to call
>>>>> + * bio_release_pages() at IO completion time.
>>>>> + */
>>>>> + get_page(bv->bv_page);
>>>>>
>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>> needed:
>>>>>
>>>>> int i;
>>>>> struct bvec_iter_all iter_all;
>>>>> struct bio_vec *tmp;
>>>>>
>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>       get_page(tmp->bv_page);
>>>>
>>>> I guess that would be the safest, even if we don't currently have more
>>>> than one page in there. I'll fix it up.
>>>
>>> It is easy to see multipage bvec from loop, :-)
>>
>> Speaking of this, I took a quick look at why we've now regressed a lot
>> on IOPS perf with the multipage work. It looks like it's all related to
>> the (much) fatter setup around iteration, which is related to this very
>> topic too.
>>
>> Basically setup of things like bio_for_each_bvec() and indexing through
>> nth_page() is MUCH slower than before.
> 
> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> needs that. However, bio_for_each_segment() isn't called from
> blk_queue_split() and blk_rq_map_sg().
> 
> One issue is that bio_for_each_bvec() still advances by page size
> instead of bvec->len, I guess that is the problem, will cook a patch
> for your test.

Probably won't make a difference for my test case...

>> We need to do something about this, it's like tossing out months of
>> optimizations.
> 
> Some following optimization can be done, such as removing
> biovec_phys_mergeable() from blk_bio_segment_split().

I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
that it is possible. But iteration startup cost is a problem in a lot of
spots, and a split fast path will only help a bit for that specific
case.

5% regressions is HUGE. I know I've mentioned this before, I just want
to really stress how big of a deal that is. It's enough to make me
consider just reverting it again, which sucks, but I don't feel great
shipping something that is known that much slower.

Suggestions?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  2:28                         ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  2:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 7:21 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>> Hi Jens,
>>>>>>>>
>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>
>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>
>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>> ---
>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>  }
>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>
>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>> +{
>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>> + size_t size;
>>>>>>>>>>> +
>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>
>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>
>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jens Axboe
>>>>>>>>>
>>>>>>>>
>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>
>>>>>>>> Reproducer:
>>>>>>>>
>>>>>>>> #define _GNU_SOURCE
>>>>>>>> #include <fcntl.h>
>>>>>>>> #include <linux/loop.h>
>>>>>>>> #include <sys/ioctl.h>
>>>>>>>> #include <sys/sendfile.h>
>>>>>>>> #include <sys/syscall.h>
>>>>>>>> #include <unistd.h>
>>>>>>>>
>>>>>>>> int main(void)
>>>>>>>> {
>>>>>>>>         int memfd, loopfd;
>>>>>>>>
>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>
>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>
>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>
>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>
>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Crash:
>>>>>>>>
>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>> flags: 0x100000000000000()
>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>
>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>
>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>> branch.
>>>>>
>>>>> Hi Jens,
>>>>>
>>>>> I saw the following change is added:
>>>>>
>>>>> + if (size == len) {
>>>>> + /*
>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>> + * reference and then not have to put them again when IO
>>>>> + * completes. But this breaks some in-kernel users, like
>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>> + * get rid of the get here and the need to call
>>>>> + * bio_release_pages() at IO completion time.
>>>>> + */
>>>>> + get_page(bv->bv_page);
>>>>>
>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>> needed:
>>>>>
>>>>> int i;
>>>>> struct bvec_iter_all iter_all;
>>>>> struct bio_vec *tmp;
>>>>>
>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>       get_page(tmp->bv_page);
>>>>
>>>> I guess that would be the safest, even if we don't currently have more
>>>> than one page in there. I'll fix it up.
>>>
>>> It is easy to see multipage bvec from loop, :-)
>>
>> Speaking of this, I took a quick look at why we've now regressed a lot
>> on IOPS perf with the multipage work. It looks like it's all related to
>> the (much) fatter setup around iteration, which is related to this very
>> topic too.
>>
>> Basically setup of things like bio_for_each_bvec() and indexing through
>> nth_page() is MUCH slower than before.
> 
> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> needs that. However, bio_for_each_segment() isn't called from
> blk_queue_split() and blk_rq_map_sg().
> 
> One issue is that bio_for_each_bvec() still advances by page size
> instead of bvec->len, I guess that is the problem, will cook a patch
> for your test.

Probably won't make a difference for my test case...

>> We need to do something about this, it's like tossing out months of
>> optimizations.
> 
> Some following optimization can be done, such as removing
> biovec_phys_mergeable() from blk_bio_segment_split().

I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
that it is possible. But iteration startup cost is a problem in a lot of
spots, and a split fast path will only help a bit for that specific
case.

5% regressions is HUGE. I know I've mentioned this before, I just want
to really stress how big of a deal that is. It's enough to make me
consider just reverting it again, which sucks, but I don't feel great
shipping something that is known that much slower.

Suggestions?

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  2:28                         ` Jens Axboe
@ 2019-02-27  2:37                           ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  2:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> On 2/26/19 7:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>> Hi Jens,
> >>>>>>>>
> >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>
> >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>
> >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>  }
> >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>
> >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>> + size_t size;
> >>>>>>>>>>> +
> >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>
> >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>
> >>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jens Axboe
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>
> >>>>>>>> Reproducer:
> >>>>>>>>
> >>>>>>>> #define _GNU_SOURCE
> >>>>>>>> #include <fcntl.h>
> >>>>>>>> #include <linux/loop.h>
> >>>>>>>> #include <sys/ioctl.h>
> >>>>>>>> #include <sys/sendfile.h>
> >>>>>>>> #include <sys/syscall.h>
> >>>>>>>> #include <unistd.h>
> >>>>>>>>
> >>>>>>>> int main(void)
> >>>>>>>> {
> >>>>>>>>         int memfd, loopfd;
> >>>>>>>>
> >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>
> >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>
> >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>
> >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>
> >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Crash:
> >>>>>>>>
> >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>> flags: 0x100000000000000()
> >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>
> >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>
> >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>> branch.
> >>>>>
> >>>>> Hi Jens,
> >>>>>
> >>>>> I saw the following change is added:
> >>>>>
> >>>>> + if (size == len) {
> >>>>> + /*
> >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>> + * reference and then not have to put them again when IO
> >>>>> + * completes. But this breaks some in-kernel users, like
> >>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>> + * get rid of the get here and the need to call
> >>>>> + * bio_release_pages() at IO completion time.
> >>>>> + */
> >>>>> + get_page(bv->bv_page);
> >>>>>
> >>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>> needed:
> >>>>>
> >>>>> int i;
> >>>>> struct bvec_iter_all iter_all;
> >>>>> struct bio_vec *tmp;
> >>>>>
> >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>       get_page(tmp->bv_page);
> >>>>
> >>>> I guess that would be the safest, even if we don't currently have more
> >>>> than one page in there. I'll fix it up.
> >>>
> >>> It is easy to see multipage bvec from loop, :-)
> >>
> >> Speaking of this, I took a quick look at why we've now regressed a lot
> >> on IOPS perf with the multipage work. It looks like it's all related to
> >> the (much) fatter setup around iteration, which is related to this very
> >> topic too.
> >>
> >> Basically setup of things like bio_for_each_bvec() and indexing through
> >> nth_page() is MUCH slower than before.
> > 
> > But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> > needs that. However, bio_for_each_segment() isn't called from
> > blk_queue_split() and blk_rq_map_sg().
> > 
> > One issue is that bio_for_each_bvec() still advances by page size
> > instead of bvec->len, I guess that is the problem, will cook a patch
> > for your test.
> 
> Probably won't make a difference for my test case...
> 
> >> We need to do something about this, it's like tossing out months of
> >> optimizations.
> > 
> > Some following optimization can be done, such as removing
> > biovec_phys_mergeable() from blk_bio_segment_split().
> 
> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> that it is possible. But iteration startup cost is a problem in a lot of
> spots, and a split fast path will only help a bit for that specific
> case.
> 
> 5% regressions is HUGE. I know I've mentioned this before, I just want
> to really stress how big of a deal that is. It's enough to make me
> consider just reverting it again, which sucks, but I don't feel great
> shipping something that is known that much slower.
> 
> Suggestions?

You mentioned nth_page() costs much in bio_for_each_bvec(), but which
shouldn't call into nth_page(). I will look into it first.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  2:37                           ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  2:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> On 2/26/19 7:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>> Hi Jens,
> >>>>>>>>
> >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>
> >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>
> >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>  }
> >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>
> >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>> + size_t size;
> >>>>>>>>>>> +
> >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>
> >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>
> >>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jens Axboe
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>
> >>>>>>>> Reproducer:
> >>>>>>>>
> >>>>>>>> #define _GNU_SOURCE
> >>>>>>>> #include <fcntl.h>
> >>>>>>>> #include <linux/loop.h>
> >>>>>>>> #include <sys/ioctl.h>
> >>>>>>>> #include <sys/sendfile.h>
> >>>>>>>> #include <sys/syscall.h>
> >>>>>>>> #include <unistd.h>
> >>>>>>>>
> >>>>>>>> int main(void)
> >>>>>>>> {
> >>>>>>>>         int memfd, loopfd;
> >>>>>>>>
> >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>
> >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>
> >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>
> >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>
> >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Crash:
> >>>>>>>>
> >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>> flags: 0x100000000000000()
> >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>
> >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>
> >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>> branch.
> >>>>>
> >>>>> Hi Jens,
> >>>>>
> >>>>> I saw the following change is added:
> >>>>>
> >>>>> + if (size == len) {
> >>>>> + /*
> >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>> + * reference and then not have to put them again when IO
> >>>>> + * completes. But this breaks some in-kernel users, like
> >>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>> + * get rid of the get here and the need to call
> >>>>> + * bio_release_pages() at IO completion time.
> >>>>> + */
> >>>>> + get_page(bv->bv_page);
> >>>>>
> >>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>> needed:
> >>>>>
> >>>>> int i;
> >>>>> struct bvec_iter_all iter_all;
> >>>>> struct bio_vec *tmp;
> >>>>>
> >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>       get_page(tmp->bv_page);
> >>>>
> >>>> I guess that would be the safest, even if we don't currently have more
> >>>> than one page in there. I'll fix it up.
> >>>
> >>> It is easy to see multipage bvec from loop, :-)
> >>
> >> Speaking of this, I took a quick look at why we've now regressed a lot
> >> on IOPS perf with the multipage work. It looks like it's all related to
> >> the (much) fatter setup around iteration, which is related to this very
> >> topic too.
> >>
> >> Basically setup of things like bio_for_each_bvec() and indexing through
> >> nth_page() is MUCH slower than before.
> > 
> > But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> > needs that. However, bio_for_each_segment() isn't called from
> > blk_queue_split() and blk_rq_map_sg().
> > 
> > One issue is that bio_for_each_bvec() still advances by page size
> > instead of bvec->len, I guess that is the problem, will cook a patch
> > for your test.
> 
> Probably won't make a difference for my test case...
> 
> >> We need to do something about this, it's like tossing out months of
> >> optimizations.
> > 
> > Some following optimization can be done, such as removing
> > biovec_phys_mergeable() from blk_bio_segment_split().
> 
> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> that it is possible. But iteration startup cost is a problem in a lot of
> spots, and a split fast path will only help a bit for that specific
> case.
> 
> 5% regressions is HUGE. I know I've mentioned this before, I just want
> to really stress how big of a deal that is. It's enough to make me
> consider just reverting it again, which sucks, but I don't feel great
> shipping something that is known that much slower.
> 
> Suggestions?

You mentioned nth_page() costs much in bio_for_each_bvec(), but which
shouldn't call into nth_page(). I will look into it first.

Thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  2:37                           ` Ming Lei
@ 2019-02-27  2:43                             ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  2:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 7:37 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>> Hi Jens,
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>  }
>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>
>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>
>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>
>>>>>>>>>> Reproducer:
>>>>>>>>>>
>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>
>>>>>>>>>> int main(void)
>>>>>>>>>> {
>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>
>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>
>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>
>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>
>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>
>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Crash:
>>>>>>>>>>
>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>
>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>
>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>> branch.
>>>>>>>
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> I saw the following change is added:
>>>>>>>
>>>>>>> + if (size == len) {
>>>>>>> + /*
>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>> + * reference and then not have to put them again when IO
>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>> + * get rid of the get here and the need to call
>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>> + */
>>>>>>> + get_page(bv->bv_page);
>>>>>>>
>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>> needed:
>>>>>>>
>>>>>>> int i;
>>>>>>> struct bvec_iter_all iter_all;
>>>>>>> struct bio_vec *tmp;
>>>>>>>
>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>       get_page(tmp->bv_page);
>>>>>>
>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>> than one page in there. I'll fix it up.
>>>>>
>>>>> It is easy to see multipage bvec from loop, :-)
>>>>
>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>> the (much) fatter setup around iteration, which is related to this very
>>>> topic too.
>>>>
>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>> nth_page() is MUCH slower than before.
>>>
>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>> needs that. However, bio_for_each_segment() isn't called from
>>> blk_queue_split() and blk_rq_map_sg().
>>>
>>> One issue is that bio_for_each_bvec() still advances by page size
>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>> for your test.
>>
>> Probably won't make a difference for my test case...
>>
>>>> We need to do something about this, it's like tossing out months of
>>>> optimizations.
>>>
>>> Some following optimization can be done, such as removing
>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>
>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>> that it is possible. But iteration startup cost is a problem in a lot of
>> spots, and a split fast path will only help a bit for that specific
>> case.
>>
>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>> to really stress how big of a deal that is. It's enough to make me
>> consider just reverting it again, which sucks, but I don't feel great
>> shipping something that is known that much slower.
>>
>> Suggestions?
> 
> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> shouldn't call into nth_page(). I will look into it first.

I'll check on the test box tomorrow, I lost connectivity before. I'll
double check in the morning.

I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
consumer.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  2:43                             ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  2:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 7:37 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>> Hi Jens,
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>  }
>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>
>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>
>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>
>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>
>>>>>>>>>> Reproducer:
>>>>>>>>>>
>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>
>>>>>>>>>> int main(void)
>>>>>>>>>> {
>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>
>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>
>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>
>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>
>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>
>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Crash:
>>>>>>>>>>
>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>
>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>
>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>> branch.
>>>>>>>
>>>>>>> Hi Jens,
>>>>>>>
>>>>>>> I saw the following change is added:
>>>>>>>
>>>>>>> + if (size == len) {
>>>>>>> + /*
>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>> + * reference and then not have to put them again when IO
>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>> + * get rid of the get here and the need to call
>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>> + */
>>>>>>> + get_page(bv->bv_page);
>>>>>>>
>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>> needed:
>>>>>>>
>>>>>>> int i;
>>>>>>> struct bvec_iter_all iter_all;
>>>>>>> struct bio_vec *tmp;
>>>>>>>
>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>       get_page(tmp->bv_page);
>>>>>>
>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>> than one page in there. I'll fix it up.
>>>>>
>>>>> It is easy to see multipage bvec from loop, :-)
>>>>
>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>> the (much) fatter setup around iteration, which is related to this very
>>>> topic too.
>>>>
>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>> nth_page() is MUCH slower than before.
>>>
>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>> needs that. However, bio_for_each_segment() isn't called from
>>> blk_queue_split() and blk_rq_map_sg().
>>>
>>> One issue is that bio_for_each_bvec() still advances by page size
>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>> for your test.
>>
>> Probably won't make a difference for my test case...
>>
>>>> We need to do something about this, it's like tossing out months of
>>>> optimizations.
>>>
>>> Some following optimization can be done, such as removing
>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>
>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>> that it is possible. But iteration startup cost is a problem in a lot of
>> spots, and a split fast path will only help a bit for that specific
>> case.
>>
>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>> to really stress how big of a deal that is. It's enough to make me
>> consider just reverting it again, which sucks, but I don't feel great
>> shipping something that is known that much slower.
>>
>> Suggestions?
> 
> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> shouldn't call into nth_page(). I will look into it first.

I'll check on the test box tomorrow, I lost connectivity before. I'll
double check in the morning.

I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
consumer.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  2:43                             ` Jens Axboe
@ 2019-02-27  3:09                               ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> On 2/26/19 7:37 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> >> On 2/26/19 7:21 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>
> >>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>>>> Hi Jens,
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>>>  }
> >>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>>>> + size_t size;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>>>
> >>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>>>
> >>>>>>>>>> Reproducer:
> >>>>>>>>>>
> >>>>>>>>>> #define _GNU_SOURCE
> >>>>>>>>>> #include <fcntl.h>
> >>>>>>>>>> #include <linux/loop.h>
> >>>>>>>>>> #include <sys/ioctl.h>
> >>>>>>>>>> #include <sys/sendfile.h>
> >>>>>>>>>> #include <sys/syscall.h>
> >>>>>>>>>> #include <unistd.h>
> >>>>>>>>>>
> >>>>>>>>>> int main(void)
> >>>>>>>>>> {
> >>>>>>>>>>         int memfd, loopfd;
> >>>>>>>>>>
> >>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>>>
> >>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>>>
> >>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>>>
> >>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>>>
> >>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Crash:
> >>>>>>>>>>
> >>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>>>> flags: 0x100000000000000()
> >>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>>>
> >>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>>>
> >>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>>>> branch.
> >>>>>>>
> >>>>>>> Hi Jens,
> >>>>>>>
> >>>>>>> I saw the following change is added:
> >>>>>>>
> >>>>>>> + if (size == len) {
> >>>>>>> + /*
> >>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>>>> + * reference and then not have to put them again when IO
> >>>>>>> + * completes. But this breaks some in-kernel users, like
> >>>>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>>>> + * get rid of the get here and the need to call
> >>>>>>> + * bio_release_pages() at IO completion time.
> >>>>>>> + */
> >>>>>>> + get_page(bv->bv_page);
> >>>>>>>
> >>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>>>> needed:
> >>>>>>>
> >>>>>>> int i;
> >>>>>>> struct bvec_iter_all iter_all;
> >>>>>>> struct bio_vec *tmp;
> >>>>>>>
> >>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>>>       get_page(tmp->bv_page);
> >>>>>>
> >>>>>> I guess that would be the safest, even if we don't currently have more
> >>>>>> than one page in there. I'll fix it up.
> >>>>>
> >>>>> It is easy to see multipage bvec from loop, :-)
> >>>>
> >>>> Speaking of this, I took a quick look at why we've now regressed a lot
> >>>> on IOPS perf with the multipage work. It looks like it's all related to
> >>>> the (much) fatter setup around iteration, which is related to this very
> >>>> topic too.
> >>>>
> >>>> Basically setup of things like bio_for_each_bvec() and indexing through
> >>>> nth_page() is MUCH slower than before.
> >>>
> >>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> >>> needs that. However, bio_for_each_segment() isn't called from
> >>> blk_queue_split() and blk_rq_map_sg().
> >>>
> >>> One issue is that bio_for_each_bvec() still advances by page size
> >>> instead of bvec->len, I guess that is the problem, will cook a patch
> >>> for your test.
> >>
> >> Probably won't make a difference for my test case...
> >>
> >>>> We need to do something about this, it's like tossing out months of
> >>>> optimizations.
> >>>
> >>> Some following optimization can be done, such as removing
> >>> biovec_phys_mergeable() from blk_bio_segment_split().
> >>
> >> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> >> that it is possible. But iteration startup cost is a problem in a lot of
> >> spots, and a split fast path will only help a bit for that specific
> >> case.
> >>
> >> 5% regressions is HUGE. I know I've mentioned this before, I just want
> >> to really stress how big of a deal that is. It's enough to make me
> >> consider just reverting it again, which sucks, but I don't feel great
> >> shipping something that is known that much slower.
> >>
> >> Suggestions?
> > 
> > You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> > shouldn't call into nth_page(). I will look into it first.
> 
> I'll check on the test box tomorrow, I lost connectivity before. I'll
> double check in the morning.
> 
> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> consumer.

Hi Jens,

Could you test the following patch which may improve on the 4k randio
test case?

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 066b66430523..c1ad8abbd9d6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -447,7 +447,7 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
 	return biovec_phys_mergeable(q, &end_bv, &nxt_bv);
 }
 
-static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
 	if (!*sg)
@@ -483,7 +483,7 @@ static unsigned blk_bvec_map_sg(struct request_queue *q,
 
 		offset = (total + bvec->bv_offset) % PAGE_SIZE;
 		idx = (total + bvec->bv_offset) / PAGE_SIZE;
-		pg = nth_page(bvec->bv_page, idx);
+		pg = bvec_nth_page(bvec->bv_page, idx);
 
 		sg_set_page(*sg, pg, seg_size, offset);
 
@@ -512,7 +512,12 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 		(*sg)->length += nbytes;
 	} else {
 new_segment:
-		(*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
+		if (bvec->bv_offset + bvec->bv_len <= PAGE_SIZE) {
+			*sg = blk_next_sg(sg, sglist);
+			sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
+			(*nsegs) += 1;
+		} else
+			(*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
 	}
 	*bvprv = *bvec;
 }
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 30a57b68d017..4376f683c08a 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -51,6 +51,11 @@ struct bvec_iter_all {
 	unsigned	done;
 };
 
+static inline struct page *bvec_nth_page(struct page *page, int idx)
+{
+	return idx == 0 ? page : nth_page(page, idx);
+}
+
 /*
  * various member access, note that bio_data should of course not be used
  * on highmem page vectors
@@ -87,8 +92,8 @@ struct bvec_iter_all {
 	      PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
 
 #define bvec_iter_page(bvec, iter)				\
-	nth_page(mp_bvec_iter_page((bvec), (iter)),		\
-		 mp_bvec_iter_page_idx((bvec), (iter)))
+	bvec_nth_page(mp_bvec_iter_page((bvec), (iter)),		\
+		      mp_bvec_iter_page_idx((bvec), (iter)))
 
 #define bvec_iter_bvec(bvec, iter)				\
 ((struct bio_vec) {						\
@@ -171,7 +176,7 @@ static inline void mp_bvec_last_segment(const struct bio_vec *bvec,
 	unsigned total = bvec->bv_offset + bvec->bv_len;
 	unsigned last_page = (total - 1) / PAGE_SIZE;
 
-	seg->bv_page = nth_page(bvec->bv_page, last_page);
+	seg->bv_page = bvec_nth_page(bvec->bv_page, last_page);
 
 	/* the whole segment is inside the last page */
 	if (bvec->bv_offset >= last_page * PAGE_SIZE) {

thanks,
Ming

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  3:09                               ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> On 2/26/19 7:37 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> >> On 2/26/19 7:21 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>
> >>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>>>> Hi Jens,
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>>>  }
> >>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>>>> +{
> >>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>>>> + size_t size;
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>>>
> >>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>>>
> >>>>>>>>>> Reproducer:
> >>>>>>>>>>
> >>>>>>>>>> #define _GNU_SOURCE
> >>>>>>>>>> #include <fcntl.h>
> >>>>>>>>>> #include <linux/loop.h>
> >>>>>>>>>> #include <sys/ioctl.h>
> >>>>>>>>>> #include <sys/sendfile.h>
> >>>>>>>>>> #include <sys/syscall.h>
> >>>>>>>>>> #include <unistd.h>
> >>>>>>>>>>
> >>>>>>>>>> int main(void)
> >>>>>>>>>> {
> >>>>>>>>>>         int memfd, loopfd;
> >>>>>>>>>>
> >>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>>>
> >>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>>>
> >>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>>>
> >>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>>>
> >>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Crash:
> >>>>>>>>>>
> >>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>>>> flags: 0x100000000000000()
> >>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>>>
> >>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>>>
> >>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>>>> branch.
> >>>>>>>
> >>>>>>> Hi Jens,
> >>>>>>>
> >>>>>>> I saw the following change is added:
> >>>>>>>
> >>>>>>> + if (size == len) {
> >>>>>>> + /*
> >>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>>>> + * reference and then not have to put them again when IO
> >>>>>>> + * completes. But this breaks some in-kernel users, like
> >>>>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>>>> + * get rid of the get here and the need to call
> >>>>>>> + * bio_release_pages() at IO completion time.
> >>>>>>> + */
> >>>>>>> + get_page(bv->bv_page);
> >>>>>>>
> >>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>>>> needed:
> >>>>>>>
> >>>>>>> int i;
> >>>>>>> struct bvec_iter_all iter_all;
> >>>>>>> struct bio_vec *tmp;
> >>>>>>>
> >>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>>>       get_page(tmp->bv_page);
> >>>>>>
> >>>>>> I guess that would be the safest, even if we don't currently have more
> >>>>>> than one page in there. I'll fix it up.
> >>>>>
> >>>>> It is easy to see multipage bvec from loop, :-)
> >>>>
> >>>> Speaking of this, I took a quick look at why we've now regressed a lot
> >>>> on IOPS perf with the multipage work. It looks like it's all related to
> >>>> the (much) fatter setup around iteration, which is related to this very
> >>>> topic too.
> >>>>
> >>>> Basically setup of things like bio_for_each_bvec() and indexing through
> >>>> nth_page() is MUCH slower than before.
> >>>
> >>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> >>> needs that. However, bio_for_each_segment() isn't called from
> >>> blk_queue_split() and blk_rq_map_sg().
> >>>
> >>> One issue is that bio_for_each_bvec() still advances by page size
> >>> instead of bvec->len, I guess that is the problem, will cook a patch
> >>> for your test.
> >>
> >> Probably won't make a difference for my test case...
> >>
> >>>> We need to do something about this, it's like tossing out months of
> >>>> optimizations.
> >>>
> >>> Some following optimization can be done, such as removing
> >>> biovec_phys_mergeable() from blk_bio_segment_split().
> >>
> >> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> >> that it is possible. But iteration startup cost is a problem in a lot of
> >> spots, and a split fast path will only help a bit for that specific
> >> case.
> >>
> >> 5% regressions is HUGE. I know I've mentioned this before, I just want
> >> to really stress how big of a deal that is. It's enough to make me
> >> consider just reverting it again, which sucks, but I don't feel great
> >> shipping something that is known that much slower.
> >>
> >> Suggestions?
> > 
> > You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> > shouldn't call into nth_page(). I will look into it first.
> 
> I'll check on the test box tomorrow, I lost connectivity before. I'll
> double check in the morning.
> 
> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> consumer.

Hi Jens,

Could you test the following patch which may improve on the 4k randio
test case?

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 066b66430523..c1ad8abbd9d6 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -447,7 +447,7 @@ static int blk_phys_contig_segment(struct request_queue *q, struct bio *bio,
 	return biovec_phys_mergeable(q, &end_bv, &nxt_bv);
 }
 
-static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+static inline struct scatterlist *blk_next_sg(struct scatterlist **sg,
 		struct scatterlist *sglist)
 {
 	if (!*sg)
@@ -483,7 +483,7 @@ static unsigned blk_bvec_map_sg(struct request_queue *q,
 
 		offset = (total + bvec->bv_offset) % PAGE_SIZE;
 		idx = (total + bvec->bv_offset) / PAGE_SIZE;
-		pg = nth_page(bvec->bv_page, idx);
+		pg = bvec_nth_page(bvec->bv_page, idx);
 
 		sg_set_page(*sg, pg, seg_size, offset);
 
@@ -512,7 +512,12 @@ __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 		(*sg)->length += nbytes;
 	} else {
 new_segment:
-		(*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
+		if (bvec->bv_offset + bvec->bv_len <= PAGE_SIZE) {
+			*sg = blk_next_sg(sg, sglist);
+			sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
+			(*nsegs) += 1;
+		} else
+			(*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
 	}
 	*bvprv = *bvec;
 }
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 30a57b68d017..4376f683c08a 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -51,6 +51,11 @@ struct bvec_iter_all {
 	unsigned	done;
 };
 
+static inline struct page *bvec_nth_page(struct page *page, int idx)
+{
+	return idx == 0 ? page : nth_page(page, idx);
+}
+
 /*
  * various member access, note that bio_data should of course not be used
  * on highmem page vectors
@@ -87,8 +92,8 @@ struct bvec_iter_all {
 	      PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
 
 #define bvec_iter_page(bvec, iter)				\
-	nth_page(mp_bvec_iter_page((bvec), (iter)),		\
-		 mp_bvec_iter_page_idx((bvec), (iter)))
+	bvec_nth_page(mp_bvec_iter_page((bvec), (iter)),		\
+		      mp_bvec_iter_page_idx((bvec), (iter)))
 
 #define bvec_iter_bvec(bvec, iter)				\
 ((struct bio_vec) {						\
@@ -171,7 +176,7 @@ static inline void mp_bvec_last_segment(const struct bio_vec *bvec,
 	unsigned total = bvec->bv_offset + bvec->bv_len;
 	unsigned last_page = (total - 1) / PAGE_SIZE;
 
-	seg->bv_page = nth_page(bvec->bv_page, last_page);
+	seg->bv_page = bvec_nth_page(bvec->bv_page, last_page);
 
 	/* the whole segment is inside the last page */
 	if (bvec->bv_offset >= last_page * PAGE_SIZE) {

thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  3:09                               ` Ming Lei
@ 2019-02-27  3:37                                 ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  3:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 8:09 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>
>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>
>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>
>>>>>>>>>>>> int main(void)
>>>>>>>>>>>> {
>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>
>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>
>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>
>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>
>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>
>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Crash:
>>>>>>>>>>>>
>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>
>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>
>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>> branch.
>>>>>>>>>
>>>>>>>>> Hi Jens,
>>>>>>>>>
>>>>>>>>> I saw the following change is added:
>>>>>>>>>
>>>>>>>>> + if (size == len) {
>>>>>>>>> + /*
>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>> + */
>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>
>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>> needed:
>>>>>>>>>
>>>>>>>>> int i;
>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>
>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>
>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>
>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>
>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>> topic too.
>>>>>>
>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>> nth_page() is MUCH slower than before.
>>>>>
>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>
>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>> for your test.
>>>>
>>>> Probably won't make a difference for my test case...
>>>>
>>>>>> We need to do something about this, it's like tossing out months of
>>>>>> optimizations.
>>>>>
>>>>> Some following optimization can be done, such as removing
>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>
>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>> spots, and a split fast path will only help a bit for that specific
>>>> case.
>>>>
>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>> to really stress how big of a deal that is. It's enough to make me
>>>> consider just reverting it again, which sucks, but I don't feel great
>>>> shipping something that is known that much slower.
>>>>
>>>> Suggestions?
>>>
>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>> shouldn't call into nth_page(). I will look into it first.
>>
>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>> double check in the morning.
>>
>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>> consumer.
> 
> Hi Jens,
> 
> Could you test the following patch which may improve on the 4k randio
> test case?

A bit, it's up 1% with this patch. I'm going to try without the
get_page/put_page that we had earlier, to see where we are in regards to
the old baseline.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  3:37                                 ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  3:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 8:09 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>
>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>
>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>
>>>>>>>>>>>> int main(void)
>>>>>>>>>>>> {
>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>
>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>
>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>
>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>
>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>
>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Crash:
>>>>>>>>>>>>
>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>
>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>
>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>> branch.
>>>>>>>>>
>>>>>>>>> Hi Jens,
>>>>>>>>>
>>>>>>>>> I saw the following change is added:
>>>>>>>>>
>>>>>>>>> + if (size == len) {
>>>>>>>>> + /*
>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>> + */
>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>
>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>> needed:
>>>>>>>>>
>>>>>>>>> int i;
>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>
>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>
>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>
>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>
>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>> topic too.
>>>>>>
>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>> nth_page() is MUCH slower than before.
>>>>>
>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>
>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>> for your test.
>>>>
>>>> Probably won't make a difference for my test case...
>>>>
>>>>>> We need to do something about this, it's like tossing out months of
>>>>>> optimizations.
>>>>>
>>>>> Some following optimization can be done, such as removing
>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>
>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>> spots, and a split fast path will only help a bit for that specific
>>>> case.
>>>>
>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>> to really stress how big of a deal that is. It's enough to make me
>>>> consider just reverting it again, which sucks, but I don't feel great
>>>> shipping something that is known that much slower.
>>>>
>>>> Suggestions?
>>>
>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>> shouldn't call into nth_page(). I will look into it first.
>>
>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>> double check in the morning.
>>
>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>> consumer.
> 
> Hi Jens,
> 
> Could you test the following patch which may improve on the 4k randio
> test case?

A bit, it's up 1% with this patch. I'm going to try without the
get_page/put_page that we had earlier, to see where we are in regards to
the old baseline.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  3:37                                 ` Jens Axboe
@ 2019-02-27  3:43                                   ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  3:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 8:37 PM, Jens Axboe wrote:
> On 2/26/19 8:09 PM, Ming Lei wrote:
>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>>
>>>>>>>>>>>>> int main(void)
>>>>>>>>>>>>> {
>>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>>
>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Crash:
>>>>>>>>>>>>>
>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>>
>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>>
>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>>> branch.
>>>>>>>>>>
>>>>>>>>>> Hi Jens,
>>>>>>>>>>
>>>>>>>>>> I saw the following change is added:
>>>>>>>>>>
>>>>>>>>>> + if (size == len) {
>>>>>>>>>> + /*
>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>>> + */
>>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>>
>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>>> needed:
>>>>>>>>>>
>>>>>>>>>> int i;
>>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>>
>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>>
>>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>>
>>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>>
>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>>> topic too.
>>>>>>>
>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>>> nth_page() is MUCH slower than before.
>>>>>>
>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>>
>>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>>> for your test.
>>>>>
>>>>> Probably won't make a difference for my test case...
>>>>>
>>>>>>> We need to do something about this, it's like tossing out months of
>>>>>>> optimizations.
>>>>>>
>>>>>> Some following optimization can be done, such as removing
>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>>
>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>>> spots, and a split fast path will only help a bit for that specific
>>>>> case.
>>>>>
>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>>> to really stress how big of a deal that is. It's enough to make me
>>>>> consider just reverting it again, which sucks, but I don't feel great
>>>>> shipping something that is known that much slower.
>>>>>
>>>>> Suggestions?
>>>>
>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>>> shouldn't call into nth_page(). I will look into it first.
>>>
>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>>> double check in the morning.
>>>
>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>>> consumer.
>>
>> Hi Jens,
>>
>> Could you test the following patch which may improve on the 4k randio
>> test case?
> 
> A bit, it's up 1% with this patch. I'm going to try without the
> get_page/put_page that we had earlier, to see where we are in regards to
> the old baseline.

~1548K now, down from 1615-1620K, which matches the numbers. That's down
now roughly 4%, instead of the original 5%, with this recent patch being
the source of that reclaimed 1%.

So that's a good start, but still 4% to go.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  3:43                                   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  3:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 8:37 PM, Jens Axboe wrote:
> On 2/26/19 8:09 PM, Ming Lei wrote:
>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>>
>>>>>>>>>>>>> int main(void)
>>>>>>>>>>>>> {
>>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>>
>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>>
>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Crash:
>>>>>>>>>>>>>
>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>>
>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>>
>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>>> branch.
>>>>>>>>>>
>>>>>>>>>> Hi Jens,
>>>>>>>>>>
>>>>>>>>>> I saw the following change is added:
>>>>>>>>>>
>>>>>>>>>> + if (size == len) {
>>>>>>>>>> + /*
>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>>> + */
>>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>>
>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>>> needed:
>>>>>>>>>>
>>>>>>>>>> int i;
>>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>>
>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>>
>>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>>
>>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>>
>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>>> topic too.
>>>>>>>
>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>>> nth_page() is MUCH slower than before.
>>>>>>
>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>>
>>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>>> for your test.
>>>>>
>>>>> Probably won't make a difference for my test case...
>>>>>
>>>>>>> We need to do something about this, it's like tossing out months of
>>>>>>> optimizations.
>>>>>>
>>>>>> Some following optimization can be done, such as removing
>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>>
>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>>> spots, and a split fast path will only help a bit for that specific
>>>>> case.
>>>>>
>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>>> to really stress how big of a deal that is. It's enough to make me
>>>>> consider just reverting it again, which sucks, but I don't feel great
>>>>> shipping something that is known that much slower.
>>>>>
>>>>> Suggestions?
>>>>
>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>>> shouldn't call into nth_page(). I will look into it first.
>>>
>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>>> double check in the morning.
>>>
>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>>> consumer.
>>
>> Hi Jens,
>>
>> Could you test the following patch which may improve on the 4k randio
>> test case?
> 
> A bit, it's up 1% with this patch. I'm going to try without the
> get_page/put_page that we had earlier, to see where we are in regards to
> the old baseline.

~1548K now, down from 1615-1620K, which matches the numbers. That's down
now roughly 4%, instead of the original 5%, with this recent patch being
the source of that reclaimed 1%.

So that's a good start, but still 4% to go.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  3:37                                 ` Jens Axboe
@ 2019-02-27  3:44                                   ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
> On 2/26/19 8:09 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> >> On 2/26/19 7:37 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 7:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>>>>>> + size_t size;
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Reproducer:
> >>>>>>>>>>>>
> >>>>>>>>>>>> #define _GNU_SOURCE
> >>>>>>>>>>>> #include <fcntl.h>
> >>>>>>>>>>>> #include <linux/loop.h>
> >>>>>>>>>>>> #include <sys/ioctl.h>
> >>>>>>>>>>>> #include <sys/sendfile.h>
> >>>>>>>>>>>> #include <sys/syscall.h>
> >>>>>>>>>>>> #include <unistd.h>
> >>>>>>>>>>>>
> >>>>>>>>>>>> int main(void)
> >>>>>>>>>>>> {
> >>>>>>>>>>>>         int memfd, loopfd;
> >>>>>>>>>>>>
> >>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Crash:
> >>>>>>>>>>>>
> >>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>>>>>> flags: 0x100000000000000()
> >>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>>>>>
> >>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>>>>>
> >>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>>>>>> branch.
> >>>>>>>>>
> >>>>>>>>> Hi Jens,
> >>>>>>>>>
> >>>>>>>>> I saw the following change is added:
> >>>>>>>>>
> >>>>>>>>> + if (size == len) {
> >>>>>>>>> + /*
> >>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>>>>>> + * reference and then not have to put them again when IO
> >>>>>>>>> + * completes. But this breaks some in-kernel users, like
> >>>>>>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>>>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>>>>>> + * get rid of the get here and the need to call
> >>>>>>>>> + * bio_release_pages() at IO completion time.
> >>>>>>>>> + */
> >>>>>>>>> + get_page(bv->bv_page);
> >>>>>>>>>
> >>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>>>>>> needed:
> >>>>>>>>>
> >>>>>>>>> int i;
> >>>>>>>>> struct bvec_iter_all iter_all;
> >>>>>>>>> struct bio_vec *tmp;
> >>>>>>>>>
> >>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>>>>>       get_page(tmp->bv_page);
> >>>>>>>>
> >>>>>>>> I guess that would be the safest, even if we don't currently have more
> >>>>>>>> than one page in there. I'll fix it up.
> >>>>>>>
> >>>>>>> It is easy to see multipage bvec from loop, :-)
> >>>>>>
> >>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
> >>>>>> on IOPS perf with the multipage work. It looks like it's all related to
> >>>>>> the (much) fatter setup around iteration, which is related to this very
> >>>>>> topic too.
> >>>>>>
> >>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
> >>>>>> nth_page() is MUCH slower than before.
> >>>>>
> >>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> >>>>> needs that. However, bio_for_each_segment() isn't called from
> >>>>> blk_queue_split() and blk_rq_map_sg().
> >>>>>
> >>>>> One issue is that bio_for_each_bvec() still advances by page size
> >>>>> instead of bvec->len, I guess that is the problem, will cook a patch
> >>>>> for your test.
> >>>>
> >>>> Probably won't make a difference for my test case...
> >>>>
> >>>>>> We need to do something about this, it's like tossing out months of
> >>>>>> optimizations.
> >>>>>
> >>>>> Some following optimization can be done, such as removing
> >>>>> biovec_phys_mergeable() from blk_bio_segment_split().
> >>>>
> >>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> >>>> that it is possible. But iteration startup cost is a problem in a lot of
> >>>> spots, and a split fast path will only help a bit for that specific
> >>>> case.
> >>>>
> >>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
> >>>> to really stress how big of a deal that is. It's enough to make me
> >>>> consider just reverting it again, which sucks, but I don't feel great
> >>>> shipping something that is known that much slower.
> >>>>
> >>>> Suggestions?
> >>>
> >>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> >>> shouldn't call into nth_page(). I will look into it first.
> >>
> >> I'll check on the test box tomorrow, I lost connectivity before. I'll
> >> double check in the morning.
> >>
> >> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> >> consumer.
> > 
> > Hi Jens,
> > 
> > Could you test the following patch which may improve on the 4k randio
> > test case?
> 
> A bit, it's up 1% with this patch. I'm going to try without the
> get_page/put_page that we had earlier, to see where we are in regards to
> the old baseline.

OK, today I will test io_uring over null_blk on one real machine and see
if something can be improved.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  3:44                                   ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27  3:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
> On 2/26/19 8:09 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> >> On 2/26/19 7:37 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 7:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>>>>>> + size_t size;
> >>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Reproducer:
> >>>>>>>>>>>>
> >>>>>>>>>>>> #define _GNU_SOURCE
> >>>>>>>>>>>> #include <fcntl.h>
> >>>>>>>>>>>> #include <linux/loop.h>
> >>>>>>>>>>>> #include <sys/ioctl.h>
> >>>>>>>>>>>> #include <sys/sendfile.h>
> >>>>>>>>>>>> #include <sys/syscall.h>
> >>>>>>>>>>>> #include <unistd.h>
> >>>>>>>>>>>>
> >>>>>>>>>>>> int main(void)
> >>>>>>>>>>>> {
> >>>>>>>>>>>>         int memfd, loopfd;
> >>>>>>>>>>>>
> >>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>>>>>
> >>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Crash:
> >>>>>>>>>>>>
> >>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>>>>>> flags: 0x100000000000000()
> >>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>>>>>
> >>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>>>>>
> >>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>>>>>> branch.
> >>>>>>>>>
> >>>>>>>>> Hi Jens,
> >>>>>>>>>
> >>>>>>>>> I saw the following change is added:
> >>>>>>>>>
> >>>>>>>>> + if (size == len) {
> >>>>>>>>> + /*
> >>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>>>>>> + * reference and then not have to put them again when IO
> >>>>>>>>> + * completes. But this breaks some in-kernel users, like
> >>>>>>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>>>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>>>>>> + * get rid of the get here and the need to call
> >>>>>>>>> + * bio_release_pages() at IO completion time.
> >>>>>>>>> + */
> >>>>>>>>> + get_page(bv->bv_page);
> >>>>>>>>>
> >>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>>>>>> needed:
> >>>>>>>>>
> >>>>>>>>> int i;
> >>>>>>>>> struct bvec_iter_all iter_all;
> >>>>>>>>> struct bio_vec *tmp;
> >>>>>>>>>
> >>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>>>>>       get_page(tmp->bv_page);
> >>>>>>>>
> >>>>>>>> I guess that would be the safest, even if we don't currently have more
> >>>>>>>> than one page in there. I'll fix it up.
> >>>>>>>
> >>>>>>> It is easy to see multipage bvec from loop, :-)
> >>>>>>
> >>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
> >>>>>> on IOPS perf with the multipage work. It looks like it's all related to
> >>>>>> the (much) fatter setup around iteration, which is related to this very
> >>>>>> topic too.
> >>>>>>
> >>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
> >>>>>> nth_page() is MUCH slower than before.
> >>>>>
> >>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> >>>>> needs that. However, bio_for_each_segment() isn't called from
> >>>>> blk_queue_split() and blk_rq_map_sg().
> >>>>>
> >>>>> One issue is that bio_for_each_bvec() still advances by page size
> >>>>> instead of bvec->len, I guess that is the problem, will cook a patch
> >>>>> for your test.
> >>>>
> >>>> Probably won't make a difference for my test case...
> >>>>
> >>>>>> We need to do something about this, it's like tossing out months of
> >>>>>> optimizations.
> >>>>>
> >>>>> Some following optimization can be done, such as removing
> >>>>> biovec_phys_mergeable() from blk_bio_segment_split().
> >>>>
> >>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> >>>> that it is possible. But iteration startup cost is a problem in a lot of
> >>>> spots, and a split fast path will only help a bit for that specific
> >>>> case.
> >>>>
> >>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
> >>>> to really stress how big of a deal that is. It's enough to make me
> >>>> consider just reverting it again, which sucks, but I don't feel great
> >>>> shipping something that is known that much slower.
> >>>>
> >>>> Suggestions?
> >>>
> >>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> >>> shouldn't call into nth_page(). I will look into it first.
> >>
> >> I'll check on the test box tomorrow, I lost connectivity before. I'll
> >> double check in the morning.
> >>
> >> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> >> consumer.
> > 
> > Hi Jens,
> > 
> > Could you test the following patch which may improve on the 4k randio
> > test case?
> 
> A bit, it's up 1% with this patch. I'm going to try without the
> get_page/put_page that we had earlier, to see where we are in regards to
> the old baseline.

OK, today I will test io_uring over null_blk on one real machine and see
if something can be improved.

Thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  3:44                                   ` Ming Lei
@ 2019-02-27  4:05                                     ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  4:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 8:44 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
>> On 2/26/19 8:09 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> int main(void)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Crash:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>>>
>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>>>> branch.
>>>>>>>>>>>
>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>
>>>>>>>>>>> I saw the following change is added:
>>>>>>>>>>>
>>>>>>>>>>> + if (size == len) {
>>>>>>>>>>> + /*
>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>>>> + */
>>>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>>>
>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>>>> needed:
>>>>>>>>>>>
>>>>>>>>>>> int i;
>>>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>>>
>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>>>
>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>>>
>>>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>>>
>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>>>> topic too.
>>>>>>>>
>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>>>> nth_page() is MUCH slower than before.
>>>>>>>
>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>>>
>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>>>> for your test.
>>>>>>
>>>>>> Probably won't make a difference for my test case...
>>>>>>
>>>>>>>> We need to do something about this, it's like tossing out months of
>>>>>>>> optimizations.
>>>>>>>
>>>>>>> Some following optimization can be done, such as removing
>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>>>
>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>>>> spots, and a split fast path will only help a bit for that specific
>>>>>> case.
>>>>>>
>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>>>> to really stress how big of a deal that is. It's enough to make me
>>>>>> consider just reverting it again, which sucks, but I don't feel great
>>>>>> shipping something that is known that much slower.
>>>>>>
>>>>>> Suggestions?
>>>>>
>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>>>> shouldn't call into nth_page(). I will look into it first.
>>>>
>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>>>> double check in the morning.
>>>>
>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>>>> consumer.
>>>
>>> Hi Jens,
>>>
>>> Could you test the following patch which may improve on the 4k randio
>>> test case?
>>
>> A bit, it's up 1% with this patch. I'm going to try without the
>> get_page/put_page that we had earlier, to see where we are in regards to
>> the old baseline.
> 
> OK, today I will test io_uring over null_blk on one real machine and see
> if something can be improved.

For reference, I'm running the default t/io_uring from fio, which is
QD=128, fixed files/buffers, and polled. Running it on two devices to
max out the CPU core:

sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1

since nvme1n1 tops out at 1164K 4k rand reads (hardware limit). Just
tried with null_blk, since I haven't done that before, and I get about
1875K from a single device with the same test case. Using 2 devices
yields the same result, so we're CPU core bound at that point. We don't
get the sg walk with null_blk though, but I do see about 4%
blk_queue_split() time with that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  4:05                                     ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  4:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 8:44 PM, Ming Lei wrote:
> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
>> On 2/26/19 8:09 PM, Ming Lei wrote:
>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> int main(void)
>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Crash:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>>>
>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>>>> branch.
>>>>>>>>>>>
>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>
>>>>>>>>>>> I saw the following change is added:
>>>>>>>>>>>
>>>>>>>>>>> + if (size == len) {
>>>>>>>>>>> + /*
>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>>>> + */
>>>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>>>
>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>>>> needed:
>>>>>>>>>>>
>>>>>>>>>>> int i;
>>>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>>>
>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>>>
>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>>>
>>>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>>>
>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>>>> topic too.
>>>>>>>>
>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>>>> nth_page() is MUCH slower than before.
>>>>>>>
>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>>>
>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>>>> for your test.
>>>>>>
>>>>>> Probably won't make a difference for my test case...
>>>>>>
>>>>>>>> We need to do something about this, it's like tossing out months of
>>>>>>>> optimizations.
>>>>>>>
>>>>>>> Some following optimization can be done, such as removing
>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>>>
>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>>>> spots, and a split fast path will only help a bit for that specific
>>>>>> case.
>>>>>>
>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>>>> to really stress how big of a deal that is. It's enough to make me
>>>>>> consider just reverting it again, which sucks, but I don't feel great
>>>>>> shipping something that is known that much slower.
>>>>>>
>>>>>> Suggestions?
>>>>>
>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>>>> shouldn't call into nth_page(). I will look into it first.
>>>>
>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>>>> double check in the morning.
>>>>
>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>>>> consumer.
>>>
>>> Hi Jens,
>>>
>>> Could you test the following patch which may improve on the 4k randio
>>> test case?
>>
>> A bit, it's up 1% with this patch. I'm going to try without the
>> get_page/put_page that we had earlier, to see where we are in regards to
>> the old baseline.
> 
> OK, today I will test io_uring over null_blk on one real machine and see
> if something can be improved.

For reference, I'm running the default t/io_uring from fio, which is
QD=128, fixed files/buffers, and polled. Running it on two devices to
max out the CPU core:

sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1

since nvme1n1 tops out at 1164K 4k rand reads (hardware limit). Just
tried with null_blk, since I haven't done that before, and I get about
1875K from a single device with the same test case. Using 2 devices
yields the same result, so we're CPU core bound at that point. We don't
get the sg walk with null_blk though, but I do see about 4%
blk_queue_split() time with that.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  4:05                                     ` Jens Axboe
@ 2019-02-27  4:06                                       ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  4:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 9:05 PM, Jens Axboe wrote:
> On 2/26/19 8:44 PM, Ming Lei wrote:
>> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
>>> On 2/26/19 8:09 PM, Ming Lei wrote:
>>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> int main(void)
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Crash:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>>>>> branch.
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>
>>>>>>>>>>>> I saw the following change is added:
>>>>>>>>>>>>
>>>>>>>>>>>> + if (size == len) {
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>>>>
>>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>>>>> needed:
>>>>>>>>>>>>
>>>>>>>>>>>> int i;
>>>>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>>>>
>>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>>>>
>>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>>>>
>>>>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>>>>
>>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>>>>> topic too.
>>>>>>>>>
>>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>>>>> nth_page() is MUCH slower than before.
>>>>>>>>
>>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>>>>
>>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>>>>> for your test.
>>>>>>>
>>>>>>> Probably won't make a difference for my test case...
>>>>>>>
>>>>>>>>> We need to do something about this, it's like tossing out months of
>>>>>>>>> optimizations.
>>>>>>>>
>>>>>>>> Some following optimization can be done, such as removing
>>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>>>>
>>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>>>>> spots, and a split fast path will only help a bit for that specific
>>>>>>> case.
>>>>>>>
>>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>>>>> to really stress how big of a deal that is. It's enough to make me
>>>>>>> consider just reverting it again, which sucks, but I don't feel great
>>>>>>> shipping something that is known that much slower.
>>>>>>>
>>>>>>> Suggestions?
>>>>>>
>>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>>>>> shouldn't call into nth_page(). I will look into it first.
>>>>>
>>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>>>>> double check in the morning.
>>>>>
>>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>>>>> consumer.
>>>>
>>>> Hi Jens,
>>>>
>>>> Could you test the following patch which may improve on the 4k randio
>>>> test case?
>>>
>>> A bit, it's up 1% with this patch. I'm going to try without the
>>> get_page/put_page that we had earlier, to see where we are in regards to
>>> the old baseline.
>>
>> OK, today I will test io_uring over null_blk on one real machine and see
>> if something can be improved.
> 
> For reference, I'm running the default t/io_uring from fio, which is
> QD=128, fixed files/buffers, and polled. Running it on two devices to
> max out the CPU core:
> 
> sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1

Forgot to mention, this is loading nvme with 12 poll queues, which is of
course important to get good performance on this test case.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27  4:06                                       ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-27  4:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On 2/26/19 9:05 PM, Jens Axboe wrote:
> On 2/26/19 8:44 PM, Ming Lei wrote:
>> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
>>> On 2/26/19 8:09 PM, Ming Lei wrote:
>>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
>>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
>>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
>>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
>>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
>>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
>>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
>>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
>>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
>>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
>>>>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
>>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
>>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
>>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
>>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
>>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
>>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
>>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
>>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
>>>>>>>>>>>>>>>>>> --- a/block/bio.c
>>>>>>>>>>>>>>>>>> +++ b/block/bio.c
>>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
>>>>>>>>>>>>>>>>>>  }
>>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
>>>>>>>>>>>>>>>>>> + unsigned int len;
>>>>>>>>>>>>>>>>>> + size_t size;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
>>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
>>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
>>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
>>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks, I folded this in.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Jens Axboe
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
>>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
>>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
>>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reproducer:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #define _GNU_SOURCE
>>>>>>>>>>>>>>> #include <fcntl.h>
>>>>>>>>>>>>>>> #include <linux/loop.h>
>>>>>>>>>>>>>>> #include <sys/ioctl.h>
>>>>>>>>>>>>>>> #include <sys/sendfile.h>
>>>>>>>>>>>>>>> #include <sys/syscall.h>
>>>>>>>>>>>>>>> #include <unistd.h>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> int main(void)
>>>>>>>>>>>>>>> {
>>>>>>>>>>>>>>>         int memfd, loopfd;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Crash:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
>>>>>>>>>>>>>>> flags: 0x100000000000000()
>>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
>>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
>>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
>>>>>>>>>>>>> branch.
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Jens,
>>>>>>>>>>>>
>>>>>>>>>>>> I saw the following change is added:
>>>>>>>>>>>>
>>>>>>>>>>>> + if (size == len) {
>>>>>>>>>>>> + /*
>>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
>>>>>>>>>>>> + * reference and then not have to put them again when IO
>>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
>>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
>>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
>>>>>>>>>>>> + * get rid of the get here and the need to call
>>>>>>>>>>>> + * bio_release_pages() at IO completion time.
>>>>>>>>>>>> + */
>>>>>>>>>>>> + get_page(bv->bv_page);
>>>>>>>>>>>>
>>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
>>>>>>>>>>>> needed:
>>>>>>>>>>>>
>>>>>>>>>>>> int i;
>>>>>>>>>>>> struct bvec_iter_all iter_all;
>>>>>>>>>>>> struct bio_vec *tmp;
>>>>>>>>>>>>
>>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
>>>>>>>>>>>>       get_page(tmp->bv_page);
>>>>>>>>>>>
>>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
>>>>>>>>>>> than one page in there. I'll fix it up.
>>>>>>>>>>
>>>>>>>>>> It is easy to see multipage bvec from loop, :-)
>>>>>>>>>
>>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
>>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
>>>>>>>>> the (much) fatter setup around iteration, which is related to this very
>>>>>>>>> topic too.
>>>>>>>>>
>>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
>>>>>>>>> nth_page() is MUCH slower than before.
>>>>>>>>
>>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
>>>>>>>> needs that. However, bio_for_each_segment() isn't called from
>>>>>>>> blk_queue_split() and blk_rq_map_sg().
>>>>>>>>
>>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
>>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
>>>>>>>> for your test.
>>>>>>>
>>>>>>> Probably won't make a difference for my test case...
>>>>>>>
>>>>>>>>> We need to do something about this, it's like tossing out months of
>>>>>>>>> optimizations.
>>>>>>>>
>>>>>>>> Some following optimization can be done, such as removing
>>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
>>>>>>>
>>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
>>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
>>>>>>> spots, and a split fast path will only help a bit for that specific
>>>>>>> case.
>>>>>>>
>>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
>>>>>>> to really stress how big of a deal that is. It's enough to make me
>>>>>>> consider just reverting it again, which sucks, but I don't feel great
>>>>>>> shipping something that is known that much slower.
>>>>>>>
>>>>>>> Suggestions?
>>>>>>
>>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
>>>>>> shouldn't call into nth_page(). I will look into it first.
>>>>>
>>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
>>>>> double check in the morning.
>>>>>
>>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
>>>>> consumer.
>>>>
>>>> Hi Jens,
>>>>
>>>> Could you test the following patch which may improve on the 4k randio
>>>> test case?
>>>
>>> A bit, it's up 1% with this patch. I'm going to try without the
>>> get_page/put_page that we had earlier, to see where we are in regards to
>>> the old baseline.
>>
>> OK, today I will test io_uring over null_blk on one real machine and see
>> if something can be improved.
> 
> For reference, I'm running the default t/io_uring from fio, which is
> QD=128, fixed files/buffers, and polled. Running it on two devices to
> max out the CPU core:
> 
> sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1

Forgot to mention, this is loading nvme with 12 poll queues, which is of
course important to get good performance on this test case.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  4:06                                       ` Jens Axboe
@ 2019-02-27 19:42                                         ` Christoph Hellwig
  -1 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2019-02-27 19:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh,
	Al Viro

On Tue, Feb 26, 2019 at 09:06:23PM -0700, Jens Axboe wrote:
> On 2/26/19 9:05 PM, Jens Axboe wrote:
> > On 2/26/19 8:44 PM, Ming Lei wrote:
> >> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
> >>> On 2/26/19 8:09 PM, Ming Lei wrote:
> >>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> >>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
> >>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> >>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
> >>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>>>>>>>>> + size_t size;
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Reproducer:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> #define _GNU_SOURCE
> >>>>>>>>>>>>>>> #include <fcntl.h>
> >>>>>>>>>>>>>>> #include <linux/loop.h>
> >>>>>>>>>>>>>>> #include <sys/ioctl.h>
> >>>>>>>>>>>>>>> #include <sys/sendfile.h>
> >>>>>>>>>>>>>>> #include <sys/syscall.h>
> >>>>>>>>>>>>>>> #include <unistd.h>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> int main(void)
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>         int memfd, loopfd;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Crash:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>>>>>>>>> flags: 0x100000000000000()
> >>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>>>>>>>>> branch.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I saw the following change is added:
> >>>>>>>>>>>>
> >>>>>>>>>>>> + if (size == len) {
> >>>>>>>>>>>> + /*
> >>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>>>>>>>>> + * reference and then not have to put them again when IO
> >>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
> >>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>>>>>>>>> + * get rid of the get here and the need to call
> >>>>>>>>>>>> + * bio_release_pages() at IO completion time.
> >>>>>>>>>>>> + */
> >>>>>>>>>>>> + get_page(bv->bv_page);
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>>>>>>>>> needed:
> >>>>>>>>>>>>
> >>>>>>>>>>>> int i;
> >>>>>>>>>>>> struct bvec_iter_all iter_all;
> >>>>>>>>>>>> struct bio_vec *tmp;
> >>>>>>>>>>>>
> >>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>>>>>>>>       get_page(tmp->bv_page);
> >>>>>>>>>>>
> >>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
> >>>>>>>>>>> than one page in there. I'll fix it up.
> >>>>>>>>>>
> >>>>>>>>>> It is easy to see multipage bvec from loop, :-)
> >>>>>>>>>
> >>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
> >>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
> >>>>>>>>> the (much) fatter setup around iteration, which is related to this very
> >>>>>>>>> topic too.
> >>>>>>>>>
> >>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
> >>>>>>>>> nth_page() is MUCH slower than before.
> >>>>>>>>
> >>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> >>>>>>>> needs that. However, bio_for_each_segment() isn't called from
> >>>>>>>> blk_queue_split() and blk_rq_map_sg().
> >>>>>>>>
> >>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
> >>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
> >>>>>>>> for your test.
> >>>>>>>
> >>>>>>> Probably won't make a difference for my test case...
> >>>>>>>
> >>>>>>>>> We need to do something about this, it's like tossing out months of
> >>>>>>>>> optimizations.
> >>>>>>>>
> >>>>>>>> Some following optimization can be done, such as removing
> >>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
> >>>>>>>
> >>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> >>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
> >>>>>>> spots, and a split fast path will only help a bit for that specific
> >>>>>>> case.
> >>>>>>>
> >>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
> >>>>>>> to really stress how big of a deal that is. It's enough to make me
> >>>>>>> consider just reverting it again, which sucks, but I don't feel great
> >>>>>>> shipping something that is known that much slower.
> >>>>>>>
> >>>>>>> Suggestions?
> >>>>>>
> >>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> >>>>>> shouldn't call into nth_page(). I will look into it first.
> >>>>>
> >>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
> >>>>> double check in the morning.
> >>>>>
> >>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> >>>>> consumer.
> >>>>
> >>>> Hi Jens,
> >>>>
> >>>> Could you test the following patch which may improve on the 4k randio
> >>>> test case?
> >>>
> >>> A bit, it's up 1% with this patch. I'm going to try without the
> >>> get_page/put_page that we had earlier, to see where we are in regards to
> >>> the old baseline.
> >>
> >> OK, today I will test io_uring over null_blk on one real machine and see
> >> if something can be improved.
> > 
> > For reference, I'm running the default t/io_uring from fio, which is
> > QD=128, fixed files/buffers, and polled. Running it on two devices to
> > max out the CPU core:
> > 
> > sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1
> 
> Forgot to mention, this is loading nvme with 12 poll queues, which is of
> course important to get good performance on this test case.

Btw, is your nvme device SGL capable?  There is some low hanging fruit
in that IFF a device has SGL support we can basically dumb down
blk_mq_map_sg to never split in this case ever because we don't have
any segment size limits.

PRPs only unforturtunately are a little dumb and could lead to all kinds
of whacky splitting.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27 19:42                                         ` Christoph Hellwig
  0 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2019-02-27 19:42 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh,
	Al Viro

On Tue, Feb 26, 2019 at 09:06:23PM -0700, Jens Axboe wrote:
> On 2/26/19 9:05 PM, Jens Axboe wrote:
> > On 2/26/19 8:44 PM, Ming Lei wrote:
> >> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
> >>> On 2/26/19 8:09 PM, Ming Lei wrote:
> >>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> >>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
> >>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> >>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
> >>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>>>>>>>>  }
> >>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>>>>>>>>> +{
> >>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>>>>>>>>> + size_t size;
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>> Jens Axboe
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Reproducer:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> #define _GNU_SOURCE
> >>>>>>>>>>>>>>> #include <fcntl.h>
> >>>>>>>>>>>>>>> #include <linux/loop.h>
> >>>>>>>>>>>>>>> #include <sys/ioctl.h>
> >>>>>>>>>>>>>>> #include <sys/sendfile.h>
> >>>>>>>>>>>>>>> #include <sys/syscall.h>
> >>>>>>>>>>>>>>> #include <unistd.h>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> int main(void)
> >>>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>>         int memfd, loopfd;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Crash:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>>>>>>>>> flags: 0x100000000000000()
> >>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>>>>>>>>> branch.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Jens,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I saw the following change is added:
> >>>>>>>>>>>>
> >>>>>>>>>>>> + if (size == len) {
> >>>>>>>>>>>> + /*
> >>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>>>>>>>>> + * reference and then not have to put them again when IO
> >>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
> >>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>>>>>>>>> + * get rid of the get here and the need to call
> >>>>>>>>>>>> + * bio_release_pages() at IO completion time.
> >>>>>>>>>>>> + */
> >>>>>>>>>>>> + get_page(bv->bv_page);
> >>>>>>>>>>>>
> >>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>>>>>>>>> needed:
> >>>>>>>>>>>>
> >>>>>>>>>>>> int i;
> >>>>>>>>>>>> struct bvec_iter_all iter_all;
> >>>>>>>>>>>> struct bio_vec *tmp;
> >>>>>>>>>>>>
> >>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>>>>>>>>       get_page(tmp->bv_page);
> >>>>>>>>>>>
> >>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
> >>>>>>>>>>> than one page in there. I'll fix it up.
> >>>>>>>>>>
> >>>>>>>>>> It is easy to see multipage bvec from loop, :-)
> >>>>>>>>>
> >>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
> >>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
> >>>>>>>>> the (much) fatter setup around iteration, which is related to this very
> >>>>>>>>> topic too.
> >>>>>>>>>
> >>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
> >>>>>>>>> nth_page() is MUCH slower than before.
> >>>>>>>>
> >>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> >>>>>>>> needs that. However, bio_for_each_segment() isn't called from
> >>>>>>>> blk_queue_split() and blk_rq_map_sg().
> >>>>>>>>
> >>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
> >>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
> >>>>>>>> for your test.
> >>>>>>>
> >>>>>>> Probably won't make a difference for my test case...
> >>>>>>>
> >>>>>>>>> We need to do something about this, it's like tossing out months of
> >>>>>>>>> optimizations.
> >>>>>>>>
> >>>>>>>> Some following optimization can be done, such as removing
> >>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
> >>>>>>>
> >>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> >>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
> >>>>>>> spots, and a split fast path will only help a bit for that specific
> >>>>>>> case.
> >>>>>>>
> >>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
> >>>>>>> to really stress how big of a deal that is. It's enough to make me
> >>>>>>> consider just reverting it again, which sucks, but I don't feel great
> >>>>>>> shipping something that is known that much slower.
> >>>>>>>
> >>>>>>> Suggestions?
> >>>>>>
> >>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> >>>>>> shouldn't call into nth_page(). I will look into it first.
> >>>>>
> >>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
> >>>>> double check in the morning.
> >>>>>
> >>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> >>>>> consumer.
> >>>>
> >>>> Hi Jens,
> >>>>
> >>>> Could you test the following patch which may improve on the 4k randio
> >>>> test case?
> >>>
> >>> A bit, it's up 1% with this patch. I'm going to try without the
> >>> get_page/put_page that we had earlier, to see where we are in regards to
> >>> the old baseline.
> >>
> >> OK, today I will test io_uring over null_blk on one real machine and see
> >> if something can be improved.
> > 
> > For reference, I'm running the default t/io_uring from fio, which is
> > QD=128, fixed files/buffers, and polled. Running it on two devices to
> > max out the CPU core:
> > 
> > sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1
> 
> Forgot to mention, this is loading nvme with 12 poll queues, which is of
> course important to get good performance on this test case.

Btw, is your nvme device SGL capable?  There is some low hanging fruit
in that IFF a device has SGL support we can basically dumb down
blk_mq_map_sg to never split in this case ever because we don't have
any segment size limits.

PRPs only unforturtunately are a little dumb and could lead to all kinds
of whacky splitting.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  2:28                         ` Jens Axboe
@ 2019-02-27 23:35                           ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27 23:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> On 2/26/19 7:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>> Hi Jens,
> >>>>>>>>
> >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>
> >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>
> >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>  }
> >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>
> >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>> + size_t size;
> >>>>>>>>>>> +
> >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>
> >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>
> >>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jens Axboe
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>
> >>>>>>>> Reproducer:
> >>>>>>>>
> >>>>>>>> #define _GNU_SOURCE
> >>>>>>>> #include <fcntl.h>
> >>>>>>>> #include <linux/loop.h>
> >>>>>>>> #include <sys/ioctl.h>
> >>>>>>>> #include <sys/sendfile.h>
> >>>>>>>> #include <sys/syscall.h>
> >>>>>>>> #include <unistd.h>
> >>>>>>>>
> >>>>>>>> int main(void)
> >>>>>>>> {
> >>>>>>>>         int memfd, loopfd;
> >>>>>>>>
> >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>
> >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>
> >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>
> >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>
> >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Crash:
> >>>>>>>>
> >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>> flags: 0x100000000000000()
> >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>
> >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>
> >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>> branch.
> >>>>>
> >>>>> Hi Jens,
> >>>>>
> >>>>> I saw the following change is added:
> >>>>>
> >>>>> + if (size == len) {
> >>>>> + /*
> >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>> + * reference and then not have to put them again when IO
> >>>>> + * completes. But this breaks some in-kernel users, like
> >>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>> + * get rid of the get here and the need to call
> >>>>> + * bio_release_pages() at IO completion time.
> >>>>> + */
> >>>>> + get_page(bv->bv_page);
> >>>>>
> >>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>> needed:
> >>>>>
> >>>>> int i;
> >>>>> struct bvec_iter_all iter_all;
> >>>>> struct bio_vec *tmp;
> >>>>>
> >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>       get_page(tmp->bv_page);
> >>>>
> >>>> I guess that would be the safest, even if we don't currently have more
> >>>> than one page in there. I'll fix it up.
> >>>
> >>> It is easy to see multipage bvec from loop, :-)
> >>
> >> Speaking of this, I took a quick look at why we've now regressed a lot
> >> on IOPS perf with the multipage work. It looks like it's all related to
> >> the (much) fatter setup around iteration, which is related to this very
> >> topic too.
> >>
> >> Basically setup of things like bio_for_each_bvec() and indexing through
> >> nth_page() is MUCH slower than before.
> > 
> > But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> > needs that. However, bio_for_each_segment() isn't called from
> > blk_queue_split() and blk_rq_map_sg().
> > 
> > One issue is that bio_for_each_bvec() still advances by page size
> > instead of bvec->len, I guess that is the problem, will cook a patch
> > for your test.
> 
> Probably won't make a difference for my test case...

The thing is that bvec_iter_len() becomes much slower than before,
I will work a patch for you soon.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-27 23:35                           ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-27 23:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Eric Biggers, open list:AIO, linux-block, linux-api,
	Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> On 2/26/19 7:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>> Hi Jens,
> >>>>>>>>
> >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>
> >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>
> >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>  }
> >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>
> >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>> + size_t size;
> >>>>>>>>>>> +
> >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>
> >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>
> >>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jens Axboe
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>
> >>>>>>>> Reproducer:
> >>>>>>>>
> >>>>>>>> #define _GNU_SOURCE
> >>>>>>>> #include <fcntl.h>
> >>>>>>>> #include <linux/loop.h>
> >>>>>>>> #include <sys/ioctl.h>
> >>>>>>>> #include <sys/sendfile.h>
> >>>>>>>> #include <sys/syscall.h>
> >>>>>>>> #include <unistd.h>
> >>>>>>>>
> >>>>>>>> int main(void)
> >>>>>>>> {
> >>>>>>>>         int memfd, loopfd;
> >>>>>>>>
> >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>
> >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>
> >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>
> >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>
> >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Crash:
> >>>>>>>>
> >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>> flags: 0x100000000000000()
> >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>
> >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>
> >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>> branch.
> >>>>>
> >>>>> Hi Jens,
> >>>>>
> >>>>> I saw the following change is added:
> >>>>>
> >>>>> + if (size == len) {
> >>>>> + /*
> >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>> + * reference and then not have to put them again when IO
> >>>>> + * completes. But this breaks some in-kernel users, like
> >>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>> + * get rid of the get here and the need to call
> >>>>> + * bio_release_pages() at IO completion time.
> >>>>> + */
> >>>>> + get_page(bv->bv_page);
> >>>>>
> >>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>> needed:
> >>>>>
> >>>>> int i;
> >>>>> struct bvec_iter_all iter_all;
> >>>>> struct bio_vec *tmp;
> >>>>>
> >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>       get_page(tmp->bv_page);
> >>>>
> >>>> I guess that would be the safest, even if we don't currently have more
> >>>> than one page in there. I'll fix it up.
> >>>
> >>> It is easy to see multipage bvec from loop, :-)
> >>
> >> Speaking of this, I took a quick look at why we've now regressed a lot
> >> on IOPS perf with the multipage work. It looks like it's all related to
> >> the (much) fatter setup around iteration, which is related to this very
> >> topic too.
> >>
> >> Basically setup of things like bio_for_each_bvec() and indexing through
> >> nth_page() is MUCH slower than before.
> > 
> > But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> > needs that. However, bio_for_each_segment() isn't called from
> > blk_queue_split() and blk_rq_map_sg().
> > 
> > One issue is that bio_for_each_bvec() still advances by page size
> > instead of bvec->len, I guess that is the problem, will cook a patch
> > for your test.
> 
> Probably won't make a difference for my test case...

The thing is that bvec_iter_len() becomes much slower than before,
I will work a patch for you soon.

Thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27 19:42                                         ` Christoph Hellwig
@ 2019-02-28  8:37                                           ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-28  8:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh,
	Al Viro

On Wed, Feb 27, 2019 at 11:42:41AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 26, 2019 at 09:06:23PM -0700, Jens Axboe wrote:
> > On 2/26/19 9:05 PM, Jens Axboe wrote:
> > > On 2/26/19 8:44 PM, Ming Lei wrote:
> > >> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
> > >>> On 2/26/19 8:09 PM, Ming Lei wrote:
> > >>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> > >>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
> > >>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> > >>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
> > >>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> > >>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> > >>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> > >>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> > >>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> > >>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> > >>>>>>>>>>>>>>> Hi Jens,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> > >>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> > >>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> > >>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> > >>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> > >>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> > >>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> > >>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> > >>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> > >>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >>>>>>>>>>>>>>>>>> ---
> > >>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> > >>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> > >>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> > >>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> > >>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> > >>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> > >>>>>>>>>>>>>>>>>> --- a/block/bio.c
> > >>>>>>>>>>>>>>>>>> +++ b/block/bio.c
> > >>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> > >>>>>>>>>>>>>>>>>>  }
> > >>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> > >>>>>>>>>>>>>>>>>> +{
> > >>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> > >>>>>>>>>>>>>>>>>> + unsigned int len;
> > >>>>>>>>>>>>>>>>>> + size_t size;
> > >>>>>>>>>>>>>>>>>> +
> > >>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> > >>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> > >>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> > >>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> > >>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks, I folded this in.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>> Jens Axboe
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> > >>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> > >>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> > >>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Reproducer:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> #define _GNU_SOURCE
> > >>>>>>>>>>>>>>> #include <fcntl.h>
> > >>>>>>>>>>>>>>> #include <linux/loop.h>
> > >>>>>>>>>>>>>>> #include <sys/ioctl.h>
> > >>>>>>>>>>>>>>> #include <sys/sendfile.h>
> > >>>>>>>>>>>>>>> #include <sys/syscall.h>
> > >>>>>>>>>>>>>>> #include <unistd.h>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> int main(void)
> > >>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>>         int memfd, loopfd;
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> > >>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Crash:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> > >>>>>>>>>>>>>>> flags: 0x100000000000000()
> > >>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> > >>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> > >>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> > >>>>>>>>>>>>> branch.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi Jens,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I saw the following change is added:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> + if (size == len) {
> > >>>>>>>>>>>> + /*
> > >>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> > >>>>>>>>>>>> + * reference and then not have to put them again when IO
> > >>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
> > >>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
> > >>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
> > >>>>>>>>>>>> + * get rid of the get here and the need to call
> > >>>>>>>>>>>> + * bio_release_pages() at IO completion time.
> > >>>>>>>>>>>> + */
> > >>>>>>>>>>>> + get_page(bv->bv_page);
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> > >>>>>>>>>>>> needed:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> int i;
> > >>>>>>>>>>>> struct bvec_iter_all iter_all;
> > >>>>>>>>>>>> struct bio_vec *tmp;
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> > >>>>>>>>>>>>       get_page(tmp->bv_page);
> > >>>>>>>>>>>
> > >>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
> > >>>>>>>>>>> than one page in there. I'll fix it up.
> > >>>>>>>>>>
> > >>>>>>>>>> It is easy to see multipage bvec from loop, :-)
> > >>>>>>>>>
> > >>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
> > >>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
> > >>>>>>>>> the (much) fatter setup around iteration, which is related to this very
> > >>>>>>>>> topic too.
> > >>>>>>>>>
> > >>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
> > >>>>>>>>> nth_page() is MUCH slower than before.
> > >>>>>>>>
> > >>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> > >>>>>>>> needs that. However, bio_for_each_segment() isn't called from
> > >>>>>>>> blk_queue_split() and blk_rq_map_sg().
> > >>>>>>>>
> > >>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
> > >>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
> > >>>>>>>> for your test.
> > >>>>>>>
> > >>>>>>> Probably won't make a difference for my test case...
> > >>>>>>>
> > >>>>>>>>> We need to do something about this, it's like tossing out months of
> > >>>>>>>>> optimizations.
> > >>>>>>>>
> > >>>>>>>> Some following optimization can be done, such as removing
> > >>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
> > >>>>>>>
> > >>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> > >>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
> > >>>>>>> spots, and a split fast path will only help a bit for that specific
> > >>>>>>> case.
> > >>>>>>>
> > >>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
> > >>>>>>> to really stress how big of a deal that is. It's enough to make me
> > >>>>>>> consider just reverting it again, which sucks, but I don't feel great
> > >>>>>>> shipping something that is known that much slower.
> > >>>>>>>
> > >>>>>>> Suggestions?
> > >>>>>>
> > >>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> > >>>>>> shouldn't call into nth_page(). I will look into it first.
> > >>>>>
> > >>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
> > >>>>> double check in the morning.
> > >>>>>
> > >>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> > >>>>> consumer.
> > >>>>
> > >>>> Hi Jens,
> > >>>>
> > >>>> Could you test the following patch which may improve on the 4k randio
> > >>>> test case?
> > >>>
> > >>> A bit, it's up 1% with this patch. I'm going to try without the
> > >>> get_page/put_page that we had earlier, to see where we are in regards to
> > >>> the old baseline.
> > >>
> > >> OK, today I will test io_uring over null_blk on one real machine and see
> > >> if something can be improved.
> > > 
> > > For reference, I'm running the default t/io_uring from fio, which is
> > > QD=128, fixed files/buffers, and polled. Running it on two devices to
> > > max out the CPU core:
> > > 
> > > sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1
> > 
> > Forgot to mention, this is loading nvme with 12 poll queues, which is of
> > course important to get good performance on this test case.
> 
> Btw, is your nvme device SGL capable?  There is some low hanging fruit
> in that IFF a device has SGL support we can basically dumb down
> blk_mq_map_sg to never split in this case ever because we don't have
> any segment size limits.

Indeed.

In case of SGL, big sg list may not be needed and blk_rq_map_sg() can be
skipped if proper DMA mapping interface is to return the dma address
for each segment. That can be one big improvement.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-02-28  8:37                                           ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-02-28  8:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh,
	Al Viro

On Wed, Feb 27, 2019 at 11:42:41AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 26, 2019 at 09:06:23PM -0700, Jens Axboe wrote:
> > On 2/26/19 9:05 PM, Jens Axboe wrote:
> > > On 2/26/19 8:44 PM, Ming Lei wrote:
> > >> On Tue, Feb 26, 2019 at 08:37:05PM -0700, Jens Axboe wrote:
> > >>> On 2/26/19 8:09 PM, Ming Lei wrote:
> > >>>> On Tue, Feb 26, 2019 at 07:43:32PM -0700, Jens Axboe wrote:
> > >>>>> On 2/26/19 7:37 PM, Ming Lei wrote:
> > >>>>>> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> > >>>>>>> On 2/26/19 7:21 PM, Ming Lei wrote:
> > >>>>>>>> On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> > >>>>>>>>> On 2/26/19 6:53 PM, Ming Lei wrote:
> > >>>>>>>>>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> > >>>>>>>>>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> > >>>>>>>>>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> > >>>>>>>>>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> > >>>>>>>>>>>>>>> Hi Jens,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> > >>>>>>>>>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> > >>>>>>>>>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> > >>>>>>>>>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> > >>>>>>>>>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> > >>>>>>>>>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> > >>>>>>>>>>>>>>>>>> check if they need to release pages on completion. This makes them
> > >>>>>>>>>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> > >>>>>>>>>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> > >>>>>>>>>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >>>>>>>>>>>>>>>>>> ---
> > >>>>>>>>>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> > >>>>>>>>>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> > >>>>>>>>>>>>>>>>>>  fs/iomap.c                |  5 ++--
> > >>>>>>>>>>>>>>>>>>  include/linux/blk_types.h |  1 +
> > >>>>>>>>>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> > >>>>>>>>>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> > >>>>>>>>>>>>>>>>>> --- a/block/bio.c
> > >>>>>>>>>>>>>>>>>> +++ b/block/bio.c
> > >>>>>>>>>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> > >>>>>>>>>>>>>>>>>>  }
> > >>>>>>>>>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> > >>>>>>>>>>>>>>>>>> +{
> > >>>>>>>>>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> > >>>>>>>>>>>>>>>>>> + unsigned int len;
> > >>>>>>>>>>>>>>>>>> + size_t size;
> > >>>>>>>>>>>>>>>>>> +
> > >>>>>>>>>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> > >>>>>>>>>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> > >>>>>>>>>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> > >>>>>>>>>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> > >>>>>>>>>>>>>>>>> can be observed when running xfstests over loop/dio.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks, I folded this in.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> --
> > >>>>>>>>>>>>>>>> Jens Axboe
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> > >>>>>>>>>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> > >>>>>>>>>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> > >>>>>>>>>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Reproducer:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> #define _GNU_SOURCE
> > >>>>>>>>>>>>>>> #include <fcntl.h>
> > >>>>>>>>>>>>>>> #include <linux/loop.h>
> > >>>>>>>>>>>>>>> #include <sys/ioctl.h>
> > >>>>>>>>>>>>>>> #include <sys/sendfile.h>
> > >>>>>>>>>>>>>>> #include <sys/syscall.h>
> > >>>>>>>>>>>>>>> #include <unistd.h>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> int main(void)
> > >>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>>         int memfd, loopfd;
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> > >>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Crash:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> > >>>>>>>>>>>>>>> flags: 0x100000000000000()
> > >>>>>>>>>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> > >>>>>>>>>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> > >>>>>>>>>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> > >>>>>>>>>>>>> branch.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi Jens,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I saw the following change is added:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> + if (size == len) {
> > >>>>>>>>>>>> + /*
> > >>>>>>>>>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> > >>>>>>>>>>>> + * reference and then not have to put them again when IO
> > >>>>>>>>>>>> + * completes. But this breaks some in-kernel users, like
> > >>>>>>>>>>>> + * splicing to/from a loop device, where we release the pipe
> > >>>>>>>>>>>> + * pages unconditionally. If we can fix that case, we can
> > >>>>>>>>>>>> + * get rid of the get here and the need to call
> > >>>>>>>>>>>> + * bio_release_pages() at IO completion time.
> > >>>>>>>>>>>> + */
> > >>>>>>>>>>>> + get_page(bv->bv_page);
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Now the 'bv' may point to more than one page, so the following one may be
> > >>>>>>>>>>>> needed:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> int i;
> > >>>>>>>>>>>> struct bvec_iter_all iter_all;
> > >>>>>>>>>>>> struct bio_vec *tmp;
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> > >>>>>>>>>>>>       get_page(tmp->bv_page);
> > >>>>>>>>>>>
> > >>>>>>>>>>> I guess that would be the safest, even if we don't currently have more
> > >>>>>>>>>>> than one page in there. I'll fix it up.
> > >>>>>>>>>>
> > >>>>>>>>>> It is easy to see multipage bvec from loop, :-)
> > >>>>>>>>>
> > >>>>>>>>> Speaking of this, I took a quick look at why we've now regressed a lot
> > >>>>>>>>> on IOPS perf with the multipage work. It looks like it's all related to
> > >>>>>>>>> the (much) fatter setup around iteration, which is related to this very
> > >>>>>>>>> topic too.
> > >>>>>>>>>
> > >>>>>>>>> Basically setup of things like bio_for_each_bvec() and indexing through
> > >>>>>>>>> nth_page() is MUCH slower than before.
> > >>>>>>>>
> > >>>>>>>> But bio_for_each_bvec() needn't nth_page(), and only bio_for_each_segment()
> > >>>>>>>> needs that. However, bio_for_each_segment() isn't called from
> > >>>>>>>> blk_queue_split() and blk_rq_map_sg().
> > >>>>>>>>
> > >>>>>>>> One issue is that bio_for_each_bvec() still advances by page size
> > >>>>>>>> instead of bvec->len, I guess that is the problem, will cook a patch
> > >>>>>>>> for your test.
> > >>>>>>>
> > >>>>>>> Probably won't make a difference for my test case...
> > >>>>>>>
> > >>>>>>>>> We need to do something about this, it's like tossing out months of
> > >>>>>>>>> optimizations.
> > >>>>>>>>
> > >>>>>>>> Some following optimization can be done, such as removing
> > >>>>>>>> biovec_phys_mergeable() from blk_bio_segment_split().
> > >>>>>>>
> > >>>>>>> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> > >>>>>>> that it is possible. But iteration startup cost is a problem in a lot of
> > >>>>>>> spots, and a split fast path will only help a bit for that specific
> > >>>>>>> case.
> > >>>>>>>
> > >>>>>>> 5% regressions is HUGE. I know I've mentioned this before, I just want
> > >>>>>>> to really stress how big of a deal that is. It's enough to make me
> > >>>>>>> consider just reverting it again, which sucks, but I don't feel great
> > >>>>>>> shipping something that is known that much slower.
> > >>>>>>>
> > >>>>>>> Suggestions?
> > >>>>>>
> > >>>>>> You mentioned nth_page() costs much in bio_for_each_bvec(), but which
> > >>>>>> shouldn't call into nth_page(). I will look into it first.
> > >>>>>
> > >>>>> I'll check on the test box tomorrow, I lost connectivity before. I'll
> > >>>>> double check in the morning.
> > >>>>>
> > >>>>> I'd focus on the blk_rq_map_sg() path, since that's the biggest cycle
> > >>>>> consumer.
> > >>>>
> > >>>> Hi Jens,
> > >>>>
> > >>>> Could you test the following patch which may improve on the 4k randio
> > >>>> test case?
> > >>>
> > >>> A bit, it's up 1% with this patch. I'm going to try without the
> > >>> get_page/put_page that we had earlier, to see where we are in regards to
> > >>> the old baseline.
> > >>
> > >> OK, today I will test io_uring over null_blk on one real machine and see
> > >> if something can be improved.
> > > 
> > > For reference, I'm running the default t/io_uring from fio, which is
> > > QD=128, fixed files/buffers, and polled. Running it on two devices to
> > > max out the CPU core:
> > > 
> > > sudo taskset -c 0 t/io_uring /dev/nvme1n1 /dev/nvme5n1
> > 
> > Forgot to mention, this is loading nvme with 12 poll queues, which is of
> > course important to get good performance on this test case.
> 
> Btw, is your nvme device SGL capable?  There is some low hanging fruit
> in that IFF a device has SGL support we can basically dumb down
> blk_mq_map_sg to never split in this case ever because we don't have
> any segment size limits.

Indeed.

In case of SGL, big sg list may not be needed and blk_rq_map_sg() can be
skipped if proper DMA mapping interface is to return the dma address
for each segment. That can be one big improvement.

Thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  2:28                         ` Jens Axboe
@ 2019-03-08  7:55                           ` Christoph Hellwig
  -1 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2019-03-08  7:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh,
	Al Viro

On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> On 2/26/19 7:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>> Hi Jens,
> >>>>>>>>
> >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>
> >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>
> >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>  }
> >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>
> >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>> + size_t size;
> >>>>>>>>>>> +
> >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>
> >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>
> >>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jens Axboe
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>
> >>>>>>>> Reproducer:
> >>>>>>>>
> >>>>>>>> #define _GNU_SOURCE
> >>>>>>>> #include <fcntl.h>
> >>>>>>>> #include <linux/loop.h>
> >>>>>>>> #include <sys/ioctl.h>
> >>>>>>>> #include <sys/sendfile.h>
> >>>>>>>> #include <sys/syscall.h>
> >>>>>>>> #include <unistd.h>
> >>>>>>>>
> >>>>>>>> int main(void)
> >>>>>>>> {
> >>>>>>>>         int memfd, loopfd;
> >>>>>>>>
> >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>
> >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>
> >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>
> >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>
> >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Crash:
> >>>>>>>>
> >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>> flags: 0x100000000000000()
> >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>
> >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>
> >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>> branch.
> >>>>>
> >>>>> Hi Jens,
> >>>>>
> >>>>> I saw the following change is added:
> >>>>>
> >>>>> + if (size == len) {
> >>>>> + /*
> >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>> + * reference and then not have to put them again when IO
> >>>>> + * completes. But this breaks some in-kernel users, like
> >>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>> + * get rid of the get here and the need to call
> >>>>> + * bio_release_pages() at IO completion time.
> >>>>> + */
> >>>>> + get_page(bv->bv_page);
> >>>>>
> >>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>> needed:
> >>>>>
> >>>>> int i;
> >>>>> struct bvec_iter_all iter_all;
> >>>>> struct bio_vec *tmp;
> >>>>>
> >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>       get_page(tmp->bv_page);
> >>>>
> > Some following optimization can be done, such as removing
> > biovec_phys_mergeable() from blk_bio_segment_split().
> 
> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> that it is possible. But iteration startup cost is a problem in a lot of
> spots, and a split fast path will only help a bit for that specific
> case.

FYI, I've got a nice fast path for the driver side in nvme here, but
I'll need to do some more testing before submitting it:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-optimize-single-segment-io

But in the block layer I think one major issue is all the phys_segments
crap.  What we really should do is to remove bi_phys_segments and all
the front/back segment crap and only do the calculation of the actual
per-bio segments once, just before adding the bio to the segment.

And don't bother with it at all unless the driver has weird segment
size or boundary limitations.

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-03-08  7:55                           ` Christoph Hellwig
  0 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2019-03-08  7:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh,
	Al Viro

On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> On 2/26/19 7:21 PM, Ming Lei wrote:
> > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> >> On 2/26/19 6:53 PM, Ming Lei wrote:
> >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>>
> >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> >>>>>>>> Hi Jens,
> >>>>>>>>
> >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> >>>>>>>>>>>
> >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> >>>>>>>>>>> check if they need to release pages on completion. This makes them
> >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> >>>>>>>>>>>
> >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>>>> ---
> >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> >>>>>>>>>>> --- a/block/bio.c
> >>>>>>>>>>> +++ b/block/bio.c
> >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> >>>>>>>>>>>  }
> >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> >>>>>>>>>>>
> >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> >>>>>>>>>>> + unsigned int len;
> >>>>>>>>>>> + size_t size;
> >>>>>>>>>>> +
> >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> >>>>>>>>>>
> >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> >>>>>>>>>> can be observed when running xfstests over loop/dio.
> >>>>>>>>>
> >>>>>>>>> Thanks, I folded this in.
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Jens Axboe
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> >>>>>>>>
> >>>>>>>> Reproducer:
> >>>>>>>>
> >>>>>>>> #define _GNU_SOURCE
> >>>>>>>> #include <fcntl.h>
> >>>>>>>> #include <linux/loop.h>
> >>>>>>>> #include <sys/ioctl.h>
> >>>>>>>> #include <sys/sendfile.h>
> >>>>>>>> #include <sys/syscall.h>
> >>>>>>>> #include <unistd.h>
> >>>>>>>>
> >>>>>>>> int main(void)
> >>>>>>>> {
> >>>>>>>>         int memfd, loopfd;
> >>>>>>>>
> >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> >>>>>>>>
> >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> >>>>>>>>
> >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> >>>>>>>>
> >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> >>>>>>>>
> >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Crash:
> >>>>>>>>
> >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> >>>>>>>> flags: 0x100000000000000()
> >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> >>>>>>>
> >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> >>>>>>
> >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> >>>>>> branch.
> >>>>>
> >>>>> Hi Jens,
> >>>>>
> >>>>> I saw the following change is added:
> >>>>>
> >>>>> + if (size == len) {
> >>>>> + /*
> >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> >>>>> + * reference and then not have to put them again when IO
> >>>>> + * completes. But this breaks some in-kernel users, like
> >>>>> + * splicing to/from a loop device, where we release the pipe
> >>>>> + * pages unconditionally. If we can fix that case, we can
> >>>>> + * get rid of the get here and the need to call
> >>>>> + * bio_release_pages() at IO completion time.
> >>>>> + */
> >>>>> + get_page(bv->bv_page);
> >>>>>
> >>>>> Now the 'bv' may point to more than one page, so the following one may be
> >>>>> needed:
> >>>>>
> >>>>> int i;
> >>>>> struct bvec_iter_all iter_all;
> >>>>> struct bio_vec *tmp;
> >>>>>
> >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> >>>>>       get_page(tmp->bv_page);
> >>>>
> > Some following optimization can be done, such as removing
> > biovec_phys_mergeable() from blk_bio_segment_split().
> 
> I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> that it is possible. But iteration startup cost is a problem in a lot of
> spots, and a split fast path will only help a bit for that specific
> case.

FYI, I've got a nice fast path for the driver side in nvme here, but
I'll need to do some more testing before submitting it:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-optimize-single-segment-io

But in the block layer I think one major issue is all the phys_segments
crap.  What we really should do is to remove bi_phys_segments and all
the front/back segment crap and only do the calculation of the actual
per-bio segments once, just before adding the bio to the segment.

And don't bother with it at all unless the driver has weird segment
size or boundary limitations.

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-02-27  1:57                     ` Jens Axboe
@ 2019-03-08  8:18                       ` Christoph Hellwig
  -1 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2019-03-08  8:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh,
	Al Viro

On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> Speaking of this, I took a quick look at why we've now regressed a lot
> on IOPS perf with the multipage work. It looks like it's all related to
> the (much) fatter setup around iteration, which is related to this very
> topic too.

> Basically setup of things like bio_for_each_bvec() and indexing through
> nth_page() is MUCH slower than before.

I haven't quite figure out what the point of nth_page is.  If we
physically merge the page structures should also be consecuite
in memory in general.  The only case where this could theoretically
not be the case is with CONFIG_DISCONTIGMEM, but in that case we should
check this once in biovec_phys_mergeable, and only for that case.

Does this patch make a difference for you on x86?

--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -53,7 +53,7 @@ struct bvec_iter_all {
 
 static inline struct page *bvec_nth_page(struct page *page, int idx)
 {
-	return idx == 0 ? page : nth_page(page, idx);
+	return page + idx;
 }
 
 /*

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-03-08  8:18                       ` Christoph Hellwig
  0 siblings, 0 replies; 128+ messages in thread
From: Christoph Hellwig @ 2019-03-08  8:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ming Lei, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Christoph Hellwig, Jeff Moyer, Avi Kivity, jannh,
	Al Viro

On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> Speaking of this, I took a quick look at why we've now regressed a lot
> on IOPS perf with the multipage work. It looks like it's all related to
> the (much) fatter setup around iteration, which is related to this very
> topic too.

> Basically setup of things like bio_for_each_bvec() and indexing through
> nth_page() is MUCH slower than before.

I haven't quite figure out what the point of nth_page is.  If we
physically merge the page structures should also be consecuite
in memory in general.  The only case where this could theoretically
not be the case is with CONFIG_DISCONTIGMEM, but in that case we should
check this once in biovec_phys_mergeable, and only for that case.

Does this patch make a difference for you on x86?

--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -53,7 +53,7 @@ struct bvec_iter_all {
 
 static inline struct page *bvec_nth_page(struct page *page, int idx)
 {
-	return idx == 0 ? page : nth_page(page, idx);
+	return page + idx;
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
  2019-03-08  7:55                           ` Christoph Hellwig
@ 2019-03-08  9:12                             ` Ming Lei
  -1 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-03-08  9:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Fri, Mar 08, 2019 at 08:55:22AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> > On 2/26/19 7:21 PM, Ming Lei wrote:
> > > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> > >> On 2/26/19 6:53 PM, Ming Lei wrote:
> > >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> > >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> > >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> > >>>>>>
> > >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> > >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> > >>>>>>>> Hi Jens,
> > >>>>>>>>
> > >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> > >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> > >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> > >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> > >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> > >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> > >>>>>>>>>>> check if they need to release pages on completion. This makes them
> > >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> > >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> > >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >>>>>>>>>>> ---
> > >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> > >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> > >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> > >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> > >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> > >>>>>>>>>>>
> > >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> > >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> > >>>>>>>>>>> --- a/block/bio.c
> > >>>>>>>>>>> +++ b/block/bio.c
> > >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> > >>>>>>>>>>>  }
> > >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> > >>>>>>>>>>>
> > >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> > >>>>>>>>>>> +{
> > >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> > >>>>>>>>>>> + unsigned int len;
> > >>>>>>>>>>> + size_t size;
> > >>>>>>>>>>> +
> > >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> > >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> > >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> > >>>>>>>>>>
> > >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> > >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> > >>>>>>>>>> can be observed when running xfstests over loop/dio.
> > >>>>>>>>>
> > >>>>>>>>> Thanks, I folded this in.
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Jens Axboe
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> > >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> > >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> > >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> > >>>>>>>>
> > >>>>>>>> Reproducer:
> > >>>>>>>>
> > >>>>>>>> #define _GNU_SOURCE
> > >>>>>>>> #include <fcntl.h>
> > >>>>>>>> #include <linux/loop.h>
> > >>>>>>>> #include <sys/ioctl.h>
> > >>>>>>>> #include <sys/sendfile.h>
> > >>>>>>>> #include <sys/syscall.h>
> > >>>>>>>> #include <unistd.h>
> > >>>>>>>>
> > >>>>>>>> int main(void)
> > >>>>>>>> {
> > >>>>>>>>         int memfd, loopfd;
> > >>>>>>>>
> > >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> > >>>>>>>>
> > >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> > >>>>>>>>
> > >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> > >>>>>>>>
> > >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> > >>>>>>>>
> > >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> > >>>>>>>> }
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Crash:
> > >>>>>>>>
> > >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> > >>>>>>>> flags: 0x100000000000000()
> > >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> > >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> > >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> > >>>>>>>
> > >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> > >>>>>>
> > >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> > >>>>>> branch.
> > >>>>>
> > >>>>> Hi Jens,
> > >>>>>
> > >>>>> I saw the following change is added:
> > >>>>>
> > >>>>> + if (size == len) {
> > >>>>> + /*
> > >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> > >>>>> + * reference and then not have to put them again when IO
> > >>>>> + * completes. But this breaks some in-kernel users, like
> > >>>>> + * splicing to/from a loop device, where we release the pipe
> > >>>>> + * pages unconditionally. If we can fix that case, we can
> > >>>>> + * get rid of the get here and the need to call
> > >>>>> + * bio_release_pages() at IO completion time.
> > >>>>> + */
> > >>>>> + get_page(bv->bv_page);
> > >>>>>
> > >>>>> Now the 'bv' may point to more than one page, so the following one may be
> > >>>>> needed:
> > >>>>>
> > >>>>> int i;
> > >>>>> struct bvec_iter_all iter_all;
> > >>>>> struct bio_vec *tmp;
> > >>>>>
> > >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> > >>>>>       get_page(tmp->bv_page);
> > >>>>
> > > Some following optimization can be done, such as removing
> > > biovec_phys_mergeable() from blk_bio_segment_split().
> > 
> > I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> > that it is possible. But iteration startup cost is a problem in a lot of
> > spots, and a split fast path will only help a bit for that specific
> > case.
> 
> FYI, I've got a nice fast path for the driver side in nvme here, but
> I'll need to do some more testing before submitting it:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-optimize-single-segment-io
> 
> But in the block layer I think one major issue is all the phys_segments
> crap.  What we really should do is to remove bi_phys_segments and all
> the front/back segment crap and only do the calculation of the actual
> per-bio segments once, just before adding the bio to the segment.

I have enabled multi-page bvec for passthrough IO in the following:

https://github.com/ming1/linux/commits/v5.0-blk-for-blk_post_mp

in which .bi_phys_segments becomes same with .bi_vcnt for passthrough bio.

Also intra-bvec merging in one bio has been killed, then only the merge
between bios is required, and seems we still need front/back segment size,
especially some use cases(such as mkfs) may make lots of small mergeable bios.

> 
> And don't bother with it at all unless the driver has weird segment
> size or boundary limitations.

It should be easy to observe that .bv_len is bigger than max segment
size.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio
@ 2019-03-08  9:12                             ` Ming Lei
  0 siblings, 0 replies; 128+ messages in thread
From: Ming Lei @ 2019-03-08  9:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Ming Lei, Eric Biggers, open list:AIO, linux-block,
	linux-api, Jeff Moyer, Avi Kivity, jannh, Al Viro

On Fri, Mar 08, 2019 at 08:55:22AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 26, 2019 at 07:28:54PM -0700, Jens Axboe wrote:
> > On 2/26/19 7:21 PM, Ming Lei wrote:
> > > On Tue, Feb 26, 2019 at 06:57:16PM -0700, Jens Axboe wrote:
> > >> On 2/26/19 6:53 PM, Ming Lei wrote:
> > >>> On Tue, Feb 26, 2019 at 06:47:54PM -0700, Jens Axboe wrote:
> > >>>> On 2/26/19 6:21 PM, Ming Lei wrote:
> > >>>>> On Tue, Feb 26, 2019 at 11:56 PM Jens Axboe <axboe@kernel.dk> wrote:
> > >>>>>>
> > >>>>>> On 2/25/19 9:34 PM, Jens Axboe wrote:
> > >>>>>>> On 2/25/19 8:46 PM, Eric Biggers wrote:
> > >>>>>>>> Hi Jens,
> > >>>>>>>>
> > >>>>>>>> On Thu, Feb 21, 2019 at 10:45:27AM -0700, Jens Axboe wrote:
> > >>>>>>>>> On 2/20/19 3:58 PM, Ming Lei wrote:
> > >>>>>>>>>> On Mon, Feb 11, 2019 at 12:00:41PM -0700, Jens Axboe wrote:
> > >>>>>>>>>>> For an ITER_BVEC, we can just iterate the iov and add the pages
> > >>>>>>>>>>> to the bio directly. This requires that the caller doesn't releases
> > >>>>>>>>>>> the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The current two callers of bio_iov_iter_get_pages() are updated to
> > >>>>>>>>>>> check if they need to release pages on completion. This makes them
> > >>>>>>>>>>> work with bvecs that contain kernel mapped pages already.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Reviewed-by: Hannes Reinecke <hare@suse.com>
> > >>>>>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> > >>>>>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > >>>>>>>>>>> ---
> > >>>>>>>>>>>  block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
> > >>>>>>>>>>>  fs/block_dev.c            |  5 ++--
> > >>>>>>>>>>>  fs/iomap.c                |  5 ++--
> > >>>>>>>>>>>  include/linux/blk_types.h |  1 +
> > >>>>>>>>>>>  4 files changed, 56 insertions(+), 14 deletions(-)
> > >>>>>>>>>>>
> > >>>>>>>>>>> diff --git a/block/bio.c b/block/bio.c
> > >>>>>>>>>>> index 4db1008309ed..330df572cfb8 100644
> > >>>>>>>>>>> --- a/block/bio.c
> > >>>>>>>>>>> +++ b/block/bio.c
> > >>>>>>>>>>> @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
> > >>>>>>>>>>>  }
> > >>>>>>>>>>>  EXPORT_SYMBOL(bio_add_page);
> > >>>>>>>>>>>
> > >>>>>>>>>>> +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
> > >>>>>>>>>>> +{
> > >>>>>>>>>>> + const struct bio_vec *bv = iter->bvec;
> > >>>>>>>>>>> + unsigned int len;
> > >>>>>>>>>>> + size_t size;
> > >>>>>>>>>>> +
> > >>>>>>>>>>> + len = min_t(size_t, bv->bv_len, iter->count);
> > >>>>>>>>>>> + size = bio_add_page(bio, bv->bv_page, len,
> > >>>>>>>>>>> +                         bv->bv_offset + iter->iov_offset);
> > >>>>>>>>>>
> > >>>>>>>>>> iter->iov_offset needs to be subtracted from 'len', looks
> > >>>>>>>>>> the following delta change[1] is required, otherwise memory corruption
> > >>>>>>>>>> can be observed when running xfstests over loop/dio.
> > >>>>>>>>>
> > >>>>>>>>> Thanks, I folded this in.
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Jens Axboe
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> syzkaller started hitting a crash on linux-next starting with this commit, and
> > >>>>>>>> it still occurs even with your latest version that has Ming's fix folded in.
> > >>>>>>>> Specifically, commit a566653ab5ab80a from your io_uring branch with commit date
> > >>>>>>>> Sun Feb 24 08:20:53 2019 -0700.
> > >>>>>>>>
> > >>>>>>>> Reproducer:
> > >>>>>>>>
> > >>>>>>>> #define _GNU_SOURCE
> > >>>>>>>> #include <fcntl.h>
> > >>>>>>>> #include <linux/loop.h>
> > >>>>>>>> #include <sys/ioctl.h>
> > >>>>>>>> #include <sys/sendfile.h>
> > >>>>>>>> #include <sys/syscall.h>
> > >>>>>>>> #include <unistd.h>
> > >>>>>>>>
> > >>>>>>>> int main(void)
> > >>>>>>>> {
> > >>>>>>>>         int memfd, loopfd;
> > >>>>>>>>
> > >>>>>>>>         memfd = syscall(__NR_memfd_create, "foo", 0);
> > >>>>>>>>
> > >>>>>>>>         pwrite(memfd, "\xa8", 1, 4096);
> > >>>>>>>>
> > >>>>>>>>         loopfd = open("/dev/loop0", O_RDWR|O_DIRECT);
> > >>>>>>>>
> > >>>>>>>>         ioctl(loopfd, LOOP_SET_FD, memfd);
> > >>>>>>>>
> > >>>>>>>>         sendfile(loopfd, loopfd, NULL, 1000000);
> > >>>>>>>> }
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> Crash:
> > >>>>>>>>
> > >>>>>>>> page:ffffea0001a6aab8 count:0 mapcount:0 mapping:0000000000000000 index:0x0
> > >>>>>>>> flags: 0x100000000000000()
> > >>>>>>>> raw: 0100000000000000 ffffea0001ad2c50 ffff88807fca49d0 0000000000000000
> > >>>>>>>> raw: 0000000000000000 0000000000000000 00000000ffffffff
> > >>>>>>>> page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
> > >>>>>>>
> > >>>>>>> I see what this is, I'll cut a fix for this tomorrow.
> > >>>>>>
> > >>>>>> Folded in a fix for this, it's in my current io_uring branch and my for-next
> > >>>>>> branch.
> > >>>>>
> > >>>>> Hi Jens,
> > >>>>>
> > >>>>> I saw the following change is added:
> > >>>>>
> > >>>>> + if (size == len) {
> > >>>>> + /*
> > >>>>> + * For the normal O_DIRECT case, we could skip grabbing this
> > >>>>> + * reference and then not have to put them again when IO
> > >>>>> + * completes. But this breaks some in-kernel users, like
> > >>>>> + * splicing to/from a loop device, where we release the pipe
> > >>>>> + * pages unconditionally. If we can fix that case, we can
> > >>>>> + * get rid of the get here and the need to call
> > >>>>> + * bio_release_pages() at IO completion time.
> > >>>>> + */
> > >>>>> + get_page(bv->bv_page);
> > >>>>>
> > >>>>> Now the 'bv' may point to more than one page, so the following one may be
> > >>>>> needed:
> > >>>>>
> > >>>>> int i;
> > >>>>> struct bvec_iter_all iter_all;
> > >>>>> struct bio_vec *tmp;
> > >>>>>
> > >>>>> mp_bvec_for_each_segment(tmp, bv, i, iter_all)
> > >>>>>       get_page(tmp->bv_page);
> > >>>>
> > > Some following optimization can be done, such as removing
> > > biovec_phys_mergeable() from blk_bio_segment_split().
> > 
> > I think we really need a fast path for <= PAGE_SIZE IOs, to the extent
> > that it is possible. But iteration startup cost is a problem in a lot of
> > spots, and a split fast path will only help a bit for that specific
> > case.
> 
> FYI, I've got a nice fast path for the driver side in nvme here, but
> I'll need to do some more testing before submitting it:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-optimize-single-segment-io
> 
> But in the block layer I think one major issue is all the phys_segments
> crap.  What we really should do is to remove bi_phys_segments and all
> the front/back segment crap and only do the calculation of the actual
> per-bio segments once, just before adding the bio to the segment.

I have enabled multi-page bvec for passthrough IO in the following:

https://github.com/ming1/linux/commits/v5.0-blk-for-blk_post_mp

in which .bi_phys_segments becomes same with .bi_vcnt for passthrough bio.

Also intra-bvec merging in one bio has been killed, then only the merge
between bios is required, and seems we still need front/back segment size,
especially some use cases(such as mkfs) may make lots of small mergeable bios.

> 
> And don't bother with it at all unless the driver has weird segment
> size or boundary limitations.

It should be easy to observe that .bv_len is bigger than max segment
size.

Thanks,
Ming

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-10  2:57             ` Jens Axboe
  (?)
@ 2019-02-10 19:55             ` Matt Mullins
  -1 siblings, 0 replies; 128+ messages in thread
From: Matt Mullins @ 2019-02-10 19:55 UTC (permalink / raw)
  To: linux-block, linux-aio, linux-api, axboe; +Cc: hch, jannh, viro, avi, jmoyer

On Sat, 2019-02-09 at 19:57 -0700, Jens Axboe wrote:
> On 2/9/19 7:34 PM, Jens Axboe wrote:
> > On 2/9/19 6:11 PM, Matt Mullins wrote:
> > > On Sat, 2019-02-09 at 17:47 -0700, Jens Axboe wrote:
> > > > On 2/9/19 4:52 PM, Matt Mullins wrote:
> > > > > > @@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> > > > > >  	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
> > > > > >  }
> > > > > >  
> > > > > > +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
> > > > > > +{
> > > > > > +#if defined(CONFIG_UNIX)
> > > > > > +	if (ctx->ring_sock) {
> > > > > > +		struct sock *sock = ctx->ring_sock->sk;
> > > > > > +		struct sk_buff *skb;
> > > > > > +
> > > > > > +		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
> > > > > 
> > > > > Something's still a bit messy with destruction.  I get a traceback here
> > > > > while running
> > > > > 
> > > > >   int main() {
> > > > >     struct io_uring_params uring_params = {
> > > > >         .flags = IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL,
> > > > >     };
> > > > >     int uring_fd = 
> > > > >         syscall(425 /* io_uring_setup */, 16, &uring_params);
> > > > >     
> > > > >     const __s32 fds[] = {1};
> > > > >     
> > > > >     syscall(427 /* io_uring_register */, uring_fd,
> > > > >             IORING_REGISTER_FILES, fds, sizeof(fds) / sizeof(*fds));
> > > > >   }
> > > > > 
> > > > > I end up with the following spew:
> > > > > 
> > > > > [  195.983322] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:500 unix_sock_destructor+0x97/0xc0
> > > > > [  195.989556] Modules linked in:
> > > > > [  195.992738] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
> > > > > [  196.000926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> > > > > [  196.008316] RIP: 0010:unix_sock_destructor+0x97/0xc0
> > > > > [  196.010912] Code: 3f 37 f3 ff 5b 5d be 00 02 00 00 48 c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b <0f> 0b 48 83 bb 68 02 00 00 00 74 89 0f 0b eb 85 48 89 de 48 c7 c7
> > > > > [  196.018887] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
> > > > > [  196.020754] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
> > > > > [  196.022811] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
> > > > > [  196.024901] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
> > > > > [  196.026977] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
> > > > > [  196.029119] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
> > > > > [  196.031071] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
> > > > > [  196.033069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [  196.034438] CR2: 00007f1a8aba5920 CR3: 000000000260e004 CR4: 00000000003606a0
> > > > > [  196.036310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > [  196.038399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > [  196.039794] Call Trace:
> > > > > [  196.040259]  __sk_destruct+0x1c/0x150
> > > > > [  196.040964]  ? io_sqe_files_unregister+0x32/0x70
> > > > > [  196.041841]  unix_destruct_scm+0x76/0xa0
> > > > > [  196.042587]  skb_release_head_state+0x38/0x60
> > > > > [  196.043401]  skb_release_all+0x9/0x20
> > > > > [  196.044034]  kfree_skb+0x2d/0xb0
> > > > > [  196.044603]  io_sqe_files_unregister+0x32/0x70
> > > > > [  196.045385]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
> > > > > [  196.046220]  io_uring_release+0x17/0x20
> > > > > [  196.046881]  __fput+0x9d/0x1d0
> > > > > [  196.047421]  task_work_run+0x7a/0x90
> > > > > [  196.048045]  do_exit+0x301/0xc20
> > > > > [  196.048626]  ? handle_mm_fault+0xf3/0x230
> > > > > [  196.049321]  do_group_exit+0x35/0xa0
> > > > > [  196.049944]  __x64_sys_exit_group+0xf/0x10
> > > > > [  196.050658]  do_syscall_64+0x3d/0xf0
> > > > > [  196.051317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > [  196.052217] RIP: 0033:0x7f1a8aba5956
> > > > > [  196.052859] Code: Bad RIP value.
> > > > > [  196.053488] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > > > > [  196.054902] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
> > > > > [  196.056124] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> > > > > [  196.057348] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
> > > > > [  196.058573] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
> > > > > [  196.059459] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
> > > > > [  196.060731] ---[ end trace 8a7e42f9199e5f92 ]---
> > > > > [  196.062671] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:501 unix_sock_destructor+0xa3/0xc0
> > > > > [  196.064372] Modules linked in:
> > > > > [  196.064966] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
> > > > > [  196.066546] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> > > > > [  196.068234] RIP: 0010:unix_sock_destructor+0xa3/0xc0
> > > > > [  196.068999] Code: c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b 0f 0b 48 83 bb 68 02 00 00 00 74 89 <0f> 0b eb 85 48 89 de 48 c7 c7 a0 c8 42 82 5b 5d e9 31 8c 75 ff 0f
> > > > > [  196.072577] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
> > > > > [  196.073595] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
> > > > > [  196.074973] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
> > > > > [  196.076348] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
> > > > > [  196.077709] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
> > > > > [  196.079072] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
> > > > > [  196.080441] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
> > > > > [  196.082026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > [  196.083131] CR2: 00007fbc19f96550 CR3: 0000000138d1e003 CR4: 00000000003606a0
> > > > > [  196.084505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > [  196.085823] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > [  196.087185] Call Trace:
> > > > > [  196.087662]  __sk_destruct+0x1c/0x150
> > > > > [  196.088376]  ? io_sqe_files_unregister+0x32/0x70
> > > > > [  196.089299]  unix_destruct_scm+0x76/0xa0
> > > > > [  196.090059]  skb_release_head_state+0x38/0x60
> > > > > [  196.090929]  skb_release_all+0x9/0x20
> > > > > [  196.091550]  kfree_skb+0x2d/0xb0
> > > > > [  196.092745]  io_sqe_files_unregister+0x32/0x70
> > > > > [  196.093535]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
> > > > > [  196.094358]  io_uring_release+0x17/0x20
> > > > > [  196.095029]  __fput+0x9d/0x1d0
> > > > > [  196.095660]  task_work_run+0x7a/0x90
> > > > > [  196.096307]  do_exit+0x301/0xc20
> > > > > [  196.096808]  ? handle_mm_fault+0xf3/0x230
> > > > > [  196.097504]  do_group_exit+0x35/0xa0
> > > > > [  196.098126]  __x64_sys_exit_group+0xf/0x10
> > > > > [  196.098836]  do_syscall_64+0x3d/0xf0
> > > > > [  196.099460]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > [  196.100334] RIP: 0033:0x7f1a8aba5956
> > > > > [  196.100958] Code: Bad RIP value.
> > > > > [  196.101293] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > > > > [  196.101933] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
> > > > > [  196.102535] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> > > > > [  196.103137] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
> > > > > [  196.103739] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
> > > > > [  196.104526] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
> > > > > [  196.105777] ---[ end trace 8a7e42f9199e5f93 ]---
> > > > > [  196.107535] unix: Attempt to release alive unix socket: 000000003b3c1a34
> > > > > 
> > > > > which corresponds to the WARN_ONs:
> > > > > 
> > > > > 	WARN_ON(!sk_unhashed(sk));
> > > > > 	WARN_ON(sk->sk_socket);
> > > > > 
> > > > > This doesn't seem to happen if I omit the call to io_uring_register.
> > > > 
> > > > Huh, I can't reproduce that here, teardown seems to work just fine. It
> > > > looks like the socket is getting torn down prematurely, when we free the
> > > > skb. I wonder if you have some networking options I don't? What's your
> > > > .config?
> > > > 
> > > 
> > > Interesting.  Attached is the config I'm using to build
> > > af22d31f8b09fa36f57569c95f4943febaacb2b1.  I'll keep playing with it on
> > > my end, too, maybe I've got something bad in my ccache.
> > 
> > Bingo, reproduces with your .config. Looks like the io_uring is released
> > basically as soon as we queue the skb in the socket. I'll take a look at
> > this tomorrow.
> 
> OK, I think I see it. Apparently with my options, the size of the skb is
> 0 when I pass in 0. With your options, it's non-zero, which wreaks havoc
> on the ref counting.
> 
> The below should fix it. I'll fold this in now.
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index d7a10484d748..c8794a11de3e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2006,6 +2006,7 @@ static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
>  
>  	fpl->max = fpl->count = nr;
>  	UNIXCB(skb).fp = fpl;
> +	refcount_add(skb->truesize, &ctx->ring_sock->sk->sk_wmem_alloc);
>  	skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
>  
>  	for (i = 0; i < nr; i++)
> 

Ah-ha!  I guess I've opted-into an over-zealous memory allocator :)

Tested-by: Matt Mullins <mmullins@fb.com>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-10  2:34           ` Jens Axboe
@ 2019-02-10  2:57             ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-10  2:57 UTC (permalink / raw)
  To: Matt Mullins, linux-block, linux-aio, linux-api
  Cc: hch, jannh, viro, avi, jmoyer

On 2/9/19 7:34 PM, Jens Axboe wrote:
> On 2/9/19 6:11 PM, Matt Mullins wrote:
>> On Sat, 2019-02-09 at 17:47 -0700, Jens Axboe wrote:
>>> On 2/9/19 4:52 PM, Matt Mullins wrote:
>>>>> @@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>>>>  	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>>>>>  }
>>>>>  
>>>>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>>>>> +{
>>>>> +#if defined(CONFIG_UNIX)
>>>>> +	if (ctx->ring_sock) {
>>>>> +		struct sock *sock = ctx->ring_sock->sk;
>>>>> +		struct sk_buff *skb;
>>>>> +
>>>>> +		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
>>>>
>>>> Something's still a bit messy with destruction.  I get a traceback here
>>>> while running
>>>>
>>>>   int main() {
>>>>     struct io_uring_params uring_params = {
>>>>         .flags = IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL,
>>>>     };
>>>>     int uring_fd = 
>>>>         syscall(425 /* io_uring_setup */, 16, &uring_params);
>>>>     
>>>>     const __s32 fds[] = {1};
>>>>     
>>>>     syscall(427 /* io_uring_register */, uring_fd,
>>>>             IORING_REGISTER_FILES, fds, sizeof(fds) / sizeof(*fds));
>>>>   }
>>>>
>>>> I end up with the following spew:
>>>>
>>>> [  195.983322] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:500 unix_sock_destructor+0x97/0xc0
>>>> [  195.989556] Modules linked in:
>>>> [  195.992738] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
>>>> [  196.000926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>>> [  196.008316] RIP: 0010:unix_sock_destructor+0x97/0xc0
>>>> [  196.010912] Code: 3f 37 f3 ff 5b 5d be 00 02 00 00 48 c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b <0f> 0b 48 83 bb 68 02 00 00 00 74 89 0f 0b eb 85 48 89 de 48 c7 c7
>>>> [  196.018887] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
>>>> [  196.020754] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
>>>> [  196.022811] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
>>>> [  196.024901] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
>>>> [  196.026977] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
>>>> [  196.029119] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
>>>> [  196.031071] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
>>>> [  196.033069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  196.034438] CR2: 00007f1a8aba5920 CR3: 000000000260e004 CR4: 00000000003606a0
>>>> [  196.036310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [  196.038399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [  196.039794] Call Trace:
>>>> [  196.040259]  __sk_destruct+0x1c/0x150
>>>> [  196.040964]  ? io_sqe_files_unregister+0x32/0x70
>>>> [  196.041841]  unix_destruct_scm+0x76/0xa0
>>>> [  196.042587]  skb_release_head_state+0x38/0x60
>>>> [  196.043401]  skb_release_all+0x9/0x20
>>>> [  196.044034]  kfree_skb+0x2d/0xb0
>>>> [  196.044603]  io_sqe_files_unregister+0x32/0x70
>>>> [  196.045385]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
>>>> [  196.046220]  io_uring_release+0x17/0x20
>>>> [  196.046881]  __fput+0x9d/0x1d0
>>>> [  196.047421]  task_work_run+0x7a/0x90
>>>> [  196.048045]  do_exit+0x301/0xc20
>>>> [  196.048626]  ? handle_mm_fault+0xf3/0x230
>>>> [  196.049321]  do_group_exit+0x35/0xa0
>>>> [  196.049944]  __x64_sys_exit_group+0xf/0x10
>>>> [  196.050658]  do_syscall_64+0x3d/0xf0
>>>> [  196.051317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [  196.052217] RIP: 0033:0x7f1a8aba5956
>>>> [  196.052859] Code: Bad RIP value.
>>>> [  196.053488] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>>> [  196.054902] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
>>>> [  196.056124] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
>>>> [  196.057348] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
>>>> [  196.058573] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
>>>> [  196.059459] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
>>>> [  196.060731] ---[ end trace 8a7e42f9199e5f92 ]---
>>>> [  196.062671] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:501 unix_sock_destructor+0xa3/0xc0
>>>> [  196.064372] Modules linked in:
>>>> [  196.064966] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
>>>> [  196.066546] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>>> [  196.068234] RIP: 0010:unix_sock_destructor+0xa3/0xc0
>>>> [  196.068999] Code: c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b 0f 0b 48 83 bb 68 02 00 00 00 74 89 <0f> 0b eb 85 48 89 de 48 c7 c7 a0 c8 42 82 5b 5d e9 31 8c 75 ff 0f
>>>> [  196.072577] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
>>>> [  196.073595] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
>>>> [  196.074973] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
>>>> [  196.076348] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
>>>> [  196.077709] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
>>>> [  196.079072] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
>>>> [  196.080441] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
>>>> [  196.082026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  196.083131] CR2: 00007fbc19f96550 CR3: 0000000138d1e003 CR4: 00000000003606a0
>>>> [  196.084505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [  196.085823] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [  196.087185] Call Trace:
>>>> [  196.087662]  __sk_destruct+0x1c/0x150
>>>> [  196.088376]  ? io_sqe_files_unregister+0x32/0x70
>>>> [  196.089299]  unix_destruct_scm+0x76/0xa0
>>>> [  196.090059]  skb_release_head_state+0x38/0x60
>>>> [  196.090929]  skb_release_all+0x9/0x20
>>>> [  196.091550]  kfree_skb+0x2d/0xb0
>>>> [  196.092745]  io_sqe_files_unregister+0x32/0x70
>>>> [  196.093535]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
>>>> [  196.094358]  io_uring_release+0x17/0x20
>>>> [  196.095029]  __fput+0x9d/0x1d0
>>>> [  196.095660]  task_work_run+0x7a/0x90
>>>> [  196.096307]  do_exit+0x301/0xc20
>>>> [  196.096808]  ? handle_mm_fault+0xf3/0x230
>>>> [  196.097504]  do_group_exit+0x35/0xa0
>>>> [  196.098126]  __x64_sys_exit_group+0xf/0x10
>>>> [  196.098836]  do_syscall_64+0x3d/0xf0
>>>> [  196.099460]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [  196.100334] RIP: 0033:0x7f1a8aba5956
>>>> [  196.100958] Code: Bad RIP value.
>>>> [  196.101293] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>>> [  196.101933] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
>>>> [  196.102535] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
>>>> [  196.103137] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
>>>> [  196.103739] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
>>>> [  196.104526] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
>>>> [  196.105777] ---[ end trace 8a7e42f9199e5f93 ]---
>>>> [  196.107535] unix: Attempt to release alive unix socket: 000000003b3c1a34
>>>>
>>>> which corresponds to the WARN_ONs:
>>>>
>>>> 	WARN_ON(!sk_unhashed(sk));
>>>> 	WARN_ON(sk->sk_socket);
>>>>
>>>> This doesn't seem to happen if I omit the call to io_uring_register.
>>>
>>> Huh, I can't reproduce that here, teardown seems to work just fine. It
>>> looks like the socket is getting torn down prematurely, when we free the
>>> skb. I wonder if you have some networking options I don't? What's your
>>> .config?
>>>
>>
>> Interesting.  Attached is the config I'm using to build
>> af22d31f8b09fa36f57569c95f4943febaacb2b1.  I'll keep playing with it on
>> my end, too, maybe I've got something bad in my ccache.
> 
> Bingo, reproduces with your .config. Looks like the io_uring is released
> basically as soon as we queue the skb in the socket. I'll take a look at
> this tomorrow.

OK, I think I see it. Apparently with my options, the size of the skb is
0 when I pass in 0. With your options, it's non-zero, which wreaks havoc
on the ref counting.

The below should fix it. I'll fold this in now.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d7a10484d748..c8794a11de3e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2006,6 +2006,7 @@ static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
 
 	fpl->max = fpl->count = nr;
 	UNIXCB(skb).fp = fpl;
+	refcount_add(skb->truesize, &ctx->ring_sock->sk->sk_wmem_alloc);
 	skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
 
 	for (i = 0; i < nr; i++)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
@ 2019-02-10  2:57             ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-10  2:57 UTC (permalink / raw)
  To: Matt Mullins, linux-block, linux-aio, linux-api
  Cc: hch, jannh, viro, avi, jmoyer

On 2/9/19 7:34 PM, Jens Axboe wrote:
> On 2/9/19 6:11 PM, Matt Mullins wrote:
>> On Sat, 2019-02-09 at 17:47 -0700, Jens Axboe wrote:
>>> On 2/9/19 4:52 PM, Matt Mullins wrote:
>>>>> @@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>>>>  	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>>>>>  }
>>>>>  
>>>>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>>>>> +{
>>>>> +#if defined(CONFIG_UNIX)
>>>>> +	if (ctx->ring_sock) {
>>>>> +		struct sock *sock = ctx->ring_sock->sk;
>>>>> +		struct sk_buff *skb;
>>>>> +
>>>>> +		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
>>>>
>>>> Something's still a bit messy with destruction.  I get a traceback here
>>>> while running
>>>>
>>>>   int main() {
>>>>     struct io_uring_params uring_params = {
>>>>         .flags = IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL,
>>>>     };
>>>>     int uring_fd = 
>>>>         syscall(425 /* io_uring_setup */, 16, &uring_params);
>>>>     
>>>>     const __s32 fds[] = {1};
>>>>     
>>>>     syscall(427 /* io_uring_register */, uring_fd,
>>>>             IORING_REGISTER_FILES, fds, sizeof(fds) / sizeof(*fds));
>>>>   }
>>>>
>>>> I end up with the following spew:
>>>>
>>>> [  195.983322] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:500 unix_sock_destructor+0x97/0xc0
>>>> [  195.989556] Modules linked in:
>>>> [  195.992738] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
>>>> [  196.000926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>>> [  196.008316] RIP: 0010:unix_sock_destructor+0x97/0xc0
>>>> [  196.010912] Code: 3f 37 f3 ff 5b 5d be 00 02 00 00 48 c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b <0f> 0b 48 83 bb 68 02 00 00 00 74 89 0f 0b eb 85 48 89 de 48 c7 c7
>>>> [  196.018887] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
>>>> [  196.020754] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
>>>> [  196.022811] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
>>>> [  196.024901] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
>>>> [  196.026977] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
>>>> [  196.029119] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
>>>> [  196.031071] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
>>>> [  196.033069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  196.034438] CR2: 00007f1a8aba5920 CR3: 000000000260e004 CR4: 00000000003606a0
>>>> [  196.036310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [  196.038399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [  196.039794] Call Trace:
>>>> [  196.040259]  __sk_destruct+0x1c/0x150
>>>> [  196.040964]  ? io_sqe_files_unregister+0x32/0x70
>>>> [  196.041841]  unix_destruct_scm+0x76/0xa0
>>>> [  196.042587]  skb_release_head_state+0x38/0x60
>>>> [  196.043401]  skb_release_all+0x9/0x20
>>>> [  196.044034]  kfree_skb+0x2d/0xb0
>>>> [  196.044603]  io_sqe_files_unregister+0x32/0x70
>>>> [  196.045385]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
>>>> [  196.046220]  io_uring_release+0x17/0x20
>>>> [  196.046881]  __fput+0x9d/0x1d0
>>>> [  196.047421]  task_work_run+0x7a/0x90
>>>> [  196.048045]  do_exit+0x301/0xc20
>>>> [  196.048626]  ? handle_mm_fault+0xf3/0x230
>>>> [  196.049321]  do_group_exit+0x35/0xa0
>>>> [  196.049944]  __x64_sys_exit_group+0xf/0x10
>>>> [  196.050658]  do_syscall_64+0x3d/0xf0
>>>> [  196.051317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [  196.052217] RIP: 0033:0x7f1a8aba5956
>>>> [  196.052859] Code: Bad RIP value.
>>>> [  196.053488] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>>> [  196.054902] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
>>>> [  196.056124] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
>>>> [  196.057348] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
>>>> [  196.058573] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
>>>> [  196.059459] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
>>>> [  196.060731] ---[ end trace 8a7e42f9199e5f92 ]---
>>>> [  196.062671] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:501 unix_sock_destructor+0xa3/0xc0
>>>> [  196.064372] Modules linked in:
>>>> [  196.064966] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
>>>> [  196.066546] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>>> [  196.068234] RIP: 0010:unix_sock_destructor+0xa3/0xc0
>>>> [  196.068999] Code: c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b 0f 0b 48 83 bb 68 02 00 00 00 74 89 <0f> 0b eb 85 48 89 de 48 c7 c7 a0 c8 42 82 5b 5d e9 31 8c 75 ff 0f
>>>> [  196.072577] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
>>>> [  196.073595] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
>>>> [  196.074973] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
>>>> [  196.076348] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
>>>> [  196.077709] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
>>>> [  196.079072] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
>>>> [  196.080441] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
>>>> [  196.082026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  196.083131] CR2: 00007fbc19f96550 CR3: 0000000138d1e003 CR4: 00000000003606a0
>>>> [  196.084505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [  196.085823] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [  196.087185] Call Trace:
>>>> [  196.087662]  __sk_destruct+0x1c/0x150
>>>> [  196.088376]  ? io_sqe_files_unregister+0x32/0x70
>>>> [  196.089299]  unix_destruct_scm+0x76/0xa0
>>>> [  196.090059]  skb_release_head_state+0x38/0x60
>>>> [  196.090929]  skb_release_all+0x9/0x20
>>>> [  196.091550]  kfree_skb+0x2d/0xb0
>>>> [  196.092745]  io_sqe_files_unregister+0x32/0x70
>>>> [  196.093535]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
>>>> [  196.094358]  io_uring_release+0x17/0x20
>>>> [  196.095029]  __fput+0x9d/0x1d0
>>>> [  196.095660]  task_work_run+0x7a/0x90
>>>> [  196.096307]  do_exit+0x301/0xc20
>>>> [  196.096808]  ? handle_mm_fault+0xf3/0x230
>>>> [  196.097504]  do_group_exit+0x35/0xa0
>>>> [  196.098126]  __x64_sys_exit_group+0xf/0x10
>>>> [  196.098836]  do_syscall_64+0x3d/0xf0
>>>> [  196.099460]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>> [  196.100334] RIP: 0033:0x7f1a8aba5956
>>>> [  196.100958] Code: Bad RIP value.
>>>> [  196.101293] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>>> [  196.101933] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
>>>> [  196.102535] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
>>>> [  196.103137] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
>>>> [  196.103739] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
>>>> [  196.104526] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
>>>> [  196.105777] ---[ end trace 8a7e42f9199e5f93 ]---
>>>> [  196.107535] unix: Attempt to release alive unix socket: 000000003b3c1a34
>>>>
>>>> which corresponds to the WARN_ONs:
>>>>
>>>> 	WARN_ON(!sk_unhashed(sk));
>>>> 	WARN_ON(sk->sk_socket);
>>>>
>>>> This doesn't seem to happen if I omit the call to io_uring_register.
>>>
>>> Huh, I can't reproduce that here, teardown seems to work just fine. It
>>> looks like the socket is getting torn down prematurely, when we free the
>>> skb. I wonder if you have some networking options I don't? What's your
>>> .config?
>>>
>>
>> Interesting.  Attached is the config I'm using to build
>> af22d31f8b09fa36f57569c95f4943febaacb2b1.  I'll keep playing with it on
>> my end, too, maybe I've got something bad in my ccache.
> 
> Bingo, reproduces with your .config. Looks like the io_uring is released
> basically as soon as we queue the skb in the socket. I'll take a look at
> this tomorrow.

OK, I think I see it. Apparently with my options, the size of the skb is
0 when I pass in 0. With your options, it's non-zero, which wreaks havoc
on the ref counting.

The below should fix it. I'll fold this in now.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index d7a10484d748..c8794a11de3e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2006,6 +2006,7 @@ static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
 
 	fpl->max = fpl->count = nr;
 	UNIXCB(skb).fp = fpl;
+	refcount_add(skb->truesize, &ctx->ring_sock->sk->sk_wmem_alloc);
 	skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
 
 	for (i = 0; i < nr; i++)

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-10  1:11       ` Matt Mullins
@ 2019-02-10  2:34           ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-10  2:34 UTC (permalink / raw)
  To: Matt Mullins, linux-block, linux-aio, linux-api
  Cc: hch, jannh, viro, avi, jmoyer

On 2/9/19 6:11 PM, Matt Mullins wrote:
> On Sat, 2019-02-09 at 17:47 -0700, Jens Axboe wrote:
>> On 2/9/19 4:52 PM, Matt Mullins wrote:
>>>> @@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>>>  	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>>>>  }
>>>>  
>>>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>>>> +{
>>>> +#if defined(CONFIG_UNIX)
>>>> +	if (ctx->ring_sock) {
>>>> +		struct sock *sock = ctx->ring_sock->sk;
>>>> +		struct sk_buff *skb;
>>>> +
>>>> +		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
>>>
>>> Something's still a bit messy with destruction.  I get a traceback here
>>> while running
>>>
>>>   int main() {
>>>     struct io_uring_params uring_params = {
>>>         .flags = IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL,
>>>     };
>>>     int uring_fd = 
>>>         syscall(425 /* io_uring_setup */, 16, &uring_params);
>>>     
>>>     const __s32 fds[] = {1};
>>>     
>>>     syscall(427 /* io_uring_register */, uring_fd,
>>>             IORING_REGISTER_FILES, fds, sizeof(fds) / sizeof(*fds));
>>>   }
>>>
>>> I end up with the following spew:
>>>
>>> [  195.983322] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:500 unix_sock_destructor+0x97/0xc0
>>> [  195.989556] Modules linked in:
>>> [  195.992738] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
>>> [  196.000926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>> [  196.008316] RIP: 0010:unix_sock_destructor+0x97/0xc0
>>> [  196.010912] Code: 3f 37 f3 ff 5b 5d be 00 02 00 00 48 c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b <0f> 0b 48 83 bb 68 02 00 00 00 74 89 0f 0b eb 85 48 89 de 48 c7 c7
>>> [  196.018887] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
>>> [  196.020754] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
>>> [  196.022811] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
>>> [  196.024901] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
>>> [  196.026977] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
>>> [  196.029119] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
>>> [  196.031071] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
>>> [  196.033069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  196.034438] CR2: 00007f1a8aba5920 CR3: 000000000260e004 CR4: 00000000003606a0
>>> [  196.036310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [  196.038399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [  196.039794] Call Trace:
>>> [  196.040259]  __sk_destruct+0x1c/0x150
>>> [  196.040964]  ? io_sqe_files_unregister+0x32/0x70
>>> [  196.041841]  unix_destruct_scm+0x76/0xa0
>>> [  196.042587]  skb_release_head_state+0x38/0x60
>>> [  196.043401]  skb_release_all+0x9/0x20
>>> [  196.044034]  kfree_skb+0x2d/0xb0
>>> [  196.044603]  io_sqe_files_unregister+0x32/0x70
>>> [  196.045385]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
>>> [  196.046220]  io_uring_release+0x17/0x20
>>> [  196.046881]  __fput+0x9d/0x1d0
>>> [  196.047421]  task_work_run+0x7a/0x90
>>> [  196.048045]  do_exit+0x301/0xc20
>>> [  196.048626]  ? handle_mm_fault+0xf3/0x230
>>> [  196.049321]  do_group_exit+0x35/0xa0
>>> [  196.049944]  __x64_sys_exit_group+0xf/0x10
>>> [  196.050658]  do_syscall_64+0x3d/0xf0
>>> [  196.051317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  196.052217] RIP: 0033:0x7f1a8aba5956
>>> [  196.052859] Code: Bad RIP value.
>>> [  196.053488] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>> [  196.054902] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
>>> [  196.056124] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
>>> [  196.057348] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
>>> [  196.058573] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
>>> [  196.059459] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
>>> [  196.060731] ---[ end trace 8a7e42f9199e5f92 ]---
>>> [  196.062671] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:501 unix_sock_destructor+0xa3/0xc0
>>> [  196.064372] Modules linked in:
>>> [  196.064966] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
>>> [  196.066546] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>> [  196.068234] RIP: 0010:unix_sock_destructor+0xa3/0xc0
>>> [  196.068999] Code: c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b 0f 0b 48 83 bb 68 02 00 00 00 74 89 <0f> 0b eb 85 48 89 de 48 c7 c7 a0 c8 42 82 5b 5d e9 31 8c 75 ff 0f
>>> [  196.072577] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
>>> [  196.073595] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
>>> [  196.074973] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
>>> [  196.076348] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
>>> [  196.077709] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
>>> [  196.079072] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
>>> [  196.080441] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
>>> [  196.082026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  196.083131] CR2: 00007fbc19f96550 CR3: 0000000138d1e003 CR4: 00000000003606a0
>>> [  196.084505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [  196.085823] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [  196.087185] Call Trace:
>>> [  196.087662]  __sk_destruct+0x1c/0x150
>>> [  196.088376]  ? io_sqe_files_unregister+0x32/0x70
>>> [  196.089299]  unix_destruct_scm+0x76/0xa0
>>> [  196.090059]  skb_release_head_state+0x38/0x60
>>> [  196.090929]  skb_release_all+0x9/0x20
>>> [  196.091550]  kfree_skb+0x2d/0xb0
>>> [  196.092745]  io_sqe_files_unregister+0x32/0x70
>>> [  196.093535]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
>>> [  196.094358]  io_uring_release+0x17/0x20
>>> [  196.095029]  __fput+0x9d/0x1d0
>>> [  196.095660]  task_work_run+0x7a/0x90
>>> [  196.096307]  do_exit+0x301/0xc20
>>> [  196.096808]  ? handle_mm_fault+0xf3/0x230
>>> [  196.097504]  do_group_exit+0x35/0xa0
>>> [  196.098126]  __x64_sys_exit_group+0xf/0x10
>>> [  196.098836]  do_syscall_64+0x3d/0xf0
>>> [  196.099460]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  196.100334] RIP: 0033:0x7f1a8aba5956
>>> [  196.100958] Code: Bad RIP value.
>>> [  196.101293] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>> [  196.101933] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
>>> [  196.102535] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
>>> [  196.103137] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
>>> [  196.103739] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
>>> [  196.104526] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
>>> [  196.105777] ---[ end trace 8a7e42f9199e5f93 ]---
>>> [  196.107535] unix: Attempt to release alive unix socket: 000000003b3c1a34
>>>
>>> which corresponds to the WARN_ONs:
>>>
>>> 	WARN_ON(!sk_unhashed(sk));
>>> 	WARN_ON(sk->sk_socket);
>>>
>>> This doesn't seem to happen if I omit the call to io_uring_register.
>>
>> Huh, I can't reproduce that here, teardown seems to work just fine. It
>> looks like the socket is getting torn down prematurely, when we free the
>> skb. I wonder if you have some networking options I don't? What's your
>> .config?
>>
> 
> Interesting.  Attached is the config I'm using to build
> af22d31f8b09fa36f57569c95f4943febaacb2b1.  I'll keep playing with it on
> my end, too, maybe I've got something bad in my ccache.

Bingo, reproduces with your .config. Looks like the io_uring is released
basically as soon as we queue the skb in the socket. I'll take a look at
this tomorrow.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
@ 2019-02-10  2:34           ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-10  2:34 UTC (permalink / raw)
  To: Matt Mullins, linux-block, linux-aio, linux-api
  Cc: hch, jannh, viro, avi, jmoyer

On 2/9/19 6:11 PM, Matt Mullins wrote:
> On Sat, 2019-02-09 at 17:47 -0700, Jens Axboe wrote:
>> On 2/9/19 4:52 PM, Matt Mullins wrote:
>>>> @@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>>>  	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>>>>  }
>>>>  
>>>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>>>> +{
>>>> +#if defined(CONFIG_UNIX)
>>>> +	if (ctx->ring_sock) {
>>>> +		struct sock *sock = ctx->ring_sock->sk;
>>>> +		struct sk_buff *skb;
>>>> +
>>>> +		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
>>>
>>> Something's still a bit messy with destruction.  I get a traceback here
>>> while running
>>>
>>>   int main() {
>>>     struct io_uring_params uring_params = {
>>>         .flags = IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL,
>>>     };
>>>     int uring_fd = 
>>>         syscall(425 /* io_uring_setup */, 16, &uring_params);
>>>     
>>>     const __s32 fds[] = {1};
>>>     
>>>     syscall(427 /* io_uring_register */, uring_fd,
>>>             IORING_REGISTER_FILES, fds, sizeof(fds) / sizeof(*fds));
>>>   }
>>>
>>> I end up with the following spew:
>>>
>>> [  195.983322] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:500 unix_sock_destructor+0x97/0xc0
>>> [  195.989556] Modules linked in:
>>> [  195.992738] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
>>> [  196.000926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>> [  196.008316] RIP: 0010:unix_sock_destructor+0x97/0xc0
>>> [  196.010912] Code: 3f 37 f3 ff 5b 5d be 00 02 00 00 48 c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b <0f> 0b 48 83 bb 68 02 00 00 00 74 89 0f 0b eb 85 48 89 de 48 c7 c7
>>> [  196.018887] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
>>> [  196.020754] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
>>> [  196.022811] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
>>> [  196.024901] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
>>> [  196.026977] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
>>> [  196.029119] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
>>> [  196.031071] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
>>> [  196.033069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  196.034438] CR2: 00007f1a8aba5920 CR3: 000000000260e004 CR4: 00000000003606a0
>>> [  196.036310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [  196.038399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [  196.039794] Call Trace:
>>> [  196.040259]  __sk_destruct+0x1c/0x150
>>> [  196.040964]  ? io_sqe_files_unregister+0x32/0x70
>>> [  196.041841]  unix_destruct_scm+0x76/0xa0
>>> [  196.042587]  skb_release_head_state+0x38/0x60
>>> [  196.043401]  skb_release_all+0x9/0x20
>>> [  196.044034]  kfree_skb+0x2d/0xb0
>>> [  196.044603]  io_sqe_files_unregister+0x32/0x70
>>> [  196.045385]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
>>> [  196.046220]  io_uring_release+0x17/0x20
>>> [  196.046881]  __fput+0x9d/0x1d0
>>> [  196.047421]  task_work_run+0x7a/0x90
>>> [  196.048045]  do_exit+0x301/0xc20
>>> [  196.048626]  ? handle_mm_fault+0xf3/0x230
>>> [  196.049321]  do_group_exit+0x35/0xa0
>>> [  196.049944]  __x64_sys_exit_group+0xf/0x10
>>> [  196.050658]  do_syscall_64+0x3d/0xf0
>>> [  196.051317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  196.052217] RIP: 0033:0x7f1a8aba5956
>>> [  196.052859] Code: Bad RIP value.
>>> [  196.053488] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>> [  196.054902] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
>>> [  196.056124] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
>>> [  196.057348] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
>>> [  196.058573] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
>>> [  196.059459] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
>>> [  196.060731] ---[ end trace 8a7e42f9199e5f92 ]---
>>> [  196.062671] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:501 unix_sock_destructor+0xa3/0xc0
>>> [  196.064372] Modules linked in:
>>> [  196.064966] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
>>> [  196.066546] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>> [  196.068234] RIP: 0010:unix_sock_destructor+0xa3/0xc0
>>> [  196.068999] Code: c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b 0f 0b 48 83 bb 68 02 00 00 00 74 89 <0f> 0b eb 85 48 89 de 48 c7 c7 a0 c8 42 82 5b 5d e9 31 8c 75 ff 0f
>>> [  196.072577] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
>>> [  196.073595] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
>>> [  196.074973] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
>>> [  196.076348] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
>>> [  196.077709] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
>>> [  196.079072] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
>>> [  196.080441] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
>>> [  196.082026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  196.083131] CR2: 00007fbc19f96550 CR3: 0000000138d1e003 CR4: 00000000003606a0
>>> [  196.084505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [  196.085823] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [  196.087185] Call Trace:
>>> [  196.087662]  __sk_destruct+0x1c/0x150
>>> [  196.088376]  ? io_sqe_files_unregister+0x32/0x70
>>> [  196.089299]  unix_destruct_scm+0x76/0xa0
>>> [  196.090059]  skb_release_head_state+0x38/0x60
>>> [  196.090929]  skb_release_all+0x9/0x20
>>> [  196.091550]  kfree_skb+0x2d/0xb0
>>> [  196.092745]  io_sqe_files_unregister+0x32/0x70
>>> [  196.093535]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
>>> [  196.094358]  io_uring_release+0x17/0x20
>>> [  196.095029]  __fput+0x9d/0x1d0
>>> [  196.095660]  task_work_run+0x7a/0x90
>>> [  196.096307]  do_exit+0x301/0xc20
>>> [  196.096808]  ? handle_mm_fault+0xf3/0x230
>>> [  196.097504]  do_group_exit+0x35/0xa0
>>> [  196.098126]  __x64_sys_exit_group+0xf/0x10
>>> [  196.098836]  do_syscall_64+0x3d/0xf0
>>> [  196.099460]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  196.100334] RIP: 0033:0x7f1a8aba5956
>>> [  196.100958] Code: Bad RIP value.
>>> [  196.101293] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
>>> [  196.101933] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
>>> [  196.102535] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
>>> [  196.103137] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
>>> [  196.103739] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
>>> [  196.104526] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
>>> [  196.105777] ---[ end trace 8a7e42f9199e5f93 ]---
>>> [  196.107535] unix: Attempt to release alive unix socket: 000000003b3c1a34
>>>
>>> which corresponds to the WARN_ONs:
>>>
>>> 	WARN_ON(!sk_unhashed(sk));
>>> 	WARN_ON(sk->sk_socket);
>>>
>>> This doesn't seem to happen if I omit the call to io_uring_register.
>>
>> Huh, I can't reproduce that here, teardown seems to work just fine. It
>> looks like the socket is getting torn down prematurely, when we free the
>> skb. I wonder if you have some networking options I don't? What's your
>> .config?
>>
> 
> Interesting.  Attached is the config I'm using to build
> af22d31f8b09fa36f57569c95f4943febaacb2b1.  I'll keep playing with it on
> my end, too, maybe I've got something bad in my ccache.

Bingo, reproduces with your .config. Looks like the io_uring is released
basically as soon as we queue the skb in the socket. I'll take a look at
this tomorrow.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-10  0:47       ` Jens Axboe
  (?)
@ 2019-02-10  1:11       ` Matt Mullins
  2019-02-10  2:34           ` Jens Axboe
  -1 siblings, 1 reply; 128+ messages in thread
From: Matt Mullins @ 2019-02-10  1:11 UTC (permalink / raw)
  To: linux-block, linux-aio, linux-api, axboe; +Cc: hch, jannh, viro, avi, jmoyer

[-- Attachment #1: Type: text/plain, Size: 8056 bytes --]

On Sat, 2019-02-09 at 17:47 -0700, Jens Axboe wrote:
> On 2/9/19 4:52 PM, Matt Mullins wrote:
> > > @@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> > >  	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
> > >  }
> > >  
> > > +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
> > > +{
> > > +#if defined(CONFIG_UNIX)
> > > +	if (ctx->ring_sock) {
> > > +		struct sock *sock = ctx->ring_sock->sk;
> > > +		struct sk_buff *skb;
> > > +
> > > +		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
> > 
> > Something's still a bit messy with destruction.  I get a traceback here
> > while running
> > 
> >   int main() {
> >     struct io_uring_params uring_params = {
> >         .flags = IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL,
> >     };
> >     int uring_fd = 
> >         syscall(425 /* io_uring_setup */, 16, &uring_params);
> >     
> >     const __s32 fds[] = {1};
> >     
> >     syscall(427 /* io_uring_register */, uring_fd,
> >             IORING_REGISTER_FILES, fds, sizeof(fds) / sizeof(*fds));
> >   }
> > 
> > I end up with the following spew:
> > 
> > [  195.983322] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:500 unix_sock_destructor+0x97/0xc0
> > [  195.989556] Modules linked in:
> > [  195.992738] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
> > [  196.000926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> > [  196.008316] RIP: 0010:unix_sock_destructor+0x97/0xc0
> > [  196.010912] Code: 3f 37 f3 ff 5b 5d be 00 02 00 00 48 c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b <0f> 0b 48 83 bb 68 02 00 00 00 74 89 0f 0b eb 85 48 89 de 48 c7 c7
> > [  196.018887] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
> > [  196.020754] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
> > [  196.022811] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
> > [  196.024901] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
> > [  196.026977] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
> > [  196.029119] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
> > [  196.031071] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
> > [  196.033069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  196.034438] CR2: 00007f1a8aba5920 CR3: 000000000260e004 CR4: 00000000003606a0
> > [  196.036310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  196.038399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  196.039794] Call Trace:
> > [  196.040259]  __sk_destruct+0x1c/0x150
> > [  196.040964]  ? io_sqe_files_unregister+0x32/0x70
> > [  196.041841]  unix_destruct_scm+0x76/0xa0
> > [  196.042587]  skb_release_head_state+0x38/0x60
> > [  196.043401]  skb_release_all+0x9/0x20
> > [  196.044034]  kfree_skb+0x2d/0xb0
> > [  196.044603]  io_sqe_files_unregister+0x32/0x70
> > [  196.045385]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
> > [  196.046220]  io_uring_release+0x17/0x20
> > [  196.046881]  __fput+0x9d/0x1d0
> > [  196.047421]  task_work_run+0x7a/0x90
> > [  196.048045]  do_exit+0x301/0xc20
> > [  196.048626]  ? handle_mm_fault+0xf3/0x230
> > [  196.049321]  do_group_exit+0x35/0xa0
> > [  196.049944]  __x64_sys_exit_group+0xf/0x10
> > [  196.050658]  do_syscall_64+0x3d/0xf0
> > [  196.051317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  196.052217] RIP: 0033:0x7f1a8aba5956
> > [  196.052859] Code: Bad RIP value.
> > [  196.053488] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > [  196.054902] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
> > [  196.056124] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> > [  196.057348] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
> > [  196.058573] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
> > [  196.059459] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
> > [  196.060731] ---[ end trace 8a7e42f9199e5f92 ]---
> > [  196.062671] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:501 unix_sock_destructor+0xa3/0xc0
> > [  196.064372] Modules linked in:
> > [  196.064966] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
> > [  196.066546] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> > [  196.068234] RIP: 0010:unix_sock_destructor+0xa3/0xc0
> > [  196.068999] Code: c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b 0f 0b 48 83 bb 68 02 00 00 00 74 89 <0f> 0b eb 85 48 89 de 48 c7 c7 a0 c8 42 82 5b 5d e9 31 8c 75 ff 0f
> > [  196.072577] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
> > [  196.073595] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
> > [  196.074973] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
> > [  196.076348] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
> > [  196.077709] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
> > [  196.079072] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
> > [  196.080441] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
> > [  196.082026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  196.083131] CR2: 00007fbc19f96550 CR3: 0000000138d1e003 CR4: 00000000003606a0
> > [  196.084505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  196.085823] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  196.087185] Call Trace:
> > [  196.087662]  __sk_destruct+0x1c/0x150
> > [  196.088376]  ? io_sqe_files_unregister+0x32/0x70
> > [  196.089299]  unix_destruct_scm+0x76/0xa0
> > [  196.090059]  skb_release_head_state+0x38/0x60
> > [  196.090929]  skb_release_all+0x9/0x20
> > [  196.091550]  kfree_skb+0x2d/0xb0
> > [  196.092745]  io_sqe_files_unregister+0x32/0x70
> > [  196.093535]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
> > [  196.094358]  io_uring_release+0x17/0x20
> > [  196.095029]  __fput+0x9d/0x1d0
> > [  196.095660]  task_work_run+0x7a/0x90
> > [  196.096307]  do_exit+0x301/0xc20
> > [  196.096808]  ? handle_mm_fault+0xf3/0x230
> > [  196.097504]  do_group_exit+0x35/0xa0
> > [  196.098126]  __x64_sys_exit_group+0xf/0x10
> > [  196.098836]  do_syscall_64+0x3d/0xf0
> > [  196.099460]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  196.100334] RIP: 0033:0x7f1a8aba5956
> > [  196.100958] Code: Bad RIP value.
> > [  196.101293] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > [  196.101933] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
> > [  196.102535] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> > [  196.103137] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
> > [  196.103739] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
> > [  196.104526] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
> > [  196.105777] ---[ end trace 8a7e42f9199e5f93 ]---
> > [  196.107535] unix: Attempt to release alive unix socket: 000000003b3c1a34
> > 
> > which corresponds to the WARN_ONs:
> > 
> > 	WARN_ON(!sk_unhashed(sk));
> > 	WARN_ON(sk->sk_socket);
> > 
> > This doesn't seem to happen if I omit the call to io_uring_register.
> 
> Huh, I can't reproduce that here, teardown seems to work just fine. It
> looks like the socket is getting torn down prematurely, when we free the
> skb. I wonder if you have some networking options I don't? What's your
> .config?
> 

Interesting.  Attached is the config I'm using to build
af22d31f8b09fa36f57569c95f4943febaacb2b1.  I'll keep playing with it on
my end, too, maybe I've got something bad in my ccache.

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 81145 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 5.0.0-rc5 Kernel Configuration
#

#
# Compiler: gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
#
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=70300
CONFIG_CLANG_VERSION=0
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_GENERIC_IRQ_DEBUGFS=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=18
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_KMEM=y
# CONFIG_BLK_CGROUP is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_CGROUP_PIDS=y
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
# CONFIG_CGROUP_DEVICE is not set
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_BPF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_CHECKPOINT_RESTORE=y
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
# CONFIG_RD_BZIP2 is not set
# CONFIG_RD_LZMA is not set
# CONFIG_RD_XZ is not set
# CONFIG_RD_LZO is not set
# CONFIG_RD_LZ4 is not set
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BPF=y
# CONFIG_EXPERT is not set
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_PRINTK_NMI=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_BPF_SYSCALL=y
# CONFIG_BPF_JIT_ALWAYS_ON is not set
CONFIG_USERFAULTFD=y
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
CONFIG_RSEQ=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLAB_MERGE_DEFAULT is not set
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_PROFILING is not set
CONFIG_TRACEPOINTS=y
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_FILTER_PGPROT=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_FEATURE_NAMES=y
# CONFIG_X86_X2APIC is not set
# CONFIG_X86_MPPARSE is not set
# CONFIG_GOLDFISH is not set
# CONFIG_RETPOLINE is not set
# CONFIG_X86_CPU_RESCTRL is not set
# CONFIG_X86_EXTENDED_PLATFORM is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
# CONFIG_IOSF_MBI is not set
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_HYPERVISOR_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_PARAVIRT_SPINLOCKS=y
# CONFIG_QUEUED_LOCK_STAT is not set
# CONFIG_XEN is not set
CONFIG_KVM_GUEST=y
# CONFIG_PVH is not set
# CONFIG_KVM_DEBUG_FS is not set
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_JAILHOUSE_GUEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_HYGON=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_GART_IOMMU is not set
# CONFIG_CALGARY_IOMMU is not set
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS_RANGE_BEGIN=2
CONFIG_NR_CPUS_RANGE_END=512
CONFIG_NR_CPUS_DEFAULT=64
CONFIG_NR_CPUS=8
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
# CONFIG_X86_MCE is not set

#
# Performance monitoring
#
# CONFIG_PERF_EVENTS_INTEL_UNCORE is not set
# CONFIG_PERF_EVENTS_INTEL_RAPL is not set
# CONFIG_PERF_EVENTS_INTEL_CSTATE is not set
# CONFIG_PERF_EVENTS_AMD_POWER is not set
CONFIG_X86_16BIT=y
CONFIG_X86_ESPFIX64=y
CONFIG_X86_VSYSCALL_EMULATION=y
# CONFIG_I8K is not set
# CONFIG_MICROCODE is not set
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
# CONFIG_X86_5LEVEL is not set
CONFIG_X86_DIRECT_GBPAGES=y
# CONFIG_X86_CPA_STATISTICS is not set
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
# CONFIG_AMD_MEM_ENCRYPT is not set
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
# CONFIG_X86_PMEM_LEGACY is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_X86_INTEL_UMIP=y
# CONFIG_X86_INTEL_MPX is not set
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
CONFIG_EFI=y
# CONFIG_EFI_STUB is not set
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
CONFIG_ARCH_HAS_KEXEC_PURGATORY=y
# CONFIG_KEXEC_VERIFY_SIG is not set
CONFIG_CRASH_DUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
# CONFIG_RANDOMIZE_BASE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
# CONFIG_HOTPLUG_CPU is not set
# CONFIG_COMPAT_VDSO is not set
CONFIG_LEGACY_VSYSCALL_EMULATE=y
# CONFIG_LEGACY_VSYSCALL_NONE is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_MODIFY_LDT_SYSCALL=y
CONFIG_HAVE_LIVEPATCH=y
CONFIG_ARCH_HAS_ADD_PAGES=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_ARCH_ENABLE_THP_MIGRATION=y

#
# Power management and ACPI options
#
# CONFIG_SUSPEND is not set
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_ADVANCED_DEBUG is not set
CONFIG_PM_CLK=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
# CONFIG_ENERGY_MODEL is not set
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
# CONFIG_ACPI_SPCR_TABLE is not set
CONFIG_ACPI_LPIT=y
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
# CONFIG_ACPI_EC_DEBUGFS is not set
# CONFIG_ACPI_AC is not set
# CONFIG_ACPI_BATTERY is not set
# CONFIG_ACPI_BUTTON is not set
# CONFIG_ACPI_FAN is not set
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
# CONFIG_ACPI_THERMAL is not set
CONFIG_ACPI_NUMA=y
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
# CONFIG_ACPI_HED is not set
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
# CONFIG_ACPI_NFIT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
# CONFIG_ACPI_APEI is not set
# CONFIG_DPTF_POWER is not set
# CONFIG_PMIC_OPREGION is not set
# CONFIG_ACPI_CONFIGFS is not set
CONFIG_X86_PM_TIMER=y
# CONFIG_SFI is not set

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_GOV_POWERSAVE is not set
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set

#
# CPU frequency scaling drivers
#
CONFIG_X86_INTEL_PSTATE=y
# CONFIG_X86_PCC_CPUFREQ is not set
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ_CPB=y
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_AMD_FREQ_SENSITIVITY is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_INTEL_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_MMCONF_FAM10H=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# CONFIG_X86_SYSFB is not set

#
# Binary Emulations
#
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
# CONFIG_X86_X32 is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_HAVE_GENERIC_GUP=y

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# CONFIG_ISCSI_IBFT_FIND is not set
CONFIG_FW_CFG_SYSFS=y
# CONFIG_FW_CFG_SYSFS_CMDLINE is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
CONFIG_EFI_VARS=y
CONFIG_EFI_ESRT=y
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set

#
# Tegra firmware driver
#
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
CONFIG_KVM_COMPAT=y
CONFIG_HAVE_KVM_IRQ_BYPASS=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
# CONFIG_KVM_AMD is not set
# CONFIG_KVM_MMU_AUDIT is not set
# CONFIG_VHOST_NET is not set
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set

#
# General architecture-dependent options
#
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_HOTPLUG_SMT=y
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
# CONFIG_STATIC_KEYS_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_UPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_HAVE_ARCH_THREAD_STRUCT_WHITELIST=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_CLK=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_HAVE_RCU_TABLE_FREE=y
CONFIG_HAVE_RCU_TABLE_INVALIDATE=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_ARCH_STACKLEAK=y
CONFIG_HAVE_STACKPROTECTOR=y
CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_MOVE_PMD=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_HAVE_COPY_THREAD_TLS=y
CONFIG_HAVE_STACK_VALIDATION=y
CONFIG_HAVE_RELIABLE_STACKTRACE=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_ARCH_HAS_REFCOUNT=y
CONFIG_REFCOUNT_FULL=y
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
CONFIG_PLUGIN_HOSTCC=""
CONFIG_HAVE_GCC_PLUGINS=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
# CONFIG_MODULE_COMPRESS is not set
# CONFIG_TRIM_UNUSED_KSYMS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLK_SCSI_REQUEST=y
CONFIG_BLK_DEV_BSG=y
# CONFIG_BLK_DEV_BSGLIB is not set
CONFIG_BLK_DEV_INTEGRITY=y
# CONFIG_BLK_DEV_ZONED is not set
# CONFIG_BLK_CMDLINE_PARSER is not set
# CONFIG_BLK_WBT is not set
CONFIG_BLK_DEBUG_FS=y
# CONFIG_BLK_SED_OPAL is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_AIX_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_CMDLINE_PARTITION is not set
CONFIG_BLOCK_COMPAT=y
CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y
CONFIG_BLK_PM=y

#
# IO Schedulers
#
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=y
CONFIG_BINFMT_MISC=y
CONFIG_COREDUMP=y

#
# Memory Management options
#
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_MEMORY_BALLOON=y
CONFIG_BALLOON_COMPACTION=y
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_ARCH_WANTS_THP_SWAP=y
CONFIG_TRANSPARENT_HUGE_PAGECACHE=y
# CONFIG_CLEANCACHE is not set
CONFIG_CMA=y
# CONFIG_CMA_DEBUG is not set
CONFIG_CMA_DEBUGFS=y
CONFIG_CMA_AREAS=7
CONFIG_MEM_SOFT_DIRTY=y
# CONFIG_ZPOOL is not set
# CONFIG_ZBUD is not set
# CONFIG_ZSMALLOC is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
CONFIG_DEFERRED_STRUCT_PAGE_INIT=y
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_ARCH_HAS_ZONE_DEVICE=y
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
# CONFIG_PERCPU_STATS is not set
# CONFIG_GUP_BENCHMARK is not set
CONFIG_ARCH_HAS_PTE_SPECIAL=y
CONFIG_NET=y
CONFIG_SKB_EXTENSIONS=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
CONFIG_UNIX_SCM=y
# CONFIG_UNIX_DIAG is not set
CONFIG_TLS=y
# CONFIG_TLS_DEVICE is not set
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_INTERFACE=y
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
# CONFIG_NET_KEY is not set
CONFIG_XDP_SOCKETS=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_NET_FOU is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_NV is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
# CONFIG_TCP_CONG_DCTCP is not set
# CONFIG_TCP_CONG_CDG is not set
# CONFIG_TCP_CONG_BBR is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=y
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
# CONFIG_IPV6_SEG6_LWTUNNEL is not set
# CONFIG_IPV6_SEG6_HMAC is not set
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
# CONFIG_NETFILTER is not set
CONFIG_BPFILTER=y
CONFIG_BPFILTER_UMH=y
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
CONFIG_HAVE_NET_DSA=y
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_6LOWPAN is not set
# CONFIG_IEEE802154 is not set
# CONFIG_NET_SCHED is not set
# CONFIG_DCB is not set
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
# CONFIG_NET_L3_MASTER_DEV is not set
# CONFIG_NET_NCSI is not set
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_CGROUP_NET_CLASSID is not set
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_BPF_JIT=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_DROP_MONITOR is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_AF_KCM=y
CONFIG_STREAM_PARSER=y
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
# CONFIG_LWTUNNEL is not set
CONFIG_GRO_CELLS=y
CONFIG_NET_SOCK_MSG=y
# CONFIG_NET_DEVLINK is not set
CONFIG_MAY_USE_DEVLINK=y
CONFIG_FAILOVER=y
CONFIG_HAVE_EBPF_JIT=y

#
# Device Drivers
#
CONFIG_HAVE_EISA=y
# CONFIG_EISA is not set
CONFIG_HAVE_PCI=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
# CONFIG_HOTPLUG_PCI_PCIE is not set
# CONFIG_PCIEAER is not set
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
# CONFIG_PCIE_PTM is not set
CONFIG_PCI_MSI=y
CONFIG_PCI_MSI_IRQ_DOMAIN=y
CONFIG_PCI_QUIRKS=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_STUB is not set
CONFIG_PCI_ATS=y
CONFIG_PCI_LOCKLESS_CONFIG=y
# CONFIG_PCI_IOV is not set
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_LABEL=y
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_ACPI is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# PCI controller drivers
#

#
# Cadence PCIe controllers support
#
# CONFIG_VMD is not set

#
# DesignWare PCI Core Support
#
# CONFIG_PCIE_DW_PLAT_HOST is not set
# CONFIG_PCI_MESON is not set

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
# CONFIG_PCCARD is not set
# CONFIG_RAPIDIO is not set

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y

#
# Firmware loader
#
CONFIG_FW_LOADER=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_FW_LOADER_USER_HELPER is not set
CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
CONFIG_REGMAP=y
CONFIG_REGMAP_I2C=y
CONFIG_DMA_SHARED_BUFFER=y
# CONFIG_DMA_FENCE_TRACE is not set
# CONFIG_DMA_CMA is not set

#
# Bus devices
#
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_GNSS is not set
# CONFIG_MTD is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_NULL_BLK=y
# CONFIG_BLK_DEV_FD is not set
CONFIG_CDROM=y
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_BLK_DEV_UMEM is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
CONFIG_BLK_DEV_NBD=y
# CONFIG_BLK_DEV_SKD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_VIRTIO_BLK is not set
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_RSXX is not set

#
# NVME Support
#
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_NVME_FC is not set
# CONFIG_NVME_TARGET is not set

#
# Misc devices
#
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_SRAM is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
CONFIG_PVPANIC=y
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_EEPROM_IDT_89HPESX is not set
# CONFIG_EEPROM_EE1004 is not set
# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_SENSORS_LIS3_I2C is not set
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set

#
# Intel MIC & related support
#

#
# Intel MIC Bus Driver
#
# CONFIG_INTEL_MIC_BUS is not set

#
# SCIF Bus Driver
#
# CONFIG_SCIF_BUS is not set

#
# VOP Bus Driver
#
# CONFIG_VOP_BUS is not set

#
# Intel MIC Host Driver
#

#
# Intel MIC Card Driver
#

#
# SCIF Driver
#

#
# Intel MIC Coprocessor State Management (COSM) Drivers
#

#
# VOP Driver
#
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_MISC_ALCOR_PCI is not set
# CONFIG_MISC_RTSX_PCI is not set
# CONFIG_MISC_RTSX_USB is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_ISCSI_BOOT_SYSFS is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_SCSI_ESAS2R is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT3SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_SMARTPQI is not set
# CONFIG_SCSI_UFSHCD is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_MYRB is not set
# CONFIG_SCSI_MYRS is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_SCSI_SNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_ISCI is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_WD719X is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
# CONFIG_SCSI_PM8001 is not set
CONFIG_SCSI_VIRTIO=y
# CONFIG_SCSI_DH is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
CONFIG_SATA_MOBILE_LPM_POLICY=0
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SCH is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
# CONFIG_PATA_ACPI is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID456=y
CONFIG_MD_MULTIPATH=y
# CONFIG_MD_FAULTY is not set
CONFIG_BCACHE=y
# CONFIG_BCACHE_DEBUG is not set
# CONFIG_BCACHE_CLOSURES_DEBUG is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
CONFIG_DM_BUFIO=y
# CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING is not set
CONFIG_DM_BIO_PRISON=y
CONFIG_DM_PERSISTENT_DATA=y
CONFIG_DM_UNSTRIPED=y
# CONFIG_DM_CRYPT is not set
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_THIN_PROVISIONING=y
CONFIG_DM_CACHE=y
CONFIG_DM_CACHE_SMQ=y
CONFIG_DM_WRITECACHE=y
CONFIG_DM_ERA=y
CONFIG_DM_MIRROR=y
# CONFIG_DM_LOG_USERSPACE is not set
CONFIG_DM_RAID=y
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=y
CONFIG_DM_MULTIPATH_QL=y
CONFIG_DM_MULTIPATH_ST=y
CONFIG_DM_DELAY=y
CONFIG_DM_UEVENT=y
CONFIG_DM_FLAKEY=y
CONFIG_DM_VERITY=y
# CONFIG_DM_VERITY_FEC is not set
CONFIG_DM_SWITCH=y
CONFIG_DM_LOG_WRITES=y
CONFIG_DM_INTEGRITY=y
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_EQUALIZER is not set
# CONFIG_NET_FC is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_MACSEC is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_TUN is not set
# CONFIG_TUN_VNET_CROSS_LE is not set
CONFIG_VETH=y
CONFIG_VIRTIO_NET=y
# CONFIG_NLMON is not set
# CONFIG_ARCNET is not set

#
# CAIF transport drivers
#

#
# Distributed Switch Architecture drivers
#
# CONFIG_ETHERNET is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_MDIO_DEVICE is not set
# CONFIG_PHYLIB is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_USB_NET_DRIVERS is not set
# CONFIG_WLAN is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
# CONFIG_WAN is not set
# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
# CONFIG_NETDEVSIM is not set
CONFIG_NET_FAILOVER=y
# CONFIG_ISDN is not set
# CONFIG_NVM is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y
CONFIG_INPUT_SPARSEKMAP=y
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
# CONFIG_INPUT_MOUSEDEV is not set
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_DLINK_DIR685 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_SAMSUNG is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_BYD=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_SYNAPTICS_SMBUS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_SENTELIC is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_PS2_FOCALTECH=y
# CONFIG_MOUSE_PS2_VMMOUSE is not set
CONFIG_MOUSE_PS2_SMBUS=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_ELAN_I2C is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
# CONFIG_INPUT_JOYSTICK is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_PEGASUS is not set
# CONFIG_TABLET_SERIAL_WACOM4 is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
# CONFIG_INPUT_MISC is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_SERIO_OLPC_APSP is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
# CONFIG_SERIAL_NONSTANDARD is not set
# CONFIG_NOZOMI is not set
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
CONFIG_DEVMEM=y
# CONFIG_DEVKMEM is not set

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
CONFIG_SERIAL_8250_MID=y
# CONFIG_SERIAL_8250_MOXA is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_DEV_BUS is not set
CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=y
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_HW_RANDOM_VIRTIO=y
# CONFIG_NVRAM is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
# CONFIG_XILLYBUS is not set
# CONFIG_RANDOM_TRUST_CPU is not set

#
# I2C support
#
CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
CONFIG_I2C_I801=y
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_NVIDIA_GPU is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PLATFORM is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EMEV2 is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_MLXCPLD is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I3C is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
# CONFIG_PPS is not set

#
# PTP clock support
#
# CONFIG_PTP_1588_CLOCK is not set

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
# CONFIG_PINCTRL is not set
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
# CONFIG_POWER_AVS is not set
# CONFIG_POWER_RESET is not set
# CONFIG_POWER_SUPPLY is not set
# CONFIG_HWMON is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_STATISTICS is not set
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_THERMAL_GOV_POWER_ALLOCATOR is not set
# CONFIG_THERMAL_EMULATION is not set

#
# Intel thermal drivers
#
# CONFIG_INTEL_POWERCLAMP is not set
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# CONFIG_INTEL_PCH_THERMAL is not set
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_AS3711 is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_BD9571MWV is not set
# CONFIG_MFD_AXP20X_I2C is not set
# CONFIG_MFD_CROS_EC is not set
# CONFIG_MFD_MADERA is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9062 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_INTEL_QUARK_I2C_GPIO is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_INTEL_SOC_PMIC_CHTWC is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX77843 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SEC_CORE is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SKY81452 is not set
# CONFIG_MFD_SMSC is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_TI_LMU is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65086 is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TPS68470 is not set
# CONFIG_MFD_TI_LP873X is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_MFD_TPS80031 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_REGULATOR is not set
# CONFIG_RC_CORE is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
# CONFIG_AGP is not set
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
# CONFIG_VGA_SWITCHEROO is not set
CONFIG_DRM=y
# CONFIG_DRM_DP_AUX_CHARDEV is not set
# CONFIG_DRM_DEBUG_MM is not set
# CONFIG_DRM_DEBUG_SELFTEST is not set
CONFIG_DRM_KMS_HELPER=y
CONFIG_DRM_KMS_FB_HELPER=y
CONFIG_DRM_FBDEV_EMULATION=y
CONFIG_DRM_FBDEV_OVERALLOC=100
# CONFIG_DRM_LOAD_EDID_FIRMWARE is not set
# CONFIG_DRM_DP_CEC is not set
CONFIG_DRM_TTM=y

#
# I2C encoder or helper chips
#
# CONFIG_DRM_I2C_CH7006 is not set
# CONFIG_DRM_I2C_SIL164 is not set
# CONFIG_DRM_I2C_NXP_TDA998X is not set
# CONFIG_DRM_I2C_NXP_TDA9950 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_AMDGPU is not set

#
# ACP (Audio CoProcessor) Configuration
#

#
# AMD Library routines
#
# CONFIG_DRM_NOUVEAU is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_VGEM is not set
# CONFIG_DRM_VKMS is not set
# CONFIG_DRM_VMWGFX is not set
# CONFIG_DRM_GMA500 is not set
# CONFIG_DRM_UDL is not set
# CONFIG_DRM_AST is not set
# CONFIG_DRM_MGAG200 is not set
CONFIG_DRM_CIRRUS_QEMU=y
CONFIG_DRM_QXL=y
# CONFIG_DRM_BOCHS is not set
# CONFIG_DRM_VIRTIO_GPU is not set
CONFIG_DRM_PANEL=y

#
# Display Panels
#
CONFIG_DRM_BRIDGE=y
CONFIG_DRM_PANEL_BRIDGE=y

#
# Display Interface Bridges
#
# CONFIG_DRM_ANALOGIX_ANX78XX is not set
# CONFIG_DRM_HISI_HIBMC is not set
# CONFIG_DRM_TINYDRM is not set
# CONFIG_DRM_LEGACY is not set
CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y

#
# Frame buffer Devices
#
CONFIG_FB_CMDLINE=y
CONFIG_FB_NOTIFY=y
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
CONFIG_FB_DEFERRED_IO=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_IBM_GXT4500 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_FB_SM712 is not set
# CONFIG_BACKLIGHT_LCD_SUPPORT is not set
CONFIG_HDMI=y

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
# CONFIG_VGACON_SOFT_SCROLLBACK_PERSISTENT_ENABLE_BY_DEFAULT is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER is not set
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set

#
# HID support
#
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACCUTOUCH is not set
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
# CONFIG_HID_BETOP_FF is not set
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
# CONFIG_HID_COUGAR is not set
# CONFIG_HID_CMEDIA is not set
CONFIG_HID_CYPRESS=y
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
CONFIG_HID_GYRATION=y
# CONFIG_HID_ICADE is not set
CONFIG_HID_ITE=y
# CONFIG_HID_JABRA is not set
# CONFIG_HID_TWINHAN is not set
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LENOVO is not set
CONFIG_HID_LOGITECH=y
# CONFIG_HID_LOGITECH_DJ is not set
# CONFIG_HID_LOGITECH_HIDPP is not set
CONFIG_LOGITECH_FF=y
# CONFIG_LOGIRUMBLEPAD2_FF is not set
# CONFIG_LOGIG940_FF is not set
CONFIG_LOGIWHEELS_FF=y
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MAYFLASH is not set
# CONFIG_HID_REDRAGON is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTI is not set
CONFIG_HID_NTRIG=y
# CONFIG_HID_ORTEK is not set
CONFIG_HID_PANTHERLORD=y
CONFIG_PANTHERLORD_FF=y
# CONFIG_HID_PENMOUNT is not set
CONFIG_HID_PETALYNX=y
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PLANTRONICS is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_RETRODE is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
CONFIG_HID_SAMSUNG=y
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEAM is not set
# CONFIG_HID_STEELSERIES is not set
CONFIG_HID_SUNPLUS=y
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
CONFIG_HID_TOPSEED=y
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# I2C HID support
#
# CONFIG_I2C_HID is not set

#
# Intel ISH HID support
#
# CONFIG_INTEL_ISH_HID is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_PCI=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_WHITELIST is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
# CONFIG_USB_XHCI_HCD is not set
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_FOTG210_HCD is not set
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_PCI=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=y
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_REALTEK is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
# CONFIG_USB_STORAGE_ENE_UB6250 is not set
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HUB_USB251XB is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_HSIC_USB4604 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set
# CONFIG_USB_CHAOSKEY is not set

#
# USB Physical Layer drivers
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_GADGET is not set
# CONFIG_TYPEC is not set
# CONFIG_USB_ROLE_SWITCH is not set
# CONFIG_USB_ULPI_BUS is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
# CONFIG_RTC_HCTOSYS is not set
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_SYSTOHC_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set
CONFIG_RTC_NVMEM=y

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_ABX80X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF85363 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8010 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV8803 is not set

#
# SPI RTC drivers
#
CONFIG_RTC_I2C_AND_SPI=y

#
# SPI and I2C RTC drivers
#
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_FTRTC010 is not set

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_HID_SENSOR_TIME is not set
# CONFIG_DMADEVICES is not set

#
# DMABUF options
#
CONFIG_SYNC_FILE=y
# CONFIG_SW_SYNC is not set
# CONFIG_UDMABUF is not set
# CONFIG_AUXDISPLAY is not set
# CONFIG_UIO is not set
CONFIG_IRQ_BYPASS_MANAGER=y
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO=y
CONFIG_VIRTIO_MENU=y
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_PCI_LEGACY=y
CONFIG_VIRTIO_BALLOON=y
CONFIG_VIRTIO_INPUT=y
# CONFIG_VIRTIO_MMIO is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV is not set
# CONFIG_STAGING is not set
# CONFIG_X86_PLATFORM_DEVICES is not set
CONFIG_PMC_ATOM=y
# CONFIG_CHROME_PLATFORMS is not set
# CONFIG_MELLANOX_PLATFORM is not set
CONFIG_CLKDEV_LOOKUP=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y

#
# Common Clock Framework
#
# CONFIG_COMMON_CLK_MAX9485 is not set
# CONFIG_COMMON_CLK_SI5351 is not set
# CONFIG_COMMON_CLK_SI544 is not set
# CONFIG_COMMON_CLK_CDCE706 is not set
# CONFIG_COMMON_CLK_CS2000_CP is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
CONFIG_MAILBOX=y
CONFIG_PCC=y
# CONFIG_ALTERA_MBOX is not set
# CONFIG_IOMMU_SUPPORT is not set

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set

#
# Rpmsg drivers
#
# CONFIG_RPMSG_QCOM_GLINK_RPM is not set
# CONFIG_RPMSG_VIRTIO is not set
# CONFIG_SOUNDWIRE is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#

#
# Broadcom SoC drivers
#

#
# NXP/Freescale QorIQ SoC drivers
#

#
# i.MX SoC drivers
#

#
# Qualcomm SoC drivers
#
# CONFIG_SOC_TI is not set

#
# Xilinx SoC drivers
#
# CONFIG_XILINX_VCU is not set
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set

#
# IRQ chip support
#
CONFIG_ARM_GIC_MAX_NR=1
# CONFIG_IPACK_BUS is not set
# CONFIG_RESET_CONTROLLER is not set
# CONFIG_FMC is not set

#
# PHY Subsystem
#
# CONFIG_GENERIC_PHY is not set
# CONFIG_BCM_KONA_USB2_PHY is not set
# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
# CONFIG_RAS is not set
# CONFIG_THUNDERBOLT is not set

#
# Android
#
# CONFIG_ANDROID is not set
# CONFIG_LIBNVDIMM is not set
CONFIG_DAX=y
# CONFIG_DEV_DAX is not set
CONFIG_NVMEM=y

#
# HW tracing support
#
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set
# CONFIG_FPGA is not set
# CONFIG_UNISYS_VISORBUS is not set
# CONFIG_SIOX is not set
# CONFIG_SLIMBUS is not set

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
CONFIG_FS_IOMAP=y
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_ENCRYPTION is not set
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_XFS_FS=y
# CONFIG_XFS_QUOTA is not set
CONFIG_XFS_POSIX_ACL=y
# CONFIG_XFS_RT is not set
CONFIG_XFS_ONLINE_SCRUB=y
# CONFIG_XFS_ONLINE_REPAIR is not set
# CONFIG_XFS_WARN is not set
# CONFIG_XFS_DEBUG is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_BTRFS_FS=y
CONFIG_BTRFS_FS_POSIX_ACL=y
# CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
# CONFIG_BTRFS_FS_REF_VERIFY is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_F2FS_FS is not set
# CONFIG_FS_DAX is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
# CONFIG_EXPORTFS_BLOCK_OPS is not set
CONFIG_FILE_LOCKING=y
CONFIG_MANDATORY_FILE_LOCKING=y
# CONFIG_FS_ENCRYPTION is not set
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_FANOTIFY is not set
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_PRINT_QUOTA_WARNING is not set
# CONFIG_QUOTA_DEBUG is not set
# CONFIG_QFMT_V1 is not set
# CONFIG_QFMT_V2 is not set
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=y
CONFIG_AUTOFS_FS=y
CONFIG_FUSE_FS=y
CONFIG_CUSE=y
CONFIG_OVERLAY_FS=y
CONFIG_OVERLAY_FS_REDIRECT_DIR=y
CONFIG_OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW=y
CONFIG_OVERLAY_FS_INDEX=y
CONFIG_OVERLAY_FS_NFS_EXPORT=y
# CONFIG_OVERLAY_FS_XINO_AUTO is not set
# CONFIG_OVERLAY_FS_METACOPY is not set

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_FAT_DEFAULT_UTF8 is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
# CONFIG_PROC_VMCORE_DEVICE_DUMP is not set
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_PROC_CHILDREN=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_MEMFD_CREATE=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_EFIVAR_FS=y
# CONFIG_MISC_FILESYSTEMS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V2=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_SWAP is not set
# CONFIG_NFS_V4_1 is not set
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
# CONFIG_NFSD is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
# CONFIG_SUNRPC_DEBUG is not set
# CONFIG_CEPH_FS is not set
CONFIG_CIFS=y
# CONFIG_CIFS_STATS2 is not set
CONFIG_CIFS_ALLOW_INSECURE_LEGACY=y
# CONFIG_CIFS_WEAK_PW_HASH is not set
# CONFIG_CIFS_UPCALL is not set
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CIFS_ACL=y
CONFIG_CIFS_DEBUG=y
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_DEBUG_DUMP_KEYS is not set
# CONFIG_CIFS_DFS_UPCALL is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=y
# CONFIG_9P_FS_POSIX_ACL is not set
# CONFIG_9P_FS_SECURITY is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=y
# CONFIG_DLM is not set

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_COMPAT=y
# CONFIG_PERSISTENT_KEYRINGS is not set
# CONFIG_BIG_KEYS is not set
# CONFIG_ENCRYPTED_KEYS is not set
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
# CONFIG_SECURITY is not set
CONFIG_SECURITYFS=y
CONFIG_PAGE_TABLE_ISOLATION=y
CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y
# CONFIG_HARDENED_USERCOPY is not set
# CONFIG_FORTIFY_SOURCE is not set
# CONFIG_STATIC_USERMODEHELPER is not set
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_DEFAULT_SECURITY=""
CONFIG_XOR_BLOCKS=y
CONFIG_ASYNC_CORE=y
CONFIG_ASYNC_MEMCPY=y
CONFIG_ASYNC_XOR=y
CONFIG_ASYNC_PQ=y
CONFIG_ASYNC_RAID6_RECOV=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
# CONFIG_CRYPTO_RSA is not set
# CONFIG_CRYPTO_DH is not set
# CONFIG_CRYPTO_ECDH is not set
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_ENGINE=y

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=y
CONFIG_CRYPTO_GCM=y
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
# CONFIG_CRYPTO_AEGIS128 is not set
# CONFIG_CRYPTO_AEGIS128L is not set
# CONFIG_CRYPTO_AEGIS256 is not set
# CONFIG_CRYPTO_AEGIS128_AESNI_SSE2 is not set
# CONFIG_CRYPTO_AEGIS128L_AESNI_SSE2 is not set
# CONFIG_CRYPTO_AEGIS256_AESNI_SSE2 is not set
# CONFIG_CRYPTO_MORUS640 is not set
# CONFIG_CRYPTO_MORUS640_SSE2 is not set
# CONFIG_CRYPTO_MORUS1280 is not set
# CONFIG_CRYPTO_MORUS1280_SSE2 is not set
# CONFIG_CRYPTO_MORUS1280_AVX2 is not set
CONFIG_CRYPTO_SEQIV=y
CONFIG_CRYPTO_ECHAINIV=y

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CFB is not set
CONFIG_CRYPTO_CTR=y
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_OFB is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set
# CONFIG_CRYPTO_KEYWRAP is not set
# CONFIG_CRYPTO_NHPOLY1305_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_AVX2 is not set
# CONFIG_CRYPTO_ADIANTUM is not set

#
# Hash modes
#
CONFIG_CRYPTO_CMAC=y
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
CONFIG_CRYPTO_CRCT10DIF=y
# CONFIG_CRYPTO_CRCT10DIF_PCLMUL is not set
CONFIG_CRYPTO_GHASH=y
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
CONFIG_CRYPTO_MD4=y
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
# CONFIG_CRYPTO_SHA3 is not set
# CONFIG_CRYPTO_SM3 is not set
# CONFIG_CRYPTO_STREEBOG is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_CHACHA20 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_SM4 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_LZO is not set
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set
# CONFIG_CRYPTO_ZSTD is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
# CONFIG_CRYPTO_DRBG_HASH is not set
# CONFIG_CRYPTO_DRBG_CTR is not set
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_CCP is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCC is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXX is not set
# CONFIG_CRYPTO_DEV_QAT_C62X is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCCVF is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXXVF is not set
# CONFIG_CRYPTO_DEV_QAT_C62XVF is not set
# CONFIG_CRYPTO_DEV_NITROX_CNN55XX is not set
CONFIG_CRYPTO_DEV_VIRTIO=y
# CONFIG_ASYMMETRIC_KEY_TYPE is not set

#
# Certificates for signature checking
#
# CONFIG_SYSTEM_BLACKLIST_KEYRING is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=y
CONFIG_RAID6_PQ_BENCHMARK=y
CONFIG_BITREVERSE=y
CONFIG_RATIONAL=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
CONFIG_CRC64=y
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
# CONFIG_CRC8 is not set
CONFIG_XXHASH=y
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_ZSTD_COMPRESS=y
CONFIG_ZSTD_DECOMPRESS=y
# CONFIG_XZ_DEC is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_XARRAY_MULTI=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_SWIOTLB=y
CONFIG_SGL_ALLOC=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
# CONFIG_CORDIC is not set
# CONFIG_DDR is not set
# CONFIG_IRQ_POLL is not set
CONFIG_OID_REGISTRY=y
CONFIG_UCS2_STRING=y
CONFIG_FONT_SUPPORT=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_SBITMAP=y
# CONFIG_STRING_SELFTEST is not set

#
# Kernel hacking
#

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_DYNAMIC_DEBUG is not set

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_INFO_REDUCED is not set
# CONFIG_DEBUG_INFO_SPLIT is not set
CONFIG_DEBUG_INFO_DWARF4=y
CONFIG_GDB_SCRIPTS=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_READABLE_ASM is not set
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_PAGE_OWNER is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
CONFIG_STACK_VALIDATION=y
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_DEBUG_KERNEL=y

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_DEBUG_PAGE_REF is not set
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
CONFIG_DEBUG_STACK_USAGE=y
# CONFIG_DEBUG_VM is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_CC_HAS_KASAN_GENERIC=y
# CONFIG_KASAN is not set
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
# CONFIG_KCOV is not set
# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Lockups and Hangs
#
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_WQ_WATCHDOG=y
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
# CONFIG_SCHED_DEBUG is not set
CONFIG_SCHED_INFO=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_SCHED_STACK_END_CHECK is not set
# CONFIG_DEBUG_TIMEKEEPING is not set
CONFIG_DEBUG_PREEMPT=y

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_PI_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_RCU_PERF_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_TRACE=y
# CONFIG_RCU_EQS_DEBUG is not set
# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
CONFIG_FUNCTION_ERROR_INJECTION=y
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_FUNCTION_TRACER is not set
# CONFIG_PREEMPTIRQ_EVENTS is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_HWLAT_TRACER is not set
# CONFIG_FTRACE_SYSCALLS is not set
# CONFIG_TRACER_SNAPSHOT is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
# CONFIG_STACK_TRACER is not set
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENTS=y
CONFIG_UPROBE_EVENTS=y
CONFIG_BPF_EVENTS=y
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
CONFIG_BPF_KPROBE_OVERRIDE=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_HIST_TRIGGERS is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
# CONFIG_TRACE_EVAL_MAP_FILE is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_RUNTIME_TESTING_MENU is not set
# CONFIG_MEMTEST is not set
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
# CONFIG_KGDB_TESTS is not set
# CONFIG_KGDB_LOW_LEVEL_TRAP is not set
# CONFIG_KGDB_KDB is not set
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_UBSAN is not set
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
# CONFIG_STRICT_DEVMEM is not set
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_EARLY_PRINTK_EFI is not set
# CONFIG_EARLY_PRINTK_USB_XDBC is not set
# CONFIG_X86_PTDUMP is not set
# CONFIG_EFI_PGT_DUMP is not set
# CONFIG_DEBUG_WX is not set
CONFIG_DOUBLEFAULT=y
# CONFIG_DEBUG_TLBFLUSH is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
# CONFIG_DEBUG_ENTRY is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_PUNIT_ATOM_DEBUG is not set
CONFIG_UNWINDER_ORC=y
# CONFIG_UNWINDER_FRAME_POINTER is not set

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-09 23:52   ` Matt Mullins
@ 2019-02-10  0:47       ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-10  0:47 UTC (permalink / raw)
  To: Matt Mullins, linux-block, linux-aio, linux-api
  Cc: hch, jannh, viro, avi, jmoyer

On 2/9/19 4:52 PM, Matt Mullins wrote:
>> @@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>  	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>>  }
>>  
>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +#if defined(CONFIG_UNIX)
>> +	if (ctx->ring_sock) {
>> +		struct sock *sock = ctx->ring_sock->sk;
>> +		struct sk_buff *skb;
>> +
>> +		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
> 
> Something's still a bit messy with destruction.  I get a traceback here
> while running
> 
>   int main() {
>     struct io_uring_params uring_params = {
>         .flags = IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL,
>     };
>     int uring_fd = 
>         syscall(425 /* io_uring_setup */, 16, &uring_params);
>     
>     const __s32 fds[] = {1};
>     
>     syscall(427 /* io_uring_register */, uring_fd,
>             IORING_REGISTER_FILES, fds, sizeof(fds) / sizeof(*fds));
>   }
> 
> I end up with the following spew:
> 
> [  195.983322] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:500 unix_sock_destructor+0x97/0xc0
> [  195.989556] Modules linked in:
> [  195.992738] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
> [  196.000926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [  196.008316] RIP: 0010:unix_sock_destructor+0x97/0xc0
> [  196.010912] Code: 3f 37 f3 ff 5b 5d be 00 02 00 00 48 c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b <0f> 0b 48 83 bb 68 02 00 00 00 74 89 0f 0b eb 85 48 89 de 48 c7 c7
> [  196.018887] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
> [  196.020754] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
> [  196.022811] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
> [  196.024901] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
> [  196.026977] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
> [  196.029119] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
> [  196.031071] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
> [  196.033069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  196.034438] CR2: 00007f1a8aba5920 CR3: 000000000260e004 CR4: 00000000003606a0
> [  196.036310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  196.038399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  196.039794] Call Trace:
> [  196.040259]  __sk_destruct+0x1c/0x150
> [  196.040964]  ? io_sqe_files_unregister+0x32/0x70
> [  196.041841]  unix_destruct_scm+0x76/0xa0
> [  196.042587]  skb_release_head_state+0x38/0x60
> [  196.043401]  skb_release_all+0x9/0x20
> [  196.044034]  kfree_skb+0x2d/0xb0
> [  196.044603]  io_sqe_files_unregister+0x32/0x70
> [  196.045385]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
> [  196.046220]  io_uring_release+0x17/0x20
> [  196.046881]  __fput+0x9d/0x1d0
> [  196.047421]  task_work_run+0x7a/0x90
> [  196.048045]  do_exit+0x301/0xc20
> [  196.048626]  ? handle_mm_fault+0xf3/0x230
> [  196.049321]  do_group_exit+0x35/0xa0
> [  196.049944]  __x64_sys_exit_group+0xf/0x10
> [  196.050658]  do_syscall_64+0x3d/0xf0
> [  196.051317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  196.052217] RIP: 0033:0x7f1a8aba5956
> [  196.052859] Code: Bad RIP value.
> [  196.053488] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [  196.054902] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
> [  196.056124] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [  196.057348] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
> [  196.058573] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
> [  196.059459] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
> [  196.060731] ---[ end trace 8a7e42f9199e5f92 ]---
> [  196.062671] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:501 unix_sock_destructor+0xa3/0xc0
> [  196.064372] Modules linked in:
> [  196.064966] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
> [  196.066546] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [  196.068234] RIP: 0010:unix_sock_destructor+0xa3/0xc0
> [  196.068999] Code: c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b 0f 0b 48 83 bb 68 02 00 00 00 74 89 <0f> 0b eb 85 48 89 de 48 c7 c7 a0 c8 42 82 5b 5d e9 31 8c 75 ff 0f
> [  196.072577] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
> [  196.073595] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
> [  196.074973] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
> [  196.076348] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
> [  196.077709] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
> [  196.079072] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
> [  196.080441] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
> [  196.082026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  196.083131] CR2: 00007fbc19f96550 CR3: 0000000138d1e003 CR4: 00000000003606a0
> [  196.084505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  196.085823] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  196.087185] Call Trace:
> [  196.087662]  __sk_destruct+0x1c/0x150
> [  196.088376]  ? io_sqe_files_unregister+0x32/0x70
> [  196.089299]  unix_destruct_scm+0x76/0xa0
> [  196.090059]  skb_release_head_state+0x38/0x60
> [  196.090929]  skb_release_all+0x9/0x20
> [  196.091550]  kfree_skb+0x2d/0xb0
> [  196.092745]  io_sqe_files_unregister+0x32/0x70
> [  196.093535]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
> [  196.094358]  io_uring_release+0x17/0x20
> [  196.095029]  __fput+0x9d/0x1d0
> [  196.095660]  task_work_run+0x7a/0x90
> [  196.096307]  do_exit+0x301/0xc20
> [  196.096808]  ? handle_mm_fault+0xf3/0x230
> [  196.097504]  do_group_exit+0x35/0xa0
> [  196.098126]  __x64_sys_exit_group+0xf/0x10
> [  196.098836]  do_syscall_64+0x3d/0xf0
> [  196.099460]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  196.100334] RIP: 0033:0x7f1a8aba5956
> [  196.100958] Code: Bad RIP value.
> [  196.101293] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [  196.101933] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
> [  196.102535] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [  196.103137] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
> [  196.103739] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
> [  196.104526] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
> [  196.105777] ---[ end trace 8a7e42f9199e5f93 ]---
> [  196.107535] unix: Attempt to release alive unix socket: 000000003b3c1a34
> 
> which corresponds to the WARN_ONs:
> 
> 	WARN_ON(!sk_unhashed(sk));
> 	WARN_ON(sk->sk_socket);
> 
> This doesn't seem to happen if I omit the call to io_uring_register.

Huh, I can't reproduce that here, teardown seems to work just fine. It
looks like the socket is getting torn down prematurely, when we free the
skb. I wonder if you have some networking options I don't? What's your
.config?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
@ 2019-02-10  0:47       ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-10  0:47 UTC (permalink / raw)
  To: Matt Mullins, linux-block, linux-aio, linux-api
  Cc: hch, jannh, viro, avi, jmoyer

On 2/9/19 4:52 PM, Matt Mullins wrote:
>> @@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>>  	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>>  }
>>  
>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +#if defined(CONFIG_UNIX)
>> +	if (ctx->ring_sock) {
>> +		struct sock *sock = ctx->ring_sock->sk;
>> +		struct sk_buff *skb;
>> +
>> +		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
> 
> Something's still a bit messy with destruction.  I get a traceback here
> while running
> 
>   int main() {
>     struct io_uring_params uring_params = {
>         .flags = IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL,
>     };
>     int uring_fd = 
>         syscall(425 /* io_uring_setup */, 16, &uring_params);
>     
>     const __s32 fds[] = {1};
>     
>     syscall(427 /* io_uring_register */, uring_fd,
>             IORING_REGISTER_FILES, fds, sizeof(fds) / sizeof(*fds));
>   }
> 
> I end up with the following spew:
> 
> [  195.983322] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:500 unix_sock_destructor+0x97/0xc0
> [  195.989556] Modules linked in:
> [  195.992738] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
> [  196.000926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [  196.008316] RIP: 0010:unix_sock_destructor+0x97/0xc0
> [  196.010912] Code: 3f 37 f3 ff 5b 5d be 00 02 00 00 48 c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b <0f> 0b 48 83 bb 68 02 00 00 00 74 89 0f 0b eb 85 48 89 de 48 c7 c7
> [  196.018887] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
> [  196.020754] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
> [  196.022811] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
> [  196.024901] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
> [  196.026977] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
> [  196.029119] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
> [  196.031071] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
> [  196.033069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  196.034438] CR2: 00007f1a8aba5920 CR3: 000000000260e004 CR4: 00000000003606a0
> [  196.036310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  196.038399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  196.039794] Call Trace:
> [  196.040259]  __sk_destruct+0x1c/0x150
> [  196.040964]  ? io_sqe_files_unregister+0x32/0x70
> [  196.041841]  unix_destruct_scm+0x76/0xa0
> [  196.042587]  skb_release_head_state+0x38/0x60
> [  196.043401]  skb_release_all+0x9/0x20
> [  196.044034]  kfree_skb+0x2d/0xb0
> [  196.044603]  io_sqe_files_unregister+0x32/0x70
> [  196.045385]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
> [  196.046220]  io_uring_release+0x17/0x20
> [  196.046881]  __fput+0x9d/0x1d0
> [  196.047421]  task_work_run+0x7a/0x90
> [  196.048045]  do_exit+0x301/0xc20
> [  196.048626]  ? handle_mm_fault+0xf3/0x230
> [  196.049321]  do_group_exit+0x35/0xa0
> [  196.049944]  __x64_sys_exit_group+0xf/0x10
> [  196.050658]  do_syscall_64+0x3d/0xf0
> [  196.051317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  196.052217] RIP: 0033:0x7f1a8aba5956
> [  196.052859] Code: Bad RIP value.
> [  196.053488] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [  196.054902] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
> [  196.056124] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [  196.057348] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
> [  196.058573] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
> [  196.059459] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
> [  196.060731] ---[ end trace 8a7e42f9199e5f92 ]---
> [  196.062671] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:501 unix_sock_destructor+0xa3/0xc0
> [  196.064372] Modules linked in:
> [  196.064966] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
> [  196.066546] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [  196.068234] RIP: 0010:unix_sock_destructor+0xa3/0xc0
> [  196.068999] Code: c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b 0f 0b 48 83 bb 68 02 00 00 00 74 89 <0f> 0b eb 85 48 89 de 48 c7 c7 a0 c8 42 82 5b 5d e9 31 8c 75 ff 0f
> [  196.072577] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
> [  196.073595] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
> [  196.074973] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
> [  196.076348] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
> [  196.077709] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
> [  196.079072] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
> [  196.080441] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
> [  196.082026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  196.083131] CR2: 00007fbc19f96550 CR3: 0000000138d1e003 CR4: 00000000003606a0
> [  196.084505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  196.085823] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  196.087185] Call Trace:
> [  196.087662]  __sk_destruct+0x1c/0x150
> [  196.088376]  ? io_sqe_files_unregister+0x32/0x70
> [  196.089299]  unix_destruct_scm+0x76/0xa0
> [  196.090059]  skb_release_head_state+0x38/0x60
> [  196.090929]  skb_release_all+0x9/0x20
> [  196.091550]  kfree_skb+0x2d/0xb0
> [  196.092745]  io_sqe_files_unregister+0x32/0x70
> [  196.093535]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
> [  196.094358]  io_uring_release+0x17/0x20
> [  196.095029]  __fput+0x9d/0x1d0
> [  196.095660]  task_work_run+0x7a/0x90
> [  196.096307]  do_exit+0x301/0xc20
> [  196.096808]  ? handle_mm_fault+0xf3/0x230
> [  196.097504]  do_group_exit+0x35/0xa0
> [  196.098126]  __x64_sys_exit_group+0xf/0x10
> [  196.098836]  do_syscall_64+0x3d/0xf0
> [  196.099460]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  196.100334] RIP: 0033:0x7f1a8aba5956
> [  196.100958] Code: Bad RIP value.
> [  196.101293] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [  196.101933] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
> [  196.102535] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
> [  196.103137] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
> [  196.103739] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
> [  196.104526] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
> [  196.105777] ---[ end trace 8a7e42f9199e5f93 ]---
> [  196.107535] unix: Attempt to release alive unix socket: 000000003b3c1a34
> 
> which corresponds to the WARN_ONs:
> 
> 	WARN_ON(!sk_unhashed(sk));
> 	WARN_ON(sk->sk_socket);
> 
> This doesn't seem to happen if I omit the call to io_uring_register.

Huh, I can't reproduce that here, teardown seems to work just fine. It
looks like the socket is getting torn down prematurely, when we free the
skb. I wonder if you have some networking options I don't? What's your
.config?

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-09 21:13   ` Jens Axboe
  (?)
@ 2019-02-09 23:52   ` Matt Mullins
  2019-02-10  0:47       ` Jens Axboe
  -1 siblings, 1 reply; 128+ messages in thread
From: Matt Mullins @ 2019-02-09 23:52 UTC (permalink / raw)
  To: linux-block, linux-aio, linux-api, axboe; +Cc: hch, jannh, viro, avi, jmoyer

On Sat, 2019-02-09 at 14:13 -0700, Jens Axboe wrote:
> We normally have to fget/fput for each IO we do on a file. Even with
> the batching we do, the cost of the atomic inc/dec of the file usage
> count adds up.
> 
> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> for the io_uring_register(2) system call. The arguments passed in must
> be an array of __s32 holding file descriptors, and nr_args should hold
> the number of file descriptors the application wishes to pin for the
> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
> called).
> 
> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> to the index in the array passed in to IORING_REGISTER_FILES.
> 
> Files are automatically unregistered when the io_uring instance is torn
> down. An application need only unregister if it wishes to register a new
> set of fds.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/io_uring.c                 | 269 ++++++++++++++++++++++++++++++----
>  include/uapi/linux/io_uring.h |   9 +-
>  2 files changed, 245 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 09a3122b3b6c..c40a7ed2edd5 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -29,6 +29,7 @@
>  #include <linux/net.h>
>  #include <net/sock.h>
>  #include <net/af_unix.h>
> +#include <net/scm.h>
>  #include <linux/anon_inodes.h>
>  #include <linux/sched/mm.h>
>  #include <linux/uaccess.h>
> @@ -41,6 +42,7 @@
>  #include "internal.h"
>  
>  #define IORING_MAX_ENTRIES	4096
> +#define IORING_MAX_FIXED_FILES	1024
>  
>  struct io_uring {
>  	u32 head ____cacheline_aligned_in_smp;
> @@ -103,6 +105,14 @@ struct io_ring_ctx {
>  		struct fasync_struct	*cq_fasync;
>  	} ____cacheline_aligned_in_smp;
>  
> +	/*
> +	 * If used, fixed file set. Writers must ensure that ->refs is dead,
> +	 * readers must ensure that ->refs is alive as long as the file* is
> +	 * used. Only updated through io_uring_register(2).
> +	 */
> +	struct file		**user_files;
> +	unsigned		nr_user_files;
> +
>  	/* if used, fixed mapped user buffers */
>  	unsigned		nr_user_bufs;
>  	struct io_mapped_ubuf	*user_bufs;
> @@ -150,6 +160,7 @@ struct io_kiocb {
>  	unsigned int		flags;
>  #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
>  #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
> +#define REQ_F_FIXED_FILE	4	/* ctx owns file */
>  	u64			user_data;
>  	u64			error;
>  
> @@ -380,15 +391,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
>  		 * Batched puts of the same file, to avoid dirtying the
>  		 * file usage count multiple times, if avoidable.
>  		 */
> -		if (!file) {
> -			file = req->rw.ki_filp;
> -			file_count = 1;
> -		} else if (file == req->rw.ki_filp) {
> -			file_count++;
> -		} else {
> -			fput_many(file, file_count);
> -			file = req->rw.ki_filp;
> -			file_count = 1;
> +		if (!(req->flags & REQ_F_FIXED_FILE)) {
> +			if (!file) {
> +				file = req->rw.ki_filp;
> +				file_count = 1;
> +			} else if (file == req->rw.ki_filp) {
> +				file_count++;
> +			} else {
> +				fput_many(file, file_count);
> +				file = req->rw.ki_filp;
> +				file_count = 1;
> +			}
>  		}
>  
>  		if (to_free == ARRAY_SIZE(reqs))
> @@ -520,13 +533,19 @@ static void kiocb_end_write(struct kiocb *kiocb)
>  	}
>  }
>  
> +static void io_fput(struct io_kiocb *req)
> +{
> +	if (!(req->flags & REQ_F_FIXED_FILE))
> +		fput(req->rw.ki_filp);
> +}
> +
>  static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
>  {
>  	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
>  
>  	kiocb_end_write(kiocb);
>  
> -	fput(kiocb->ki_filp);
> +	io_fput(req);
>  	io_cqring_add_event(req->ctx, req->user_data, res, 0);
>  	io_free_req(req);
>  }
> @@ -642,19 +661,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  {
>  	struct io_ring_ctx *ctx = req->ctx;
>  	struct kiocb *kiocb = &req->rw;
> -	unsigned ioprio;
> +	unsigned ioprio, flags;
>  	int fd, ret;
>  
>  	/* For -EAGAIN retry, everything is already prepped */
>  	if (kiocb->ki_filp)
>  		return 0;
>  
> +	flags = READ_ONCE(sqe->flags);
>  	fd = READ_ONCE(sqe->fd);
> -	kiocb->ki_filp = io_file_get(state, fd);
> -	if (unlikely(!kiocb->ki_filp))
> -		return -EBADF;
> -	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
> -		force_nonblock = false;
> +
> +	if (flags & IOSQE_FIXED_FILE) {
> +		if (unlikely(!ctx->user_files ||
> +		    (unsigned) fd >= ctx->nr_user_files))
> +			return -EBADF;
> +		kiocb->ki_filp = ctx->user_files[fd];
> +		req->flags |= REQ_F_FIXED_FILE;
> +	} else {
> +		kiocb->ki_filp = io_file_get(state, fd);
> +		if (unlikely(!kiocb->ki_filp))
> +			return -EBADF;
> +		if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
> +			force_nonblock = false;
> +	}
>  	kiocb->ki_pos = READ_ONCE(sqe->off);
>  	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
>  	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
> @@ -694,10 +723,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  	}
>  	return 0;
>  out_fput:
> -	/* in case of error, we didn't use this file reference. drop it. */
> -	if (state)
> -		state->used_refs--;
> -	io_file_put(state, kiocb->ki_filp);
> +	if (!(flags & IOSQE_FIXED_FILE)) {
> +		/*
> +		 * in case of error, we didn't use this file reference. drop it.
> +		 */
> +		if (state)
> +			state->used_refs--;
> +		io_file_put(state, kiocb->ki_filp);
> +	}
>  	return ret;
>  }
>  
> @@ -837,7 +870,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
>  out_fput:
>  	/* Hold on to the file for -EAGAIN */
>  	if (unlikely(ret && ret != -EAGAIN))
> -		fput(file);
> +		io_fput(req);
>  	return ret;
>  }
>  
> @@ -891,7 +924,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
>  	kfree(iovec);
>  out_fput:
>  	if (unlikely(ret))
> -		fput(file);
> +		io_fput(req);
>  	return ret;
>  }
>  
> @@ -914,7 +947,8 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
>  	 */
>  	if (req->rw.ki_filp) {
>  		err = -EBADF;
> -		fput(req->rw.ki_filp);
> +		if (!(req->flags & REQ_F_FIXED_FILE))
> +			fput(req->rw.ki_filp);
>  	}
>  	io_cqring_add_event(ctx, user_data, err, 0);
>  	io_free_req(req);
> @@ -923,21 +957,32 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
>  
>  static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>  {
> +	struct io_ring_ctx *ctx = req->ctx;
> +	unsigned flags;
>  	int fd;
>  
>  	/* Prep already done */
>  	if (req->rw.ki_filp)
>  		return 0;
>  
> -	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
> +	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
>  		return -EINVAL;
>  	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
>  		return -EINVAL;
>  
>  	fd = READ_ONCE(sqe->fd);
> -	req->rw.ki_filp = fget(fd);
> -	if (unlikely(!req->rw.ki_filp))
> -		return -EBADF;
> +	flags = READ_ONCE(sqe->flags);
> +
> +	if (flags & IOSQE_FIXED_FILE) {
> +		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
> +			return -EBADF;
> +		req->rw.ki_filp = ctx->user_files[fd];
> +		req->flags |= REQ_F_FIXED_FILE;
> +	} else {
> +		req->rw.ki_filp = fget(fd);
> +		if (unlikely(!req->rw.ki_filp))
> +			return -EBADF;
> +	}
>  
>  	return 0;
>  }
> @@ -967,7 +1012,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  				end > 0 ? end : LLONG_MAX,
>  				fsync_flags & IORING_FSYNC_DATASYNC);
>  
> -	fput(req->rw.ki_filp);
> +	if (!(req->flags & REQ_F_FIXED_FILE))
> +		fput(req->rw.ki_filp);
>  	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
>  	io_free_req(req);
>  	return 0;
> @@ -1104,7 +1150,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
>  	ssize_t ret;
>  
>  	/* enforce forwards compatibility on users */
> -	if (unlikely(s->sqe->flags))
> +	if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
>  		return -EINVAL;
>  
>  	req = io_get_req(ctx, state);
> @@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>  	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>  }
>  
> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +#if defined(CONFIG_UNIX)
> +	if (ctx->ring_sock) {
> +		struct sock *sock = ctx->ring_sock->sk;
> +		struct sk_buff *skb;
> +
> +		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)

Something's still a bit messy with destruction.  I get a traceback here
while running

  int main() {
    struct io_uring_params uring_params = {
        .flags = IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL,
    };
    int uring_fd = 
        syscall(425 /* io_uring_setup */, 16, &uring_params);
    
    const __s32 fds[] = {1};
    
    syscall(427 /* io_uring_register */, uring_fd,
            IORING_REGISTER_FILES, fds, sizeof(fds) / sizeof(*fds));
  }

I end up with the following spew:

[  195.983322] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:500 unix_sock_destructor+0x97/0xc0
[  195.989556] Modules linked in:
[  195.992738] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
[  196.000926] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  196.008316] RIP: 0010:unix_sock_destructor+0x97/0xc0
[  196.010912] Code: 3f 37 f3 ff 5b 5d be 00 02 00 00 48 c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b <0f> 0b 48 83 bb 68 02 00 00 00 74 89 0f 0b eb 85 48 89 de 48 c7 c7
[  196.018887] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
[  196.020754] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
[  196.022811] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
[  196.024901] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
[  196.026977] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
[  196.029119] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
[  196.031071] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
[  196.033069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  196.034438] CR2: 00007f1a8aba5920 CR3: 000000000260e004 CR4: 00000000003606a0
[  196.036310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  196.038399] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  196.039794] Call Trace:
[  196.040259]  __sk_destruct+0x1c/0x150
[  196.040964]  ? io_sqe_files_unregister+0x32/0x70
[  196.041841]  unix_destruct_scm+0x76/0xa0
[  196.042587]  skb_release_head_state+0x38/0x60
[  196.043401]  skb_release_all+0x9/0x20
[  196.044034]  kfree_skb+0x2d/0xb0
[  196.044603]  io_sqe_files_unregister+0x32/0x70
[  196.045385]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
[  196.046220]  io_uring_release+0x17/0x20
[  196.046881]  __fput+0x9d/0x1d0
[  196.047421]  task_work_run+0x7a/0x90
[  196.048045]  do_exit+0x301/0xc20
[  196.048626]  ? handle_mm_fault+0xf3/0x230
[  196.049321]  do_group_exit+0x35/0xa0
[  196.049944]  __x64_sys_exit_group+0xf/0x10
[  196.050658]  do_syscall_64+0x3d/0xf0
[  196.051317]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  196.052217] RIP: 0033:0x7f1a8aba5956
[  196.052859] Code: Bad RIP value.
[  196.053488] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  196.054902] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
[  196.056124] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  196.057348] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
[  196.058573] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
[  196.059459] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
[  196.060731] ---[ end trace 8a7e42f9199e5f92 ]---
[  196.062671] WARNING: CPU: 1 PID: 1938 at ../net/unix/af_unix.c:501 unix_sock_destructor+0xa3/0xc0
[  196.064372] Modules linked in:
[  196.064966] CPU: 1 PID: 1938 Comm: aio_buffered Tainted: G        W         5.0.0-rc5+ #379
[  196.066546] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  196.068234] RIP: 0010:unix_sock_destructor+0xa3/0xc0
[  196.068999] Code: c7 c7 6c 5b 9a 81 e9 8c 2a 71 ff 48 89 ef e8 c4 dc 87 ff eb be 0f 0b 48 83 7b 70 00 74 8b 0f 0b 48 83 bb 68 02 00 00 00 74 89 <0f> 0b eb 85 48 89 de 48 c7 c7 a0 c8 42 82 5b 5d e9 31 8c 75 ff 0f
[  196.072577] RSP: 0018:ffffc900008a7d40 EFLAGS: 00010282
[  196.073595] RAX: 0000000000000000 RBX: ffff8881351dd000 RCX: 0000000000000000
[  196.074973] RDX: 0000000000000001 RSI: 0000000000000282 RDI: 00000000ffffffff
[  196.076348] RBP: ffff8881351dd000 R08: 0000000000024120 R09: ffffffff819a97fe
[  196.077709] R10: ffffea0004cf6800 R11: 00000000005b8d80 R12: ffffffff81294ec2
[  196.079072] R13: ffff888134e27b40 R14: ffff88813bb307a0 R15: ffff888133d59910
[  196.080441] FS:  00007f1a8a8c3740(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
[  196.082026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  196.083131] CR2: 00007fbc19f96550 CR3: 0000000138d1e003 CR4: 00000000003606a0
[  196.084505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  196.085823] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  196.087185] Call Trace:
[  196.087662]  __sk_destruct+0x1c/0x150
[  196.088376]  ? io_sqe_files_unregister+0x32/0x70
[  196.089299]  unix_destruct_scm+0x76/0xa0
[  196.090059]  skb_release_head_state+0x38/0x60
[  196.090929]  skb_release_all+0x9/0x20
[  196.091550]  kfree_skb+0x2d/0xb0
[  196.092745]  io_sqe_files_unregister+0x32/0x70
[  196.093535]  io_ring_ctx_wait_and_kill+0xf6/0x1a0
[  196.094358]  io_uring_release+0x17/0x20
[  196.095029]  __fput+0x9d/0x1d0
[  196.095660]  task_work_run+0x7a/0x90
[  196.096307]  do_exit+0x301/0xc20
[  196.096808]  ? handle_mm_fault+0xf3/0x230
[  196.097504]  do_group_exit+0x35/0xa0
[  196.098126]  __x64_sys_exit_group+0xf/0x10
[  196.098836]  do_syscall_64+0x3d/0xf0
[  196.099460]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  196.100334] RIP: 0033:0x7f1a8aba5956
[  196.100958] Code: Bad RIP value.
[  196.101293] RSP: 002b:00007fffbdbcad38 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  196.101933] RAX: ffffffffffffffda RBX: 00007f1a8ac975c0 RCX: 00007f1a8aba5956
[  196.102535] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  196.103137] RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff78
[  196.103739] R10: 00007fffbdbcabf8 R11: 0000000000000246 R12: 00007f1a8ac975c0
[  196.104526] R13: 0000000000000001 R14: 00007f1a8aca0288 R15: 0000000000000000
[  196.105777] ---[ end trace 8a7e42f9199e5f93 ]---
[  196.107535] unix: Attempt to release alive unix socket: 000000003b3c1a34

which corresponds to the WARN_ONs:

	WARN_ON(!sk_unhashed(sk));
	WARN_ON(sk->sk_socket);

This doesn't seem to happen if I omit the call to io_uring_register.

> +			kfree_skb(skb);
> +	}
> +#else
> +	int i;
> +
> +	for (i = 0; i < ctx->nr_user_files; i++)
> +		fput(ctx->user_files[i]);
> +#endif
> +}
> +
> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +	if (!ctx->user_files)
> +		return -ENXIO;
> +
> +	__io_sqe_files_unregister(ctx);
> +	kfree(ctx->user_files);
> +	ctx->user_files = NULL;
> +	return 0;
> +}
> +
> +#if defined(CONFIG_UNIX)
> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
> +{
> +	struct scm_fp_list *fpl;
> +	struct sk_buff *skb;
> +	int i;
> +
> +	fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
> +	if (!fpl)
> +		return -ENOMEM;
> +
> +	skb = alloc_skb(0, GFP_KERNEL);
> +	if (!skb) {
> +		kfree(fpl);
> +		return -ENOMEM;
> +	}
> +
> +	skb->sk = ctx->ring_sock->sk;
> +	skb->destructor = unix_destruct_scm;
> +
> +	fpl->user = get_uid(ctx->user);
> +	for (i = 0; i < nr; i++) {
> +		fpl->fp[i] = get_file(ctx->user_files[i + offset]);
> +		unix_inflight(fpl->user, fpl->fp[i]);
> +	}
> +
> +	fpl->max = fpl->count = nr;
> +	UNIXCB(skb).fp = fpl;
> +	skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
> +
> +	for (i = 0; i < nr; i++)
> +		fput(fpl->fp[i]);
> +
> +	return 0;
> +}
> +
> +/*
> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which
> + * causes regular reference counting to break down. We rely on the UNIX
> + * garbage collection to take care of this problem for us.
> + */
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +	unsigned left, total;
> +	int ret = 0;
> +
> +	total = 0;
> +	left = ctx->nr_user_files;
> +	while (left) {
> +		unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
> +		int ret;
> +
> +		ret = __io_sqe_files_scm(ctx, this_files, total);
> +		if (ret)
> +			break;
> +		left -= this_files;
> +		total += this_files;
> +	}
> +
> +	return ret;
> +}
> +#else
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +	return 0;
> +}
> +#endif
> +
> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
> +				 unsigned nr_args)
> +{
> +	__s32 __user *fds = (__s32 __user *) arg;
> +	int fd, ret = 0;
> +	unsigned i;
> +
> +	if (ctx->user_files)
> +		return -EBUSY;
> +	if (!nr_args)
> +		return -EINVAL;
> +	if (nr_args > IORING_MAX_FIXED_FILES)
> +		return -EMFILE;
> +
> +	ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
> +	if (!ctx->user_files)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < nr_args; i++) {
> +		ret = -EFAULT;
> +		if (copy_from_user(&fd, &fds[i], sizeof(fd)))
> +			break;
> +
> +		ctx->user_files[i] = fget(fd);
> +
> +		ret = -EBADF;
> +		if (!ctx->user_files[i])
> +			break;
> +		/*
> +		 * Don't allow io_uring instances to be registered. If UNIX
> +		 * isn't enabled, then this causes a reference cycle and this
> +		 * instance can never get freed. If UNIX is enabled we'll
> +		 * handle it just fine, but there's still no point in allowing
> +		 * a ring fd as it doesn't support regular read/write anyway.
> +		 */
> +		if (ctx->user_files[i]->f_op == &io_uring_fops) {
> +			fput(ctx->user_files[i]);
> +			break;
> +		}
> +		ctx->nr_user_files++;
> +		ret = 0;
> +	}
> +
> +	if (!ret)
> +		ret = io_sqe_files_scm(ctx);
> +	if (ret)
> +		io_sqe_files_unregister(ctx);
> +
> +	return ret;
> +}
> +
>  static int io_sq_offload_start(struct io_ring_ctx *ctx)
>  {
>  	int ret;
> @@ -1560,14 +1754,16 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
>  		destroy_workqueue(ctx->sqo_wq);
>  	if (ctx->sqo_mm)
>  		mmdrop(ctx->sqo_mm);
> +
> +	io_iopoll_reap_events(ctx);
> +	io_sqe_buffer_unregister(ctx);
> +	io_sqe_files_unregister(ctx);
> +
>  #if defined(CONFIG_UNIX)
>  	if (ctx->ring_sock)
>  		sock_release(ctx->ring_sock);
>  #endif
>  
> -	io_iopoll_reap_events(ctx);
> -	io_sqe_buffer_unregister(ctx);
> -
>  	io_mem_free(ctx->sq_ring);
>  	io_mem_free(ctx->sq_sqes);
>  	io_mem_free(ctx->cq_ring);
> @@ -1934,6 +2130,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>  			break;
>  		ret = io_sqe_buffer_unregister(ctx);
>  		break;
> +	case IORING_REGISTER_FILES:
> +		ret = io_sqe_files_register(ctx, arg, nr_args);
> +		break;
> +	case IORING_UNREGISTER_FILES:
> +		ret = -EINVAL;
> +		if (arg || nr_args)
> +			break;
> +		ret = io_sqe_files_unregister(ctx);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index cf28f7a11f12..6257478d55e9 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -16,7 +16,7 @@
>   */
>  struct io_uring_sqe {
>  	__u8	opcode;		/* type of operation for this sqe */
> -	__u8	flags;		/* as of now unused */
> +	__u8	flags;		/* IOSQE_ flags */
>  	__u16	ioprio;		/* ioprio for the request */
>  	__s32	fd;		/* file descriptor to do IO on */
>  	__u64	off;		/* offset into file */
> @@ -33,6 +33,11 @@ struct io_uring_sqe {
>  	};
>  };
>  
> +/*
> + * sqe->flags
> + */
> +#define IOSQE_FIXED_FILE	(1U << 0)	/* use fixed fileset */
> +
>  /*
>   * io_uring_setup() flags
>   */
> @@ -113,5 +118,7 @@ struct io_uring_params {
>   */
>  #define IORING_REGISTER_BUFFERS		0
>  #define IORING_UNREGISTER_BUFFERS	1
> +#define IORING_REGISTER_FILES		2
> +#define IORING_UNREGISTER_FILES		3
>  
>  #endif

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCH 14/19] io_uring: add file set registration
  2019-02-09 21:13 [PATCHSET v14] " Jens Axboe
@ 2019-02-09 21:13   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-09 21:13 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We normally have to fget/fput for each IO we do on a file. Even with
the batching we do, the cost of the atomic inc/dec of the file usage
count adds up.

This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
for the io_uring_register(2) system call. The arguments passed in must
be an array of __s32 holding file descriptors, and nr_args should hold
the number of file descriptors the application wishes to pin for the
duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
called).

When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
to the index in the array passed in to IORING_REGISTER_FILES.

Files are automatically unregistered when the io_uring instance is torn
down. An application need only unregister if it wishes to register a new
set of fds.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 269 ++++++++++++++++++++++++++++++----
 include/uapi/linux/io_uring.h |   9 +-
 2 files changed, 245 insertions(+), 33 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 09a3122b3b6c..c40a7ed2edd5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -29,6 +29,7 @@
 #include <linux/net.h>
 #include <net/sock.h>
 #include <net/af_unix.h>
+#include <net/scm.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
@@ -41,6 +42,7 @@
 #include "internal.h"
 
 #define IORING_MAX_ENTRIES	4096
+#define IORING_MAX_FIXED_FILES	1024
 
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
@@ -103,6 +105,14 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/*
+	 * If used, fixed file set. Writers must ensure that ->refs is dead,
+	 * readers must ensure that ->refs is alive as long as the file* is
+	 * used. Only updated through io_uring_register(2).
+	 */
+	struct file		**user_files;
+	unsigned		nr_user_files;
+
 	/* if used, fixed mapped user buffers */
 	unsigned		nr_user_bufs;
 	struct io_mapped_ubuf	*user_bufs;
@@ -150,6 +160,7 @@ struct io_kiocb {
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
+#define REQ_F_FIXED_FILE	4	/* ctx owns file */
 	u64			user_data;
 	u64			error;
 
@@ -380,15 +391,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		 * Batched puts of the same file, to avoid dirtying the
 		 * file usage count multiple times, if avoidable.
 		 */
-		if (!file) {
-			file = req->rw.ki_filp;
-			file_count = 1;
-		} else if (file == req->rw.ki_filp) {
-			file_count++;
-		} else {
-			fput_many(file, file_count);
-			file = req->rw.ki_filp;
-			file_count = 1;
+		if (!(req->flags & REQ_F_FIXED_FILE)) {
+			if (!file) {
+				file = req->rw.ki_filp;
+				file_count = 1;
+			} else if (file == req->rw.ki_filp) {
+				file_count++;
+			} else {
+				fput_many(file, file_count);
+				file = req->rw.ki_filp;
+				file_count = 1;
+			}
 		}
 
 		if (to_free == ARRAY_SIZE(reqs))
@@ -520,13 +533,19 @@ static void kiocb_end_write(struct kiocb *kiocb)
 	}
 }
 
+static void io_fput(struct io_kiocb *req)
+{
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		fput(req->rw.ki_filp);
+}
+
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
 
 	kiocb_end_write(kiocb);
 
-	fput(kiocb->ki_filp);
+	io_fput(req);
 	io_cqring_add_event(req->ctx, req->user_data, res, 0);
 	io_free_req(req);
 }
@@ -642,19 +661,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
-	unsigned ioprio;
+	unsigned ioprio, flags;
 	int fd, ret;
 
 	/* For -EAGAIN retry, everything is already prepped */
 	if (kiocb->ki_filp)
 		return 0;
 
+	flags = READ_ONCE(sqe->flags);
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = io_file_get(state, fd);
-	if (unlikely(!kiocb->ki_filp))
-		return -EBADF;
-	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
-		force_nonblock = false;
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files ||
+		    (unsigned) fd >= ctx->nr_user_files))
+			return -EBADF;
+		kiocb->ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		kiocb->ki_filp = io_file_get(state, fd);
+		if (unlikely(!kiocb->ki_filp))
+			return -EBADF;
+		if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
+			force_nonblock = false;
+	}
 	kiocb->ki_pos = READ_ONCE(sqe->off);
 	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
@@ -694,10 +723,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	/* in case of error, we didn't use this file reference. drop it. */
-	if (state)
-		state->used_refs--;
-	io_file_put(state, kiocb->ki_filp);
+	if (!(flags & IOSQE_FIXED_FILE)) {
+		/*
+		 * in case of error, we didn't use this file reference. drop it.
+		 */
+		if (state)
+			state->used_refs--;
+		io_file_put(state, kiocb->ki_filp);
+	}
 	return ret;
 }
 
@@ -837,7 +870,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 out_fput:
 	/* Hold on to the file for -EAGAIN */
 	if (unlikely(ret && ret != -EAGAIN))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -891,7 +924,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -914,7 +947,8 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 	 */
 	if (req->rw.ki_filp) {
 		err = -EBADF;
-		fput(req->rw.ki_filp);
+		if (!(req->flags & REQ_F_FIXED_FILE))
+			fput(req->rw.ki_filp);
 	}
 	io_cqring_add_event(ctx, user_data, err, 0);
 	io_free_req(req);
@@ -923,21 +957,32 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 
 static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
+	struct io_ring_ctx *ctx = req->ctx;
+	unsigned flags;
 	int fd;
 
 	/* Prep already done */
 	if (req->rw.ki_filp)
 		return 0;
 
-	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
 	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fd = READ_ONCE(sqe->fd);
-	req->rw.ki_filp = fget(fd);
-	if (unlikely(!req->rw.ki_filp))
-		return -EBADF;
+	flags = READ_ONCE(sqe->flags);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		req->rw.ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		req->rw.ki_filp = fget(fd);
+		if (unlikely(!req->rw.ki_filp))
+			return -EBADF;
+	}
 
 	return 0;
 }
@@ -967,7 +1012,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 				end > 0 ? end : LLONG_MAX,
 				fsync_flags & IORING_FSYNC_DATASYNC);
 
-	fput(req->rw.ki_filp);
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		fput(req->rw.ki_filp);
 	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
 	io_free_req(req);
 	return 0;
@@ -1104,7 +1150,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
-	if (unlikely(s->sqe->flags))
+	if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
 		return -EINVAL;
 
 	req = io_get_req(ctx, state);
@@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
 }
 
+static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+#if defined(CONFIG_UNIX)
+	if (ctx->ring_sock) {
+		struct sock *sock = ctx->ring_sock->sk;
+		struct sk_buff *skb;
+
+		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
+			kfree_skb(skb);
+	}
+#else
+	int i;
+
+	for (i = 0; i < ctx->nr_user_files; i++)
+		fput(ctx->user_files[i]);
+#endif
+}
+
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+	if (!ctx->user_files)
+		return -ENXIO;
+
+	__io_sqe_files_unregister(ctx);
+	kfree(ctx->user_files);
+	ctx->user_files = NULL;
+	return 0;
+}
+
+#if defined(CONFIG_UNIX)
+static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
+{
+	struct scm_fp_list *fpl;
+	struct sk_buff *skb;
+	int i;
+
+	fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
+	if (!fpl)
+		return -ENOMEM;
+
+	skb = alloc_skb(0, GFP_KERNEL);
+	if (!skb) {
+		kfree(fpl);
+		return -ENOMEM;
+	}
+
+	skb->sk = ctx->ring_sock->sk;
+	skb->destructor = unix_destruct_scm;
+
+	fpl->user = get_uid(ctx->user);
+	for (i = 0; i < nr; i++) {
+		fpl->fp[i] = get_file(ctx->user_files[i + offset]);
+		unix_inflight(fpl->user, fpl->fp[i]);
+	}
+
+	fpl->max = fpl->count = nr;
+	UNIXCB(skb).fp = fpl;
+	skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
+
+	for (i = 0; i < nr; i++)
+		fput(fpl->fp[i]);
+
+	return 0;
+}
+
+/*
+ * If UNIX sockets are enabled, fd passing can cause a reference cycle which
+ * causes regular reference counting to break down. We rely on the UNIX
+ * garbage collection to take care of this problem for us.
+ */
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	unsigned left, total;
+	int ret = 0;
+
+	total = 0;
+	left = ctx->nr_user_files;
+	while (left) {
+		unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
+		int ret;
+
+		ret = __io_sqe_files_scm(ctx, this_files, total);
+		if (ret)
+			break;
+		left -= this_files;
+		total += this_files;
+	}
+
+	return ret;
+}
+#else
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	return 0;
+}
+#endif
+
+static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
+				 unsigned nr_args)
+{
+	__s32 __user *fds = (__s32 __user *) arg;
+	int fd, ret = 0;
+	unsigned i;
+
+	if (ctx->user_files)
+		return -EBUSY;
+	if (!nr_args)
+		return -EINVAL;
+	if (nr_args > IORING_MAX_FIXED_FILES)
+		return -EMFILE;
+
+	ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
+	if (!ctx->user_files)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		ret = -EFAULT;
+		if (copy_from_user(&fd, &fds[i], sizeof(fd)))
+			break;
+
+		ctx->user_files[i] = fget(fd);
+
+		ret = -EBADF;
+		if (!ctx->user_files[i])
+			break;
+		/*
+		 * Don't allow io_uring instances to be registered. If UNIX
+		 * isn't enabled, then this causes a reference cycle and this
+		 * instance can never get freed. If UNIX is enabled we'll
+		 * handle it just fine, but there's still no point in allowing
+		 * a ring fd as it doesn't support regular read/write anyway.
+		 */
+		if (ctx->user_files[i]->f_op == &io_uring_fops) {
+			fput(ctx->user_files[i]);
+			break;
+		}
+		ctx->nr_user_files++;
+		ret = 0;
+	}
+
+	if (!ret)
+		ret = io_sqe_files_scm(ctx);
+	if (ret)
+		io_sqe_files_unregister(ctx);
+
+	return ret;
+}
+
 static int io_sq_offload_start(struct io_ring_ctx *ctx)
 {
 	int ret;
@@ -1560,14 +1754,16 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		destroy_workqueue(ctx->sqo_wq);
 	if (ctx->sqo_mm)
 		mmdrop(ctx->sqo_mm);
+
+	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
+	io_sqe_files_unregister(ctx);
+
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
 		sock_release(ctx->ring_sock);
 #endif
 
-	io_iopoll_reap_events(ctx);
-	io_sqe_buffer_unregister(ctx);
-
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
 	io_mem_free(ctx->cq_ring);
@@ -1934,6 +2130,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_sqe_buffer_unregister(ctx);
 		break;
+	case IORING_REGISTER_FILES:
+		ret = io_sqe_files_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_FILES:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_files_unregister(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index cf28f7a11f12..6257478d55e9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -16,7 +16,7 @@
  */
 struct io_uring_sqe {
 	__u8	opcode;		/* type of operation for this sqe */
-	__u8	flags;		/* as of now unused */
+	__u8	flags;		/* IOSQE_ flags */
 	__u16	ioprio;		/* ioprio for the request */
 	__s32	fd;		/* file descriptor to do IO on */
 	__u64	off;		/* offset into file */
@@ -33,6 +33,11 @@ struct io_uring_sqe {
 	};
 };
 
+/*
+ * sqe->flags
+ */
+#define IOSQE_FIXED_FILE	(1U << 0)	/* use fixed fileset */
+
 /*
  * io_uring_setup() flags
  */
@@ -113,5 +118,7 @@ struct io_uring_params {
  */
 #define IORING_REGISTER_BUFFERS		0
 #define IORING_UNREGISTER_BUFFERS	1
+#define IORING_REGISTER_FILES		2
+#define IORING_UNREGISTER_FILES		3
 
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 14/19] io_uring: add file set registration
@ 2019-02-09 21:13   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-09 21:13 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We normally have to fget/fput for each IO we do on a file. Even with
the batching we do, the cost of the atomic inc/dec of the file usage
count adds up.

This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
for the io_uring_register(2) system call. The arguments passed in must
be an array of __s32 holding file descriptors, and nr_args should hold
the number of file descriptors the application wishes to pin for the
duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
called).

When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
to the index in the array passed in to IORING_REGISTER_FILES.

Files are automatically unregistered when the io_uring instance is torn
down. An application need only unregister if it wishes to register a new
set of fds.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 269 ++++++++++++++++++++++++++++++----
 include/uapi/linux/io_uring.h |   9 +-
 2 files changed, 245 insertions(+), 33 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 09a3122b3b6c..c40a7ed2edd5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -29,6 +29,7 @@
 #include <linux/net.h>
 #include <net/sock.h>
 #include <net/af_unix.h>
+#include <net/scm.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
@@ -41,6 +42,7 @@
 #include "internal.h"
 
 #define IORING_MAX_ENTRIES	4096
+#define IORING_MAX_FIXED_FILES	1024
 
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
@@ -103,6 +105,14 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/*
+	 * If used, fixed file set. Writers must ensure that ->refs is dead,
+	 * readers must ensure that ->refs is alive as long as the file* is
+	 * used. Only updated through io_uring_register(2).
+	 */
+	struct file		**user_files;
+	unsigned		nr_user_files;
+
 	/* if used, fixed mapped user buffers */
 	unsigned		nr_user_bufs;
 	struct io_mapped_ubuf	*user_bufs;
@@ -150,6 +160,7 @@ struct io_kiocb {
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
+#define REQ_F_FIXED_FILE	4	/* ctx owns file */
 	u64			user_data;
 	u64			error;
 
@@ -380,15 +391,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		 * Batched puts of the same file, to avoid dirtying the
 		 * file usage count multiple times, if avoidable.
 		 */
-		if (!file) {
-			file = req->rw.ki_filp;
-			file_count = 1;
-		} else if (file == req->rw.ki_filp) {
-			file_count++;
-		} else {
-			fput_many(file, file_count);
-			file = req->rw.ki_filp;
-			file_count = 1;
+		if (!(req->flags & REQ_F_FIXED_FILE)) {
+			if (!file) {
+				file = req->rw.ki_filp;
+				file_count = 1;
+			} else if (file == req->rw.ki_filp) {
+				file_count++;
+			} else {
+				fput_many(file, file_count);
+				file = req->rw.ki_filp;
+				file_count = 1;
+			}
 		}
 
 		if (to_free == ARRAY_SIZE(reqs))
@@ -520,13 +533,19 @@ static void kiocb_end_write(struct kiocb *kiocb)
 	}
 }
 
+static void io_fput(struct io_kiocb *req)
+{
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		fput(req->rw.ki_filp);
+}
+
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
 
 	kiocb_end_write(kiocb);
 
-	fput(kiocb->ki_filp);
+	io_fput(req);
 	io_cqring_add_event(req->ctx, req->user_data, res, 0);
 	io_free_req(req);
 }
@@ -642,19 +661,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
-	unsigned ioprio;
+	unsigned ioprio, flags;
 	int fd, ret;
 
 	/* For -EAGAIN retry, everything is already prepped */
 	if (kiocb->ki_filp)
 		return 0;
 
+	flags = READ_ONCE(sqe->flags);
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = io_file_get(state, fd);
-	if (unlikely(!kiocb->ki_filp))
-		return -EBADF;
-	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
-		force_nonblock = false;
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files ||
+		    (unsigned) fd >= ctx->nr_user_files))
+			return -EBADF;
+		kiocb->ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		kiocb->ki_filp = io_file_get(state, fd);
+		if (unlikely(!kiocb->ki_filp))
+			return -EBADF;
+		if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
+			force_nonblock = false;
+	}
 	kiocb->ki_pos = READ_ONCE(sqe->off);
 	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
@@ -694,10 +723,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	/* in case of error, we didn't use this file reference. drop it. */
-	if (state)
-		state->used_refs--;
-	io_file_put(state, kiocb->ki_filp);
+	if (!(flags & IOSQE_FIXED_FILE)) {
+		/*
+		 * in case of error, we didn't use this file reference. drop it.
+		 */
+		if (state)
+			state->used_refs--;
+		io_file_put(state, kiocb->ki_filp);
+	}
 	return ret;
 }
 
@@ -837,7 +870,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 out_fput:
 	/* Hold on to the file for -EAGAIN */
 	if (unlikely(ret && ret != -EAGAIN))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -891,7 +924,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -914,7 +947,8 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 	 */
 	if (req->rw.ki_filp) {
 		err = -EBADF;
-		fput(req->rw.ki_filp);
+		if (!(req->flags & REQ_F_FIXED_FILE))
+			fput(req->rw.ki_filp);
 	}
 	io_cqring_add_event(ctx, user_data, err, 0);
 	io_free_req(req);
@@ -923,21 +957,32 @@ static int io_nop(struct io_kiocb *req, u64 user_data)
 
 static int io_prep_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
+	struct io_ring_ctx *ctx = req->ctx;
+	unsigned flags;
 	int fd;
 
 	/* Prep already done */
 	if (req->rw.ki_filp)
 		return 0;
 
-	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+	if (unlikely(ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
 	if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index))
 		return -EINVAL;
 
 	fd = READ_ONCE(sqe->fd);
-	req->rw.ki_filp = fget(fd);
-	if (unlikely(!req->rw.ki_filp))
-		return -EBADF;
+	flags = READ_ONCE(sqe->flags);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		req->rw.ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		req->rw.ki_filp = fget(fd);
+		if (unlikely(!req->rw.ki_filp))
+			return -EBADF;
+	}
 
 	return 0;
 }
@@ -967,7 +1012,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 				end > 0 ? end : LLONG_MAX,
 				fsync_flags & IORING_FSYNC_DATASYNC);
 
-	fput(req->rw.ki_filp);
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		fput(req->rw.ki_filp);
 	io_cqring_add_event(req->ctx, sqe->user_data, ret, 0);
 	io_free_req(req);
 	return 0;
@@ -1104,7 +1150,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
-	if (unlikely(s->sqe->flags))
+	if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
 		return -EINVAL;
 
 	req = io_get_req(ctx, state);
@@ -1292,6 +1338,154 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
 }
 
+static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+#if defined(CONFIG_UNIX)
+	if (ctx->ring_sock) {
+		struct sock *sock = ctx->ring_sock->sk;
+		struct sk_buff *skb;
+
+		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
+			kfree_skb(skb);
+	}
+#else
+	int i;
+
+	for (i = 0; i < ctx->nr_user_files; i++)
+		fput(ctx->user_files[i]);
+#endif
+}
+
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+	if (!ctx->user_files)
+		return -ENXIO;
+
+	__io_sqe_files_unregister(ctx);
+	kfree(ctx->user_files);
+	ctx->user_files = NULL;
+	return 0;
+}
+
+#if defined(CONFIG_UNIX)
+static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
+{
+	struct scm_fp_list *fpl;
+	struct sk_buff *skb;
+	int i;
+
+	fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
+	if (!fpl)
+		return -ENOMEM;
+
+	skb = alloc_skb(0, GFP_KERNEL);
+	if (!skb) {
+		kfree(fpl);
+		return -ENOMEM;
+	}
+
+	skb->sk = ctx->ring_sock->sk;
+	skb->destructor = unix_destruct_scm;
+
+	fpl->user = get_uid(ctx->user);
+	for (i = 0; i < nr; i++) {
+		fpl->fp[i] = get_file(ctx->user_files[i + offset]);
+		unix_inflight(fpl->user, fpl->fp[i]);
+	}
+
+	fpl->max = fpl->count = nr;
+	UNIXCB(skb).fp = fpl;
+	skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
+
+	for (i = 0; i < nr; i++)
+		fput(fpl->fp[i]);
+
+	return 0;
+}
+
+/*
+ * If UNIX sockets are enabled, fd passing can cause a reference cycle which
+ * causes regular reference counting to break down. We rely on the UNIX
+ * garbage collection to take care of this problem for us.
+ */
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	unsigned left, total;
+	int ret = 0;
+
+	total = 0;
+	left = ctx->nr_user_files;
+	while (left) {
+		unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
+		int ret;
+
+		ret = __io_sqe_files_scm(ctx, this_files, total);
+		if (ret)
+			break;
+		left -= this_files;
+		total += this_files;
+	}
+
+	return ret;
+}
+#else
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	return 0;
+}
+#endif
+
+static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
+				 unsigned nr_args)
+{
+	__s32 __user *fds = (__s32 __user *) arg;
+	int fd, ret = 0;
+	unsigned i;
+
+	if (ctx->user_files)
+		return -EBUSY;
+	if (!nr_args)
+		return -EINVAL;
+	if (nr_args > IORING_MAX_FIXED_FILES)
+		return -EMFILE;
+
+	ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
+	if (!ctx->user_files)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		ret = -EFAULT;
+		if (copy_from_user(&fd, &fds[i], sizeof(fd)))
+			break;
+
+		ctx->user_files[i] = fget(fd);
+
+		ret = -EBADF;
+		if (!ctx->user_files[i])
+			break;
+		/*
+		 * Don't allow io_uring instances to be registered. If UNIX
+		 * isn't enabled, then this causes a reference cycle and this
+		 * instance can never get freed. If UNIX is enabled we'll
+		 * handle it just fine, but there's still no point in allowing
+		 * a ring fd as it doesn't support regular read/write anyway.
+		 */
+		if (ctx->user_files[i]->f_op == &io_uring_fops) {
+			fput(ctx->user_files[i]);
+			break;
+		}
+		ctx->nr_user_files++;
+		ret = 0;
+	}
+
+	if (!ret)
+		ret = io_sqe_files_scm(ctx);
+	if (ret)
+		io_sqe_files_unregister(ctx);
+
+	return ret;
+}
+
 static int io_sq_offload_start(struct io_ring_ctx *ctx)
 {
 	int ret;
@@ -1560,14 +1754,16 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		destroy_workqueue(ctx->sqo_wq);
 	if (ctx->sqo_mm)
 		mmdrop(ctx->sqo_mm);
+
+	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
+	io_sqe_files_unregister(ctx);
+
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
 		sock_release(ctx->ring_sock);
 #endif
 
-	io_iopoll_reap_events(ctx);
-	io_sqe_buffer_unregister(ctx);
-
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
 	io_mem_free(ctx->cq_ring);
@@ -1934,6 +2130,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_sqe_buffer_unregister(ctx);
 		break;
+	case IORING_REGISTER_FILES:
+		ret = io_sqe_files_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_FILES:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_files_unregister(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index cf28f7a11f12..6257478d55e9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -16,7 +16,7 @@
  */
 struct io_uring_sqe {
 	__u8	opcode;		/* type of operation for this sqe */
-	__u8	flags;		/* as of now unused */
+	__u8	flags;		/* IOSQE_ flags */
 	__u16	ioprio;		/* ioprio for the request */
 	__s32	fd;		/* file descriptor to do IO on */
 	__u64	off;		/* offset into file */
@@ -33,6 +33,11 @@ struct io_uring_sqe {
 	};
 };
 
+/*
+ * sqe->flags
+ */
+#define IOSQE_FIXED_FILE	(1U << 0)	/* use fixed fileset */
+
 /*
  * io_uring_setup() flags
  */
@@ -113,5 +118,7 @@ struct io_uring_params {
  */
 #define IORING_REGISTER_BUFFERS		0
 #define IORING_UNREGISTER_BUFFERS	1
+#define IORING_REGISTER_FILES		2
+#define IORING_UNREGISTER_FILES		3
 
 #endif
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-08 17:34   ` Jens Axboe
@ 2019-02-09  9:50     ` Hannes Reinecke
  -1 siblings, 0 replies; 128+ messages in thread
From: Hannes Reinecke @ 2019-02-09  9:50 UTC (permalink / raw)
  To: Jens Axboe, linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro

On 2/8/19 6:34 PM, Jens Axboe wrote:
> We normally have to fget/fput for each IO we do on a file. Even with
> the batching we do, the cost of the atomic inc/dec of the file usage
> count adds up.
> 
> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> for the io_uring_register(2) system call. The arguments passed in must
> be an array of __s32 holding file descriptors, and nr_args should hold
> the number of file descriptors the application wishes to pin for the
> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
> called).
> 
> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> to the index in the array passed in to IORING_REGISTER_FILES.
> 
> Files are automatically unregistered when the io_uring instance is torn
> down. An application need only unregister if it wishes to register a new
> set of fds.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   fs/io_uring.c                 | 256 ++++++++++++++++++++++++++++++----
>   include/uapi/linux/io_uring.h |   9 +-
>   2 files changed, 235 insertions(+), 30 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
@ 2019-02-09  9:50     ` Hannes Reinecke
  0 siblings, 0 replies; 128+ messages in thread
From: Hannes Reinecke @ 2019-02-09  9:50 UTC (permalink / raw)
  To: Jens Axboe, linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro

On 2/8/19 6:34 PM, Jens Axboe wrote:
> We normally have to fget/fput for each IO we do on a file. Even with
> the batching we do, the cost of the atomic inc/dec of the file usage
> count adds up.
> 
> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> for the io_uring_register(2) system call. The arguments passed in must
> be an array of __s32 holding file descriptors, and nr_args should hold
> the number of file descriptors the application wishes to pin for the
> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
> called).
> 
> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> to the index in the array passed in to IORING_REGISTER_FILES.
> 
> Files are automatically unregistered when the io_uring instance is torn
> down. An application need only unregister if it wishes to register a new
> set of fds.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   fs/io_uring.c                 | 256 ++++++++++++++++++++++++++++++----
>   include/uapi/linux/io_uring.h |   9 +-
>   2 files changed, 235 insertions(+), 30 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes


--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-08 20:26     ` Jann Horn
@ 2019-02-09  0:16       ` Jens Axboe
  -1 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-09  0:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On 2/8/19 1:26 PM, Jann Horn wrote:
> On Fri, Feb 8, 2019 at 6:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>> We normally have to fget/fput for each IO we do on a file. Even with
>> the batching we do, the cost of the atomic inc/dec of the file usage
>> count adds up.
>>
>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>> for the io_uring_register(2) system call. The arguments passed in must
>> be an array of __s32 holding file descriptors, and nr_args should hold
>> the number of file descriptors the application wishes to pin for the
>> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
>> called).
>>
>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>> to the index in the array passed in to IORING_REGISTER_FILES.
>>
>> Files are automatically unregistered when the io_uring instance is torn
>> down. An application need only unregister if it wishes to register a new
>> set of fds.
> 
> I think the overall concept here is still broken: You're giving the
> user_files to the GC, and I think the GC can drop their refcounts, but
> I don't see you actually getting feedback from the GC anywhere that
> would let the GC break your references? E.g. in io_prep_rw() you grab
> file pointers from ctx->user_files after simply checking
> ctx->nr_user_files, and there is no path from the GC that touches
> those fields. As far as I can tell, the GC is just going to go through
> unix_destruct_scm() and drop references on your files, causing
> use-after-free.
> 
> But the unix GC is complicated, and maybe I'm just missing something...

Only when the skb is released, which is either done when the io_uring
is torn down (and then definitely safe), or if the socket is released,
which is again also at a safe time.

>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +#if defined(CONFIG_UNIX)
>> +       if (ctx->ring_sock) {
>> +               struct sock *sock = ctx->ring_sock->sk;
>> +               struct sk_buff *skb;
>> +
>> +               while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
>> +                       kfree_skb(skb);
>> +       }
>> +#else
>> +       int i;
>> +
>> +       for (i = 0; i < ctx->nr_user_files; i++)
>> +               fput(ctx->user_files[i]);
>> +#endif
>> +}
>> +
>> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +       if (!ctx->user_files)
>> +               return -ENXIO;
>> +
>> +       __io_sqe_files_unregister(ctx);
>> +       kfree(ctx->user_files);
>> +       ctx->user_files = NULL;
>> +       return 0;
>> +}
>> +
>> +#if defined(CONFIG_UNIX)
>> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
>> +{
>> +       struct scm_fp_list *fpl;
>> +       struct sk_buff *skb;
>> +       int i;
>> +
>> +       fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
>> +       if (!fpl)
>> +               return -ENOMEM;
>> +
>> +       skb = alloc_skb(0, GFP_KERNEL);
>> +       if (!skb) {
>> +               kfree(fpl);
>> +               return -ENOMEM;
>> +       }
>> +
>> +       skb->sk = ctx->ring_sock->sk;
>> +       skb->destructor = unix_destruct_scm;
>> +
>> +       fpl->user = get_uid(ctx->user);
>> +       for (i = 0; i < nr; i++) {
>> +               fpl->fp[i] = get_file(ctx->user_files[i + offset]);
>> +               unix_inflight(fpl->user, fpl->fp[i]);
>> +               fput(fpl->fp[i]);
> 
> This pattern is almost always superfluous. You increment the file's
> refcount, maybe insert the file into a list (essentially), and drop
> the file's refcount back down. You're already holding a stable
> reference, and you're not temporarily lending that to anyone else.

Actually, this is me messing up. The fput() should be done AFTER
adding to the socket. I'll fix that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
@ 2019-02-09  0:16       ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-09  0:16 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On 2/8/19 1:26 PM, Jann Horn wrote:
> On Fri, Feb 8, 2019 at 6:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>> We normally have to fget/fput for each IO we do on a file. Even with
>> the batching we do, the cost of the atomic inc/dec of the file usage
>> count adds up.
>>
>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>> for the io_uring_register(2) system call. The arguments passed in must
>> be an array of __s32 holding file descriptors, and nr_args should hold
>> the number of file descriptors the application wishes to pin for the
>> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
>> called).
>>
>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>> to the index in the array passed in to IORING_REGISTER_FILES.
>>
>> Files are automatically unregistered when the io_uring instance is torn
>> down. An application need only unregister if it wishes to register a new
>> set of fds.
> 
> I think the overall concept here is still broken: You're giving the
> user_files to the GC, and I think the GC can drop their refcounts, but
> I don't see you actually getting feedback from the GC anywhere that
> would let the GC break your references? E.g. in io_prep_rw() you grab
> file pointers from ctx->user_files after simply checking
> ctx->nr_user_files, and there is no path from the GC that touches
> those fields. As far as I can tell, the GC is just going to go through
> unix_destruct_scm() and drop references on your files, causing
> use-after-free.
> 
> But the unix GC is complicated, and maybe I'm just missing something...

Only when the skb is released, which is either done when the io_uring
is torn down (and then definitely safe), or if the socket is released,
which is again also at a safe time.

>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +#if defined(CONFIG_UNIX)
>> +       if (ctx->ring_sock) {
>> +               struct sock *sock = ctx->ring_sock->sk;
>> +               struct sk_buff *skb;
>> +
>> +               while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
>> +                       kfree_skb(skb);
>> +       }
>> +#else
>> +       int i;
>> +
>> +       for (i = 0; i < ctx->nr_user_files; i++)
>> +               fput(ctx->user_files[i]);
>> +#endif
>> +}
>> +
>> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +       if (!ctx->user_files)
>> +               return -ENXIO;
>> +
>> +       __io_sqe_files_unregister(ctx);
>> +       kfree(ctx->user_files);
>> +       ctx->user_files = NULL;
>> +       return 0;
>> +}
>> +
>> +#if defined(CONFIG_UNIX)
>> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
>> +{
>> +       struct scm_fp_list *fpl;
>> +       struct sk_buff *skb;
>> +       int i;
>> +
>> +       fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
>> +       if (!fpl)
>> +               return -ENOMEM;
>> +
>> +       skb = alloc_skb(0, GFP_KERNEL);
>> +       if (!skb) {
>> +               kfree(fpl);
>> +               return -ENOMEM;
>> +       }
>> +
>> +       skb->sk = ctx->ring_sock->sk;
>> +       skb->destructor = unix_destruct_scm;
>> +
>> +       fpl->user = get_uid(ctx->user);
>> +       for (i = 0; i < nr; i++) {
>> +               fpl->fp[i] = get_file(ctx->user_files[i + offset]);
>> +               unix_inflight(fpl->user, fpl->fp[i]);
>> +               fput(fpl->fp[i]);
> 
> This pattern is almost always superfluous. You increment the file's
> refcount, maybe insert the file into a list (essentially), and drop
> the file's refcount back down. You're already holding a stable
> reference, and you're not temporarily lending that to anyone else.

Actually, this is me messing up. The fput() should be done AFTER
adding to the socket. I'll fix that.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
  2019-02-08 17:34   ` Jens Axboe
@ 2019-02-08 20:26     ` Jann Horn
  -1 siblings, 0 replies; 128+ messages in thread
From: Jann Horn @ 2019-02-08 20:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On Fri, Feb 8, 2019 at 6:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> We normally have to fget/fput for each IO we do on a file. Even with
> the batching we do, the cost of the atomic inc/dec of the file usage
> count adds up.
>
> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> for the io_uring_register(2) system call. The arguments passed in must
> be an array of __s32 holding file descriptors, and nr_args should hold
> the number of file descriptors the application wishes to pin for the
> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
> called).
>
> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> to the index in the array passed in to IORING_REGISTER_FILES.
>
> Files are automatically unregistered when the io_uring instance is torn
> down. An application need only unregister if it wishes to register a new
> set of fds.

I think the overall concept here is still broken: You're giving the
user_files to the GC, and I think the GC can drop their refcounts, but
I don't see you actually getting feedback from the GC anywhere that
would let the GC break your references? E.g. in io_prep_rw() you grab
file pointers from ctx->user_files after simply checking
ctx->nr_user_files, and there is no path from the GC that touches
those fields. As far as I can tell, the GC is just going to go through
unix_destruct_scm() and drop references on your files, causing
use-after-free.

But the unix GC is complicated, and maybe I'm just missing something...

> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +#if defined(CONFIG_UNIX)
> +       if (ctx->ring_sock) {
> +               struct sock *sock = ctx->ring_sock->sk;
> +               struct sk_buff *skb;
> +
> +               while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
> +                       kfree_skb(skb);
> +       }
> +#else
> +       int i;
> +
> +       for (i = 0; i < ctx->nr_user_files; i++)
> +               fput(ctx->user_files[i]);
> +#endif
> +}
> +
> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +       if (!ctx->user_files)
> +               return -ENXIO;
> +
> +       __io_sqe_files_unregister(ctx);
> +       kfree(ctx->user_files);
> +       ctx->user_files = NULL;
> +       return 0;
> +}
> +
> +#if defined(CONFIG_UNIX)
> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
> +{
> +       struct scm_fp_list *fpl;
> +       struct sk_buff *skb;
> +       int i;
> +
> +       fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
> +       if (!fpl)
> +               return -ENOMEM;
> +
> +       skb = alloc_skb(0, GFP_KERNEL);
> +       if (!skb) {
> +               kfree(fpl);
> +               return -ENOMEM;
> +       }
> +
> +       skb->sk = ctx->ring_sock->sk;
> +       skb->destructor = unix_destruct_scm;
> +
> +       fpl->user = get_uid(ctx->user);
> +       for (i = 0; i < nr; i++) {
> +               fpl->fp[i] = get_file(ctx->user_files[i + offset]);
> +               unix_inflight(fpl->user, fpl->fp[i]);
> +               fput(fpl->fp[i]);

This pattern is almost always superfluous. You increment the file's
refcount, maybe insert the file into a list (essentially), and drop
the file's refcount back down. You're already holding a stable
reference, and you're not temporarily lending that to anyone else.

> +       }
> +
> +       fpl->max = fpl->count = nr;
> +       UNIXCB(skb).fp = fpl;
> +       skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
> +       return 0;
> +}
> +
> +/*
> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which
> + * causes regular reference counting to break down. We rely on the UNIX
> + * garbage collection to take care of this problem for us.
> + */
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       unsigned left, total;
> +       int ret = 0;
> +
> +       total = 0;
> +       left = ctx->nr_user_files;
> +       while (left) {
> +               unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
> +               int ret;
> +
> +               ret = __io_sqe_files_scm(ctx, this_files, total);
> +               if (ret)
> +                       break;
> +               left -= this_files;
> +               total += this_files;
> +       }
> +
> +       return ret;
> +}
> +#else
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       return 0;
> +}
> +#endif
> +
> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
> +                                unsigned nr_args)
> +{
> +       __s32 __user *fds = (__s32 __user *) arg;
> +       int fd, ret = 0;
> +       unsigned i;
> +
> +       if (ctx->user_files)
> +               return -EBUSY;
> +       if (!nr_args)
> +               return -EINVAL;
> +       if (nr_args > IORING_MAX_FIXED_FILES)
> +               return -EMFILE;
> +
> +       ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
> +       if (!ctx->user_files)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nr_args; i++) {
> +               ret = -EFAULT;
> +               if (copy_from_user(&fd, &fds[i], sizeof(fd)))
> +                       break;
> +
> +               ctx->user_files[i] = fget(fd);
> +
> +               ret = -EBADF;
> +               if (!ctx->user_files[i])
> +                       break;
> +               /*
> +                * Don't allow io_uring instances to be registered. If UNIX
> +                * isn't enabled, then this causes a reference cycle and this
> +                * instance can never get freed. If UNIX is enabled we'll
> +                * handle it just fine, but there's still no point in allowing
> +                * a ring fd as it doesn't suppor regular read/write anyway.

nit: support

> +                */
> +               if (ctx->user_files[i]->f_op == &io_uring_fops) {
> +                       fput(ctx->user_files[i]);
> +                       break;
> +               }
> +               ctx->nr_user_files++;
> +               ret = 0;
> +       }
> +
> +       if (!ret)
> +               ret = io_sqe_files_scm(ctx);
> +       if (ret)
> +               io_sqe_files_unregister(ctx);
> +
> +       return ret;
> +}
> +
>  static int io_sq_offload_start(struct io_ring_ctx *ctx)
>  {
>         int ret;
> @@ -1521,14 +1708,16 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
>                 destroy_workqueue(ctx->sqo_wq);
>         if (ctx->sqo_mm)
>                 mmdrop(ctx->sqo_mm);
> +
> +       io_iopoll_reap_events(ctx);
> +       io_sqe_buffer_unregister(ctx);
> +       io_sqe_files_unregister(ctx);
> +
>  #if defined(CONFIG_UNIX)
>         if (ctx->ring_sock)
>                 sock_release(ctx->ring_sock);
>  #endif
>
> -       io_iopoll_reap_events(ctx);
> -       io_sqe_buffer_unregister(ctx);

^ permalink raw reply	[flat|nested] 128+ messages in thread

* Re: [PATCH 14/19] io_uring: add file set registration
@ 2019-02-08 20:26     ` Jann Horn
  0 siblings, 0 replies; 128+ messages in thread
From: Jann Horn @ 2019-02-08 20:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-aio, linux-block, Linux API, hch, jmoyer, Avi Kivity, Al Viro

On Fri, Feb 8, 2019 at 6:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> We normally have to fget/fput for each IO we do on a file. Even with
> the batching we do, the cost of the atomic inc/dec of the file usage
> count adds up.
>
> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> for the io_uring_register(2) system call. The arguments passed in must
> be an array of __s32 holding file descriptors, and nr_args should hold
> the number of file descriptors the application wishes to pin for the
> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
> called).
>
> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> to the index in the array passed in to IORING_REGISTER_FILES.
>
> Files are automatically unregistered when the io_uring instance is torn
> down. An application need only unregister if it wishes to register a new
> set of fds.

I think the overall concept here is still broken: You're giving the
user_files to the GC, and I think the GC can drop their refcounts, but
I don't see you actually getting feedback from the GC anywhere that
would let the GC break your references? E.g. in io_prep_rw() you grab
file pointers from ctx->user_files after simply checking
ctx->nr_user_files, and there is no path from the GC that touches
those fields. As far as I can tell, the GC is just going to go through
unix_destruct_scm() and drop references on your files, causing
use-after-free.

But the unix GC is complicated, and maybe I'm just missing something...

> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +#if defined(CONFIG_UNIX)
> +       if (ctx->ring_sock) {
> +               struct sock *sock = ctx->ring_sock->sk;
> +               struct sk_buff *skb;
> +
> +               while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
> +                       kfree_skb(skb);
> +       }
> +#else
> +       int i;
> +
> +       for (i = 0; i < ctx->nr_user_files; i++)
> +               fput(ctx->user_files[i]);
> +#endif
> +}
> +
> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +       if (!ctx->user_files)
> +               return -ENXIO;
> +
> +       __io_sqe_files_unregister(ctx);
> +       kfree(ctx->user_files);
> +       ctx->user_files = NULL;
> +       return 0;
> +}
> +
> +#if defined(CONFIG_UNIX)
> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
> +{
> +       struct scm_fp_list *fpl;
> +       struct sk_buff *skb;
> +       int i;
> +
> +       fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
> +       if (!fpl)
> +               return -ENOMEM;
> +
> +       skb = alloc_skb(0, GFP_KERNEL);
> +       if (!skb) {
> +               kfree(fpl);
> +               return -ENOMEM;
> +       }
> +
> +       skb->sk = ctx->ring_sock->sk;
> +       skb->destructor = unix_destruct_scm;
> +
> +       fpl->user = get_uid(ctx->user);
> +       for (i = 0; i < nr; i++) {
> +               fpl->fp[i] = get_file(ctx->user_files[i + offset]);
> +               unix_inflight(fpl->user, fpl->fp[i]);
> +               fput(fpl->fp[i]);

This pattern is almost always superfluous. You increment the file's
refcount, maybe insert the file into a list (essentially), and drop
the file's refcount back down. You're already holding a stable
reference, and you're not temporarily lending that to anyone else.

> +       }
> +
> +       fpl->max = fpl->count = nr;
> +       UNIXCB(skb).fp = fpl;
> +       skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
> +       return 0;
> +}
> +
> +/*
> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which
> + * causes regular reference counting to break down. We rely on the UNIX
> + * garbage collection to take care of this problem for us.
> + */
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       unsigned left, total;
> +       int ret = 0;
> +
> +       total = 0;
> +       left = ctx->nr_user_files;
> +       while (left) {
> +               unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
> +               int ret;
> +
> +               ret = __io_sqe_files_scm(ctx, this_files, total);
> +               if (ret)
> +                       break;
> +               left -= this_files;
> +               total += this_files;
> +       }
> +
> +       return ret;
> +}
> +#else
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       return 0;
> +}
> +#endif
> +
> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
> +                                unsigned nr_args)
> +{
> +       __s32 __user *fds = (__s32 __user *) arg;
> +       int fd, ret = 0;
> +       unsigned i;
> +
> +       if (ctx->user_files)
> +               return -EBUSY;
> +       if (!nr_args)
> +               return -EINVAL;
> +       if (nr_args > IORING_MAX_FIXED_FILES)
> +               return -EMFILE;
> +
> +       ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
> +       if (!ctx->user_files)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nr_args; i++) {
> +               ret = -EFAULT;
> +               if (copy_from_user(&fd, &fds[i], sizeof(fd)))
> +                       break;
> +
> +               ctx->user_files[i] = fget(fd);
> +
> +               ret = -EBADF;
> +               if (!ctx->user_files[i])
> +                       break;
> +               /*
> +                * Don't allow io_uring instances to be registered. If UNIX
> +                * isn't enabled, then this causes a reference cycle and this
> +                * instance can never get freed. If UNIX is enabled we'll
> +                * handle it just fine, but there's still no point in allowing
> +                * a ring fd as it doesn't suppor regular read/write anyway.

nit: support

> +                */
> +               if (ctx->user_files[i]->f_op == &io_uring_fops) {
> +                       fput(ctx->user_files[i]);
> +                       break;
> +               }
> +               ctx->nr_user_files++;
> +               ret = 0;
> +       }
> +
> +       if (!ret)
> +               ret = io_sqe_files_scm(ctx);
> +       if (ret)
> +               io_sqe_files_unregister(ctx);
> +
> +       return ret;
> +}
> +
>  static int io_sq_offload_start(struct io_ring_ctx *ctx)
>  {
>         int ret;
> @@ -1521,14 +1708,16 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
>                 destroy_workqueue(ctx->sqo_wq);
>         if (ctx->sqo_mm)
>                 mmdrop(ctx->sqo_mm);
> +
> +       io_iopoll_reap_events(ctx);
> +       io_sqe_buffer_unregister(ctx);
> +       io_sqe_files_unregister(ctx);
> +
>  #if defined(CONFIG_UNIX)
>         if (ctx->ring_sock)
>                 sock_release(ctx->ring_sock);
>  #endif
>
> -       io_iopoll_reap_events(ctx);
> -       io_sqe_buffer_unregister(ctx);

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply	[flat|nested] 128+ messages in thread

* [PATCH 14/19] io_uring: add file set registration
  2019-02-08 17:34 [PATCHSET v13] io_uring IO interface Jens Axboe
@ 2019-02-08 17:34   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-08 17:34 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We normally have to fget/fput for each IO we do on a file. Even with
the batching we do, the cost of the atomic inc/dec of the file usage
count adds up.

This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
for the io_uring_register(2) system call. The arguments passed in must
be an array of __s32 holding file descriptors, and nr_args should hold
the number of file descriptors the application wishes to pin for the
duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
called).

When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
to the index in the array passed in to IORING_REGISTER_FILES.

Files are automatically unregistered when the io_uring instance is torn
down. An application need only unregister if it wishes to register a new
set of fds.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 256 ++++++++++++++++++++++++++++++----
 include/uapi/linux/io_uring.h |   9 +-
 2 files changed, 235 insertions(+), 30 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 50c48e43d56e..244fb71e3424 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -29,6 +29,7 @@
 #include <linux/net.h>
 #include <net/sock.h>
 #include <net/af_unix.h>
+#include <net/scm.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
@@ -41,6 +42,7 @@
 #include "internal.h"
 
 #define IORING_MAX_ENTRIES	4096
+#define IORING_MAX_FIXED_FILES	1024
 
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
@@ -102,6 +104,14 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/*
+	 * If used, fixed file set. Writers must ensure that ->refs is dead,
+	 * readers must ensure that ->refs is alive as long as the file* is
+	 * used. Only updated through io_uring_register(2).
+	 */
+	struct file		**user_files;
+	unsigned		nr_user_files;
+
 	/* if used, fixed mapped user buffers */
 	unsigned		nr_user_bufs;
 	struct io_mapped_ubuf	*user_bufs;
@@ -149,6 +159,7 @@ struct io_kiocb {
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
+#define REQ_F_FIXED_FILE	4	/* ctx owns file */
 	u64			user_data;
 	u64			error;
 
@@ -376,15 +387,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		 * Batched puts of the same file, to avoid dirtying the
 		 * file usage count multiple times, if avoidable.
 		 */
-		if (!file) {
-			file = req->rw.ki_filp;
-			file_count = 1;
-		} else if (file == req->rw.ki_filp) {
-			file_count++;
-		} else {
-			fput_many(file, file_count);
-			file = req->rw.ki_filp;
-			file_count = 1;
+		if (!(req->flags & REQ_F_FIXED_FILE)) {
+			if (!file) {
+				file = req->rw.ki_filp;
+				file_count = 1;
+			} else if (file == req->rw.ki_filp) {
+				file_count++;
+			} else {
+				fput_many(file, file_count);
+				file = req->rw.ki_filp;
+				file_count = 1;
+			}
 		}
 
 		if (to_free == ARRAY_SIZE(reqs))
@@ -516,13 +529,19 @@ static void kiocb_end_write(struct kiocb *kiocb)
 	}
 }
 
+static void io_fput(struct io_kiocb *req)
+{
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		fput(req->rw.ki_filp);
+}
+
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
 
 	kiocb_end_write(kiocb);
 
-	fput(kiocb->ki_filp);
+	io_fput(req);
 	io_cqring_add_event(req->ctx, req->user_data, res, 0);
 	io_free_req(req);
 }
@@ -638,19 +657,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
-	unsigned ioprio;
+	unsigned ioprio, flags;
 	int fd, ret;
 
 	/* For -EAGAIN retry, everything is already prepped */
 	if (kiocb->ki_filp)
 		return 0;
 
+	flags = READ_ONCE(sqe->flags);
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = io_file_get(state, fd);
-	if (unlikely(!kiocb->ki_filp))
-		return -EBADF;
-	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
-		force_nonblock = false;
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files ||
+		    (unsigned) fd >= ctx->nr_user_files))
+			return -EBADF;
+		kiocb->ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		kiocb->ki_filp = io_file_get(state, fd);
+		if (unlikely(!kiocb->ki_filp))
+			return -EBADF;
+		if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
+			force_nonblock = false;
+	}
 	kiocb->ki_pos = READ_ONCE(sqe->off);
 	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
@@ -690,10 +719,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	/* in case of error, we didn't use this file reference. drop it. */
-	if (state)
-		state->used_refs--;
-	io_file_put(state, kiocb->ki_filp);
+	if (!(flags & IOSQE_FIXED_FILE)) {
+		/*
+		 * in case of error, we didn't use this file reference. drop it.
+		 */
+		if (state)
+			state->used_refs--;
+		io_file_put(state, kiocb->ki_filp);
+	}
 	return ret;
 }
 
@@ -825,7 +858,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 out_fput:
 	/* Hold on to the file for -EAGAIN */
 	if (unlikely(ret && ret != -EAGAIN))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -879,7 +912,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -905,7 +938,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	loff_t sqe_off = READ_ONCE(sqe->off);
 	loff_t sqe_len = READ_ONCE(sqe->len);
 	loff_t end = sqe_off + sqe_len;
-	unsigned fsync_flags;
+	unsigned fsync_flags, flags;
 	struct file *file;
 	int ret, fd;
 
@@ -923,14 +956,23 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return -EINVAL;
 
 	fd = READ_ONCE(sqe->fd);
-	file = fget(fd);
+	flags = READ_ONCE(sqe->flags);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		file = ctx->user_files[fd];
+	} else {
+		file = fget(fd);
+	}
 	if (unlikely(!file))
 		return -EBADF;
 
 	ret = vfs_fsync_range(file, sqe_off, end > 0 ? end : LLONG_MAX,
 				fsync_flags & IORING_FSYNC_DATASYNC);
 
-	fput(file);
+	if (!(flags & IOSQE_FIXED_FILE))
+		fput(file);
 	io_cqring_add_event(ctx, sqe->user_data, ret, 0);
 	io_free_req(req);
 	return 0;
@@ -1067,7 +1109,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
-	if (unlikely(s->sqe->flags))
+	if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
 		return -EINVAL;
 
 	req = io_get_req(ctx, state);
@@ -1255,6 +1297,151 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	return ring->r.head == ring->r.tail ? ret : 0;
 }
 
+static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+#if defined(CONFIG_UNIX)
+	if (ctx->ring_sock) {
+		struct sock *sock = ctx->ring_sock->sk;
+		struct sk_buff *skb;
+
+		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
+			kfree_skb(skb);
+	}
+#else
+	int i;
+
+	for (i = 0; i < ctx->nr_user_files; i++)
+		fput(ctx->user_files[i]);
+#endif
+}
+
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+	if (!ctx->user_files)
+		return -ENXIO;
+
+	__io_sqe_files_unregister(ctx);
+	kfree(ctx->user_files);
+	ctx->user_files = NULL;
+	return 0;
+}
+
+#if defined(CONFIG_UNIX)
+static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
+{
+	struct scm_fp_list *fpl;
+	struct sk_buff *skb;
+	int i;
+
+	fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
+	if (!fpl)
+		return -ENOMEM;
+
+	skb = alloc_skb(0, GFP_KERNEL);
+	if (!skb) {
+		kfree(fpl);
+		return -ENOMEM;
+	}
+
+	skb->sk = ctx->ring_sock->sk;
+	skb->destructor = unix_destruct_scm;
+
+	fpl->user = get_uid(ctx->user);
+	for (i = 0; i < nr; i++) {
+		fpl->fp[i] = get_file(ctx->user_files[i + offset]);
+		unix_inflight(fpl->user, fpl->fp[i]);
+		fput(fpl->fp[i]);
+	}
+
+	fpl->max = fpl->count = nr;
+	UNIXCB(skb).fp = fpl;
+	skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
+	return 0;
+}
+
+/*
+ * If UNIX sockets are enabled, fd passing can cause a reference cycle which
+ * causes regular reference counting to break down. We rely on the UNIX
+ * garbage collection to take care of this problem for us.
+ */
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	unsigned left, total;
+	int ret = 0;
+
+	total = 0;
+	left = ctx->nr_user_files;
+	while (left) {
+		unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
+		int ret;
+
+		ret = __io_sqe_files_scm(ctx, this_files, total);
+		if (ret)
+			break;
+		left -= this_files;
+		total += this_files;
+	}
+
+	return ret;
+}
+#else
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	return 0;
+}
+#endif
+
+static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
+				 unsigned nr_args)
+{
+	__s32 __user *fds = (__s32 __user *) arg;
+	int fd, ret = 0;
+	unsigned i;
+
+	if (ctx->user_files)
+		return -EBUSY;
+	if (!nr_args)
+		return -EINVAL;
+	if (nr_args > IORING_MAX_FIXED_FILES)
+		return -EMFILE;
+
+	ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
+	if (!ctx->user_files)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		ret = -EFAULT;
+		if (copy_from_user(&fd, &fds[i], sizeof(fd)))
+			break;
+
+		ctx->user_files[i] = fget(fd);
+
+		ret = -EBADF;
+		if (!ctx->user_files[i])
+			break;
+		/*
+		 * Don't allow io_uring instances to be registered. If UNIX
+		 * isn't enabled, then this causes a reference cycle and this
+		 * instance can never get freed. If UNIX is enabled we'll
+		 * handle it just fine, but there's still no point in allowing
+		 * a ring fd as it doesn't suppor regular read/write anyway.
+		 */
+		if (ctx->user_files[i]->f_op == &io_uring_fops) {
+			fput(ctx->user_files[i]);
+			break;
+		}
+		ctx->nr_user_files++;
+		ret = 0;
+	}
+
+	if (!ret)
+		ret = io_sqe_files_scm(ctx);
+	if (ret)
+		io_sqe_files_unregister(ctx);
+
+	return ret;
+}
+
 static int io_sq_offload_start(struct io_ring_ctx *ctx)
 {
 	int ret;
@@ -1521,14 +1708,16 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		destroy_workqueue(ctx->sqo_wq);
 	if (ctx->sqo_mm)
 		mmdrop(ctx->sqo_mm);
+
+	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
+	io_sqe_files_unregister(ctx);
+
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
 		sock_release(ctx->ring_sock);
 #endif
 
-	io_iopoll_reap_events(ctx);
-	io_sqe_buffer_unregister(ctx);
-
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
 	io_mem_free(ctx->cq_ring);
@@ -1886,6 +2075,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_sqe_buffer_unregister(ctx);
 		break;
+	case IORING_REGISTER_FILES:
+		ret = io_sqe_files_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_FILES:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_files_unregister(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index cf28f7a11f12..6257478d55e9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -16,7 +16,7 @@
  */
 struct io_uring_sqe {
 	__u8	opcode;		/* type of operation for this sqe */
-	__u8	flags;		/* as of now unused */
+	__u8	flags;		/* IOSQE_ flags */
 	__u16	ioprio;		/* ioprio for the request */
 	__s32	fd;		/* file descriptor to do IO on */
 	__u64	off;		/* offset into file */
@@ -33,6 +33,11 @@ struct io_uring_sqe {
 	};
 };
 
+/*
+ * sqe->flags
+ */
+#define IOSQE_FIXED_FILE	(1U << 0)	/* use fixed fileset */
+
 /*
  * io_uring_setup() flags
  */
@@ -113,5 +118,7 @@ struct io_uring_params {
  */
 #define IORING_REGISTER_BUFFERS		0
 #define IORING_UNREGISTER_BUFFERS	1
+#define IORING_REGISTER_FILES		2
+#define IORING_UNREGISTER_FILES		3
 
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 128+ messages in thread

* [PATCH 14/19] io_uring: add file set registration
@ 2019-02-08 17:34   ` Jens Axboe
  0 siblings, 0 replies; 128+ messages in thread
From: Jens Axboe @ 2019-02-08 17:34 UTC (permalink / raw)
  To: linux-aio, linux-block, linux-api
  Cc: hch, jmoyer, avi, jannh, viro, Jens Axboe

We normally have to fget/fput for each IO we do on a file. Even with
the batching we do, the cost of the atomic inc/dec of the file usage
count adds up.

This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
for the io_uring_register(2) system call. The arguments passed in must
be an array of __s32 holding file descriptors, and nr_args should hold
the number of file descriptors the application wishes to pin for the
duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
called).

When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
to the index in the array passed in to IORING_REGISTER_FILES.

Files are automatically unregistered when the io_uring instance is torn
down. An application need only unregister if it wishes to register a new
set of fds.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 256 ++++++++++++++++++++++++++++++----
 include/uapi/linux/io_uring.h |   9 +-
 2 files changed, 235 insertions(+), 30 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 50c48e43d56e..244fb71e3424 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -29,6 +29,7 @@
 #include <linux/net.h>
 #include <net/sock.h>
 #include <net/af_unix.h>
+#include <net/scm.h>
 #include <linux/anon_inodes.h>
 #include <linux/sched/mm.h>
 #include <linux/uaccess.h>
@@ -41,6 +42,7 @@
 #include "internal.h"
 
 #define IORING_MAX_ENTRIES	4096
+#define IORING_MAX_FIXED_FILES	1024
 
 struct io_uring {
 	u32 head ____cacheline_aligned_in_smp;
@@ -102,6 +104,14 @@ struct io_ring_ctx {
 		struct fasync_struct	*cq_fasync;
 	} ____cacheline_aligned_in_smp;
 
+	/*
+	 * If used, fixed file set. Writers must ensure that ->refs is dead,
+	 * readers must ensure that ->refs is alive as long as the file* is
+	 * used. Only updated through io_uring_register(2).
+	 */
+	struct file		**user_files;
+	unsigned		nr_user_files;
+
 	/* if used, fixed mapped user buffers */
 	unsigned		nr_user_bufs;
 	struct io_mapped_ubuf	*user_bufs;
@@ -149,6 +159,7 @@ struct io_kiocb {
 	unsigned int		flags;
 #define REQ_F_FORCE_NONBLOCK	1	/* inline submission attempt */
 #define REQ_F_IOPOLL_COMPLETED	2	/* polled IO has completed */
+#define REQ_F_FIXED_FILE	4	/* ctx owns file */
 	u64			user_data;
 	u64			error;
 
@@ -376,15 +387,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
 		 * Batched puts of the same file, to avoid dirtying the
 		 * file usage count multiple times, if avoidable.
 		 */
-		if (!file) {
-			file = req->rw.ki_filp;
-			file_count = 1;
-		} else if (file == req->rw.ki_filp) {
-			file_count++;
-		} else {
-			fput_many(file, file_count);
-			file = req->rw.ki_filp;
-			file_count = 1;
+		if (!(req->flags & REQ_F_FIXED_FILE)) {
+			if (!file) {
+				file = req->rw.ki_filp;
+				file_count = 1;
+			} else if (file == req->rw.ki_filp) {
+				file_count++;
+			} else {
+				fput_many(file, file_count);
+				file = req->rw.ki_filp;
+				file_count = 1;
+			}
 		}
 
 		if (to_free == ARRAY_SIZE(reqs))
@@ -516,13 +529,19 @@ static void kiocb_end_write(struct kiocb *kiocb)
 	}
 }
 
+static void io_fput(struct io_kiocb *req)
+{
+	if (!(req->flags & REQ_F_FIXED_FILE))
+		fput(req->rw.ki_filp);
+}
+
 static void io_complete_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw);
 
 	kiocb_end_write(kiocb);
 
-	fput(kiocb->ki_filp);
+	io_fput(req);
 	io_cqring_add_event(req->ctx, req->user_data, res, 0);
 	io_free_req(req);
 }
@@ -638,19 +657,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct kiocb *kiocb = &req->rw;
-	unsigned ioprio;
+	unsigned ioprio, flags;
 	int fd, ret;
 
 	/* For -EAGAIN retry, everything is already prepped */
 	if (kiocb->ki_filp)
 		return 0;
 
+	flags = READ_ONCE(sqe->flags);
 	fd = READ_ONCE(sqe->fd);
-	kiocb->ki_filp = io_file_get(state, fd);
-	if (unlikely(!kiocb->ki_filp))
-		return -EBADF;
-	if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
-		force_nonblock = false;
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files ||
+		    (unsigned) fd >= ctx->nr_user_files))
+			return -EBADF;
+		kiocb->ki_filp = ctx->user_files[fd];
+		req->flags |= REQ_F_FIXED_FILE;
+	} else {
+		kiocb->ki_filp = io_file_get(state, fd);
+		if (unlikely(!kiocb->ki_filp))
+			return -EBADF;
+		if (force_nonblock && !io_file_supports_async(kiocb->ki_filp))
+			force_nonblock = false;
+	}
 	kiocb->ki_pos = READ_ONCE(sqe->off);
 	kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
 	kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
@@ -690,10 +719,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	}
 	return 0;
 out_fput:
-	/* in case of error, we didn't use this file reference. drop it. */
-	if (state)
-		state->used_refs--;
-	io_file_put(state, kiocb->ki_filp);
+	if (!(flags & IOSQE_FIXED_FILE)) {
+		/*
+		 * in case of error, we didn't use this file reference. drop it.
+		 */
+		if (state)
+			state->used_refs--;
+		io_file_put(state, kiocb->ki_filp);
+	}
 	return ret;
 }
 
@@ -825,7 +858,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s,
 out_fput:
 	/* Hold on to the file for -EAGAIN */
 	if (unlikely(ret && ret != -EAGAIN))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -879,7 +912,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
-		fput(file);
+		io_fput(req);
 	return ret;
 }
 
@@ -905,7 +938,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	loff_t sqe_off = READ_ONCE(sqe->off);
 	loff_t sqe_len = READ_ONCE(sqe->len);
 	loff_t end = sqe_off + sqe_len;
-	unsigned fsync_flags;
+	unsigned fsync_flags, flags;
 	struct file *file;
 	int ret, fd;
 
@@ -923,14 +956,23 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		return -EINVAL;
 
 	fd = READ_ONCE(sqe->fd);
-	file = fget(fd);
+	flags = READ_ONCE(sqe->flags);
+
+	if (flags & IOSQE_FIXED_FILE) {
+		if (unlikely(!ctx->user_files || fd >= ctx->nr_user_files))
+			return -EBADF;
+		file = ctx->user_files[fd];
+	} else {
+		file = fget(fd);
+	}
 	if (unlikely(!file))
 		return -EBADF;
 
 	ret = vfs_fsync_range(file, sqe_off, end > 0 ? end : LLONG_MAX,
 				fsync_flags & IORING_FSYNC_DATASYNC);
 
-	fput(file);
+	if (!(flags & IOSQE_FIXED_FILE))
+		fput(file);
 	io_cqring_add_event(ctx, sqe->user_data, ret, 0);
 	io_free_req(req);
 	return 0;
@@ -1067,7 +1109,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	ssize_t ret;
 
 	/* enforce forwards compatibility on users */
-	if (unlikely(s->sqe->flags))
+	if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE))
 		return -EINVAL;
 
 	req = io_get_req(ctx, state);
@@ -1255,6 +1297,151 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
 	return ring->r.head == ring->r.tail ? ret : 0;
 }
 
+static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+#if defined(CONFIG_UNIX)
+	if (ctx->ring_sock) {
+		struct sock *sock = ctx->ring_sock->sk;
+		struct sk_buff *skb;
+
+		while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
+			kfree_skb(skb);
+	}
+#else
+	int i;
+
+	for (i = 0; i < ctx->nr_user_files; i++)
+		fput(ctx->user_files[i]);
+#endif
+}
+
+static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
+{
+	if (!ctx->user_files)
+		return -ENXIO;
+
+	__io_sqe_files_unregister(ctx);
+	kfree(ctx->user_files);
+	ctx->user_files = NULL;
+	return 0;
+}
+
+#if defined(CONFIG_UNIX)
+static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
+{
+	struct scm_fp_list *fpl;
+	struct sk_buff *skb;
+	int i;
+
+	fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
+	if (!fpl)
+		return -ENOMEM;
+
+	skb = alloc_skb(0, GFP_KERNEL);
+	if (!skb) {
+		kfree(fpl);
+		return -ENOMEM;
+	}
+
+	skb->sk = ctx->ring_sock->sk;
+	skb->destructor = unix_destruct_scm;
+
+	fpl->user = get_uid(ctx->user);
+	for (i = 0; i < nr; i++) {
+		fpl->fp[i] = get_file(ctx->user_files[i + offset]);
+		unix_inflight(fpl->user, fpl->fp[i]);
+		fput(fpl->fp[i]);
+	}
+
+	fpl->max = fpl->count = nr;
+	UNIXCB(skb).fp = fpl;
+	skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb);
+	return 0;
+}
+
+/*
+ * If UNIX sockets are enabled, fd passing can cause a reference cycle which
+ * causes regular reference counting to break down. We rely on the UNIX
+ * garbage collection to take care of this problem for us.
+ */
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	unsigned left, total;
+	int ret = 0;
+
+	total = 0;
+	left = ctx->nr_user_files;
+	while (left) {
+		unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
+		int ret;
+
+		ret = __io_sqe_files_scm(ctx, this_files, total);
+		if (ret)
+			break;
+		left -= this_files;
+		total += this_files;
+	}
+
+	return ret;
+}
+#else
+static int io_sqe_files_scm(struct io_ring_ctx *ctx)
+{
+	return 0;
+}
+#endif
+
+static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
+				 unsigned nr_args)
+{
+	__s32 __user *fds = (__s32 __user *) arg;
+	int fd, ret = 0;
+	unsigned i;
+
+	if (ctx->user_files)
+		return -EBUSY;
+	if (!nr_args)
+		return -EINVAL;
+	if (nr_args > IORING_MAX_FIXED_FILES)
+		return -EMFILE;
+
+	ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
+	if (!ctx->user_files)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++) {
+		ret = -EFAULT;
+		if (copy_from_user(&fd, &fds[i], sizeof(fd)))
+			break;
+
+		ctx->user_files[i] = fget(fd);
+
+		ret = -EBADF;
+		if (!ctx->user_files[i])
+			break;
+		/*
+		 * Don't allow io_uring instances to be registered. If UNIX
+		 * isn't enabled, then this causes a reference cycle and this
+		 * instance can never get freed. If UNIX is enabled we'll
+		 * handle it just fine, but there's still no point in allowing
+		 * a ring fd as it doesn't suppor regular read/write anyway.
+		 */
+		if (ctx->user_files[i]->f_op == &io_uring_fops) {
+			fput(ctx->user_files[i]);
+			break;
+		}
+		ctx->nr_user_files++;
+		ret = 0;
+	}
+
+	if (!ret)
+		ret = io_sqe_files_scm(ctx);
+	if (ret)
+		io_sqe_files_unregister(ctx);
+
+	return ret;
+}
+
 static int io_sq_offload_start(struct io_ring_ctx *ctx)
 {
 	int ret;
@@ -1521,14 +1708,16 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
 		destroy_workqueue(ctx->sqo_wq);
 	if (ctx->sqo_mm)
 		mmdrop(ctx->sqo_mm);
+
+	io_iopoll_reap_events(ctx);
+	io_sqe_buffer_unregister(ctx);
+	io_sqe_files_unregister(ctx);
+
 #if defined(CONFIG_UNIX)
 	if (ctx->ring_sock)
 		sock_release(ctx->ring_sock);
 #endif
 
-	io_iopoll_reap_events(ctx);
-	io_sqe_buffer_unregister(ctx);
-
 	io_mem_free(ctx->sq_ring);
 	io_mem_free(ctx->sq_sqes);
 	io_mem_free(ctx->cq_ring);
@@ -1886,6 +2075,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_sqe_buffer_unregister(ctx);
 		break;
+	case IORING_REGISTER_FILES:
+		ret = io_sqe_files_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_FILES:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_files_unregister(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index cf28f7a11f12..6257478d55e9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -16,7 +16,7 @@
  */
 struct io_uring_sqe {
 	__u8	opcode;		/* type of operation for this sqe */
-	__u8	flags;		/* as of now unused */
+	__u8	flags;		/* IOSQE_ flags */
 	__u16	ioprio;		/* ioprio for the request */
 	__s32	fd;		/* file descriptor to do IO on */
 	__u64	off;		/* offset into file */
@@ -33,6 +33,11 @@ struct io_uring_sqe {
 	};
 };
 
+/*
+ * sqe->flags
+ */
+#define IOSQE_FIXED_FILE	(1U << 0)	/* use fixed fileset */
+
 /*
  * io_uring_setup() flags
  */
@@ -113,5 +118,7 @@ struct io_uring_params {
  */
 #define IORING_REGISTER_BUFFERS		0
 #define IORING_UNREGISTER_BUFFERS	1
+#define IORING_REGISTER_FILES		2
+#define IORING_UNREGISTER_FILES		3
 
 #endif
-- 
2.17.1

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

^ permalink raw reply related	[flat|nested] 128+ messages in thread

end of thread, other threads:[~2019-03-08  9:12 UTC | newest]

Thread overview: 128+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-11 19:00 [PATCHSET v15] io_uring IO interface Jens Axboe
2019-02-11 19:00 ` Jens Axboe
2019-02-11 19:00 ` [PATCH 01/19] fs: add an iopoll method to struct file_operations Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH] io_uring: add io_uring_event cache hit information Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 02/19] block: wire up block device iopoll method Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 03/19] block: add bio_set_polled() helper Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 04/19] iomap: wire up the iopoll method Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 05/19] Add io_uring IO interface Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 06/19] io_uring: add fsync support Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 07/19] io_uring: support for IO polling Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 08/19] fs: add fget_many() and fput_many() Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 09/19] io_uring: use fget/fput_many() for file references Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 10/19] io_uring: batch io_kiocb allocation Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-20 22:58   ` Ming Lei
2019-02-20 22:58     ` Ming Lei
2019-02-21 17:45     ` Jens Axboe
2019-02-21 17:45       ` Jens Axboe
2019-02-26  3:46       ` Eric Biggers
2019-02-26  3:46         ` Eric Biggers
2019-02-26  4:34         ` Jens Axboe
2019-02-26  4:34           ` Jens Axboe
2019-02-26 15:54           ` Jens Axboe
2019-02-26 15:54             ` Jens Axboe
2019-02-27  1:21             ` Ming Lei
2019-02-27  1:21               ` Ming Lei
2019-02-27  1:47               ` Jens Axboe
2019-02-27  1:47                 ` Jens Axboe
2019-02-27  1:53                 ` Ming Lei
2019-02-27  1:53                   ` Ming Lei
2019-02-27  1:57                   ` Jens Axboe
2019-02-27  1:57                     ` Jens Axboe
2019-02-27  2:03                     ` Jens Axboe
2019-02-27  2:21                     ` Ming Lei
2019-02-27  2:21                       ` Ming Lei
2019-02-27  2:28                       ` Jens Axboe
2019-02-27  2:28                         ` Jens Axboe
2019-02-27  2:37                         ` Ming Lei
2019-02-27  2:37                           ` Ming Lei
2019-02-27  2:43                           ` Jens Axboe
2019-02-27  2:43                             ` Jens Axboe
2019-02-27  3:09                             ` Ming Lei
2019-02-27  3:09                               ` Ming Lei
2019-02-27  3:37                               ` Jens Axboe
2019-02-27  3:37                                 ` Jens Axboe
2019-02-27  3:43                                 ` Jens Axboe
2019-02-27  3:43                                   ` Jens Axboe
2019-02-27  3:44                                 ` Ming Lei
2019-02-27  3:44                                   ` Ming Lei
2019-02-27  4:05                                   ` Jens Axboe
2019-02-27  4:05                                     ` Jens Axboe
2019-02-27  4:06                                     ` Jens Axboe
2019-02-27  4:06                                       ` Jens Axboe
2019-02-27 19:42                                       ` Christoph Hellwig
2019-02-27 19:42                                         ` Christoph Hellwig
2019-02-28  8:37                                         ` Ming Lei
2019-02-28  8:37                                           ` Ming Lei
2019-02-27 23:35                         ` Ming Lei
2019-02-27 23:35                           ` Ming Lei
2019-03-08  7:55                         ` Christoph Hellwig
2019-03-08  7:55                           ` Christoph Hellwig
2019-03-08  9:12                           ` Ming Lei
2019-03-08  9:12                             ` Ming Lei
2019-03-08  8:18                     ` Christoph Hellwig
2019-03-08  8:18                       ` Christoph Hellwig
2019-02-11 19:00 ` [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-19 19:08   ` Jann Horn
2019-02-19 19:08     ` Jann Horn
2019-02-22 22:29     ` Jens Axboe
2019-02-22 22:29       ` Jens Axboe
2019-02-11 19:00 ` [PATCH 13/19] net: split out functions related to registering inflight socket files Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-19 16:12   ` Jann Horn
2019-02-19 16:12     ` Jann Horn
2019-02-22 22:29     ` Jens Axboe
2019-02-22 22:29       ` Jens Axboe
2019-02-11 19:00 ` [PATCH 15/19] io_uring: add submission polling Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 16/19] io_uring: add io_kiocb ref count Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 17/19] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 18/19] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-11 19:00 ` [PATCH 19/19] io_uring: add io_uring_event cache hit information Jens Axboe
2019-02-11 19:00   ` Jens Axboe
2019-02-21 12:10 ` [PATCHSET v15] io_uring IO interface Marek Majkowski
2019-02-21 12:10   ` Marek Majkowski
2019-02-21 17:48   ` Jens Axboe
2019-02-21 17:48     ` Jens Axboe
2019-02-22 15:01     ` Marek Majkowski
2019-02-22 15:01       ` Marek Majkowski
2019-02-22 22:32       ` Jens Axboe
2019-02-22 22:32         ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2019-02-09 21:13 [PATCHSET v14] " Jens Axboe
2019-02-09 21:13 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-09 21:13   ` Jens Axboe
2019-02-09 23:52   ` Matt Mullins
2019-02-10  0:47     ` Jens Axboe
2019-02-10  0:47       ` Jens Axboe
2019-02-10  1:11       ` Matt Mullins
2019-02-10  2:34         ` Jens Axboe
2019-02-10  2:34           ` Jens Axboe
2019-02-10  2:57           ` Jens Axboe
2019-02-10  2:57             ` Jens Axboe
2019-02-10 19:55             ` Matt Mullins
2019-02-08 17:34 [PATCHSET v13] io_uring IO interface Jens Axboe
2019-02-08 17:34 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-08 17:34   ` Jens Axboe
2019-02-08 20:26   ` Jann Horn
2019-02-08 20:26     ` Jann Horn
2019-02-09  0:16     ` Jens Axboe
2019-02-09  0:16       ` Jens Axboe
2019-02-09  9:50   ` Hannes Reinecke
2019-02-09  9:50     ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.