From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCHSET v2] io_uring IO interface Date: Wed, 9 Jan 2019 19:43:49 -0700 Message-ID: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Here's v2 of the io_uring interface. See the v1 posting for some more info: https://lore.kernel.org/linux-block/20190108165645.19311-1-axboe@kernel.dk/ The data structures changed, to improve the symmetry of the submission and completion side. The io_uring_iocb is now io_uring_sqe, but it otherwise remains the same as before. Ditto on the completion side, where io_uring_event is now io_uring_cqe. I've updated the fio io_uring test app, and the io_uring engine. The liburing git repo has also been adapted to the various changes since the v1 posting. As a reminder, the liburing git repo contains some helpers for doing IO without having to muck with the ring directly, setting up an io_uring context, etc. Clone that here: git://git.kernel.dk/liburing In terms of usage, there's also a small test app here: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c and the liburing repo has a few test apps in test/ as well. Patches are aginst 5.0-rc1, but can also be found here: git://git.kernel.dk/linux-block io_uring Changes since v1: - Fail IORING_OP_{READ,WRITE}_FIXED if not configured - Fix ctx drop ref issue on failure to close ring_fd when sq thread/wq are in use - Move to separate Kconfig entry (CONFIG_IO_URING) - Add SPDX headers - Drop gcc ism of zero sized arrays - Rename io_uring_iocb -> io_uring_sqe - Rename io_uring_event -> io_uring_cqe - Drop needless io_event_ring and io_iocb_ring structures - Drop ctx->max_reqs, use ->sq_entries - Drop unused ->ring_lock - Drop io_ring_ctx slab cache - Fix state batched kiocb alloc failure to put ctx - Fix missing write ordering barrier when filling in the cqe - Drop io_req_init() - Various renames - Fix a few lines that were too long - Address other minor review comments - Fix IORING_SETUP_SQPOLL being set without IORING_SETUP_SQTHREAD - Drop IORING_SETUP_FIXEDBUFS, iovecs being non-NULL is enough - Fix error handling free of ctx in setup path - Change standard read/write commands to be iov based READV/WRITEV - Pass in struct sqe_submit instead of separate sqe/index everywhere - Fix reap of polled events on fops->release() - Lock uring for sq thread polling - Don't grab ->completion_lock for polled IO cqe filling - Fix ev_flags vs flags typo - Consolidate parts of the io_ring_ctx alignment Documentation/filesystems/vfs.txt | 3 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + block/bio.c | 59 +- fs/Makefile | 1 + fs/block_dev.c | 19 +- fs/file.c | 15 +- fs/file_table.c | 9 +- fs/gfs2/file.c | 2 + fs/io_uring.c | 1890 ++++++++++++++++++++++++ fs/iomap.c | 48 +- fs/xfs/xfs_file.c | 1 + include/linux/bio.h | 14 + include/linux/blk_types.h | 1 + include/linux/file.h | 2 + include/linux/fs.h | 6 +- include/linux/iomap.h | 1 + include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 114 ++ init/Kconfig | 8 + kernel/sys_ni.c | 2 + 20 files changed, 2163 insertions(+), 39 deletions(-) -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 02/15] block: wire up block device iopoll method Date: Wed, 9 Jan 2019 19:43:51 -0700 Message-ID: <20190110024404.25372-3-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: Christoph Hellwig Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/block_dev.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index c546cdce77e6..5415579f3e14 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -279,6 +279,14 @@ struct blkdev_dio { static struct bio_set blkdev_dio_pool; +static int blkdev_iopoll(struct kiocb *kiocb, bool wait) +{ + struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host); + struct request_queue *q = bdev_get_queue(bdev); + + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); +} + static void blkdev_bio_end_io(struct bio *bio) { struct blkdev_dio *dio = bio->bi_private; @@ -396,6 +404,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_opf |= REQ_HIPRI; qc = submit_bio(bio); + WRITE_ONCE(iocb->ki_cookie, qc); break; } @@ -2068,6 +2077,7 @@ const struct file_operations def_blk_fops = { .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, + .iopoll = blkdev_iopoll, .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 01/15] fs: add an iopoll method to struct file_operations Date: Wed, 9 Jan 2019 19:43:50 -0700 Message-ID: <20190110024404.25372-2-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: Christoph Hellwig This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. TODO: we can probably union ki_cookie with the existing hint and I/O priority fields to avoid struct kiocb growth. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- Documentation/filesystems/vfs.txt | 3 +++ include/linux/fs.h | 2 ++ 2 files changed, 5 insertions(+) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 8dc8e9c2913f..761c6fd24a53 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -902,6 +903,8 @@ otherwise noted. write_iter: possibly asynchronous write with iov_iter as source + iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents iterate_shared: called when the VFS needs to read the directory contents diff --git a/include/linux/fs.h b/include/linux/fs.h index 811c77743dad..ccb0b7a63aa5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -310,6 +310,7 @@ struct kiocb { int ki_flags; u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ + unsigned int ki_cookie; /* for ->iopoll */ } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1786,6 +1787,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 03/15] block: add bio_set_polled() helper Date: Wed, 9 Jan 2019 19:43:52 -0700 Message-ID: <20190110024404.25372-4-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org For the upcoming async polled IO, we can't sleep allocating requests. If we do, then we introduce a deadlock where the submitter already has async polled IO in-flight, but can't wait for them to complete since polled requests must be active found and reaped. Utilize the helper in the blockdev DIRECT_IO code. Signed-off-by: Jens Axboe --- fs/block_dev.c | 4 ++-- include/linux/bio.h | 14 ++++++++++++++ 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 5415579f3e14..2ebd2a0d7789 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, task_io_account_write(ret); } if (iocb->ki_flags & IOCB_HIPRI) - bio.bi_opf |= REQ_HIPRI; + bio_set_polled(&bio, iocb); qc = submit_bio(&bio); for (;;) { @@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) { if (iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; + bio_set_polled(bio, iocb); qc = submit_bio(bio); WRITE_ONCE(iocb->ki_cookie, qc); diff --git a/include/linux/bio.h b/include/linux/bio.h index 7380b094dcca..f6f0a2b3cbc8 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, #endif /* CONFIG_BLK_DEV_INTEGRITY */ +/* + * Mark a bio as polled. Note that for async polled IO, the caller must + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). + * We cannot block waiting for requests on polled IO, as those completions + * must be found by the caller. This is different than IRQ driven IO, where + * it's safe to wait for IO to complete. + */ +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) +{ + bio->bi_opf |= REQ_HIPRI; + if (!is_sync_kiocb(kiocb)) + bio->bi_opf |= REQ_NOWAIT; +} + #endif /* CONFIG_BLOCK */ #endif /* __LINUX_BIO_H */ -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 04/15] iomap: wire up the iopoll method Date: Wed, 9 Jan 2019 19:43:53 -0700 Message-ID: <20190110024404.25372-5-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org From: Christoph Hellwig Store the request queue the last bio was submitted to in the iocb private data in addition to the cookie so that we find the right block device. Also refactor the common direct I/O bio submission code into a nice little helper. Signed-off-by: Christoph Hellwig Modified to use bio_set_polled(). Signed-off-by: Jens Axboe --- fs/gfs2/file.c | 2 ++ fs/iomap.c | 43 ++++++++++++++++++++++++++++--------------- fs/xfs/xfs_file.c | 1 + include/linux/iomap.h | 1 + 4 files changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index a2dea5bc0427..58a768e59712 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, @@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, diff --git a/fs/iomap.c b/fs/iomap.c index a3088fae567b..4ee50b76b4a1 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1454,6 +1454,28 @@ struct iomap_dio { }; }; +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin) +{ + struct request_queue *q = READ_ONCE(kiocb->private); + + if (!q) + return 0; + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin); +} +EXPORT_SYMBOL_GPL(iomap_dio_iopoll); + +static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, + struct bio *bio) +{ + atomic_inc(&dio->ref); + + if (dio->iocb->ki_flags & IOCB_HIPRI) + bio_set_polled(bio, dio->iocb); + + dio->submit.last_queue = bdev_get_queue(iomap->bdev); + dio->submit.cookie = submit_bio(bio); +} + static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; @@ -1566,7 +1588,7 @@ static void iomap_dio_bio_end_io(struct bio *bio) } } -static blk_qc_t +static void iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, unsigned len) { @@ -1580,15 +1602,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - if (dio->iocb->ki_flags & IOCB_HIPRI) - flags |= REQ_HIPRI; - get_page(page); __bio_add_page(bio, page, len, 0); bio_set_op_attrs(bio, REQ_OP_WRITE, flags); - - atomic_inc(&dio->ref); - return submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } static loff_t @@ -1691,9 +1708,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio_set_pages_dirty(bio); } - if (dio->iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; - iov_iter_advance(dio->submit.iter, n); dio->size += n; @@ -1701,11 +1715,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, copied += n; nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES); - - atomic_inc(&dio->ref); - - dio->submit.last_queue = bdev_get_queue(iomap->bdev); - dio->submit.cookie = submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } while (nr_pages); /* @@ -1916,6 +1926,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; + WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie); + WRITE_ONCE(iocb->private, dio->submit.last_queue); + if (!atomic_dec_and_test(&dio->ref)) { if (!dio->wait_for_completion) return -EIOCBQUEUED; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e47425071e65..60c2da41f0fc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = { .write_iter = xfs_file_write_iter, .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = xfs_file_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = xfs_file_compat_ioctl, diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 9a4258154b25..0fefb5455bda 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret, unsigned flags); ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, iomap_dio_end_io_t end_io); +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin); #ifdef CONFIG_SWAP struct file; -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 06/15] io_uring: support for IO polling Date: Wed, 9 Jan 2019 19:43:55 -0700 Message-ID: <20190110024404.25372-7-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Add support for polled read and write commands. These act like their non-polled counterparts, except we expect to poll for completion of them. To use polling, io_uring_setup() must be used with the IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match polled and non-polled IO on an io_uring. Signed-off-by: Jens Axboe --- fs/io_uring.c | 247 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 5 + 2 files changed, 239 insertions(+), 13 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 0bad563f3486..c872bfb32a03 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -71,13 +71,17 @@ struct io_ring_ctx { struct completion ctx_done; + /* iopoll submission state */ struct { - struct mutex uring_lock; - wait_queue_head_t wait; + spinlock_t poll_lock; + struct list_head poll_submitted; } ____cacheline_aligned_in_smp; struct { + struct list_head poll_completing; spinlock_t completion_lock; + struct mutex uring_lock; + wait_queue_head_t wait; } ____cacheline_aligned_in_smp; }; @@ -97,9 +101,12 @@ struct io_kiocb { unsigned long ki_index; struct list_head ki_list; unsigned long ki_flags; +#define KIOCB_F_IOPOLL_COMPLETED 0 /* polled IO has completed */ +#define KIOCB_F_IOPOLL_EAGAIN 1 /* submission got EAGAIN */ }; #define IO_PLUG_THRESHOLD 2 +#define IO_IOPOLL_BATCH 8 struct sqe_submit { const struct io_uring_sqe *sqe; @@ -136,6 +143,9 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) spin_lock_init(&ctx->completion_lock); init_waitqueue_head(&ctx->wait); + spin_lock_init(&ctx->poll_lock); + INIT_LIST_HEAD(&ctx->poll_submitted); + INIT_LIST_HEAD(&ctx->poll_completing); mutex_init(&ctx->uring_lock); return ctx; @@ -187,12 +197,151 @@ static void io_ring_drop_ctx_ref(struct io_ring_ctx *ctx, unsigned refs) wake_up(&ctx->wait); } +static void io_free_kiocb_many(struct io_ring_ctx *ctx, void **iocbs, int *nr) +{ + if (*nr) { + kmem_cache_free_bulk(kiocb_cachep, *nr, iocbs); + io_ring_drop_ctx_ref(ctx, *nr); + *nr = 0; + } +} + static void io_free_kiocb(struct io_kiocb *iocb) { kmem_cache_free(kiocb_cachep, iocb); io_ring_drop_ctx_ref(iocb->ki_ctx, 1); } +/* + * Find and free completed poll iocbs + */ +static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) +{ + void *iocbs[IO_IOPOLL_BATCH]; + struct io_kiocb *iocb, *n; + int to_free = 0; + + list_for_each_entry_safe(iocb, n, &ctx->poll_completing, ki_list) { + if (!test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags)) + continue; + if (to_free == ARRAY_SIZE(iocbs)) + io_free_kiocb_many(ctx, iocbs, &to_free); + + list_del(&iocb->ki_list); + iocbs[to_free++] = iocb; + + fput(iocb->rw.ki_filp); + (*nr_events)++; + } + + if (to_free) + io_free_kiocb_many(ctx, iocbs, &to_free); +} + +/* + * Poll for a mininum of 'min' events, and a maximum of 'max'. Note that if + * min == 0 we consider that a non-spinning poll check - we'll still enter + * the driver poll loop, but only as a non-spinning completion check. + */ +static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, + long min) +{ + struct io_kiocb *iocb; + int found, polled, ret; + + /* + * Check if we already have done events that satisfy what we need + */ + if (!list_empty(&ctx->poll_completing)) { + io_iopoll_reap(ctx, nr_events); + if (min && *nr_events >= min) + return 0; + } + + /* + * Take in a new working set from the submitted list, if possible. + */ + if (!list_empty_careful(&ctx->poll_submitted)) { + spin_lock(&ctx->poll_lock); + list_splice_init(&ctx->poll_submitted, &ctx->poll_completing); + spin_unlock(&ctx->poll_lock); + } + + if (list_empty(&ctx->poll_completing)) + return 0; + + /* + * Check again now that we have a new batch. + */ + io_iopoll_reap(ctx, nr_events); + if (min && *nr_events >= min) + return 0; + + polled = found = 0; + list_for_each_entry(iocb, &ctx->poll_completing, ki_list) { + /* + * Poll for needed events with spin == true, anything after + * that we just check if we have more, up to max. + */ + bool spin = !polled || *nr_events < min; + struct kiocb *kiocb = &iocb->rw; + + if (test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags)) + break; + + found++; + ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); + if (ret < 0) + return ret; + + polled += ret; + } + + io_iopoll_reap(ctx, nr_events); + if (*nr_events >= min) + return 0; + return found; +} + +/* + * We can't just wait for polled events to come to us, we have to actively + * find and complete them. + */ +static void io_iopoll_reap_events(struct io_ring_ctx *ctx) +{ + if (!(ctx->flags & IORING_SETUP_IOPOLL)) + return; + + mutex_lock(&ctx->uring_lock); + while (!list_empty_careful(&ctx->poll_submitted) || + !list_empty(&ctx->poll_completing)) { + unsigned int nr_events = 0; + + io_iopoll_getevents(ctx, &nr_events, 1); + } + mutex_unlock(&ctx->uring_lock); +} + +static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, + long min) +{ + int ret = 0; + + while (!*nr_events || !need_resched()) { + int tmin = 0; + + if (*nr_events < min) + tmin = min - *nr_events; + + ret = io_iopoll_getevents(ctx, nr_events, tmin); + if (ret <= 0) + break; + ret = 0; + } + + return ret; +} + static void kiocb_end_write(struct kiocb *kiocb) { if (kiocb->ki_flags & IOCB_WRITE) { @@ -208,18 +357,16 @@ static void kiocb_end_write(struct kiocb *kiocb) } } -static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, - long res, unsigned ev_flags) +static void __io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, + long res, unsigned ev_flags) { struct io_uring_cqe *cqe; - unsigned long flags; /* * If we can't get a cq entry, userspace overflowed the * submission (by quite a lot). Increment the overflow count in * the ring. */ - spin_lock_irqsave(&ctx->completion_lock, flags); cqe = io_peek_cqring(ctx); if (cqe) { cqe->index = ki_index; @@ -229,6 +376,15 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, io_inc_cqring(ctx); } else ctx->cq_ring->overflow++; +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, + long res, unsigned ev_flags) +{ + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + __io_cqring_fill_event(ctx, ki_index, res, ev_flags); spin_unlock_irqrestore(&ctx->completion_lock, flags); } @@ -243,8 +399,23 @@ static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) io_free_kiocb(iocb); } +static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + if (unlikely(res == -EAGAIN)) { + set_bit(KIOCB_F_IOPOLL_EAGAIN, &iocb->ki_flags); + } else { + __io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, 0); + set_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags); + } +} + static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) { + struct io_ring_ctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; int ret; @@ -266,12 +437,22 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) ret = kiocb_set_rw_flags(req, sqe->rw_flags); if (unlikely(ret)) goto out_fput; - if (req->ki_flags & IOCB_HIPRI) { - ret = -EINVAL; - goto out_fput; - } - req->ki_complete = io_complete_scqring_rw; + if (ctx->flags & IORING_SETUP_IOPOLL) { + ret = -EOPNOTSUPP; + if (!(req->ki_flags & IOCB_DIRECT) || + !req->ki_filp->f_op->iopoll) + goto out_fput; + + req->ki_flags |= IOCB_HIPRI; + req->ki_complete = io_complete_scqring_iopoll; + } else { + if (req->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + req->ki_complete = io_complete_scqring_rw; + } return 0; out_fput: fput(req->ki_filp); @@ -298,6 +479,30 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret) } } +/* + * After the iocb has been issued, it's safe to be found on the poll list. + * Adding the kiocb to the list AFTER submission ensures that we don't + * find it from a io_getevents() thread before the issuer is done accessing + * the kiocb cookie. + */ +static void io_iopoll_kiocb_issued(struct io_kiocb *kiocb) +{ + /* + * For fast devices, IO may have already completed. If it has, add + * it to the front so we find it first. We can't add to the poll_done + * list as that's unlocked from the completion side. + */ + const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags); + struct io_ring_ctx *ctx = kiocb->ki_ctx; + + spin_lock(&ctx->poll_lock); + if (front) + list_add(&kiocb->ki_list, &ctx->poll_submitted); + else + list_add_tail(&kiocb->ki_list, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); +} + static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; @@ -400,6 +605,8 @@ static int io_fsync(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, { struct fsync_iocb *req = &kiocb->fsync; + if (kiocb->ki_ctx->flags & IORING_SETUP_IOPOLL) + return -EINVAL; if (unlikely(sqe->addr || sqe->off || sqe->len || sqe->__resv)) return -EINVAL; @@ -461,6 +668,13 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) */ if (ret) goto out_put_req; + if (ctx->flags & IORING_SETUP_IOPOLL) { + if (test_bit(KIOCB_F_IOPOLL_EAGAIN, &req->ki_flags)) { + ret = -EAGAIN; + goto out_put_req; + } + io_iopoll_kiocb_issued(req); + } return 0; out_put_req: io_free_kiocb(req); @@ -573,12 +787,17 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } if (flags & IORING_ENTER_GETEVENTS) { + unsigned nr_events = 0; int get_ret; if (!ret && to_submit) min_complete = 0; - get_ret = io_cqring_wait(ctx, min_complete); + if (ctx->flags & IORING_SETUP_IOPOLL) + get_ret = io_iopoll_check(ctx, &nr_events, + min_complete); + else + get_ret = io_cqring_wait(ctx, min_complete); if (get_ret < 0 && !ret) ret = get_ret; } @@ -604,6 +823,7 @@ static void io_free_scq_urings(struct io_ring_ctx *ctx) static void io_ring_ctx_free(struct io_ring_ctx *ctx) { + io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); percpu_ref_exit(&ctx->refs); kfree(ctx); @@ -612,6 +832,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) { percpu_ref_kill(&ctx->refs); + io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); io_ring_ctx_free(ctx); } @@ -815,7 +1036,7 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, return -EINVAL; } - if (p.flags) + if (p.flags & ~IORING_SETUP_IOPOLL) return -EINVAL; if (iovecs) return -EINVAL; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ae30ed41965f..ba9e5b851f73 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -31,6 +31,11 @@ struct io_uring_sqe { }; }; +/* + * io_uring_setup() flags + */ +#define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ + #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 05/15] Add io_uring IO interface Date: Wed, 9 Jan 2019 19:43:54 -0700 Message-ID: <20190110024404.25372-6-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_sqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered and can point to any io_uring_iocb. Two new system calls are added for this: io_uring_setup(entries, iovecs, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_iocbs. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 838 +++++++++++++++++++++++++ include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 96 +++ init/Kconfig | 8 + kernel/sys_ni.c | 2 + 7 files changed, 952 insertions(+) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..453ff7a79002 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common io_uring_setup __x64_sys_io_uring_setup +336 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..0bad563f3486 --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,838 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2019 Jens Axboe + */ +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include + +#include "internal.h" + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct percpu_ref refs; + + unsigned int flags; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned sq_entries; + unsigned sq_mask; + struct io_uring_sqe *sq_sqes; + + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cq_entries; + unsigned cq_mask; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; +}; + +struct fsync_iocb { + struct work_struct work; + struct file *file; + bool datasync; +}; + +struct io_kiocb { + union { + struct kiocb rw; + struct fsync_iocb fsync; + }; + + struct io_ring_ctx *ki_ctx; + unsigned long ki_index; + struct list_head ki_list; + unsigned long ki_flags; +}; + +#define IO_PLUG_THRESHOLD 2 + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned index; +}; + +static struct kmem_cache *kiocb_cachep; + +static const struct file_operations io_scqring_fops; + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + + init_completion(&ctx->ctx_done); + + spin_lock_init(&ctx->completion_lock); + init_waitqueue_head(&ctx->wait); + mutex_init(&ctx->uring_lock); + + return ctx; +} + +static void io_inc_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + ring->r.tail++; + smp_wmb(); +} + +static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + smp_rmb(); + tail = READ_ONCE(ring->r.tail); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + return &ring->cqes[tail & ctx->cq_mask]; +} + +static struct io_kiocb *io_get_kiocb(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (!percpu_ref_tryget(&ctx->refs)) + return NULL; + + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); + if (!req) + return NULL; + + req->ki_ctx = ctx; + INIT_LIST_HEAD(&req->ki_list); + req->ki_flags = 0; + return req; +} + +static void io_ring_drop_ctx_ref(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_free_kiocb(struct io_kiocb *iocb) +{ + kmem_cache_free(kiocb_cachep, iocb); + io_ring_drop_ctx_ref(iocb->ki_ctx, 1); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + unsigned long flags; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + spin_lock_irqsave(&ctx->completion_lock, flags); + cqe = io_peek_cqring(ctx); + if (cqe) { + cqe->index = ki_index; + cqe->res = res; + cqe->flags = ev_flags; + smp_wmb(); + io_inc_cqring(ctx); + } else + ctx->cq_ring->overflow++; + spin_unlock_irqrestore(&ctx->completion_lock, flags); +} + +static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, 0); + io_free_kiocb(iocb); +} + +static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +{ + struct kiocb *req = &kiocb->rw; + int ret; + + req->ki_filp = fget(sqe->fd); + if (unlikely(!req->ki_filp)) + return -EBADF; + req->ki_pos = sqe->off; + req->ki_flags = iocb_flags(req->ki_filp); + req->ki_hint = ki_hint_validate(file_write_hint(req->ki_filp)); + if (sqe->ioprio) { + ret = ioprio_check_cap(sqe->ioprio); + if (ret) + goto out_fput; + + req->ki_ioprio = sqe->ioprio; + } else + req->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(req, sqe->rw_flags); + if (unlikely(ret)) + goto out_fput; + if (req->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + req->ki_complete = io_complete_scqring_rw; + return 0; +out_fput: + fput(req->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *req, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * There's no easy way to restart the syscall since other AIO's + * may be already running. Just fail this IO with EINTR. + */ + ret = -EINTR; + /*FALLTHRU*/ + default: + req->ki_complete(req, ret, 0); + } +} + +static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *req = &kiocb->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(kiocb, sqe); + if (ret) + return ret; + file = req->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter)); + if (!ret) + io_rw_done(req, call_read_iter(file, req, &iter)); + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *req = &kiocb->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(kiocb, sqe); + if (ret) + return ret; + file = req->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + req->ki_flags |= IOCB_WRITE; + io_rw_done(req, call_write_iter(file, req, &iter)); + } +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static void io_fsync_work(struct work_struct *work) +{ + struct fsync_iocb *req = container_of(work, struct fsync_iocb, work); + struct io_kiocb *iocb = container_of(req, struct io_kiocb, fsync); + int ret; + + ret = vfs_fsync(req->file, req->datasync); + fput(req->file); + + io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, ret, 0); + io_free_kiocb(iocb); +} + +static int io_fsync(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, + bool datasync) +{ + struct fsync_iocb *req = &kiocb->fsync; + + if (unlikely(sqe->addr || sqe->off || sqe->len || sqe->__resv)) + return -EINVAL; + + req->file = fget(sqe->fd); + if (unlikely(!req->file)) + return -EBADF; + if (unlikely(!req->file->f_op->fsync)) { + fput(req->file); + return -EINVAL; + } + + req->datasync = datasync; + INIT_WORK(&req->work, io_fsync_work); + schedule_work(&req->work); + return 0; +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + const struct io_uring_sqe *sqe = s->sqe; + struct io_kiocb *req; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(sqe->flags)) + return -EINVAL; + + req = io_get_kiocb(ctx); + if (unlikely(!req)) + return -EAGAIN; + + ret = -EINVAL; + if (s->index >= ctx->sq_entries) + goto out_put_req; + req->ki_index = s->index; + + ret = -EINVAL; + switch (sqe->opcode) { + case IORING_OP_READV: + ret = io_read(req, sqe); + break; + case IORING_OP_WRITEV: + ret = io_write(req, sqe); + break; + case IORING_OP_FSYNC: + ret = io_fsync(req, sqe, false); + break; + case IORING_OP_FDSYNC: + ret = io_fsync(req, sqe, true); + break; + default: + ret = -EINVAL; + break; + } + + /* + * If ret is 0, ->ki_complete() has either been called, or will get + * called later on. Anything else, we need to free the req. + */ + if (ret) + goto out_put_req; + return 0; +out_put_req: + io_free_kiocb(req); + return ret; +} + +static void io_inc_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + ring->r.head++; + smp_wmb(); +} + +static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + smp_rmb(); + head = READ_ONCE(ring->r.head); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = ring->array[head & ctx->sq_mask]; + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + return true; + } + + /* drop invalid entries */ + ring->r.head++; + ring->dropped++; + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_peek_sqring(ctx, &s)) + break; + + ret = io_submit_sqe(ctx, &s); + if (ret) + break; + + submit++; + io_inc_sqring(ctx); + } + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) +{ + struct io_cq_ring *ring = ctx->cq_ring; + DEFINE_WAIT(wait); + int ret = 0; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, + unsigned min_complete, unsigned flags) +{ + int ret = 0; + + if (to_submit) { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } + if (flags & IORING_ENTER_GETEVENTS) { + int get_ret; + + if (!ret && to_submit) + min_complete = 0; + + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) + ret = get_ret; + } + + return ret; +} + +static void io_free_scq_urings(struct io_ring_ctx *ctx) +{ + if (ctx->sq_ring) { + page_frag_free(ctx->sq_ring); + ctx->sq_ring = NULL; + } + if (ctx->sq_sqes) { + page_frag_free(ctx->sq_sqes); + ctx->sq_sqes = NULL; + } + if (ctx->cq_ring) { + page_frag_free(ctx->cq_ring); + ctx->cq_ring = NULL; + } +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + io_free_scq_urings(ctx); + percpu_ref_exit(&ctx->refs); + kfree(ctx); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_scqring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_scqring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_scqring_fops) + goto out_fput; + + ret = -EINVAL; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_ref(ctx, 1); +out_fput: + fdput(f); + return ret; +} + +static const struct file_operations io_scqring_fops = { + .release = io_scqring_release, + .mmap = io_scqring_mmap, +}; + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + int ret; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + ret = -EOVERFLOW; + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + goto err; + ret = -ENOMEM; + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) + goto err; + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) + goto err; + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +err: + io_free_scq_urings(ctx); + return ret; +} + +static void io_fill_offsets(struct io_uring_params *p) +{ + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + int ret; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + ctx = io_ring_ctx_alloc(p); + if (!ctx) + return -ENOMEM; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + io_fill_offsets(p); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, + struct io_uring_params __user *, params) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + if (iovecs) + return -EINVAL; + + ret = io_uring_create(entries, &p); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +static int __init io_uring_setup(void) +{ + kiocb_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_setup); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..6d40939f65cd 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include #include @@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, struct iovec __user *iov, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..ae30ed41965f --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,96 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include +#include + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; + __u8 flags; + __u16 ioprio; + __s32 fd; + __u64 off; + union { + void *addr; + __u64 __pad; + }; + __u32 len; + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; +}; + +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 +#define IORING_OP_FSYNC 3 +#define IORING_OP_FDSYNC 4 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 index; /* what sqe this event came from */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv[3]; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u32 resv[4]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1 << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u16 resv[10]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..6fbb2f40e912 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1402,6 +1402,14 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and completion IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 07/15] io_uring: add submission side request cache Date: Wed, 9 Jan 2019 19:43:56 -0700 Message-ID: <20190110024404.25372-8-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org We have to add each submitted polled request to the io_ring_ctx poll_submitted list, which means we have to grab the poll_lock. We already use the block plug to batch submissions if we're doing a batch of IO submissions, extend that to cover the poll requests internally as well. Signed-off-by: Jens Axboe --- fs/io_uring.c | 122 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 106 insertions(+), 16 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index c872bfb32a03..f7938156552f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -113,6 +113,21 @@ struct sqe_submit { unsigned index; }; +struct io_submit_state { + struct io_ring_ctx *ctx; + + struct blk_plug plug; +#ifdef CONFIG_BLOCK + struct blk_plug_cb plug_cb; +#endif + + /* + * Polled iocbs that have been submitted, but not added to the ctx yet + */ + struct list_head req_list; + unsigned int req_count; +}; + static struct kmem_cache *kiocb_cachep; static const struct file_operations io_scqring_fops; @@ -480,21 +495,29 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret) } /* - * After the iocb has been issued, it's safe to be found on the poll list. - * Adding the kiocb to the list AFTER submission ensures that we don't - * find it from a io_getevents() thread before the issuer is done accessing - * the kiocb cookie. + * Called either at the end of IO submission, or through a plug callback + * because we're going to schedule. Moves out local batch of requests to + * the ctx poll list, so they can be found for polling + reaping. */ -static void io_iopoll_kiocb_issued(struct io_kiocb *kiocb) +static void io_flush_state_reqs(struct io_ring_ctx *ctx, + struct io_submit_state *state) { + spin_lock(&ctx->poll_lock); + list_splice_tail_init(&state->req_list, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); + state->req_count = 0; +} + +static void io_iopoll_iocb_add_list(struct io_kiocb *kiocb) +{ + const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags); + struct io_ring_ctx *ctx = kiocb->ki_ctx; + /* * For fast devices, IO may have already completed. If it has, add * it to the front so we find it first. We can't add to the poll_done * list as that's unlocked from the completion side. */ - const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags); - struct io_ring_ctx *ctx = kiocb->ki_ctx; - spin_lock(&ctx->poll_lock); if (front) list_add(&kiocb->ki_list, &ctx->poll_submitted); @@ -503,6 +526,33 @@ static void io_iopoll_kiocb_issued(struct io_kiocb *kiocb) spin_unlock(&ctx->poll_lock); } +static void io_iopoll_iocb_add_state(struct io_submit_state *state, + struct io_kiocb *kiocb) +{ + if (test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags)) + list_add(&kiocb->ki_list, &state->req_list); + else + list_add_tail(&kiocb->ki_list, &state->req_list); + + if (++state->req_count >= IO_IOPOLL_BATCH) + io_flush_state_reqs(state->ctx, state); +} + +/* + * After the iocb has been issued, it's safe to be found on the poll list. + * Adding the kiocb to the list AFTER submission ensures that we don't + * find it from a io_getevents() thread before the issuer is done accessing + * the kiocb cookie. + */ +static void io_iopoll_kiocb_issued(struct io_submit_state *state, + struct io_kiocb *kiocb) +{ + if (!state || !IS_ENABLED(CONFIG_BLOCK)) + io_iopoll_iocb_add_list(kiocb); + else + io_iopoll_iocb_add_state(state, kiocb); +} + static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; @@ -624,7 +674,8 @@ static int io_fsync(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, return 0; } -static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, + struct io_submit_state *state) { const struct io_uring_sqe *sqe = s->sqe; struct io_kiocb *req; @@ -673,7 +724,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) ret = -EAGAIN; goto out_put_req; } - io_iopoll_kiocb_issued(req); + io_iopoll_kiocb_issued(state, req); } return 0; out_put_req: @@ -681,6 +732,43 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) return ret; } +#ifdef CONFIG_BLOCK +static void io_state_unplug(struct blk_plug_cb *cb, bool from_schedule) +{ + struct io_submit_state *state; + + state = container_of(cb, struct io_submit_state, plug_cb); + if (!list_empty(&state->req_list)) + io_flush_state_reqs(state->ctx, state); +} +#endif + +/* + * Batched submission is done, ensure local IO is flushed out. + */ +static void io_submit_state_end(struct io_submit_state *state) +{ + blk_finish_plug(&state->plug); + if (!list_empty(&state->req_list)) + io_flush_state_reqs(state->ctx, state); +} + +/* + * Start submission side cache. + */ +static void io_submit_state_start(struct io_submit_state *state, + struct io_ring_ctx *ctx) +{ + state->ctx = ctx; + INIT_LIST_HEAD(&state->req_list); + state->req_count = 0; +#ifdef CONFIG_BLOCK + state->plug_cb.callback = io_state_unplug; + blk_start_plug(&state->plug); + list_add(&state->plug_cb.list, &state->plug.cb_list); +#endif +} + static void io_inc_sqring(struct io_ring_ctx *ctx) { struct io_sq_ring *ring = ctx->sq_ring; @@ -715,11 +803,13 @@ static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { + struct io_submit_state state, *statep = NULL; int i, ret = 0, submit = 0; - struct blk_plug plug; - if (to_submit > IO_PLUG_THRESHOLD) - blk_start_plug(&plug); + if (to_submit > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx); + statep = &state; + } for (i = 0; i < to_submit; i++) { struct sqe_submit s; @@ -727,7 +817,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) if (!io_peek_sqring(ctx, &s)) break; - ret = io_submit_sqe(ctx, &s); + ret = io_submit_sqe(ctx, &s, statep); if (ret) break; @@ -735,8 +825,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) io_inc_sqring(ctx); } - if (to_submit > IO_PLUG_THRESHOLD) - blk_finish_plug(&plug); + if (statep) + io_submit_state_end(statep); return submit ? submit : ret; } -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 08/15] fs: add fget_many() and fput_many() Date: Wed, 9 Jan 2019 19:43:57 -0700 Message-ID: <20190110024404.25372-9-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Signed-off-by: Jens Axboe --- fs/file.c | 15 ++++++++++----- fs/file_table.c | 9 +++++++-- include/linux/file.h | 2 ++ include/linux/fs.h | 4 +++- 4 files changed, 22 insertions(+), 8 deletions(-) diff --git a/fs/file.c b/fs/file.c index 3209ee271c41..e0d7ce70e860 100644 --- a/fs/file.c +++ b/fs/file.c @@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files) spin_unlock(&files->file_lock); } -static struct file *__fget(unsigned int fd, fmode_t mask) +static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs) { struct files_struct *files = current->files; struct file *file; @@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask) */ if (file->f_mode & mask) file = NULL; - else if (!get_file_rcu(file)) + else if (!get_file_rcu_many(file, refs)) goto loop; } rcu_read_unlock(); @@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask) return file; } +struct file *fget_many(unsigned int fd, unsigned int refs) +{ + return __fget(fd, FMODE_PATH, refs); +} + struct file *fget(unsigned int fd) { - return __fget(fd, FMODE_PATH); + return fget_many(fd, 1); } EXPORT_SYMBOL(fget); struct file *fget_raw(unsigned int fd) { - return __fget(fd, 0); + return __fget(fd, 0, 1); } EXPORT_SYMBOL(fget_raw); @@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask) return 0; return (unsigned long)file; } else { - file = __fget(fd, mask); + file = __fget(fd, mask, 1); if (!file) return 0; return FDPUT_FPUT | (unsigned long)file; diff --git a/fs/file_table.c b/fs/file_table.c index 5679e7fcb6b0..155d7514a094 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -326,9 +326,9 @@ void flush_delayed_fput(void) static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); -void fput(struct file *file) +void fput_many(struct file *file, unsigned int refs) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (atomic_long_sub_and_test(refs, &file->f_count)) { struct task_struct *task = current; if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { @@ -347,6 +347,11 @@ void fput(struct file *file) } } +void fput(struct file *file) +{ + fput_many(file, 1); +} + /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without diff --git a/include/linux/file.h b/include/linux/file.h index 6b2fb032416c..3fcddff56bc4 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -13,6 +13,7 @@ struct file; extern void fput(struct file *); +extern void fput_many(struct file *, unsigned int); struct file_operations; struct vfsmount; @@ -44,6 +45,7 @@ static inline void fdput(struct fd fd) } extern struct file *fget(unsigned int fd); +extern struct file *fget_many(unsigned int fd, unsigned int refs); extern struct file *fget_raw(unsigned int fd); extern unsigned long __fdget(unsigned int fd); extern unsigned long __fdget_raw(unsigned int fd); diff --git a/include/linux/fs.h b/include/linux/fs.h index ccb0b7a63aa5..acaad78b6781 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f) atomic_long_inc(&f->f_count); return f; } -#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count) +#define get_file_rcu_many(x, cnt) \ + atomic_long_add_unless(&(x)->f_count, (cnt), 0) +#define get_file_rcu(x) get_file_rcu_many((x), 1) #define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1) #define file_count(x) atomic_long_read(&(x)->f_count) -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 09/15] io_uring: use fget/fput_many() for file references Date: Wed, 9 Jan 2019 19:43:58 -0700 Message-ID: <20190110024404.25372-10-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On the submission side, add file reference batching to the io_submit_state. We get as many references as the number of iocbs we are submitting, and drop unused ones if we end up switching files. The assumption here is that we're usually only dealing with one fd, and if there are multiple, hopefuly they are at least somewhat ordered. Could trivially be extended to cover multiple fds, if needed. On the completion side we do the same thing, except this is trivially done just locally in io_iopoll_reap(). Signed-off-by: Jens Axboe --- fs/io_uring.c | 105 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 92 insertions(+), 13 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index f7938156552f..cd2dfc153338 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -126,6 +126,15 @@ struct io_submit_state { */ struct list_head req_list; unsigned int req_count; + + /* + * File reference cache + */ + struct file *file; + unsigned int fd; + unsigned int has_refs; + unsigned int used_refs; + unsigned int ios_left; }; static struct kmem_cache *kiocb_cachep; @@ -234,7 +243,8 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) { void *iocbs[IO_IOPOLL_BATCH]; struct io_kiocb *iocb, *n; - int to_free = 0; + int file_count, to_free = 0; + struct file *file = NULL; list_for_each_entry_safe(iocb, n, &ctx->poll_completing, ki_list) { if (!test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags)) @@ -245,10 +255,27 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) list_del(&iocb->ki_list); iocbs[to_free++] = iocb; - fput(iocb->rw.ki_filp); + /* + * Batched puts of the same file, to avoid dirtying the + * file usage count multiple times, if avoidable. + */ + if (!file) { + file = iocb->rw.ki_filp; + file_count = 1; + } else if (file == iocb->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = iocb->rw.ki_filp; + file_count = 1; + } + (*nr_events)++; } + if (file) + fput_many(file, file_count); + if (to_free) io_free_kiocb_many(ctx, iocbs, &to_free); } @@ -428,13 +455,60 @@ static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) } } -static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +static void io_file_put(struct io_submit_state *state, struct file *file) +{ + if (!state) { + fput(file); + } else if (state->file) { + int diff = state->has_refs - state->used_refs; + + if (diff) + fput_many(state->file, diff); + state->file = NULL; + } +} + +/* + * Get as many references to a file as we have IOs left in this submission, + * assuming most submissions are for one file, or at least that each file + * has more than one submission. + */ +static struct file *io_file_get(struct io_submit_state *state, int fd) +{ + if (!state) + return fget(fd); + + if (!state->file) { +get_file: + state->file = fget_many(fd, state->ios_left); + if (!state->file) + return NULL; + + state->fd = fd; + state->has_refs = state->ios_left; + state->used_refs = 1; + state->ios_left--; + return state->file; + } + + if (state->fd == fd) { + state->used_refs++; + state->ios_left--; + return state->file; + } + + io_file_put(state, NULL); + goto get_file; +} + +static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, + struct io_submit_state *state) { struct io_ring_ctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; int ret; - req->ki_filp = fget(sqe->fd); + req->ki_filp = io_file_get(state, sqe->fd); if (unlikely(!req->ki_filp)) return -EBADF; req->ki_pos = sqe->off; @@ -470,7 +544,7 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) } return 0; out_fput: - fput(req->ki_filp); + io_file_put(state, req->ki_filp); return ret; } @@ -553,7 +627,8 @@ static void io_iopoll_kiocb_issued(struct io_submit_state *state, io_iopoll_iocb_add_state(state, kiocb); } -static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, + struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; void __user *buf = (void __user *) (uintptr_t) sqe->addr; @@ -562,7 +637,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, sqe); + ret = io_prep_rw(kiocb, sqe, state); if (ret) return ret; file = req->ki_filp; @@ -588,7 +663,8 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) return ret; } -static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, + struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; void __user *buf = (void __user *) (uintptr_t) sqe->addr; @@ -597,7 +673,7 @@ static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, sqe); + ret = io_prep_rw(kiocb, sqe, state); if (ret) return ret; file = req->ki_filp; @@ -697,10 +773,10 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, ret = -EINVAL; switch (sqe->opcode) { case IORING_OP_READV: - ret = io_read(req, sqe); + ret = io_read(req, sqe, state); break; case IORING_OP_WRITEV: - ret = io_write(req, sqe); + ret = io_write(req, sqe, state); break; case IORING_OP_FSYNC: ret = io_fsync(req, sqe, false); @@ -751,17 +827,20 @@ static void io_submit_state_end(struct io_submit_state *state) blk_finish_plug(&state->plug); if (!list_empty(&state->req_list)) io_flush_state_reqs(state->ctx, state); + io_file_put(state, NULL); } /* * Start submission side cache. */ static void io_submit_state_start(struct io_submit_state *state, - struct io_ring_ctx *ctx) + struct io_ring_ctx *ctx, unsigned max_ios) { state->ctx = ctx; INIT_LIST_HEAD(&state->req_list); state->req_count = 0; + state->file = NULL; + state->ios_left = max_ios; #ifdef CONFIG_BLOCK state->plug_cb.callback = io_state_unplug; blk_start_plug(&state->plug); @@ -807,7 +886,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) int i, ret = 0, submit = 0; if (to_submit > IO_PLUG_THRESHOLD) { - io_submit_state_start(&state, ctx); + io_submit_state_start(&state, ctx, to_submit); statep = &state; } -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 11/15] block: implement bio helper to add iter bvec pages to bio Date: Wed, 9 Jan 2019 19:44:00 -0700 Message-ID: <20190110024404.25372-12-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org For an ITER_BVEC, we can just iterate the iov and add the pages to the bio directly. This requires that the caller doesn't releases the pages on IO completion, we add a BIO_HOLD_PAGES flag for that. The current two callers of bio_iov_iter_get_pages() are updated to check if they need to release pages on completion. This makes them work with bvecs that contain kernel mapped pages already. Signed-off-by: Jens Axboe --- block/bio.c | 59 ++++++++++++++++++++++++++++++++------- fs/block_dev.c | 5 ++-- fs/iomap.c | 5 ++-- include/linux/blk_types.h | 1 + 4 files changed, 56 insertions(+), 14 deletions(-) diff --git a/block/bio.c b/block/bio.c index 4db1008309ed..7af4f45d2ed6 100644 --- a/block/bio.c +++ b/block/bio.c @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) +{ + const struct bio_vec *bv = iter->bvec; + unsigned int len; + size_t size; + + len = min_t(size_t, bv->bv_len, iter->count); + size = bio_add_page(bio, bv->bv_page, len, + bv->bv_offset + iter->iov_offset); + if (size == len) { + iov_iter_advance(iter, size); + return 0; + } + + return -EINVAL; +} + #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) /** @@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) } /** - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * bio_iov_iter_get_pages - add user or kernel pages to a bio * @bio: bio to add pages to - * @iter: iov iterator describing the region to be mapped + * @iter: iov iterator describing the region to be added + * + * This takes either an iterator pointing to user memory, or one pointing to + * kernel pages (BVEC iterator). If we're adding user pages, we pin them and + * map them into the kernel. On IO completion, the caller should put those + * pages. If we're adding kernel pages, we just have to add the pages to the + * bio directly. We don't grab an extra reference to those pages (the user + * should already have that), and we don't put the page on IO completion. + * The caller needs to check if the bio is flagged BIO_HOLD_PAGES on IO + * completion. If it isn't, then pages should be released. * - * Pins pages from *iter and appends them to @bio's bvec array. The - * pages will have to be released using put_page() when done. * The function tries, but does not guarantee, to pin as many pages as - * fit into the bio, or are requested in *iter, whatever is smaller. - * If MM encounters an error pinning the requested pages, it stops. - * Error is returned only if 0 pages could be pinned. + * fit into the bio, or are requested in *iter, whatever is smaller. If + * MM encounters an error pinning the requested pages, it stops. Error + * is returned only if 0 pages could be pinned. */ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { + const bool is_bvec = iov_iter_is_bvec(iter); unsigned short orig_vcnt = bio->bi_vcnt; + /* + * If this is a BVEC iter, then the pages are kernel pages. Don't + * release them on IO completion. + */ + if (is_bvec) + bio_set_flag(bio, BIO_HOLD_PAGES); + do { - int ret = __bio_iov_iter_get_pages(bio, iter); + int ret; + + if (is_bvec) + ret = __bio_iov_bvec_add_pages(bio, iter); + else + ret = __bio_iov_iter_get_pages(bio, iter); if (unlikely(ret)) return bio->bi_vcnt > orig_vcnt ? 0 : ret; @@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work) next = bio->bi_private; bio_set_pages_dirty(bio); - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); } } @@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio) goto defer; } - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); return; defer: diff --git a/fs/block_dev.c b/fs/block_dev.c index 2ebd2a0d7789..b7742014c9de 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -324,8 +324,9 @@ static void blkdev_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/fs/iomap.c b/fs/iomap.c index 4ee50b76b4a1..0a64c9c51203 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1582,8 +1582,9 @@ static void iomap_dio_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 5c7e7f859a24..97e206855cd3 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -215,6 +215,7 @@ struct bio { /* * bio flags */ +#define BIO_HOLD_PAGES 0 /* don't put O_DIRECT pages */ #define BIO_SEG_VALID 1 /* bi_phys_segments valid */ #define BIO_CLONED 2 /* doesn't own data */ #define BIO_BOUNCED 3 /* bio is a bounce bio */ -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 10/15] io_uring: batch io_kiocb allocation Date: Wed, 9 Jan 2019 19:43:59 -0700 Message-ID: <20190110024404.25372-11-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Similarly to how we use the state->ios_left to know how many references to get to a file, we can use it to allocate the io_kiocb's we need in bulk. Signed-off-by: Jens Axboe --- fs/io_uring.c | 71 +++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 52 insertions(+), 19 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index cd2dfc153338..b5233786b5a8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -127,6 +127,13 @@ struct io_submit_state { struct list_head req_list; unsigned int req_count; + /* + * io_kiocb alloc cache + */ + void *kiocbs[IO_IOPOLL_BATCH]; + unsigned int free_kiocbs; + unsigned int cur_kiocb; + /* * File reference cache */ @@ -196,36 +203,58 @@ static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) return &ring->cqes[tail & ctx->cq_mask]; } -static struct io_kiocb *io_get_kiocb(struct io_ring_ctx *ctx) +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static struct io_kiocb *io_get_kiocb(struct io_ring_ctx *ctx, + struct io_submit_state *state) { struct io_kiocb *req; if (!percpu_ref_tryget(&ctx->refs)) return NULL; - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); - if (!req) - return NULL; - - req->ki_ctx = ctx; - INIT_LIST_HEAD(&req->ki_list); - req->ki_flags = 0; - return req; -} + if (!state) + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); + else if (!state->free_kiocbs) { + size_t sz; + int ret; + + sz = min_t(size_t, state->ios_left, ARRAY_SIZE(state->kiocbs)); + ret = kmem_cache_alloc_bulk(kiocb_cachep, GFP_KERNEL, sz, + state->kiocbs); + if (ret <= 0) + goto out; + state->free_kiocbs = ret - 1; + state->cur_kiocb = 1; + req = state->kiocbs[0]; + } else { + req = state->kiocbs[state->cur_kiocb]; + state->free_kiocbs--; + state->cur_kiocb++; + } -static void io_ring_drop_ctx_ref(struct io_ring_ctx *ctx, unsigned refs) -{ - percpu_ref_put_many(&ctx->refs, refs); + if (req) { + req->ki_ctx = ctx; + req->ki_flags = 0; + return req; + } - if (waitqueue_active(&ctx->wait)) - wake_up(&ctx->wait); +out: + io_ring_drop_ctx_refs(ctx, 1); + return NULL; } static void io_free_kiocb_many(struct io_ring_ctx *ctx, void **iocbs, int *nr) { if (*nr) { kmem_cache_free_bulk(kiocb_cachep, *nr, iocbs); - io_ring_drop_ctx_ref(ctx, *nr); + io_ring_drop_ctx_refs(ctx, *nr); *nr = 0; } } @@ -233,7 +262,7 @@ static void io_free_kiocb_many(struct io_ring_ctx *ctx, void **iocbs, int *nr) static void io_free_kiocb(struct io_kiocb *iocb) { kmem_cache_free(kiocb_cachep, iocb); - io_ring_drop_ctx_ref(iocb->ki_ctx, 1); + io_ring_drop_ctx_refs(iocb->ki_ctx, 1); } /* @@ -761,7 +790,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, if (unlikely(sqe->flags)) return -EINVAL; - req = io_get_kiocb(ctx); + req = io_get_kiocb(ctx, state); if (unlikely(!req)) return -EAGAIN; @@ -828,6 +857,9 @@ static void io_submit_state_end(struct io_submit_state *state) if (!list_empty(&state->req_list)) io_flush_state_reqs(state->ctx, state); io_file_put(state, NULL); + if (state->free_kiocbs) + kmem_cache_free_bulk(kiocb_cachep, state->free_kiocbs, + &state->kiocbs[state->cur_kiocb]); } /* @@ -839,6 +871,7 @@ static void io_submit_state_start(struct io_submit_state *state, state->ctx = ctx; INIT_LIST_HEAD(&state->req_list); state->req_count = 0; + state->free_kiocbs = 0; state->file = NULL; state->ios_left = max_ios; #ifdef CONFIG_BLOCK @@ -1071,7 +1104,7 @@ SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, ret = __io_uring_enter(ctx, to_submit, min_complete, flags); mutex_unlock(&ctx->uring_lock); } - io_ring_drop_ctx_ref(ctx, 1); + io_ring_drop_ctx_refs(ctx, 1); out_fput: fdput(f); return ret; -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 12/15] io_uring: add support for pre-mapped user IO buffers Date: Wed, 9 Jan 2019 19:44:01 -0700 Message-ID: <20190110024404.25372-13-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org If we have fixed user buffers, we can map them into the kernel when we setup the io_context. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must pass in an array of iovecs that contain the desired buffer addresses and lengths. These buffers can then be mapped into the kernel for the life time of the io_uring, as opposed to just the duration of the each single IO. The application can then use the IORING_OP_{READ,WRITE}_FIXED to perform IO to these fixed locations. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. A limit of 4M is imposed as the largest buffer we currently support. There's nothing preventing us from going larger, but we need some cap, and 4M seemed like it would definitely be big enough. RLIMIT_MEMLOCK is used to cap the total amount of memory pinned. Signed-off-by: Jens Axboe --- fs/io_uring.c | 202 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 2 + 2 files changed, 196 insertions(+), 8 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index b5233786b5a8..7ab20258e39b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -23,6 +23,8 @@ #include #include #include +#include +#include #include #include @@ -53,6 +55,13 @@ struct io_cq_ring { struct io_uring_cqe cqes[]; }; +struct io_mapped_ubuf { + u64 ubuf; + size_t len; + struct bio_vec *bvec; + unsigned int nr_bvecs; +}; + struct io_ring_ctx { struct percpu_ref refs; @@ -69,6 +78,9 @@ struct io_ring_ctx { unsigned cq_entries; unsigned cq_mask; + /* if used, fixed mapped user buffers */ + struct io_mapped_ubuf *user_bufs; + struct completion ctx_done; /* iopoll submission state */ @@ -656,11 +668,42 @@ static void io_iopoll_kiocb_issued(struct io_submit_state *state, io_iopoll_iocb_add_state(state, kiocb); } +static int io_import_fixed(int rw, struct io_kiocb *kiocb, + const struct io_uring_sqe *sqe, + struct iov_iter *iter) +{ + struct io_ring_ctx *ctx = kiocb->ki_ctx; + struct io_mapped_ubuf *imu; + size_t len = sqe->len; + size_t offset; + int index; + + /* attempt to use fixed buffers without having provided iovecs */ + if (!ctx->user_bufs) + return -EFAULT; + + /* io_submit_sqe() already validated the index */ + index = array_index_nospec(kiocb->ki_index, ctx->sq_entries); + imu = &ctx->user_bufs[index]; + if ((unsigned long) sqe->addr < imu->ubuf || + (unsigned long) sqe->addr + len > imu->ubuf + imu->len) + return -EFAULT; + + /* + * May not be a start of buffer, set size appropriately + * and advance us to the beginning. + */ + offset = (unsigned long) sqe->addr - imu->ubuf; + iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len); + if (offset) + iov_iter_advance(iter, offset); + return 0; +} + static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - void __user *buf = (void __user *) (uintptr_t) sqe->addr; struct kiocb *req = &kiocb->rw; struct iov_iter iter; struct file *file; @@ -678,7 +721,15 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, if (unlikely(!file->f_op->read_iter)) goto out_fput; - ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (sqe->opcode == IORING_OP_READV) { + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + + ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, + &iter); + } else { + ret = io_import_fixed(READ, kiocb, sqe, &iter); + iovec = NULL; + } if (ret) goto out_fput; @@ -696,7 +747,6 @@ static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - void __user *buf = (void __user *) (uintptr_t) sqe->addr; struct kiocb *req = &kiocb->rw; struct iov_iter iter; struct file *file; @@ -714,7 +764,14 @@ static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, if (unlikely(!file->f_op->write_iter)) goto out_fput; - ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (sqe->opcode == IORING_OP_WRITEV) { + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + + ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + } else { + ret = io_import_fixed(WRITE, kiocb, sqe, &iter); + iovec = NULL; + } if (ret) goto out_fput; @@ -802,9 +859,11 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, ret = -EINVAL; switch (sqe->opcode) { case IORING_OP_READV: + case IORING_OP_READ_FIXED: ret = io_read(req, sqe, state); break; case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: ret = io_write(req, sqe, state); break; case IORING_OP_FSYNC: @@ -1007,6 +1066,127 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } +static void io_sqe_buffer_unmap(struct io_ring_ctx *ctx) +{ + int i, j; + + if (!ctx->user_bufs) + return; + + for (i = 0; i < ctx->sq_entries; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + + for (j = 0; j < imu->nr_bvecs; j++) + put_page(imu->bvec[j].bv_page); + + kfree(imu->bvec); + imu->nr_bvecs = 0; + } + + kfree(ctx->user_bufs); + ctx->user_bufs = NULL; +} + +static int io_sqe_buffer_map(struct io_ring_ctx *ctx, + struct iovec __user *iovecs) +{ + unsigned long total_pages, page_limit; + struct page **pages = NULL; + int i, j, got_pages = 0; + int ret = -EINVAL; + + ctx->user_bufs = kcalloc(ctx->sq_entries, sizeof(struct io_mapped_ubuf), + GFP_KERNEL); + if (!ctx->user_bufs) + return -ENOMEM; + + /* Don't allow more pages than we can safely lock */ + total_pages = 0; + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + for (i = 0; i < ctx->sq_entries; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + unsigned long off, start, end, ubuf; + int pret, nr_pages; + struct iovec iov; + size_t size; + + ret = -EFAULT; + if (copy_from_user(&iov, &iovecs[i], sizeof(iov))) + goto err; + + /* + * Don't impose further limits on the size and buffer + * constraints here, we'll -EINVAL later when IO is + * submitted if they are wrong. + */ + ret = -EFAULT; + if (!iov.iov_base) + goto err; + + /* arbitrary limit, but we need something */ + if (iov.iov_len > SZ_4M) + goto err; + + ubuf = (unsigned long) iov.iov_base; + end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = ubuf >> PAGE_SHIFT; + nr_pages = end - start; + + ret = -ENOMEM; + if (total_pages + nr_pages > page_limit) + goto err; + + if (!pages || nr_pages > got_pages) { + kfree(pages); + pages = kmalloc(nr_pages * sizeof(struct page *), + GFP_KERNEL); + if (!pages) + goto err; + got_pages = nr_pages; + } + + imu->bvec = kmalloc(nr_pages * sizeof(struct bio_vec), + GFP_KERNEL); + if (!imu->bvec) + goto err; + + down_write(¤t->mm->mmap_sem); + pret = get_user_pages_longterm(ubuf, nr_pages, 1, pages, NULL); + up_write(¤t->mm->mmap_sem); + + if (pret < nr_pages) { + if (pret < 0) + ret = pret; + goto err; + } + + off = ubuf & ~PAGE_MASK; + size = iov.iov_len; + for (j = 0; j < nr_pages; j++) { + size_t vec_len; + + vec_len = min_t(size_t, size, PAGE_SIZE - off); + imu->bvec[j].bv_page = pages[j]; + imu->bvec[j].bv_len = vec_len; + imu->bvec[j].bv_offset = off; + off = 0; + size -= vec_len; + } + /* store original address for later verification */ + imu->ubuf = ubuf; + imu->len = iov.iov_len; + imu->nr_bvecs = nr_pages; + total_pages += nr_pages; + } + kfree(pages); + return 0; +err: + kfree(pages); + io_sqe_buffer_unmap(ctx); + return ret; +} + static void io_free_scq_urings(struct io_ring_ctx *ctx) { if (ctx->sq_ring) { @@ -1027,6 +1207,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); + io_sqe_buffer_unmap(ctx); percpu_ref_exit(&ctx->refs); kfree(ctx); } @@ -1185,7 +1366,8 @@ static void io_fill_offsets(struct io_uring_params *p) p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); } -static int io_uring_create(unsigned entries, struct io_uring_params *p) +static int io_uring_create(unsigned entries, struct io_uring_params *p, + struct iovec __user *iovecs) { struct io_ring_ctx *ctx; int ret; @@ -1207,6 +1389,12 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) if (ret) goto err; + if (iovecs) { + ret = io_sqe_buffer_map(ctx, iovecs); + if (ret) + goto err; + } + ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, O_RDWR | O_CLOEXEC); if (ret < 0) @@ -1240,10 +1428,8 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, if (p.flags & ~IORING_SETUP_IOPOLL) return -EINVAL; - if (iovecs) - return -EINVAL; - ret = io_uring_create(entries, &p); + ret = io_uring_create(entries, &p, iovecs); if (ret < 0) return ret; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ba9e5b851f73..80d1a8224b9c 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -40,6 +40,8 @@ struct io_uring_sqe { #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 #define IORING_OP_FDSYNC 4 +#define IORING_OP_READ_FIXED 5 +#define IORING_OP_WRITE_FIXED 6 /* * IO completion data structure (Completion Queue Entry) -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 13/15] io_uring: support kernel side submission Date: Wed, 9 Jan 2019 19:44:02 -0700 Message-ID: <20190110024404.25372-14-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Add support for backing the io_uring fd with either a thread, or a workqueue and letting those handle the submission for us. This can be used to reduce overhead for submission, or to always make submission async. The latter is particularly useful for buffered aio, which is now fully async with this feature. For polled IO, we could have the kernel side thread hammer on the SQ ring and submit when it finds IO. This would mean that an application would NEVER have to enter the kernel to do IO! Didn't add this yet, but it would be trivial to add. If an application sets IORING_SETUP_SCQTHREAD, the io_uring gets a single thread backing. If used with buffered IO, this will limit the device queue depth to 1, but it will be async, IOs will simply be serialized. Or an application can set IORING_SETUP_SQWQ, in which case the urings get a work queue backing. The concurrency level is the mininum of twice the available CPUs, or the queue depth specific for the context. For this mode, we attempt to do buffered reads inline, in case they are cached. So we should only punt to a workqueue, if we would have to block to get our data. Tested with polling, no polling, fixedbufs, no fixedbufs, buffered, O_DIRECT. See this sample application for how to use it: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe --- fs/io_uring.c | 378 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 5 +- 2 files changed, 369 insertions(+), 14 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 7ab20258e39b..da46872ecd67 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -25,6 +26,8 @@ #include #include #include +#include +#include #include #include @@ -62,6 +65,14 @@ struct io_mapped_ubuf { unsigned int nr_bvecs; }; +struct io_sq_offload { + struct task_struct *thread; /* if using a thread */ + struct workqueue_struct *wq; /* wq offload */ + struct mm_struct *mm; + struct files_struct *files; + wait_queue_head_t wait; +}; + struct io_ring_ctx { struct percpu_ref refs; @@ -71,6 +82,7 @@ struct io_ring_ctx { struct io_sq_ring *sq_ring; unsigned sq_entries; unsigned sq_mask; + unsigned sq_thread_cpu; struct io_uring_sqe *sq_sqes; /* CQ ring */ @@ -81,6 +93,9 @@ struct io_ring_ctx { /* if used, fixed mapped user buffers */ struct io_mapped_ubuf *user_bufs; + /* sq ring submitter thread, if used */ + struct io_sq_offload sq_offload; + struct completion ctx_done; /* iopoll submission state */ @@ -115,6 +130,7 @@ struct io_kiocb { unsigned long ki_flags; #define KIOCB_F_IOPOLL_COMPLETED 0 /* polled IO has completed */ #define KIOCB_F_IOPOLL_EAGAIN 1 /* submission got EAGAIN */ +#define KIOCB_F_FORCE_NONBLOCK 2 /* inline submission attempt */ }; #define IO_PLUG_THRESHOLD 2 @@ -125,6 +141,12 @@ struct sqe_submit { unsigned index; }; +struct io_work { + struct work_struct work; + struct io_ring_ctx *ctx; + struct sqe_submit submit; +}; + struct io_submit_state { struct io_ring_ctx *ctx; @@ -471,6 +493,20 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, spin_unlock_irqrestore(&ctx->completion_lock, flags); } +static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, + long error) +{ + io_cqring_fill_event(ctx, s->index, error, 0); + + /* + * for thread offload, app could already be sleeping in io_ring_enter() + * before we get to flag the error. wake them up, if needed. + */ + if (ctx->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); @@ -543,7 +579,7 @@ static struct file *io_file_get(struct io_submit_state *state, int fd) } static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, - struct io_submit_state *state) + struct io_submit_state *state, bool force_nonblock) { struct io_ring_ctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; @@ -567,6 +603,10 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, ret = kiocb_set_rw_flags(req, sqe->rw_flags); if (unlikely(ret)) goto out_fput; + if (force_nonblock) { + req->ki_flags |= IOCB_NOWAIT; + set_bit(KIOCB_F_FORCE_NONBLOCK, &kiocb->ki_flags); + } if (ctx->flags & IORING_SETUP_IOPOLL) { ret = -EOPNOTSUPP; @@ -701,7 +741,7 @@ static int io_import_fixed(int rw, struct io_kiocb *kiocb, } static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, - struct io_submit_state *state) + struct io_submit_state *state, bool nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *req = &kiocb->rw; @@ -709,7 +749,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, sqe, state); + ret = io_prep_rw(kiocb, sqe, state, nonblock); if (ret) return ret; file = req->ki_filp; @@ -734,8 +774,18 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, goto out_fput; ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter)); - if (!ret) - io_rw_done(req, call_read_iter(file, req, &iter)); + if (!ret) { + ssize_t ret2; + + /* + * Catch -EAGAIN return for forced non-blocking submission + */ + ret2 = call_read_iter(file, req, &iter); + if (!nonblock || ret2 != -EAGAIN) + io_rw_done(req, ret2); + else + ret = -EAGAIN; + } kfree(iovec); out_fput: if (unlikely(ret)) @@ -752,7 +802,7 @@ static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, sqe, state); + ret = io_prep_rw(kiocb, sqe, state, false); if (ret) return ret; file = req->ki_filp; @@ -837,7 +887,7 @@ static int io_fsync(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, } static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, - struct io_submit_state *state) + struct io_submit_state *state, bool force_nonblock) { const struct io_uring_sqe *sqe = s->sqe; struct io_kiocb *req; @@ -860,7 +910,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, switch (sqe->opcode) { case IORING_OP_READV: case IORING_OP_READ_FIXED: - ret = io_read(req, sqe, state); + ret = io_read(req, sqe, state, force_nonblock); break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: @@ -988,7 +1038,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) if (!io_peek_sqring(ctx, &s)) break; - ret = io_submit_sqe(ctx, &s, statep); + ret = io_submit_sqe(ctx, &s, statep, false); if (ret) break; @@ -1037,15 +1087,237 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) return ring->r.head == ring->r.tail ? ret : 0; } +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_work *iw = container_of(work, struct io_work, work); + struct io_ring_ctx *ctx = iw->ctx; + struct io_sq_offload *sqo = &ctx->sq_offload; + mm_segment_t old_fs = get_fs(); + struct files_struct *old_files; + int ret; + + old_files = current->files; + current->files = sqo->files; + + if (sqo->mm) { + if (!mmget_not_zero(sqo->mm)) { + ret = -EFAULT; + goto err; + } + use_mm(sqo->mm); + } + + set_fs(USER_DS); + + ret = io_submit_sqe(ctx, &iw->submit, NULL, false); + + set_fs(old_fs); + if (sqo->mm) { + unuse_mm(sqo->mm); + mmput(sqo->mm); + } + +err: + if (ret) + io_fill_cq_error(ctx, &iw->submit, ret); + current->files = old_files; + kfree(iw); +} + +static int io_queue_async_work(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_work *work; + + work = kmalloc(sizeof(*work), GFP_KERNEL); + if (!work) + return -ENOMEM; + + memcpy(&work->submit, s, sizeof(*s)); + INIT_WORK(&work->work, io_sq_wq_submit_work); + work->ctx = ctx; + queue_work(ctx->sq_offload.wq, &work->work); + return 0; +} + +static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, + unsigned int nr, struct mm_struct *cur_mm, + bool mm_fault) +{ + struct io_submit_state state, *statep = NULL; + int ret, i, submitted = 0; + + if (nr > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, nr); + statep = &state; + } + + for (i = 0; i < nr; i++) { + if (unlikely(mm_fault)) + ret = -EFAULT; + else + ret = io_submit_sqe(ctx, &sqes[i], statep, false); + if (!ret) { + submitted++; + continue; + } + + io_fill_cq_error(ctx, &sqes[i], ret); + } + + if (statep) + io_submit_state_end(&state); + + return submitted; +} + +/* + * sq thread only supports O_DIRECT or FIXEDBUFS IO + */ +static int io_sq_thread(void *data) +{ + struct sqe_submit sqes[IO_IOPOLL_BATCH]; + struct io_ring_ctx *ctx = data; + struct io_sq_offload *sqo = &ctx->sq_offload; + struct mm_struct *cur_mm = NULL; + struct files_struct *old_files; + mm_segment_t old_fs; + DEFINE_WAIT(wait); + + old_files = current->files; + current->files = sqo->files; + + old_fs = get_fs(); + set_fs(USER_DS); + + while (!kthread_should_stop()) { + bool mm_fault = false; + int i; + + if (!io_peek_sqring(ctx, &sqes[0])) { + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + + prepare_to_wait(&sqo->wait, &wait, TASK_INTERRUPTIBLE); + if (!io_peek_sqring(ctx, &sqes[0])) { + if (kthread_should_park()) + kthread_parkme(); + if (kthread_should_stop()) { + finish_wait(&sqo->wait, &wait); + break; + } + if (signal_pending(current)) + flush_signals(current); + schedule(); + finish_wait(&sqo->wait, &wait); + continue; + } + finish_wait(&sqo->wait, &wait); + } + + /* If ->mm is set, we're not doing FIXEDBUFS */ + if (sqo->mm && !cur_mm) { + mm_fault = !mmget_not_zero(sqo->mm); + if (!mm_fault) { + use_mm(sqo->mm); + cur_mm = sqo->mm; + } + } + + i = 0; + do { + if (i == ARRAY_SIZE(sqes)) + break; + i++; + io_inc_sqring(ctx); + } while (io_peek_sqring(ctx, &sqes[i])); + + io_submit_sqes(ctx, sqes, i, cur_mm, mm_fault); + } + current->files = old_files; + set_fs(old_fs); + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + } + return 0; +} + +/* + * If this is a read, try a cached inline read first. If the IO is in the + * page cache, we can satisfy it without blocking and without having to + * punt to a threaded execution. This is much faster, particularly for + * lower queue depth IO, and it's always a lot more efficient. + */ +static bool io_sq_try_inline(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + int ret; + + if (s->sqe->opcode != IORING_OP_READV && + s->sqe->opcode != IORING_OP_READ_FIXED) + return false; + + ret = io_submit_sqe(ctx, s, NULL, true); + + /* + * If we get -EAGAIN, return false to submit out-of-line. Any other + * result and we're done, caller will fill in CQ ring event. + */ + return ret != -EAGAIN; +} + +static int io_sq_wq_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + struct sqe_submit s; + int ret, queued; + + ret = queued = 0; + while (io_peek_sqring(ctx, &s)) { + ret = io_sq_try_inline(ctx, &s); + if (!ret) { + ret = io_queue_async_work(ctx, &s); + if (ret) + break; + } + io_inc_sqring(ctx); + queued++; + if (queued == to_submit) + break; + } + + return queued ? queued : ret; +} + static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, unsigned min_complete, unsigned flags) { int ret = 0; if (to_submit) { - ret = io_ring_submit(ctx, to_submit); - if (ret < 0) - return ret; + /* + * Three options here: + * 1) We have an sq thread, just wake it up to do submissions + * 2) We have an sq wq, queue a work item for each sqe + * 3) Submit directly + */ + if (ctx->flags & IORING_SETUP_SQTHREAD) { + wake_up(&ctx->sq_offload.wait); + ret = to_submit; + } else if (ctx->flags & IORING_SETUP_SQWQ) { + ret = io_sq_wq_submit(ctx, to_submit); + } else { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } } if (flags & IORING_ENTER_GETEVENTS) { unsigned nr_events = 0; @@ -1187,6 +1459,77 @@ static int io_sqe_buffer_map(struct io_ring_ctx *ctx, return ret; } +static int io_sq_thread(void *); + +static int io_sq_thread_start(struct io_ring_ctx *ctx) +{ + struct io_sq_offload *sqo = &ctx->sq_offload; + int ret; + + memset(sqo, 0, sizeof(*sqo)); + init_waitqueue_head(&sqo->wait); + + if (!ctx->user_bufs) + sqo->mm = current->mm; + + /* + * This is safe since 'current' has the fd installed, and if that + * gets closed on exit, then fops->release() is invoked which + * waits for the sq thread and sq workqueue to flush and exit + * before exiting. + */ + ret = -EBADF; + sqo->files = current->files; + if (!sqo->files) + goto err; + + if (ctx->flags & IORING_SETUP_SQTHREAD) { + sqo->thread = kthread_create_on_cpu(io_sq_thread, ctx, + ctx->sq_thread_cpu, + "io_uring-sq"); + if (IS_ERR(sqo->thread)) { + ret = PTR_ERR(sqo->thread); + sqo->thread = NULL; + goto err; + } + wake_up_process(sqo->thread); + } else if (ctx->flags & IORING_SETUP_SQWQ) { + int concurrency; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + concurrency = min(ctx->sq_entries - 1, 2 * num_online_cpus()); + sqo->wq = alloc_workqueue("io_ring-wq", + WQ_UNBOUND | WQ_FREEZABLE, + concurrency); + if (!sqo->wq) { + ret = -ENOMEM; + goto err; + } + } + + return 0; +err: + if (sqo->files) + sqo->files = NULL; + if (sqo->mm) + sqo->mm = NULL; + return ret; +} + +static void io_sq_thread_stop(struct io_ring_ctx *ctx) +{ + struct io_sq_offload *sqo = &ctx->sq_offload; + + if (sqo->thread) { + kthread_park(sqo->thread); + kthread_stop(sqo->thread); + sqo->thread = NULL; + } else if (sqo->wq) { + destroy_workqueue(sqo->wq); + sqo->wq = NULL; + } +} + static void io_free_scq_urings(struct io_ring_ctx *ctx) { if (ctx->sq_ring) { @@ -1205,6 +1548,7 @@ static void io_free_scq_urings(struct io_ring_ctx *ctx) static void io_ring_ctx_free(struct io_ring_ctx *ctx) { + io_sq_thread_stop(ctx); io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); io_sqe_buffer_unmap(ctx); @@ -1394,6 +1738,13 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, if (ret) goto err; } + if (p->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) { + ctx->sq_thread_cpu = p->sq_thread_cpu; + + ret = io_sq_thread_start(ctx); + if (ret) + goto err; + } ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, O_RDWR | O_CLOEXEC); @@ -1426,7 +1777,8 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, return -EINVAL; } - if (p.flags & ~IORING_SETUP_IOPOLL) + if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQTHREAD | + IORING_SETUP_SQWQ)) return -EINVAL; ret = io_uring_create(entries, &p, iovecs); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 80d1a8224b9c..79004940f7da 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -35,6 +35,8 @@ struct io_uring_sqe { * io_uring_setup() flags */ #define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ +#define IORING_SETUP_SQTHREAD (1 << 1) /* Use SQ thread */ +#define IORING_SETUP_SQWQ (1 << 2) /* Use SQ workqueue */ #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 @@ -95,7 +97,8 @@ struct io_uring_params { __u32 sq_entries; __u32 cq_entries; __u32 flags; - __u16 resv[10]; + __u16 sq_thread_cpu; + __u16 resv[9]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; }; -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 14/15] io_uring: add submission polling Date: Wed, 9 Jan 2019 19:44:03 -0700 Message-ID: <20190110024404.25372-15-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new events and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions. For O_DIRECT, we can do this with just SQTHREAD being enabled. For buffered aio, we need the workqueue as well. If we can satisfy the buffered inline from the SQTHREAD, we do that. If not, we punt to the workqueue. This is just like buffered aio off the io_uring_enter(2) system call. Proof of concept. If the thread has been idle for 1 second, it will set sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with: barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, to_submit, 0, 0); instead of calling it unconditionally. Improvements: 1) Maybe have smarter backoff. Busy loop for X time, then go to monitor/mwait, finally the schedule we have now after an idle second. Might not be worth the complexity. 2) Probably want the application to pass in the appropriate grace period, not hard code it at 1 second. Signed-off-by: Jens Axboe --- fs/io_uring.c | 102 +++++++++++++++++++++++++++++++--- include/uapi/linux/io_uring.h | 3 + 2 files changed, 97 insertions(+), 8 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index da46872ecd67..6c62329b00ec 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -67,6 +67,7 @@ struct io_mapped_ubuf { struct io_sq_offload { struct task_struct *thread; /* if using a thread */ + bool thread_poll; struct workqueue_struct *wq; /* wq offload */ struct mm_struct *mm; struct files_struct *files; @@ -1145,17 +1146,35 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, { struct io_submit_state state, *statep = NULL; int ret, i, submitted = 0; + bool nonblock; if (nr > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, ctx, nr); statep = &state; } + /* + * Having both a thread and a workqueue only makes sense for buffered + * IO, where we can't submit in an async fashion. Use the NOWAIT + * trick from the SQ thread, and punt to the workqueue if we can't + * satisfy this iocb without blocking. This is only necessary + * for buffered IO with sqthread polled submission. + */ + nonblock = (ctx->flags & IORING_SETUP_SQWQ) != 0; + for (i = 0; i < nr; i++) { - if (unlikely(mm_fault)) + if (unlikely(mm_fault)) { ret = -EFAULT; - else - ret = io_submit_sqe(ctx, &sqes[i], statep, false); + } else { + ret = io_submit_sqe(ctx, &sqes[i], statep, nonblock); + /* nogo, submit to workqueue */ + if (nonblock && ret == -EAGAIN) + ret = io_queue_async_work(ctx, &sqes[i]); + if (!ret) { + submitted++; + continue; + } + } if (!ret) { submitted++; continue; @@ -1171,7 +1190,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, } /* - * sq thread only supports O_DIRECT or FIXEDBUFS IO + * SQ thread is woken if the app asked for offloaded submission. This can + * be either O_DIRECT, in which case we do submissions directly, or it can + * be buffered IO, in which case we do them inline if we can do so without + * blocking. If we can't, then we punt to a workqueue. */ static int io_sq_thread(void *data) { @@ -1182,6 +1204,8 @@ static int io_sq_thread(void *data) struct files_struct *old_files; mm_segment_t old_fs; DEFINE_WAIT(wait); + unsigned inflight; + unsigned long timeout; old_files = current->files; current->files = sqo->files; @@ -1189,11 +1213,49 @@ static int io_sq_thread(void *data) old_fs = get_fs(); set_fs(USER_DS); + timeout = inflight = 0; while (!kthread_should_stop()) { bool mm_fault = false; int i; + if (sqo->thread_poll && inflight) { + unsigned int nr_events = 0; + + /* + * Normal IO, just pretend everything completed. + * We don't have to poll completions for that. + */ + if (ctx->flags & IORING_SETUP_IOPOLL) { + /* + * App should not use IORING_ENTER_GETEVENTS + * with thread polling, but if it does, then + * ensure we are mutually exclusive. + */ + if (mutex_trylock(&ctx->uring_lock)) { + io_iopoll_check(ctx, &nr_events, 0); + mutex_unlock(&ctx->uring_lock); + } + } else { + nr_events = inflight; + } + + inflight -= nr_events; + if (!inflight) + timeout = jiffies + HZ; + } + if (!io_peek_sqring(ctx, &sqes[0])) { + /* + * If we're polling, let us spin for a second without + * work before going to sleep. + */ + if (sqo->thread_poll) { + if (inflight || !time_after(jiffies, timeout)) { + cpu_relax(); + continue; + } + } + /* * Drop cur_mm before scheduling, we can't hold it for * long periods (or over schedule()). Do this before @@ -1207,6 +1269,13 @@ static int io_sq_thread(void *data) } prepare_to_wait(&sqo->wait, &wait, TASK_INTERRUPTIBLE); + + /* Tell userspace we may need a wakeup call */ + if (sqo->thread_poll) { + ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP; + smp_wmb(); + } + if (!io_peek_sqring(ctx, &sqes[0])) { if (kthread_should_park()) kthread_parkme(); @@ -1218,6 +1287,13 @@ static int io_sq_thread(void *data) flush_signals(current); schedule(); finish_wait(&sqo->wait, &wait); + + if (sqo->thread_poll) { + struct io_sq_ring *ring; + + ring = ctx->sq_ring; + ring->flags &= ~IORING_SQ_NEED_WAKEUP; + } continue; } finish_wait(&sqo->wait, &wait); @@ -1240,7 +1316,7 @@ static int io_sq_thread(void *data) io_inc_sqring(ctx); } while (io_peek_sqring(ctx, &sqes[i])); - io_submit_sqes(ctx, sqes, i, cur_mm, mm_fault); + inflight += io_submit_sqes(ctx, sqes, i, cur_mm, mm_fault); } current->files = old_files; set_fs(old_fs); @@ -1483,6 +1559,9 @@ static int io_sq_thread_start(struct io_ring_ctx *ctx) if (!sqo->files) goto err; + if (ctx->flags & IORING_SETUP_SQPOLL) + sqo->thread_poll = true; + if (ctx->flags & IORING_SETUP_SQTHREAD) { sqo->thread = kthread_create_on_cpu(io_sq_thread, ctx, ctx->sq_thread_cpu, @@ -1493,7 +1572,8 @@ static int io_sq_thread_start(struct io_ring_ctx *ctx) goto err; } wake_up_process(sqo->thread); - } else if (ctx->flags & IORING_SETUP_SQWQ) { + } + if (ctx->flags & IORING_SETUP_SQWQ) { int concurrency; /* Do QD, or 2 * CPUS, whatever is smallest */ @@ -1524,7 +1604,8 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx) kthread_park(sqo->thread); kthread_stop(sqo->thread); sqo->thread = NULL; - } else if (sqo->wq) { + } + if (sqo->wq) { destroy_workqueue(sqo->wq); sqo->wq = NULL; } @@ -1738,6 +1819,11 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, if (ret) goto err; } + if ((p->flags & IORING_SETUP_SQPOLL) && + !(p->flags & IORING_SETUP_SQTHREAD)) { + ret = -EINVAL; + goto err; + } if (p->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) { ctx->sq_thread_cpu = p->sq_thread_cpu; @@ -1778,7 +1864,7 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, } if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQTHREAD | - IORING_SETUP_SQWQ)) + IORING_SETUP_SQWQ | IORING_SETUP_SQPOLL)) return -EINVAL; ret = io_uring_create(entries, &p, iovecs); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 79004940f7da..9321eb97479d 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -37,6 +37,7 @@ struct io_uring_sqe { #define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ #define IORING_SETUP_SQTHREAD (1 << 1) /* Use SQ thread */ #define IORING_SETUP_SQWQ (1 << 2) /* Use SQ workqueue */ +#define IORING_SETUP_SQPOLL (1 << 3) /* SQ thread polls */ #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 @@ -75,6 +76,8 @@ struct io_sqring_offsets { __u32 resv[3]; }; +#define IORING_SQ_NEED_WAKEUP (1 << 0) /* needs io_uring_enter wakeup */ + struct io_cqring_offsets { __u32 head; __u32 tail; -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: [PATCH 15/15] io_uring: add io_uring_event cache hit information Date: Wed, 9 Jan 2019 19:44:04 -0700 Message-ID: <20190110024404.25372-16-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Add hint on whether a read was served out of the page cache, or if it hit media. This is useful for buffered async IO, O_DIRECT reads would never have this set (for obvious reasons). If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT set. Signed-off-by: Jens Axboe --- fs/io_uring.c | 7 ++++++- include/uapi/linux/io_uring.h | 5 +++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 6c62329b00ec..f2a603c447ba 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -511,11 +511,16 @@ static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); + unsigned ev_flags = 0; kiocb_end_write(kiocb); fput(kiocb->ki_filp); - io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, 0); + + if (res > 0 && test_bit(KIOCB_F_FORCE_NONBLOCK, &iocb->ki_flags)) + ev_flags = IOCQE_FLAG_CACHEHIT; + + io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, ev_flags); io_free_kiocb(iocb); } diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 9321eb97479d..20e4c22e040d 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -55,6 +55,11 @@ struct io_uring_cqe { __u32 flags; }; +/* + * io_uring_event->flags + */ +#define IOCQE_FLAG_CACHEHIT (1 << 0) /* IO did not hit media */ + /* * Magic offsets for the application to mmap the data it needs */ -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff Moyer Subject: Re: [PATCH 15/15] io_uring: add io_uring_event cache hit information Date: Thu, 10 Jan 2019 18:12:49 -0500 Message-ID: References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-16-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, avi@scylladb.com To: Jens Axboe Return-path: In-Reply-To: <20190110024404.25372-16-axboe@kernel.dk> (Jens Axboe's message of "Wed, 9 Jan 2019 19:44:04 -0700") Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Jens Axboe writes: > Add hint on whether a read was served out of the page cache, or if it > hit media. This is useful for buffered async IO, O_DIRECT reads would > never have this set (for obvious reasons). > > If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT > set. We may want to hold off on this one until the whole mincore/RWF_NOWAIT debate is sorted. [1] Cheers, Jeff [1] https://lore.kernel.org/lkml/20190109022430.GE27534@dastard/ -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: Re: [PATCH 15/15] io_uring: add io_uring_event cache hit information Date: Thu, 10 Jan 2019 16:47:28 -0700 Message-ID: References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-16-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, avi@scylladb.com To: Jeff Moyer Return-path: In-Reply-To: Content-Language: en-US Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 1/10/19 4:12 PM, Jeff Moyer wrote: > Jens Axboe writes: > >> Add hint on whether a read was served out of the page cache, or if it >> hit media. This is useful for buffered async IO, O_DIRECT reads would >> never have this set (for obvious reasons). >> >> If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT >> set. > > We may want to hold off on this one until the whole mincore/RWF_NOWAIT > debate is sorted. [1] Definitely, it's why it's separate and at the end of the series. But in reality, this doesn't leak anything that timing doesn't already tell you. So it's kind of a moot point. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Roman Penyaev Subject: Re: [PATCHSET v2] io_uring IO interface Date: Fri, 11 Jan 2019 10:46:01 +0100 Message-ID: References: <20190110024404.25372-1-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org To: Jens Axboe Return-path: In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Hi Jens, That is interesting. Recently I sent an rfc related to epoll uring: https://lore.kernel.org/lkml/20190109164025.24554-1-rpenyaev@suse.de which can be mapped to userspace and all ready events can be consumed from it directly. I am wondering, is it possible to make some common API for all kind of ready events / urings, or it doesn't make any sense? -- Roman On 2019-01-10 03:43, Jens Axboe wrote: > Here's v2 of the io_uring interface. See the v1 posting for some more > info: > > https://lore.kernel.org/linux-block/20190108165645.19311-1-axboe@kernel.dk/ > > The data structures changed, to improve the symmetry of the submission > and completion side. The io_uring_iocb is now io_uring_sqe, but it > otherwise remains the same as before. Ditto on the completion side, > where io_uring_event is now io_uring_cqe. > > I've updated the fio io_uring test app, and the io_uring engine. The > liburing git repo has also been adapted to the various changes since > the > v1 posting. As a reminder, the liburing git repo contains some helpers > for doing IO without having to muck with the ring directly, setting up > an io_uring context, etc. Clone that here: > > git://git.kernel.dk/liburing > > In terms of usage, there's also a small test app here: > > http://git.kernel.dk/cgit/fio/plain/t/io_uring.c > > and the liburing repo has a few test apps in test/ as well. > > Patches are aginst 5.0-rc1, but can also be found here: > > git://git.kernel.dk/linux-block io_uring > > Changes since v1: > > - Fail IORING_OP_{READ,WRITE}_FIXED if not configured > - Fix ctx drop ref issue on failure to close ring_fd when sq thread/wq > are in use > - Move to separate Kconfig entry (CONFIG_IO_URING) > - Add SPDX headers > - Drop gcc ism of zero sized arrays > - Rename io_uring_iocb -> io_uring_sqe > - Rename io_uring_event -> io_uring_cqe > - Drop needless io_event_ring and io_iocb_ring structures > - Drop ctx->max_reqs, use ->sq_entries > - Drop unused ->ring_lock > - Drop io_ring_ctx slab cache > - Fix state batched kiocb alloc failure to put ctx > - Fix missing write ordering barrier when filling in the cqe > - Drop io_req_init() > - Various renames > - Fix a few lines that were too long > - Address other minor review comments > - Fix IORING_SETUP_SQPOLL being set without IORING_SETUP_SQTHREAD > - Drop IORING_SETUP_FIXEDBUFS, iovecs being non-NULL is enough > - Fix error handling free of ctx in setup path > - Change standard read/write commands to be iov based READV/WRITEV > - Pass in struct sqe_submit instead of separate sqe/index everywhere > - Fix reap of polled events on fops->release() > - Lock uring for sq thread polling > - Don't grab ->completion_lock for polled IO cqe filling > - Fix ev_flags vs flags typo > - Consolidate parts of the io_ring_ctx alignment > > Documentation/filesystems/vfs.txt | 3 + > arch/x86/entry/syscalls/syscall_64.tbl | 2 + > block/bio.c | 59 +- > fs/Makefile | 1 + > fs/block_dev.c | 19 +- > fs/file.c | 15 +- > fs/file_table.c | 9 +- > fs/gfs2/file.c | 2 + > fs/io_uring.c | 1890 ++++++++++++++++++++++++ > fs/iomap.c | 48 +- > fs/xfs/xfs_file.c | 1 + > include/linux/bio.h | 14 + > include/linux/blk_types.h | 1 + > include/linux/file.h | 2 + > include/linux/fs.h | 6 +- > include/linux/iomap.h | 1 + > include/linux/syscalls.h | 5 + > include/uapi/linux/io_uring.h | 114 ++ > init/Kconfig | 8 + > kernel/sys_ni.c | 2 + > 20 files changed, 2163 insertions(+), 39 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ilya Dryomov Subject: Re: [PATCHSET v2] io_uring IO interface Date: Fri, 11 Jan 2019 17:11:57 +0100 Message-ID: References: <20190110024404.25372-1-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Cc: Jens Axboe , linux-fsdevel , linux-aio@kvack.org, linux-block , linux-arch@vger.kernel.org, Christoph Hellwig , jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org To: Roman Penyaev Return-path: In-Reply-To: Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Jan 11, 2019 at 10:51 AM Roman Penyaev wrote: > > Hi Jens, > > That is interesting. Recently I sent an rfc related to epoll uring: > > https://lore.kernel.org/lkml/20190109164025.24554-1-rpenyaev@suse.de > > which can be mapped to userspace and all ready events can be consumed > from it directly. I am wondering, is it possible to make some common > API for all kind of ready events / urings, or it doesn't make any > sense? I think you can use the new IOCB_CMD_POLL from Christoph and avoid epoll_wait() in favor of aio/io_uring interface, at least in new high performance applications. Reaping events entirely in userspace (i.e. performing io_getevents() without entering the kernel) has been possible for a long time even with the existing aio interface. Thanks, Ilya -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoph Hellwig Subject: Re: [PATCHSET v2] io_uring IO interface Date: Fri, 11 Jan 2019 17:21:43 +0100 Message-ID: <20190111162143.GA14914@lst.de> References: <20190110024404.25372-1-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Roman Penyaev , Jens Axboe , linux-fsdevel , linux-aio@kvack.org, linux-block , linux-arch@vger.kernel.org, Christoph Hellwig , jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org To: Ilya Dryomov Return-path: Content-Disposition: inline In-Reply-To: Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On Fri, Jan 11, 2019 at 05:11:57PM +0100, Ilya Dryomov wrote: > I think you can use the new IOCB_CMD_POLL from Christoph and avoid > epoll_wait() in favor of aio/io_uring interface, at least in new high > performance applications. Reaping events entirely in userspace (i.e. > performing io_getevents() without entering the kernel) has been > possible for a long time even with the existing aio interface. For io_uring we can reuse the IOCB_CMD_POLL concept, but we'd have to add a new cancel command, as the uring right now doesn't support cancelation. But I'd rather make that command a new opcode instead of a separate syscall, which would lead to a nicer design. A prototype for this should be fairly easy, I'd just want someone to actually use it for real life testing, like ScyllaDB does for IOCB_CMD_POLL. -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Roman Penyaev Subject: Re: [PATCHSET v2] io_uring IO interface Date: Fri, 11 Jan 2019 17:39:33 +0100 Message-ID: <43189f1be5f03697f750631d31ffaed5@suse.de> References: <20190110024404.25372-1-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Cc: Jens Axboe , linux-fsdevel , linux-aio@kvack.org, linux-block , linux-arch@vger.kernel.org, Christoph Hellwig , jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org To: Ilya Dryomov Return-path: In-Reply-To: Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 2019-01-11 17:11, Ilya Dryomov wrote: > On Fri, Jan 11, 2019 at 10:51 AM Roman Penyaev > wrote: >> >> Hi Jens, >> >> That is interesting. Recently I sent an rfc related to epoll uring: >> >> https://lore.kernel.org/lkml/20190109164025.24554-1-rpenyaev@suse.de >> >> which can be mapped to userspace and all ready events can be consumed >> from it directly. I am wondering, is it possible to make some common >> API for all kind of ready events / urings, or it doesn't make any >> sense? > > I think you can use the new IOCB_CMD_POLL from Christoph and avoid > epoll_wait() in favor of aio/io_uring interface, at least in new high > performance applications. Yeah, I saw this extension for aio from Christoph. I was motivated to extend epoll with uring to avoid constant recharging of file descriptors, i.e. once you inserted descriptor to epoll you just consume events from uring (of course in that particular case only edge triggered events are supported). Also recently for epoll I fixed contention on event callback making hot path completely lockless, i.e. with uring epoll can become a nice thingy in terms of performance. But can any descriptor (on vfs layer) be extended to have a uring? To have some common API? Then if event source (say socket) has a uring (do not know how, just thoughts) and event destination (aio, epoll) has a uring, then reading on userside can be a matter of traversing urings. > Reaping events entirely in userspace (i.e. > performing io_getevents() without entering the kernel) has been > possible for a long time even with the existing aio interface. True. -- Roman -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: Re: [PATCHSET v2] io_uring IO interface Date: Fri, 11 Jan 2019 11:05:35 -0700 Message-ID: <4a4a23c7-7842-0a12-2d46-c892cf2082bd@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org To: Roman Penyaev Return-path: In-Reply-To: Content-Language: en-US Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 1/11/19 2:46 AM, Roman Penyaev wrote: > Hi Jens, > > That is interesting. Recently I sent an rfc related to epoll uring: > > https://lore.kernel.org/lkml/20190109164025.24554-1-rpenyaev@suse.de > > which can be mapped to userspace and all ready events can be consumed > from it directly. I am wondering, is it possible to make some common > API for all kind of ready events / urings, or it doesn't make any > sense? Not sure that's easily possible, even out of the two rings in io_uring, the sq and cq rings aren't the same. The latter is sequentially written, as completions come in. The former ring is actually indexes into the array, so you can submit things out of order when needed. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Martin K. Petersen" Subject: Re: [PATCH 05/15] Add io_uring IO interface Date: Fri, 11 Jan 2019 13:19:35 -0500 Message-ID: References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-6-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com To: Jens Axboe Return-path: In-Reply-To: <20190110024404.25372-6-axboe@kernel.dk> (Jens Axboe's message of "Wed, 9 Jan 2019 19:43:54 -0700") Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Jens, > +struct io_uring_sqe { > + __u8 opcode; > + __u8 flags; > + __u16 ioprio; > + __s32 fd; > + __u64 off; > + union { > + void *addr; > + __u64 __pad; > + }; > + __u32 len; > + union { > + __kernel_rwf_t rw_flags; > + __u32 __resv; > + }; > +}; A bit tongue in cheek and yet somewhat serious: While I'm super excited about the 4 x 64 bitness of the sqe, where does the integrity buffer go? Or the 128-bit KV store key. How do we extend this interface beyond the flags? -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: Re: [PATCH 05/15] Add io_uring IO interface Date: Fri, 11 Jan 2019 11:34:40 -0700 Message-ID: References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-6-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com To: "Martin K. Petersen" Return-path: In-Reply-To: Content-Language: en-US Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 1/11/19 11:19 AM, Martin K. Petersen wrote: > > Jens, > >> +struct io_uring_sqe { >> + __u8 opcode; >> + __u8 flags; >> + __u16 ioprio; >> + __s32 fd; >> + __u64 off; >> + union { >> + void *addr; >> + __u64 __pad; >> + }; >> + __u32 len; >> + union { >> + __kernel_rwf_t rw_flags; >> + __u32 __resv; >> + }; >> +}; > > A bit tongue in cheek and yet somewhat serious: While I'm super excited > about the 4 x 64 bitness of the sqe, where does the integrity buffer go? > Or the 128-bit KV store key. How do we extend this interface beyond the > flags? For integrity buffers, how about we stash them on the side? The newer series has an extra system call, io_uring_register(), which is currently used for registering files and buffers for IO on the side. You could trivially tie an integrity buffer to an sqe through that. For KV, I thint that's an actually interesting use case (sorry, integrity), and we might just want to bite the bullet and extend the sqe to full 64 bytes. We're currently at 48 bytes, which leaves us with 16 bytes of space for KV, and other use cases. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: Re: [PATCH 05/15] Add io_uring IO interface Date: Sun, 13 Jan 2019 09:22:16 -0700 Message-ID: <54976aac-bef2-880f-dc10-e3030189a08a@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-6-axboe@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com To: "Martin K. Petersen" Return-path: In-Reply-To: Content-Language: en-US Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org On 1/11/19 11:34 AM, Jens Axboe wrote: > On 1/11/19 11:19 AM, Martin K. Petersen wrote: >> >> Jens, >> >>> +struct io_uring_sqe { >>> + __u8 opcode; >>> + __u8 flags; >>> + __u16 ioprio; >>> + __s32 fd; >>> + __u64 off; >>> + union { >>> + void *addr; >>> + __u64 __pad; >>> + }; >>> + __u32 len; >>> + union { >>> + __kernel_rwf_t rw_flags; >>> + __u32 __resv; >>> + }; >>> +}; >> >> A bit tongue in cheek and yet somewhat serious: While I'm super excited >> about the 4 x 64 bitness of the sqe, where does the integrity buffer go? >> Or the 128-bit KV store key. How do we extend this interface beyond the >> flags? > > For integrity buffers, how about we stash them on the side? The newer > series has an extra system call, io_uring_register(), which is currently > used for registering files and buffers for IO on the side. You could > trivially tie an integrity buffer to an sqe through that. > > For KV, I thint that's an actually interesting use case (sorry, > integrity), and we might just want to bite the bullet and extend the sqe > to full 64 bytes. We're currently at 48 bytes, which leaves us with 16 > bytes of space for KV, and other use cases. I bit the bullet and bumped the size. 64 bytes is a nicer size in terms of cachelines anyway, and I really doubt that 48 vs 64 bytes makes a size consumption problem for anyone. The buf_index is only used for the fixed buffers, which means that we have 16 bytes / 128 bits that we can grab for things like KV. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Martin K. Petersen" Subject: Re: [PATCH 05/15] Add io_uring IO interface Date: Tue, 15 Jan 2019 12:31:59 -0500 Message-ID: References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-6-axboe@kernel.dk> <54976aac-bef2-880f-dc10-e3030189a08a@kernel.dk> Mime-Version: 1.0 Content-Type: text/plain Cc: "Martin K. Petersen" , linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com To: Jens Axboe Return-path: In-Reply-To: <54976aac-bef2-880f-dc10-e3030189a08a@kernel.dk> (Jens Axboe's message of "Sun, 13 Jan 2019 09:22:16 -0700") Sender: owner-linux-aio@kvack.org List-Id: linux-fsdevel.vger.kernel.org Jens, > I bit the bullet and bumped the size. 64 bytes is a nicer size in terms > of cachelines anyway, and I really doubt that 48 vs 64 bytes makes a > size consumption problem for anyone. > > The buf_index is only used for the fixed buffers, which means that we have > 16 bytes / 128 bits that we can grab for things like KV. Great! -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: aart@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D434C43612 for ; Thu, 10 Jan 2019 02:44:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6C98920665 for ; Thu, 10 Jan 2019 02:44:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="mMktypjV" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726948AbfAJCoM (ORCPT ); Wed, 9 Jan 2019 21:44:12 -0500 Received: from mail-pf1-f195.google.com ([209.85.210.195]:36362 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726936AbfAJCoM (ORCPT ); Wed, 9 Jan 2019 21:44:12 -0500 Received: by mail-pf1-f195.google.com with SMTP id b85so4605101pfc.3 for ; Wed, 09 Jan 2019 18:44:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id; bh=CCcyBncwVzdWqipRqNN4klGIvSLcprj9iOTQQ+ShfQE=; b=mMktypjV6CGigi37iJzpIIloM0ywqzBUbisH/JH77ayexNBVMDAFsjswD1d+udFNoR ztXxJ/TXxDF+sUMDtpq/p9a4TigyaN2ABpnmvfGMsnO+x6v+HRJoPUGiPCJ12T20VxKJ 9s9pFjxWuZHbWypwp2bfpkhK1DKm996eyHkMivVITWEjln3nKPwyV/W3eSA3EHAeVMBp eXRRv6e08hiCBJAb95YqDDPoUCjQULLaxLf0NsZ6PVqEeo9ch3xEyiuyU0EM3rAlzSQI t4vGidnTWhxbBde76NDw9wxBkfSDRM0jxmQ04QUHAvZLiHFQHKR/rzJ0pGEqFXVHhLuw NRpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=CCcyBncwVzdWqipRqNN4klGIvSLcprj9iOTQQ+ShfQE=; b=Pv84MM0NftOugN+xzbB7njEOoNtnNV3n0lb/mDvjgEsaL1HIRevRaZ5KSRxdO/SZpR bI0qGVblccptY/Xjzdr8HCKtvMFFlRSYsjcACNpmsDy87eCR/vqfIGE8eL8rj65xTQPC KJW9mO11CcTw/WfNI0LJ3VIJDTB9LKi0jmr27WmwYHDFOP/rYCPgAi5Cmtf2kc4VUfS2 8UfM+xyZiZq0/NMOsE5cDSEX1brvt703yR9Pv8wHhonITbYGryZOBn2cqBCYAzSRysXU quYmqkhIvHe3/2MJjRzkNHKK2Ldbf6r2uQ4VnGOmlqN7Mhc/G9Y0rA2SXnRd9tmLoxD+ vN7Q== X-Gm-Message-State: AJcUukdLKwgkVAwOo/G+v2kJxHLzuiBtuYJNMHn/MWRCb404OtsCdTul Te8XhbSNQFTYz+Y/B4Y/9O2d8PwAzESl3w== X-Google-Smtp-Source: ALg8bN61hfSGmNYHxIFQQBtDVRjwgR4dxoud4EStO8tSLlhzisdKMLdXhZCQJkKpEn7ovUG4ZV2QeQ== X-Received: by 2002:a63:b34f:: with SMTP id x15mr7834295pgt.243.1547088250785; Wed, 09 Jan 2019 18:44:10 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.08 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:09 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com Subject: [PATCHSET v2] io_uring IO interface Date: Wed, 9 Jan 2019 19:43:49 -0700 Message-Id: <20190110024404.25372-1-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024349.E8HT1C7URZ7Cg4E-GmnUXQt5ngRWdYCv9Ok4oNQs6fU@z> Here's v2 of the io_uring interface. See the v1 posting for some more info: https://lore.kernel.org/linux-block/20190108165645.19311-1-axboe@kernel.dk/ The data structures changed, to improve the symmetry of the submission and completion side. The io_uring_iocb is now io_uring_sqe, but it otherwise remains the same as before. Ditto on the completion side, where io_uring_event is now io_uring_cqe. I've updated the fio io_uring test app, and the io_uring engine. The liburing git repo has also been adapted to the various changes since the v1 posting. As a reminder, the liburing git repo contains some helpers for doing IO without having to muck with the ring directly, setting up an io_uring context, etc. Clone that here: git://git.kernel.dk/liburing In terms of usage, there's also a small test app here: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c and the liburing repo has a few test apps in test/ as well. Patches are aginst 5.0-rc1, but can also be found here: git://git.kernel.dk/linux-block io_uring Changes since v1: - Fail IORING_OP_{READ,WRITE}_FIXED if not configured - Fix ctx drop ref issue on failure to close ring_fd when sq thread/wq are in use - Move to separate Kconfig entry (CONFIG_IO_URING) - Add SPDX headers - Drop gcc ism of zero sized arrays - Rename io_uring_iocb -> io_uring_sqe - Rename io_uring_event -> io_uring_cqe - Drop needless io_event_ring and io_iocb_ring structures - Drop ctx->max_reqs, use ->sq_entries - Drop unused ->ring_lock - Drop io_ring_ctx slab cache - Fix state batched kiocb alloc failure to put ctx - Fix missing write ordering barrier when filling in the cqe - Drop io_req_init() - Various renames - Fix a few lines that were too long - Address other minor review comments - Fix IORING_SETUP_SQPOLL being set without IORING_SETUP_SQTHREAD - Drop IORING_SETUP_FIXEDBUFS, iovecs being non-NULL is enough - Fix error handling free of ctx in setup path - Change standard read/write commands to be iov based READV/WRITEV - Pass in struct sqe_submit instead of separate sqe/index everywhere - Fix reap of polled events on fops->release() - Lock uring for sq thread polling - Don't grab ->completion_lock for polled IO cqe filling - Fix ev_flags vs flags typo - Consolidate parts of the io_ring_ctx alignment Documentation/filesystems/vfs.txt | 3 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + block/bio.c | 59 +- fs/Makefile | 1 + fs/block_dev.c | 19 +- fs/file.c | 15 +- fs/file_table.c | 9 +- fs/gfs2/file.c | 2 + fs/io_uring.c | 1890 ++++++++++++++++++++++++ fs/iomap.c | 48 +- fs/xfs/xfs_file.c | 1 + include/linux/bio.h | 14 + include/linux/blk_types.h | 1 + include/linux/file.h | 2 + include/linux/fs.h | 6 +- include/linux/iomap.h | 1 + include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 114 ++ init/Kconfig | 8 + kernel/sys_ni.c | 2 + 20 files changed, 2163 insertions(+), 39 deletions(-) -- Jens Axboe From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0BE4DC43612 for ; Thu, 10 Jan 2019 02:44:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D346720665 for ; Thu, 10 Jan 2019 02:44:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="b2edNnYj" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726978AbfAJCoP (ORCPT ); Wed, 9 Jan 2019 21:44:15 -0500 Received: from mail-pl1-f193.google.com ([209.85.214.193]:47002 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726973AbfAJCoO (ORCPT ); Wed, 9 Jan 2019 21:44:14 -0500 Received: by mail-pl1-f193.google.com with SMTP id t13so4486237ply.13 for ; Wed, 09 Jan 2019 18:44:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=AAKxls5Es8ExKqpwMiUOBUcGduRQU3BNIbRWpRo7VtQ=; b=b2edNnYjlsl680gPOpfYlqo5XIXIzCSlwmNZvS2LaXhvk6ac1CNQ8r19rGH+QpvvDk Z9usX5NTK5zGcBOfgAAf/VMzFMJo333xkEAefAF2FcYEiwF1vkn3IIB2vAn1CgOreXNu 2pZbji+PceITVqfHGktvPayBrTxUN8ByxDY+xXW3w/TVzQpvbE5/qLYoyC2mV+5kW2U/ WKnckaYo/42y080vdTaD98xCJ7OrRPynaj9DEGjWea0nsLBCOKyrt3a3Cc0F++OfgQ46 n8+9VGRGSQhRQEKeDXo7WZRepse+Q+mlDqqIQU7rnP2w+kK/2dpSrsDox4zjqHn9dkFp HxGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=AAKxls5Es8ExKqpwMiUOBUcGduRQU3BNIbRWpRo7VtQ=; b=GVbScA9BNiKkfPqbL+K58UPiAQSsORT+bCUkToWlw29DRj9zNpXKA8zfEhf+lw//rN kcWFjUfJONuzP9BAqj/wam+FzEpw5dqdDjv3LkGYeGaP2hGYrND2Xhz94Y/bxrOpax7N fCrjGJNWJradNj9LiyMGat+pPDegnLXAM9uuqf904gJAvhWjvQWr/0GYAquukeh8XSlr cZPP0/ChNHhc3rAIN4/HxOvMt2F7Gy9k04HL0OeXaadpreGpyHYlAGzNOXgUdY+/LHVR gP+gUiwjtnJKIkUkub2vTXN1Hk741LA1MEMxS3u38akv8dADpaBt9qfDvj99YtJBTNl6 R1iw== X-Gm-Message-State: AJcUukfVnZUbExMI5EOu3NjaDie4qjKBfYvu3wFV39qqTznkkesMpIHo n0Gxc5etKIQlYxkF9eOgvpnuG5L7P+ewrg== X-Google-Smtp-Source: ALg8bN4hUWfOgmHd5yUtDTNs8hS/nuSP6iEcpd3u5gxOt0w7gNkNXBC4jyR3zql4Idj+B0DS0vZ7ww== X-Received: by 2002:a17:902:7201:: with SMTP id ba1mr8546211plb.105.1547088253525; Wed, 09 Jan 2019 18:44:13 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.11 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:12 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 01/15] fs: add an iopoll method to struct file_operations Date: Wed, 9 Jan 2019 19:43:50 -0700 Message-Id: <20190110024404.25372-2-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024350.lCzxM_DeLxGCtvPlhTCFM7EuIX6eHvrVOypVJarfn-Y@z> From: Christoph Hellwig This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. TODO: we can probably union ki_cookie with the existing hint and I/O priority fields to avoid struct kiocb growth. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- Documentation/filesystems/vfs.txt | 3 +++ include/linux/fs.h | 2 ++ 2 files changed, 5 insertions(+) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 8dc8e9c2913f..761c6fd24a53 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -902,6 +903,8 @@ otherwise noted. write_iter: possibly asynchronous write with iov_iter as source + iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents iterate_shared: called when the VFS needs to read the directory contents diff --git a/include/linux/fs.h b/include/linux/fs.h index 811c77743dad..ccb0b7a63aa5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -310,6 +310,7 @@ struct kiocb { int ki_flags; u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ + unsigned int ki_cookie; /* for ->iopoll */ } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1786,6 +1787,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 73F5AC43444 for ; Thu, 10 Jan 2019 02:44:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4059720665 for ; Thu, 10 Jan 2019 02:44:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="ON2LCZDf" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726987AbfAJCoS (ORCPT ); Wed, 9 Jan 2019 21:44:18 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:33081 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726989AbfAJCoS (ORCPT ); Wed, 9 Jan 2019 21:44:18 -0500 Received: by mail-pg1-f193.google.com with SMTP id z11so4190552pgu.0 for ; Wed, 09 Jan 2019 18:44:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=IbkWyHpTil5Vluy4ET88nT33hqg2kUPldnzNz+Wucw8=; b=ON2LCZDf+FhqrSMU/mPaA0e8okl+JMjzrfIfkWRDv+z1ql2ZDeKwFX9RUuY2FsTBlE +Jpxk0LYI4zAyng1FALK3xuNDoqtDX74UAq6mKnlA0qwKKANMC7FnLDyeMrxsfj2I308 a8c3zUiIi9QZjK4b2Qkg5yXOS4HVlg9jKHdIbTdSETM4qxTcDSAGBVP+mksBkkhsr2r1 qz/0ntVJpn04l8+ckvgpWg4pET0OGXZb8209gn30+kdYup7I4V6VwlKdRddKgEedovZL 5hZbLbWNMJqobLiBE9r5yr1PUQnc31Wtwy7rGF2amNpi07wpqo7H6m0kak6X0nbcMKKn BWzA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=IbkWyHpTil5Vluy4ET88nT33hqg2kUPldnzNz+Wucw8=; b=mcUz0M2n3N6kLY4lZj/ryAGDtHmVet871oCAoli3RUxtcRMku+IWu4RsshyHMJBV2T JZyn0GYrWVilEVlhqhlytCoDH5WzfW0q58VPuYoB/whUJPDOZyQZ7VNOWZ6LswQ1kbLU aqVbVe//W61PFR3LVYoLi6TNaS2iFwDcyZbnColtNfXOTa1ycjLUjYLmdyoybjJKSc5d aIhKgp/NezawAD2TUseh5kY+/CO4fE0br/8lywIljBGhi/ok80Vt0uDOOJ+3/MX/hicR pSdFAAh2W7DwjVcmGmE7xegZtYgjW9ULjtdjdnJvxmRWG7yUaxWAo2GukZ5ubl1tm816 JqvA== X-Gm-Message-State: AJcUukfALAyXrvEMyrChhbMOrvDVBbPgsYtX+zpjH5jrF8QKx29Q3J2h UsGz6kqy6E1iMmogwsDbfLECSgYkQJYQXQ== X-Google-Smtp-Source: ALg8bN5dLSohnTTe1jwXROKs3eQ/MVzwnrM9Xp59Wh+JL6mWBlPGO+F5gV+EjnTr91A2sziO0yga8g== X-Received: by 2002:a62:16d6:: with SMTP id 205mr8412416pfw.256.1547088257039; Wed, 09 Jan 2019 18:44:17 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.13 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:16 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 02/15] block: wire up block device iopoll method Date: Wed, 9 Jan 2019 19:43:51 -0700 Message-Id: <20190110024404.25372-3-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024351.2f55r-GhPo8NaZlbCeGmREiZcN6lbIoVjWnM6cMBVnU@z> From: Christoph Hellwig Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/block_dev.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index c546cdce77e6..5415579f3e14 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -279,6 +279,14 @@ struct blkdev_dio { static struct bio_set blkdev_dio_pool; +static int blkdev_iopoll(struct kiocb *kiocb, bool wait) +{ + struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host); + struct request_queue *q = bdev_get_queue(bdev); + + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); +} + static void blkdev_bio_end_io(struct bio *bio) { struct blkdev_dio *dio = bio->bi_private; @@ -396,6 +404,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_opf |= REQ_HIPRI; qc = submit_bio(bio); + WRITE_ONCE(iocb->ki_cookie, qc); break; } @@ -2068,6 +2077,7 @@ const struct file_operations def_blk_fops = { .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, + .iopoll = blkdev_iopoll, .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF4D3C43612 for ; Thu, 10 Jan 2019 02:44:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 9FD7E21738 for ; Thu, 10 Jan 2019 02:44:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="otbjQTdL" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726989AbfAJCoX (ORCPT ); Wed, 9 Jan 2019 21:44:23 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:35202 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727038AbfAJCoW (ORCPT ); Wed, 9 Jan 2019 21:44:22 -0500 Received: by mail-pg1-f194.google.com with SMTP id s198so4191088pgs.2 for ; Wed, 09 Jan 2019 18:44:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=27Lm/JhUsJpkymG55uhLhyMpM6M6IdPOsWlq/569Gqs=; b=otbjQTdL6mIOkhEwpN8LZSDvw8zfM73Z0soH01PVqgSOKNI5ypPRJ1hvfKkcXzaUfd wr28SFjo9cgVpIM1FiwXA1r2TK4icFoKIsjGQMOYKVfTJqaLAIH758EJROE56xYfBZ9d OQr64TEsg+0ONRorP1zaMYR81qtMUktw2CZu1Rd8Q1M5h6wkSy/dQnnyA+uXdjYr85jn 5SAOfDB6p8eUjsOsOxUAaIgmWAGzjKAzr7bDZgOe54AhryiEp6X3loKxYYMo+8ebh8Dr pheT24teUQWPUoL0hWGXhc+wfPDLFRKBymQPqmx8RhPDAtqaZIrYC3Jh/CnXrIK9i50G 0fAw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=27Lm/JhUsJpkymG55uhLhyMpM6M6IdPOsWlq/569Gqs=; b=ru8WbO3jYzRGhXrfHU3tOvyjhlGSZK/QSxuVYH2pDcMd+tuOoHDCAN0pAMt0yVsuop lEegXobAqOmtyo26HVcsUhvS5GWC/4xkpyJldEmJnbZreS8HhbX0tppA2LU9fyB5znyN Djh+7cYUFMqI0MfDPiRtsJTP2JAHaK/iYwsJczqabxGJTdGijfrICXIHDvckZQcFOBYK vZK+fgH5+op8RHlIQe48mL7aDshChy/VGRrWkleIUf73d9yt4IITCiJ5qifM4EtMhnDX V+lq2CXQ1yNml9Ho/0jG7vYeqK7NMWj4fr7OpnlJrQJHaujYHc7R5HGhddq7yexNekqZ /Vng== X-Gm-Message-State: AJcUukfv6AmABoAlTnQ44UEd7nYl4jLYtiSghnU/0mFMqO0hcbRCoWhA bc+T0bOJOHhF/si2zcYzqtjMbpPU4/T9dQ== X-Google-Smtp-Source: ALg8bN5IEuX/XI2VxWZtjjj8bx217jurJG7juAbtV9ibzKJqbBtdP59WTGNltBYOEaEXy5yhbcQlBQ== X-Received: by 2002:a63:e247:: with SMTP id y7mr5296409pgj.84.1547088261422; Wed, 09 Jan 2019 18:44:21 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.17 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:20 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 03/15] block: add bio_set_polled() helper Date: Wed, 9 Jan 2019 19:43:52 -0700 Message-Id: <20190110024404.25372-4-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024352.cWJJsedn_zhEAhMYKM7V4w537Jns6xgVfqom-HbDj3M@z> For the upcoming async polled IO, we can't sleep allocating requests. If we do, then we introduce a deadlock where the submitter already has async polled IO in-flight, but can't wait for them to complete since polled requests must be active found and reaped. Utilize the helper in the blockdev DIRECT_IO code. Signed-off-by: Jens Axboe --- fs/block_dev.c | 4 ++-- include/linux/bio.h | 14 ++++++++++++++ 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 5415579f3e14..2ebd2a0d7789 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, task_io_account_write(ret); } if (iocb->ki_flags & IOCB_HIPRI) - bio.bi_opf |= REQ_HIPRI; + bio_set_polled(&bio, iocb); qc = submit_bio(&bio); for (;;) { @@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) { if (iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; + bio_set_polled(bio, iocb); qc = submit_bio(bio); WRITE_ONCE(iocb->ki_cookie, qc); diff --git a/include/linux/bio.h b/include/linux/bio.h index 7380b094dcca..f6f0a2b3cbc8 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, #endif /* CONFIG_BLK_DEV_INTEGRITY */ +/* + * Mark a bio as polled. Note that for async polled IO, the caller must + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). + * We cannot block waiting for requests on polled IO, as those completions + * must be found by the caller. This is different than IRQ driven IO, where + * it's safe to wait for IO to complete. + */ +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) +{ + bio->bi_opf |= REQ_HIPRI; + if (!is_sync_kiocb(kiocb)) + bio->bi_opf |= REQ_NOWAIT; +} + #endif /* CONFIG_BLOCK */ #endif /* __LINUX_BIO_H */ -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1DE29C43612 for ; Thu, 10 Jan 2019 02:44:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E188021738 for ; Thu, 10 Jan 2019 02:44:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="IvSZU5cZ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727071AbfAJCoZ (ORCPT ); Wed, 9 Jan 2019 21:44:25 -0500 Received: from mail-pl1-f195.google.com ([209.85.214.195]:39356 "EHLO mail-pl1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727055AbfAJCoZ (ORCPT ); Wed, 9 Jan 2019 21:44:25 -0500 Received: by mail-pl1-f195.google.com with SMTP id 101so4499684pld.6 for ; Wed, 09 Jan 2019 18:44:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=tkb1muQx9x9KnElGaaM03AUkKIKh+AW+Xsb0ZQn8sQk=; b=IvSZU5cZVa2diw41gy5LVeeQ3BEPBkbCZPecitBc1SMU/ekle17lwtZQ50m9t/KFYt GxMJOgDUEycAl+VMtYv9fY4hO2kp4PyerPI3Y39caWMvrCcZyHJ8RheUi+ihI/Rg6eip oOFt8kpYGCiMB8hg7qZjD+P/9VdH4dXRKuuXnTxiXOmktpCOHigie7virrfMWBbM8YUA 8TKETfbOviDnATJGzHv29NCMjeJdZlPEP7kYhGHTmOtxj8tCgrqmmDto7Idc/7YnIahw cPJTcQ1TWxN9CKi1OQ1IPigqE5jdRpFA8UXC4kRGCg3PeRmgoJOSitTbusXWGdbR8wyu ITHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=tkb1muQx9x9KnElGaaM03AUkKIKh+AW+Xsb0ZQn8sQk=; b=RbY6UvqkAJRe5k9FCcp0n7bTH0Ja976jFJiFKovcXgG3g4huEoiMoH+EQDMHyL7qCi ZBJFXmAnvYKLleQvxdEctS9TAWyQ78/KNTbQ+uhjJHEH24dHLozbGkvNbcNSPO5TYDT4 XDLV8Lh8hqKQ7vyOsy5ywJlG9O5gY1hFtaucgy/b5f9vlS6cXx53qgfALRs2gELLwNEr 5OVSlLJ9CiWYgAoUDcE1I77kWJkdZF2w6Oj5Mxvldvvq8fdn2ll2m7flCRAzfTKc33xN RdeCrOuXZVy6R/fMAg2gOqcHxWDlGnKXb03vbA0vEskg2ZD4zvr2eW9ikvGURVqFd63C dfcw== X-Gm-Message-State: AJcUuke5ExfcPRNdWcP48IUJnr6gD6fhN6W2McpuAAHE9KBuLVb0YJ5I QIJo4xVx5K4cfiKD7tnz1MVgUSCwAoiXMg== X-Google-Smtp-Source: ALg8bN6cnGI6vplG4PliMxQeAQgKa8in7FZ8YQMN5sJfMEAqr0mir9n7YFpzXy+1Q+E0s2GTmigdmg== X-Received: by 2002:a17:902:541:: with SMTP id 59mr8753370plf.88.1547088263642; Wed, 09 Jan 2019 18:44:23 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.21 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:22 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 04/15] iomap: wire up the iopoll method Date: Wed, 9 Jan 2019 19:43:53 -0700 Message-Id: <20190110024404.25372-5-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024353.WhSv3DmT0MAX_tK-VYMcr_0RzfxD7SaZygUpxrGzG8A@z> From: Christoph Hellwig Store the request queue the last bio was submitted to in the iocb private data in addition to the cookie so that we find the right block device. Also refactor the common direct I/O bio submission code into a nice little helper. Signed-off-by: Christoph Hellwig Modified to use bio_set_polled(). Signed-off-by: Jens Axboe --- fs/gfs2/file.c | 2 ++ fs/iomap.c | 43 ++++++++++++++++++++++++++++--------------- fs/xfs/xfs_file.c | 1 + include/linux/iomap.h | 1 + 4 files changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index a2dea5bc0427..58a768e59712 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, @@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, diff --git a/fs/iomap.c b/fs/iomap.c index a3088fae567b..4ee50b76b4a1 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1454,6 +1454,28 @@ struct iomap_dio { }; }; +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin) +{ + struct request_queue *q = READ_ONCE(kiocb->private); + + if (!q) + return 0; + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin); +} +EXPORT_SYMBOL_GPL(iomap_dio_iopoll); + +static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, + struct bio *bio) +{ + atomic_inc(&dio->ref); + + if (dio->iocb->ki_flags & IOCB_HIPRI) + bio_set_polled(bio, dio->iocb); + + dio->submit.last_queue = bdev_get_queue(iomap->bdev); + dio->submit.cookie = submit_bio(bio); +} + static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; @@ -1566,7 +1588,7 @@ static void iomap_dio_bio_end_io(struct bio *bio) } } -static blk_qc_t +static void iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, unsigned len) { @@ -1580,15 +1602,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - if (dio->iocb->ki_flags & IOCB_HIPRI) - flags |= REQ_HIPRI; - get_page(page); __bio_add_page(bio, page, len, 0); bio_set_op_attrs(bio, REQ_OP_WRITE, flags); - - atomic_inc(&dio->ref); - return submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } static loff_t @@ -1691,9 +1708,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio_set_pages_dirty(bio); } - if (dio->iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; - iov_iter_advance(dio->submit.iter, n); dio->size += n; @@ -1701,11 +1715,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, copied += n; nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES); - - atomic_inc(&dio->ref); - - dio->submit.last_queue = bdev_get_queue(iomap->bdev); - dio->submit.cookie = submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } while (nr_pages); /* @@ -1916,6 +1926,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; + WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie); + WRITE_ONCE(iocb->private, dio->submit.last_queue); + if (!atomic_dec_and_test(&dio->ref)) { if (!dio->wait_for_completion) return -EIOCBQUEUED; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e47425071e65..60c2da41f0fc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = { .write_iter = xfs_file_write_iter, .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = xfs_file_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = xfs_file_compat_ioctl, diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 9a4258154b25..0fefb5455bda 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret, unsigned flags); ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, iomap_dio_end_io_t end_io); +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin); #ifdef CONFIG_SWAP struct file; -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3906FC43612 for ; Thu, 10 Jan 2019 02:44:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 074E321738 for ; Thu, 10 Jan 2019 02:44:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="RGeW4wm3" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727079AbfAJCob (ORCPT ); Wed, 9 Jan 2019 21:44:31 -0500 Received: from mail-pf1-f196.google.com ([209.85.210.196]:41281 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727083AbfAJCoa (ORCPT ); Wed, 9 Jan 2019 21:44:30 -0500 Received: by mail-pf1-f196.google.com with SMTP id b7so4596956pfi.8 for ; Wed, 09 Jan 2019 18:44:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=z1SsFfT+rKseQTAv05MoMGn60EiztrQ0b2unMjq80nI=; b=RGeW4wm3qyNM79DQyvFKsSxQbpCZmyPgrHtrvzNiNnVLgLAj9dn1yoFxOVsJgz+4h1 HyVhT08vaBbVtuceNRwohADbk5NE0TyMNkRHve3+5bwY9JdipowKWaM+8BNHjV2X/zxt kWZrdxJAz1xI6GNxqVl7XtPeQ+RJW0lrQMMlq+HseKfADQRG/T91R/eb+XPrVScTlHzl uPhYmHQuZ/VRYj5+cwMlFv/PcJ+pLuAFspMojgaDWKLlCzHiqv9+9BDpAeQ14VzC/5b0 KK3g3nEaYjBT0VvWzJMUh6LzjfHb1zBed3BfJR5JBy1i4iD8k+jNq+z4pvMhXO24DPLM 57UQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=z1SsFfT+rKseQTAv05MoMGn60EiztrQ0b2unMjq80nI=; b=VJ7o6UmFA/q/QxzT4tKbwex6L3AzwLfDjBXorN6RqznRLCg/go01cYSLAo6dx07CbU f0/p461My9zPZ4Y8559v8UyKTe9OYo110CEwolTyUrElu8aR4ih7pFeyjf5zAs/7zxYv LyNwA5vhsnKxQIIiahV4sbgTm9mQYHDrHUXUfXptQ2ovODL6SeDo7drMdzBpnfS0MLNI CrI/FCfpZzdsuoyyBQc3XThM4DNeBcpWNKHDxJ+yUg84Vkm2AJeiBAZ0UWYcfUPb3NtE GwipNT7y0cHsYNEp+CcohXRRoBpRf9SNhoEhfMpDF2XRaI05fqqVU5h+05YNk35OAl43 b/6g== X-Gm-Message-State: AJcUukfBkmuY0Pcf1WMxnsx2fRqVlsCSGBI42g5gI0Nv/UJweP/Z55qP YJtElC0FAHGy+O97j297QQzRpZwUA6lU7w== X-Google-Smtp-Source: ALg8bN7RohdV9aia09hn3He4iCefNA+UX6pcC+mFoDB9CNacPgp+FMbkyFgTMIwvsHluefD/bHkyPQ== X-Received: by 2002:a62:4641:: with SMTP id t62mr8347977pfa.141.1547088268338; Wed, 09 Jan 2019 18:44:28 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.26 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:27 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 06/15] io_uring: support for IO polling Date: Wed, 9 Jan 2019 19:43:55 -0700 Message-Id: <20190110024404.25372-7-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024355.JW0shKBsdpSxEBPZrIEfbIQTOLXZyHXFlLv9zrLkCbo@z> Add support for polled read and write commands. These act like their non-polled counterparts, except we expect to poll for completion of them. To use polling, io_uring_setup() must be used with the IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match polled and non-polled IO on an io_uring. Signed-off-by: Jens Axboe --- fs/io_uring.c | 247 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 5 + 2 files changed, 239 insertions(+), 13 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 0bad563f3486..c872bfb32a03 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -71,13 +71,17 @@ struct io_ring_ctx { struct completion ctx_done; + /* iopoll submission state */ struct { - struct mutex uring_lock; - wait_queue_head_t wait; + spinlock_t poll_lock; + struct list_head poll_submitted; } ____cacheline_aligned_in_smp; struct { + struct list_head poll_completing; spinlock_t completion_lock; + struct mutex uring_lock; + wait_queue_head_t wait; } ____cacheline_aligned_in_smp; }; @@ -97,9 +101,12 @@ struct io_kiocb { unsigned long ki_index; struct list_head ki_list; unsigned long ki_flags; +#define KIOCB_F_IOPOLL_COMPLETED 0 /* polled IO has completed */ +#define KIOCB_F_IOPOLL_EAGAIN 1 /* submission got EAGAIN */ }; #define IO_PLUG_THRESHOLD 2 +#define IO_IOPOLL_BATCH 8 struct sqe_submit { const struct io_uring_sqe *sqe; @@ -136,6 +143,9 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) spin_lock_init(&ctx->completion_lock); init_waitqueue_head(&ctx->wait); + spin_lock_init(&ctx->poll_lock); + INIT_LIST_HEAD(&ctx->poll_submitted); + INIT_LIST_HEAD(&ctx->poll_completing); mutex_init(&ctx->uring_lock); return ctx; @@ -187,12 +197,151 @@ static void io_ring_drop_ctx_ref(struct io_ring_ctx *ctx, unsigned refs) wake_up(&ctx->wait); } +static void io_free_kiocb_many(struct io_ring_ctx *ctx, void **iocbs, int *nr) +{ + if (*nr) { + kmem_cache_free_bulk(kiocb_cachep, *nr, iocbs); + io_ring_drop_ctx_ref(ctx, *nr); + *nr = 0; + } +} + static void io_free_kiocb(struct io_kiocb *iocb) { kmem_cache_free(kiocb_cachep, iocb); io_ring_drop_ctx_ref(iocb->ki_ctx, 1); } +/* + * Find and free completed poll iocbs + */ +static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) +{ + void *iocbs[IO_IOPOLL_BATCH]; + struct io_kiocb *iocb, *n; + int to_free = 0; + + list_for_each_entry_safe(iocb, n, &ctx->poll_completing, ki_list) { + if (!test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags)) + continue; + if (to_free == ARRAY_SIZE(iocbs)) + io_free_kiocb_many(ctx, iocbs, &to_free); + + list_del(&iocb->ki_list); + iocbs[to_free++] = iocb; + + fput(iocb->rw.ki_filp); + (*nr_events)++; + } + + if (to_free) + io_free_kiocb_many(ctx, iocbs, &to_free); +} + +/* + * Poll for a mininum of 'min' events, and a maximum of 'max'. Note that if + * min == 0 we consider that a non-spinning poll check - we'll still enter + * the driver poll loop, but only as a non-spinning completion check. + */ +static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, + long min) +{ + struct io_kiocb *iocb; + int found, polled, ret; + + /* + * Check if we already have done events that satisfy what we need + */ + if (!list_empty(&ctx->poll_completing)) { + io_iopoll_reap(ctx, nr_events); + if (min && *nr_events >= min) + return 0; + } + + /* + * Take in a new working set from the submitted list, if possible. + */ + if (!list_empty_careful(&ctx->poll_submitted)) { + spin_lock(&ctx->poll_lock); + list_splice_init(&ctx->poll_submitted, &ctx->poll_completing); + spin_unlock(&ctx->poll_lock); + } + + if (list_empty(&ctx->poll_completing)) + return 0; + + /* + * Check again now that we have a new batch. + */ + io_iopoll_reap(ctx, nr_events); + if (min && *nr_events >= min) + return 0; + + polled = found = 0; + list_for_each_entry(iocb, &ctx->poll_completing, ki_list) { + /* + * Poll for needed events with spin == true, anything after + * that we just check if we have more, up to max. + */ + bool spin = !polled || *nr_events < min; + struct kiocb *kiocb = &iocb->rw; + + if (test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags)) + break; + + found++; + ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); + if (ret < 0) + return ret; + + polled += ret; + } + + io_iopoll_reap(ctx, nr_events); + if (*nr_events >= min) + return 0; + return found; +} + +/* + * We can't just wait for polled events to come to us, we have to actively + * find and complete them. + */ +static void io_iopoll_reap_events(struct io_ring_ctx *ctx) +{ + if (!(ctx->flags & IORING_SETUP_IOPOLL)) + return; + + mutex_lock(&ctx->uring_lock); + while (!list_empty_careful(&ctx->poll_submitted) || + !list_empty(&ctx->poll_completing)) { + unsigned int nr_events = 0; + + io_iopoll_getevents(ctx, &nr_events, 1); + } + mutex_unlock(&ctx->uring_lock); +} + +static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, + long min) +{ + int ret = 0; + + while (!*nr_events || !need_resched()) { + int tmin = 0; + + if (*nr_events < min) + tmin = min - *nr_events; + + ret = io_iopoll_getevents(ctx, nr_events, tmin); + if (ret <= 0) + break; + ret = 0; + } + + return ret; +} + static void kiocb_end_write(struct kiocb *kiocb) { if (kiocb->ki_flags & IOCB_WRITE) { @@ -208,18 +357,16 @@ static void kiocb_end_write(struct kiocb *kiocb) } } -static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, - long res, unsigned ev_flags) +static void __io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, + long res, unsigned ev_flags) { struct io_uring_cqe *cqe; - unsigned long flags; /* * If we can't get a cq entry, userspace overflowed the * submission (by quite a lot). Increment the overflow count in * the ring. */ - spin_lock_irqsave(&ctx->completion_lock, flags); cqe = io_peek_cqring(ctx); if (cqe) { cqe->index = ki_index; @@ -229,6 +376,15 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, io_inc_cqring(ctx); } else ctx->cq_ring->overflow++; +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, + long res, unsigned ev_flags) +{ + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + __io_cqring_fill_event(ctx, ki_index, res, ev_flags); spin_unlock_irqrestore(&ctx->completion_lock, flags); } @@ -243,8 +399,23 @@ static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) io_free_kiocb(iocb); } +static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + if (unlikely(res == -EAGAIN)) { + set_bit(KIOCB_F_IOPOLL_EAGAIN, &iocb->ki_flags); + } else { + __io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, 0); + set_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags); + } +} + static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) { + struct io_ring_ctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; int ret; @@ -266,12 +437,22 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) ret = kiocb_set_rw_flags(req, sqe->rw_flags); if (unlikely(ret)) goto out_fput; - if (req->ki_flags & IOCB_HIPRI) { - ret = -EINVAL; - goto out_fput; - } - req->ki_complete = io_complete_scqring_rw; + if (ctx->flags & IORING_SETUP_IOPOLL) { + ret = -EOPNOTSUPP; + if (!(req->ki_flags & IOCB_DIRECT) || + !req->ki_filp->f_op->iopoll) + goto out_fput; + + req->ki_flags |= IOCB_HIPRI; + req->ki_complete = io_complete_scqring_iopoll; + } else { + if (req->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + req->ki_complete = io_complete_scqring_rw; + } return 0; out_fput: fput(req->ki_filp); @@ -298,6 +479,30 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret) } } +/* + * After the iocb has been issued, it's safe to be found on the poll list. + * Adding the kiocb to the list AFTER submission ensures that we don't + * find it from a io_getevents() thread before the issuer is done accessing + * the kiocb cookie. + */ +static void io_iopoll_kiocb_issued(struct io_kiocb *kiocb) +{ + /* + * For fast devices, IO may have already completed. If it has, add + * it to the front so we find it first. We can't add to the poll_done + * list as that's unlocked from the completion side. + */ + const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags); + struct io_ring_ctx *ctx = kiocb->ki_ctx; + + spin_lock(&ctx->poll_lock); + if (front) + list_add(&kiocb->ki_list, &ctx->poll_submitted); + else + list_add_tail(&kiocb->ki_list, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); +} + static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; @@ -400,6 +605,8 @@ static int io_fsync(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, { struct fsync_iocb *req = &kiocb->fsync; + if (kiocb->ki_ctx->flags & IORING_SETUP_IOPOLL) + return -EINVAL; if (unlikely(sqe->addr || sqe->off || sqe->len || sqe->__resv)) return -EINVAL; @@ -461,6 +668,13 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) */ if (ret) goto out_put_req; + if (ctx->flags & IORING_SETUP_IOPOLL) { + if (test_bit(KIOCB_F_IOPOLL_EAGAIN, &req->ki_flags)) { + ret = -EAGAIN; + goto out_put_req; + } + io_iopoll_kiocb_issued(req); + } return 0; out_put_req: io_free_kiocb(req); @@ -573,12 +787,17 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } if (flags & IORING_ENTER_GETEVENTS) { + unsigned nr_events = 0; int get_ret; if (!ret && to_submit) min_complete = 0; - get_ret = io_cqring_wait(ctx, min_complete); + if (ctx->flags & IORING_SETUP_IOPOLL) + get_ret = io_iopoll_check(ctx, &nr_events, + min_complete); + else + get_ret = io_cqring_wait(ctx, min_complete); if (get_ret < 0 && !ret) ret = get_ret; } @@ -604,6 +823,7 @@ static void io_free_scq_urings(struct io_ring_ctx *ctx) static void io_ring_ctx_free(struct io_ring_ctx *ctx) { + io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); percpu_ref_exit(&ctx->refs); kfree(ctx); @@ -612,6 +832,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) { percpu_ref_kill(&ctx->refs); + io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); io_ring_ctx_free(ctx); } @@ -815,7 +1036,7 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, return -EINVAL; } - if (p.flags) + if (p.flags & ~IORING_SETUP_IOPOLL) return -EINVAL; if (iovecs) return -EINVAL; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ae30ed41965f..ba9e5b851f73 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -31,6 +31,11 @@ struct io_uring_sqe { }; }; +/* + * io_uring_setup() flags + */ +#define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ + #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B040C43444 for ; Thu, 10 Jan 2019 02:44:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id AEA5A20665 for ; Thu, 10 Jan 2019 02:44:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="Gw1wx1/C" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727100AbfAJCob (ORCPT ); Wed, 9 Jan 2019 21:44:31 -0500 Received: from mail-pg1-f196.google.com ([209.85.215.196]:38990 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727079AbfAJCo2 (ORCPT ); Wed, 9 Jan 2019 21:44:28 -0500 Received: by mail-pg1-f196.google.com with SMTP id w6so4179094pgl.6 for ; Wed, 09 Jan 2019 18:44:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=6R5pXwfP/b+cKMhU2Cir7I6dpEfFi5fC13u70FGNNAI=; b=Gw1wx1/CSwWWKtTlPR02RNuAspx89E54otX+G/S2V6g4KuHEOt1iXaYeYCVQL5xq/H 1gCNpIx+OLVK0KWaFd8snOiZavJ84HmL5jBYEEfF/FI7ICgb3KDmt9AQ7vA0PZ+psoAJ rcy1ykovUSQUCnAfWo7RG/CYCYzAJ9o16SsABiuufnpzNPegHd8UCpcZ6+IHF/eEBZZz PxuRzRMyEAOxXhXjcDIYL6U/pdjxLxouhJM5UMMORoumNhB+vtCh6VgGu2o+0Tzfiu+1 wDN2p7TObDD8y4YE29VV1gw4MdCK4eaxOmGyYyJiQ5IVr+FxuNcf7fRQGNoatKKZlKv7 /z8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=6R5pXwfP/b+cKMhU2Cir7I6dpEfFi5fC13u70FGNNAI=; b=CYUp362pOGmQ2rfqnArNmND+noiggEd5t2z7S8pfpN3KJRUTd/c8rM8IcHz1or6cM0 EQNX+R8A3TVc8voKrK+xBq/x0B+gvQoH4z1K1HcpAzvTomEttCpiqAHL2TVUTwN1IfWq 417/lJQ+fDzOZWTPJVDvKh0RVZqtYiFX+Ii1JUg02EckNf1ZHYh57VffBWk/nkrc97ri 84bRtA41ZJ4c5DQ/JG6yGCRWRR/UXz1ZFLSuTYAMiTqjs4A80yY8ojbrx4U71MqqeCKA 9trMcfpqND+Waojeex6OOOFuwePaMiqWEMr7qH0V8I5yNN/o3KP5O16vbAC/ZzJTnZXh hEgw== X-Gm-Message-State: AJcUukd3VHCX3PIOt8WJDSWivm2dRu5GbUK+A/1La7Jx0zCuaoUXs2o1 3zqHWzp9WI1emp8g9MZ0K6mMZfwfcn4IYw== X-Google-Smtp-Source: ALg8bN64Ogb8SytIwRmaLfPonSL45k3190Xz5lhBRdGJJp3xms0OuD261GSoJLN10K9bZdJnQIvGKQ== X-Received: by 2002:a62:2c4d:: with SMTP id s74mr8372481pfs.6.1547088266256; Wed, 09 Jan 2019 18:44:26 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.23 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:25 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 05/15] Add io_uring IO interface Date: Wed, 9 Jan 2019 19:43:54 -0700 Message-Id: <20190110024404.25372-6-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024354.hjfJxxmw6BYlejOkpMv4lLWw9PQsWzvv7meCKjdkCfw@z> The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_sqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered and can point to any io_uring_iocb. Two new system calls are added for this: io_uring_setup(entries, iovecs, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_iocbs. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 838 +++++++++++++++++++++++++ include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 96 +++ init/Kconfig | 8 + kernel/sys_ni.c | 2 + 7 files changed, 952 insertions(+) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..453ff7a79002 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common io_uring_setup __x64_sys_io_uring_setup +336 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..0bad563f3486 --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,838 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2019 Jens Axboe + */ +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include + +#include "internal.h" + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct percpu_ref refs; + + unsigned int flags; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned sq_entries; + unsigned sq_mask; + struct io_uring_sqe *sq_sqes; + + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cq_entries; + unsigned cq_mask; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; +}; + +struct fsync_iocb { + struct work_struct work; + struct file *file; + bool datasync; +}; + +struct io_kiocb { + union { + struct kiocb rw; + struct fsync_iocb fsync; + }; + + struct io_ring_ctx *ki_ctx; + unsigned long ki_index; + struct list_head ki_list; + unsigned long ki_flags; +}; + +#define IO_PLUG_THRESHOLD 2 + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned index; +}; + +static struct kmem_cache *kiocb_cachep; + +static const struct file_operations io_scqring_fops; + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + + init_completion(&ctx->ctx_done); + + spin_lock_init(&ctx->completion_lock); + init_waitqueue_head(&ctx->wait); + mutex_init(&ctx->uring_lock); + + return ctx; +} + +static void io_inc_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + ring->r.tail++; + smp_wmb(); +} + +static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + smp_rmb(); + tail = READ_ONCE(ring->r.tail); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + return &ring->cqes[tail & ctx->cq_mask]; +} + +static struct io_kiocb *io_get_kiocb(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (!percpu_ref_tryget(&ctx->refs)) + return NULL; + + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); + if (!req) + return NULL; + + req->ki_ctx = ctx; + INIT_LIST_HEAD(&req->ki_list); + req->ki_flags = 0; + return req; +} + +static void io_ring_drop_ctx_ref(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_free_kiocb(struct io_kiocb *iocb) +{ + kmem_cache_free(kiocb_cachep, iocb); + io_ring_drop_ctx_ref(iocb->ki_ctx, 1); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + unsigned long flags; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + spin_lock_irqsave(&ctx->completion_lock, flags); + cqe = io_peek_cqring(ctx); + if (cqe) { + cqe->index = ki_index; + cqe->res = res; + cqe->flags = ev_flags; + smp_wmb(); + io_inc_cqring(ctx); + } else + ctx->cq_ring->overflow++; + spin_unlock_irqrestore(&ctx->completion_lock, flags); +} + +static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, 0); + io_free_kiocb(iocb); +} + +static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +{ + struct kiocb *req = &kiocb->rw; + int ret; + + req->ki_filp = fget(sqe->fd); + if (unlikely(!req->ki_filp)) + return -EBADF; + req->ki_pos = sqe->off; + req->ki_flags = iocb_flags(req->ki_filp); + req->ki_hint = ki_hint_validate(file_write_hint(req->ki_filp)); + if (sqe->ioprio) { + ret = ioprio_check_cap(sqe->ioprio); + if (ret) + goto out_fput; + + req->ki_ioprio = sqe->ioprio; + } else + req->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(req, sqe->rw_flags); + if (unlikely(ret)) + goto out_fput; + if (req->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + req->ki_complete = io_complete_scqring_rw; + return 0; +out_fput: + fput(req->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *req, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * There's no easy way to restart the syscall since other AIO's + * may be already running. Just fail this IO with EINTR. + */ + ret = -EINTR; + /*FALLTHRU*/ + default: + req->ki_complete(req, ret, 0); + } +} + +static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *req = &kiocb->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(kiocb, sqe); + if (ret) + return ret; + file = req->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter)); + if (!ret) + io_rw_done(req, call_read_iter(file, req, &iter)); + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *req = &kiocb->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(kiocb, sqe); + if (ret) + return ret; + file = req->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + req->ki_flags |= IOCB_WRITE; + io_rw_done(req, call_write_iter(file, req, &iter)); + } +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static void io_fsync_work(struct work_struct *work) +{ + struct fsync_iocb *req = container_of(work, struct fsync_iocb, work); + struct io_kiocb *iocb = container_of(req, struct io_kiocb, fsync); + int ret; + + ret = vfs_fsync(req->file, req->datasync); + fput(req->file); + + io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, ret, 0); + io_free_kiocb(iocb); +} + +static int io_fsync(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, + bool datasync) +{ + struct fsync_iocb *req = &kiocb->fsync; + + if (unlikely(sqe->addr || sqe->off || sqe->len || sqe->__resv)) + return -EINVAL; + + req->file = fget(sqe->fd); + if (unlikely(!req->file)) + return -EBADF; + if (unlikely(!req->file->f_op->fsync)) { + fput(req->file); + return -EINVAL; + } + + req->datasync = datasync; + INIT_WORK(&req->work, io_fsync_work); + schedule_work(&req->work); + return 0; +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + const struct io_uring_sqe *sqe = s->sqe; + struct io_kiocb *req; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(sqe->flags)) + return -EINVAL; + + req = io_get_kiocb(ctx); + if (unlikely(!req)) + return -EAGAIN; + + ret = -EINVAL; + if (s->index >= ctx->sq_entries) + goto out_put_req; + req->ki_index = s->index; + + ret = -EINVAL; + switch (sqe->opcode) { + case IORING_OP_READV: + ret = io_read(req, sqe); + break; + case IORING_OP_WRITEV: + ret = io_write(req, sqe); + break; + case IORING_OP_FSYNC: + ret = io_fsync(req, sqe, false); + break; + case IORING_OP_FDSYNC: + ret = io_fsync(req, sqe, true); + break; + default: + ret = -EINVAL; + break; + } + + /* + * If ret is 0, ->ki_complete() has either been called, or will get + * called later on. Anything else, we need to free the req. + */ + if (ret) + goto out_put_req; + return 0; +out_put_req: + io_free_kiocb(req); + return ret; +} + +static void io_inc_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + ring->r.head++; + smp_wmb(); +} + +static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + smp_rmb(); + head = READ_ONCE(ring->r.head); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = ring->array[head & ctx->sq_mask]; + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + return true; + } + + /* drop invalid entries */ + ring->r.head++; + ring->dropped++; + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_peek_sqring(ctx, &s)) + break; + + ret = io_submit_sqe(ctx, &s); + if (ret) + break; + + submit++; + io_inc_sqring(ctx); + } + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) +{ + struct io_cq_ring *ring = ctx->cq_ring; + DEFINE_WAIT(wait); + int ret = 0; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, + unsigned min_complete, unsigned flags) +{ + int ret = 0; + + if (to_submit) { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } + if (flags & IORING_ENTER_GETEVENTS) { + int get_ret; + + if (!ret && to_submit) + min_complete = 0; + + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) + ret = get_ret; + } + + return ret; +} + +static void io_free_scq_urings(struct io_ring_ctx *ctx) +{ + if (ctx->sq_ring) { + page_frag_free(ctx->sq_ring); + ctx->sq_ring = NULL; + } + if (ctx->sq_sqes) { + page_frag_free(ctx->sq_sqes); + ctx->sq_sqes = NULL; + } + if (ctx->cq_ring) { + page_frag_free(ctx->cq_ring); + ctx->cq_ring = NULL; + } +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + io_free_scq_urings(ctx); + percpu_ref_exit(&ctx->refs); + kfree(ctx); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_scqring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_scqring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_scqring_fops) + goto out_fput; + + ret = -EINVAL; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_ref(ctx, 1); +out_fput: + fdput(f); + return ret; +} + +static const struct file_operations io_scqring_fops = { + .release = io_scqring_release, + .mmap = io_scqring_mmap, +}; + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + int ret; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + ret = -EOVERFLOW; + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + goto err; + ret = -ENOMEM; + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) + goto err; + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) + goto err; + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +err: + io_free_scq_urings(ctx); + return ret; +} + +static void io_fill_offsets(struct io_uring_params *p) +{ + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + int ret; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + ctx = io_ring_ctx_alloc(p); + if (!ctx) + return -ENOMEM; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + io_fill_offsets(p); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, + struct io_uring_params __user *, params) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + if (iovecs) + return -EINVAL; + + ret = io_uring_create(entries, &p); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +static int __init io_uring_setup(void) +{ + kiocb_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_setup); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..6d40939f65cd 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include #include @@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, struct iovec __user *iov, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..ae30ed41965f --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,96 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include +#include + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; + __u8 flags; + __u16 ioprio; + __s32 fd; + __u64 off; + union { + void *addr; + __u64 __pad; + }; + __u32 len; + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; +}; + +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 +#define IORING_OP_FSYNC 3 +#define IORING_OP_FDSYNC 4 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 index; /* what sqe this event came from */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv[3]; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u32 resv[4]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1 << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u16 resv[10]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..6fbb2f40e912 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1402,6 +1402,14 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and completion IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9BACEC43387 for ; Thu, 10 Jan 2019 02:44:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 606D621773 for ; Thu, 10 Jan 2019 02:44:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="c5Gua/V/" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727114AbfAJCod (ORCPT ); Wed, 9 Jan 2019 21:44:33 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:36566 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727083AbfAJCoc (ORCPT ); Wed, 9 Jan 2019 21:44:32 -0500 Received: by mail-pg1-f193.google.com with SMTP id n2so4178492pgm.3 for ; Wed, 09 Jan 2019 18:44:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=9NmQ4UO3QmH+HkLFK/h1DeE1y2XnZFdARIN74g9Q9Wo=; b=c5Gua/V/gtKGXZ9kRMX7Xafn5M3Qp7Hz3PRl2E6cqurqAmfD+yotRxTPOp/kohM//S rm4pjXrsxcf05e+THTtXAZGwMgpB6FLlppV6KAWvTuxRqKvJ/9b+diliXmt9MDbDHUsv WpxmuC/pMZvksySZqB/fXHMqRJr+0rg9Xr+CgsTWm/aw+ALMIQgoECBcoo/Sdg/MED2O UgnWFyW5P56JF0DGcf5gIyn/np4eU3IAIMWeJ54LfI8TutcAnLTCRA1V3tcJjfUcmokA zO+lRGd/Af0HSEaeOH/DL4E9voOzxipSbbR96ACi1+VBLeKX2ANRzVH3GW23GhlLqGKc 3CdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=9NmQ4UO3QmH+HkLFK/h1DeE1y2XnZFdARIN74g9Q9Wo=; b=VP9Paw2v+wc19rR6L89F4w6gI1axK4m3JzbLGYlCobmxH8b2Gs9TUCutH4Pa4ilSIa nHURD4Bww+QopddM0ClLDI3GqEvarQyReiuue46ccF0I0Ng0r4LNiAIym3vwzpri2rWM XcurodFj8ka1CDTHROa5qh/qo4V6wG0PZRVAsH/esRVDIKx3tXT8QxlkqLLOCWm/2WBw mE8+BLLd1vb3tnegbyNj9uIaBzWyfrOVWMSbdQjvpNdvyj/kyiAocRv3NYIYqv3+n9ou QKrQP5pTduE4mBD6sRPKHDCDLbemxAgy8uARvxa1hEhX7tcGX6ARQ5SpuL6AcIcnSyiu ix0g== X-Gm-Message-State: AJcUukcKuCCVgZEGBjTA0DpaTSNxcyvUsk+lDJkO+DUAdR6Q8VZzItSV 59CYbSSMYhtbyrC7VKk3EDuVtK0K5Zxhzw== X-Google-Smtp-Source: ALg8bN6iTZsowVLpkNrD7zlSN6tCUaG/Dvy1z2s5qRc+rgWc8WVsNHGm1runHbrONHLo9OCsCmmgyg== X-Received: by 2002:a63:4c5:: with SMTP id 188mr7839493pge.391.1547088270620; Wed, 09 Jan 2019 18:44:30 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.28 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:29 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 07/15] io_uring: add submission side request cache Date: Wed, 9 Jan 2019 19:43:56 -0700 Message-Id: <20190110024404.25372-8-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024356.5A9uZncfQ1DldPoVV9QSJjaLJwM-mJcASdum2aI8bAA@z> We have to add each submitted polled request to the io_ring_ctx poll_submitted list, which means we have to grab the poll_lock. We already use the block plug to batch submissions if we're doing a batch of IO submissions, extend that to cover the poll requests internally as well. Signed-off-by: Jens Axboe --- fs/io_uring.c | 122 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 106 insertions(+), 16 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index c872bfb32a03..f7938156552f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -113,6 +113,21 @@ struct sqe_submit { unsigned index; }; +struct io_submit_state { + struct io_ring_ctx *ctx; + + struct blk_plug plug; +#ifdef CONFIG_BLOCK + struct blk_plug_cb plug_cb; +#endif + + /* + * Polled iocbs that have been submitted, but not added to the ctx yet + */ + struct list_head req_list; + unsigned int req_count; +}; + static struct kmem_cache *kiocb_cachep; static const struct file_operations io_scqring_fops; @@ -480,21 +495,29 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret) } /* - * After the iocb has been issued, it's safe to be found on the poll list. - * Adding the kiocb to the list AFTER submission ensures that we don't - * find it from a io_getevents() thread before the issuer is done accessing - * the kiocb cookie. + * Called either at the end of IO submission, or through a plug callback + * because we're going to schedule. Moves out local batch of requests to + * the ctx poll list, so they can be found for polling + reaping. */ -static void io_iopoll_kiocb_issued(struct io_kiocb *kiocb) +static void io_flush_state_reqs(struct io_ring_ctx *ctx, + struct io_submit_state *state) { + spin_lock(&ctx->poll_lock); + list_splice_tail_init(&state->req_list, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); + state->req_count = 0; +} + +static void io_iopoll_iocb_add_list(struct io_kiocb *kiocb) +{ + const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags); + struct io_ring_ctx *ctx = kiocb->ki_ctx; + /* * For fast devices, IO may have already completed. If it has, add * it to the front so we find it first. We can't add to the poll_done * list as that's unlocked from the completion side. */ - const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags); - struct io_ring_ctx *ctx = kiocb->ki_ctx; - spin_lock(&ctx->poll_lock); if (front) list_add(&kiocb->ki_list, &ctx->poll_submitted); @@ -503,6 +526,33 @@ static void io_iopoll_kiocb_issued(struct io_kiocb *kiocb) spin_unlock(&ctx->poll_lock); } +static void io_iopoll_iocb_add_state(struct io_submit_state *state, + struct io_kiocb *kiocb) +{ + if (test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags)) + list_add(&kiocb->ki_list, &state->req_list); + else + list_add_tail(&kiocb->ki_list, &state->req_list); + + if (++state->req_count >= IO_IOPOLL_BATCH) + io_flush_state_reqs(state->ctx, state); +} + +/* + * After the iocb has been issued, it's safe to be found on the poll list. + * Adding the kiocb to the list AFTER submission ensures that we don't + * find it from a io_getevents() thread before the issuer is done accessing + * the kiocb cookie. + */ +static void io_iopoll_kiocb_issued(struct io_submit_state *state, + struct io_kiocb *kiocb) +{ + if (!state || !IS_ENABLED(CONFIG_BLOCK)) + io_iopoll_iocb_add_list(kiocb); + else + io_iopoll_iocb_add_state(state, kiocb); +} + static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; @@ -624,7 +674,8 @@ static int io_fsync(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, return 0; } -static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, + struct io_submit_state *state) { const struct io_uring_sqe *sqe = s->sqe; struct io_kiocb *req; @@ -673,7 +724,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) ret = -EAGAIN; goto out_put_req; } - io_iopoll_kiocb_issued(req); + io_iopoll_kiocb_issued(state, req); } return 0; out_put_req: @@ -681,6 +732,43 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) return ret; } +#ifdef CONFIG_BLOCK +static void io_state_unplug(struct blk_plug_cb *cb, bool from_schedule) +{ + struct io_submit_state *state; + + state = container_of(cb, struct io_submit_state, plug_cb); + if (!list_empty(&state->req_list)) + io_flush_state_reqs(state->ctx, state); +} +#endif + +/* + * Batched submission is done, ensure local IO is flushed out. + */ +static void io_submit_state_end(struct io_submit_state *state) +{ + blk_finish_plug(&state->plug); + if (!list_empty(&state->req_list)) + io_flush_state_reqs(state->ctx, state); +} + +/* + * Start submission side cache. + */ +static void io_submit_state_start(struct io_submit_state *state, + struct io_ring_ctx *ctx) +{ + state->ctx = ctx; + INIT_LIST_HEAD(&state->req_list); + state->req_count = 0; +#ifdef CONFIG_BLOCK + state->plug_cb.callback = io_state_unplug; + blk_start_plug(&state->plug); + list_add(&state->plug_cb.list, &state->plug.cb_list); +#endif +} + static void io_inc_sqring(struct io_ring_ctx *ctx) { struct io_sq_ring *ring = ctx->sq_ring; @@ -715,11 +803,13 @@ static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { + struct io_submit_state state, *statep = NULL; int i, ret = 0, submit = 0; - struct blk_plug plug; - if (to_submit > IO_PLUG_THRESHOLD) - blk_start_plug(&plug); + if (to_submit > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx); + statep = &state; + } for (i = 0; i < to_submit; i++) { struct sqe_submit s; @@ -727,7 +817,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) if (!io_peek_sqring(ctx, &s)) break; - ret = io_submit_sqe(ctx, &s); + ret = io_submit_sqe(ctx, &s, statep); if (ret) break; @@ -735,8 +825,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) io_inc_sqring(ctx); } - if (to_submit > IO_PLUG_THRESHOLD) - blk_finish_plug(&plug); + if (statep) + io_submit_state_end(statep); return submit ? submit : ret; } -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3525BC43612 for ; Thu, 10 Jan 2019 02:44:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0434F21738 for ; Thu, 10 Jan 2019 02:44:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="m09SaBnX" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727091AbfAJCoe (ORCPT ); Wed, 9 Jan 2019 21:44:34 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:34597 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727123AbfAJCoe (ORCPT ); Wed, 9 Jan 2019 21:44:34 -0500 Received: by mail-pg1-f193.google.com with SMTP id j10so4189145pga.1 for ; Wed, 09 Jan 2019 18:44:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=scNBCeOhysecepgCupm97WAYtsMfIFMGuolJ4I0NyhI=; b=m09SaBnXPPXpKqlJalzUG+AWJI25+5/1GStXjtPIDayefmC2h4kRYImd+thX9wCm2Q bBDaCBq2ucv0KDXXbkLBo4lYXTt/CFHWNusBH1ahUKu/h50HKriF8/QBdzyYAS8pvBCO 1fysW9LyFO76ffHO89qMamsH3E4yCQPsCKU+Fu4FUJagiutYNsMFGjZv12xH6c+tVHTv 2z2iGC8NgsCVDMM5c+Ao7lRP5uR2fHcUSht9yOIIBAnqJlgiQZuBb6xXtxak89yjTZ7/ CvEaUqA9fG+027wePc8xcFqbBtmNKJOIND26i1NtSxAAAmNvwU9Z1rsEwg0nsqHKGidD SPOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=scNBCeOhysecepgCupm97WAYtsMfIFMGuolJ4I0NyhI=; b=tKIsYXaSsyEnykK0nxyq/xnva58ob+/HLFTT7IzDPNbgXOf0ceTRMbpOLHBBPJ1BQn CVMAUznB7cnEaI40NzJRVr2yKAqXpSM/d9yuoM0i3MV5XU6bYZ8Ob2NfjLFJlc2Lq59m Us5IkdXLScYOB9J60Goik4vqy0Mk04Y9001LnFPZeB5gkUBDz5lQTTGsRabcpFS8VZtG +c8YlLL8DmVq6nGCTQRKjhhkdrNly0lXb5e37wbxu1vp8aVTaj09C9usO41EIQXQf43D EwL3C/Bqs/BUu2gQMzB/0TNd7fk+nHbPfpR8+Hw35Ht8zB00o+WbJcCsnZmA3gZ8bGHg EMQw== X-Gm-Message-State: AJcUukfu9vxPoCXOTT/XH5d/ncbrPrmThCDdDGFpLi0DLQkbqcXXsG1V kLv8tlaBvDgOCE4sr1ZGaggRlJanJvA+bg== X-Google-Smtp-Source: ALg8bN5r3cbfXjCv4VqGPoTzMLLihQnJPQtKE9SA+db6/ocBWCRoU7Bw1fcjsRstEQjNrcYkJZk7+w== X-Received: by 2002:a63:f65:: with SMTP id 37mr7689063pgp.238.1547088272807; Wed, 09 Jan 2019 18:44:32 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.30 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:31 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 08/15] fs: add fget_many() and fput_many() Date: Wed, 9 Jan 2019 19:43:57 -0700 Message-Id: <20190110024404.25372-9-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024357.e687ptOQvEWF8fF9dP62j_bdhhMUc9Kiad85MGuRYkQ@z> Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Signed-off-by: Jens Axboe --- fs/file.c | 15 ++++++++++----- fs/file_table.c | 9 +++++++-- include/linux/file.h | 2 ++ include/linux/fs.h | 4 +++- 4 files changed, 22 insertions(+), 8 deletions(-) diff --git a/fs/file.c b/fs/file.c index 3209ee271c41..e0d7ce70e860 100644 --- a/fs/file.c +++ b/fs/file.c @@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files) spin_unlock(&files->file_lock); } -static struct file *__fget(unsigned int fd, fmode_t mask) +static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs) { struct files_struct *files = current->files; struct file *file; @@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask) */ if (file->f_mode & mask) file = NULL; - else if (!get_file_rcu(file)) + else if (!get_file_rcu_many(file, refs)) goto loop; } rcu_read_unlock(); @@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask) return file; } +struct file *fget_many(unsigned int fd, unsigned int refs) +{ + return __fget(fd, FMODE_PATH, refs); +} + struct file *fget(unsigned int fd) { - return __fget(fd, FMODE_PATH); + return fget_many(fd, 1); } EXPORT_SYMBOL(fget); struct file *fget_raw(unsigned int fd) { - return __fget(fd, 0); + return __fget(fd, 0, 1); } EXPORT_SYMBOL(fget_raw); @@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask) return 0; return (unsigned long)file; } else { - file = __fget(fd, mask); + file = __fget(fd, mask, 1); if (!file) return 0; return FDPUT_FPUT | (unsigned long)file; diff --git a/fs/file_table.c b/fs/file_table.c index 5679e7fcb6b0..155d7514a094 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -326,9 +326,9 @@ void flush_delayed_fput(void) static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); -void fput(struct file *file) +void fput_many(struct file *file, unsigned int refs) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (atomic_long_sub_and_test(refs, &file->f_count)) { struct task_struct *task = current; if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { @@ -347,6 +347,11 @@ void fput(struct file *file) } } +void fput(struct file *file) +{ + fput_many(file, 1); +} + /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without diff --git a/include/linux/file.h b/include/linux/file.h index 6b2fb032416c..3fcddff56bc4 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -13,6 +13,7 @@ struct file; extern void fput(struct file *); +extern void fput_many(struct file *, unsigned int); struct file_operations; struct vfsmount; @@ -44,6 +45,7 @@ static inline void fdput(struct fd fd) } extern struct file *fget(unsigned int fd); +extern struct file *fget_many(unsigned int fd, unsigned int refs); extern struct file *fget_raw(unsigned int fd); extern unsigned long __fdget(unsigned int fd); extern unsigned long __fdget_raw(unsigned int fd); diff --git a/include/linux/fs.h b/include/linux/fs.h index ccb0b7a63aa5..acaad78b6781 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f) atomic_long_inc(&f->f_count); return f; } -#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count) +#define get_file_rcu_many(x, cnt) \ + atomic_long_add_unless(&(x)->f_count, (cnt), 0) +#define get_file_rcu(x) get_file_rcu_many((x), 1) #define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1) #define file_count(x) atomic_long_read(&(x)->f_count) -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4D86C43612 for ; Thu, 10 Jan 2019 02:44:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A7A0321738 for ; Thu, 10 Jan 2019 02:44:37 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="ZQ64FMEG" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727135AbfAJCog (ORCPT ); Wed, 9 Jan 2019 21:44:36 -0500 Received: from mail-pf1-f195.google.com ([209.85.210.195]:38124 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727130AbfAJCog (ORCPT ); Wed, 9 Jan 2019 21:44:36 -0500 Received: by mail-pf1-f195.google.com with SMTP id q1so4606868pfi.5 for ; Wed, 09 Jan 2019 18:44:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=ynwH7XRy8gBtL+3OT2w/cCPQ5tVO3FYfbGzVjLacAOw=; b=ZQ64FMEGW0thqprjc6T1NhubWtGBpm7OTy+hETbVaDXg/9z7OxUY0IDaPSsV0axyLa sEjvys1SLKngYNDPjAw8SwkbEGVqTqzku17O7nJpoMLnvtXhuPMbdQxs24ta8oxHlnn3 r1VemOVFhS60ypCBeY5Rlji9+G5L40O6gyz8I47pnT8q/7SqUGzEIQvhlko5OfBr2hz1 NcJ65k320F2eQf6OWJAdQVvDXgp2lMfFbIXBccTZLfjAUwvGSabhQB2kowsUS+XdlAzj sasS92q4h1bcHpNKY1uQUNgLGmMbS6/QyfMwWX3COrWYOs2/GdoJcJxNqyEd18Lj0BxZ t5Lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=ynwH7XRy8gBtL+3OT2w/cCPQ5tVO3FYfbGzVjLacAOw=; b=plwR4AQu0uTgjpiyLhxpAMKDLNXzF4qprkK4JYNIr7ij8gFamOUZaQBOBC8UMYCgZ8 1HnYv0Yy2Nf6Kxr7ra4fod1RY1eHS8nkIOcvkC4vziL6clRY2a79k8F/eo4M/QitZYqC 7QVJQuOPEEc7is0CFtCJy5DldH1a966ajOUyawO3ObNa+JMTtqeVuCWZUNF5YkONWiXh zQsxU5yGEQrGSkktTZetis2s4daYtOeu4KFZaNGjwNY0u7g6L4r7ERkWVr9+iNuDXrwT WnoyQsoINKLJuC+SzWuAOu8p8Fx81ZAf94rYShkOTArhaX+7UB+UNFWayeeBSN2S9OZ8 Kanw== X-Gm-Message-State: AJcUukcJ2ISnyxEI//HNIiip+AdWbDbtT+VHh4snm5896GAv0q8VmUas ehAC8MozqnE9jyUlr8gyy7oLXtYOavZDyQ== X-Google-Smtp-Source: ALg8bN63BZjPgIQQE0KUX471EAO9mF5jqPnMD39l4uMiX6kmUtsH2au9z7K1NeAKhKGSuBg/Tjgl2g== X-Received: by 2002:a65:4ccb:: with SMTP id n11mr7872901pgt.257.1547088274894; Wed, 09 Jan 2019 18:44:34 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.33 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:34 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 09/15] io_uring: use fget/fput_many() for file references Date: Wed, 9 Jan 2019 19:43:58 -0700 Message-Id: <20190110024404.25372-10-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024358.ejAB3ID1GB73yEyZ57X_kqNF_F67ymo2rD6XZib3nPo@z> On the submission side, add file reference batching to the io_submit_state. We get as many references as the number of iocbs we are submitting, and drop unused ones if we end up switching files. The assumption here is that we're usually only dealing with one fd, and if there are multiple, hopefuly they are at least somewhat ordered. Could trivially be extended to cover multiple fds, if needed. On the completion side we do the same thing, except this is trivially done just locally in io_iopoll_reap(). Signed-off-by: Jens Axboe --- fs/io_uring.c | 105 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 92 insertions(+), 13 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index f7938156552f..cd2dfc153338 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -126,6 +126,15 @@ struct io_submit_state { */ struct list_head req_list; unsigned int req_count; + + /* + * File reference cache + */ + struct file *file; + unsigned int fd; + unsigned int has_refs; + unsigned int used_refs; + unsigned int ios_left; }; static struct kmem_cache *kiocb_cachep; @@ -234,7 +243,8 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) { void *iocbs[IO_IOPOLL_BATCH]; struct io_kiocb *iocb, *n; - int to_free = 0; + int file_count, to_free = 0; + struct file *file = NULL; list_for_each_entry_safe(iocb, n, &ctx->poll_completing, ki_list) { if (!test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags)) @@ -245,10 +255,27 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) list_del(&iocb->ki_list); iocbs[to_free++] = iocb; - fput(iocb->rw.ki_filp); + /* + * Batched puts of the same file, to avoid dirtying the + * file usage count multiple times, if avoidable. + */ + if (!file) { + file = iocb->rw.ki_filp; + file_count = 1; + } else if (file == iocb->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = iocb->rw.ki_filp; + file_count = 1; + } + (*nr_events)++; } + if (file) + fput_many(file, file_count); + if (to_free) io_free_kiocb_many(ctx, iocbs, &to_free); } @@ -428,13 +455,60 @@ static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) } } -static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +static void io_file_put(struct io_submit_state *state, struct file *file) +{ + if (!state) { + fput(file); + } else if (state->file) { + int diff = state->has_refs - state->used_refs; + + if (diff) + fput_many(state->file, diff); + state->file = NULL; + } +} + +/* + * Get as many references to a file as we have IOs left in this submission, + * assuming most submissions are for one file, or at least that each file + * has more than one submission. + */ +static struct file *io_file_get(struct io_submit_state *state, int fd) +{ + if (!state) + return fget(fd); + + if (!state->file) { +get_file: + state->file = fget_many(fd, state->ios_left); + if (!state->file) + return NULL; + + state->fd = fd; + state->has_refs = state->ios_left; + state->used_refs = 1; + state->ios_left--; + return state->file; + } + + if (state->fd == fd) { + state->used_refs++; + state->ios_left--; + return state->file; + } + + io_file_put(state, NULL); + goto get_file; +} + +static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, + struct io_submit_state *state) { struct io_ring_ctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; int ret; - req->ki_filp = fget(sqe->fd); + req->ki_filp = io_file_get(state, sqe->fd); if (unlikely(!req->ki_filp)) return -EBADF; req->ki_pos = sqe->off; @@ -470,7 +544,7 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) } return 0; out_fput: - fput(req->ki_filp); + io_file_put(state, req->ki_filp); return ret; } @@ -553,7 +627,8 @@ static void io_iopoll_kiocb_issued(struct io_submit_state *state, io_iopoll_iocb_add_state(state, kiocb); } -static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, + struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; void __user *buf = (void __user *) (uintptr_t) sqe->addr; @@ -562,7 +637,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, sqe); + ret = io_prep_rw(kiocb, sqe, state); if (ret) return ret; file = req->ki_filp; @@ -588,7 +663,8 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) return ret; } -static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) +static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, + struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; void __user *buf = (void __user *) (uintptr_t) sqe->addr; @@ -597,7 +673,7 @@ static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe) struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, sqe); + ret = io_prep_rw(kiocb, sqe, state); if (ret) return ret; file = req->ki_filp; @@ -697,10 +773,10 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, ret = -EINVAL; switch (sqe->opcode) { case IORING_OP_READV: - ret = io_read(req, sqe); + ret = io_read(req, sqe, state); break; case IORING_OP_WRITEV: - ret = io_write(req, sqe); + ret = io_write(req, sqe, state); break; case IORING_OP_FSYNC: ret = io_fsync(req, sqe, false); @@ -751,17 +827,20 @@ static void io_submit_state_end(struct io_submit_state *state) blk_finish_plug(&state->plug); if (!list_empty(&state->req_list)) io_flush_state_reqs(state->ctx, state); + io_file_put(state, NULL); } /* * Start submission side cache. */ static void io_submit_state_start(struct io_submit_state *state, - struct io_ring_ctx *ctx) + struct io_ring_ctx *ctx, unsigned max_ios) { state->ctx = ctx; INIT_LIST_HEAD(&state->req_list); state->req_count = 0; + state->file = NULL; + state->ios_left = max_ios; #ifdef CONFIG_BLOCK state->plug_cb.callback = io_state_unplug; blk_start_plug(&state->plug); @@ -807,7 +886,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) int i, ret = 0, submit = 0; if (to_submit > IO_PLUG_THRESHOLD) { - io_submit_state_start(&state, ctx); + io_submit_state_start(&state, ctx, to_submit); statep = &state; } -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 622F1C43387 for ; Thu, 10 Jan 2019 02:44:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3203221738 for ; Thu, 10 Jan 2019 02:44:40 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="YoeJSUYM" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727138AbfAJCoj (ORCPT ); Wed, 9 Jan 2019 21:44:39 -0500 Received: from mail-pl1-f194.google.com ([209.85.214.194]:34789 "EHLO mail-pl1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727130AbfAJCoi (ORCPT ); Wed, 9 Jan 2019 21:44:38 -0500 Received: by mail-pl1-f194.google.com with SMTP id w4so4506804plz.1 for ; Wed, 09 Jan 2019 18:44:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=pIhU8eaLRf7z3l+B8J2n3fyMzAo9urqPpN7Nm7QZZV8=; b=YoeJSUYMkoHTdYD46jvivXBRx+ZRy6kirpF0c+sCukfhm+OyXMfjPQWHvaXngD/jQZ bPJD8dCk39AnBJZodShXRk7SXV04YwjyimD1/vV0aTQAZCqFsV7o9ICEVB9Mhf+uom+F ESqvu7FVNh3YcNDE3aiZCRRWOXQeRfo/onOqvVvEt3unBIqc7S5HGgNINUofxyuJcVOH tGsOwu4bKpP/zp1Ifkam18TG59swiZF23QyQg1HpFcENSvhLb0XgarCU6NszZ+zYycAx Xgp+HFEdDzVCbosu3FoDDqrbxmp5dRHtzm6583BHjvmIPhtaHAYam8XXFJBNspoF1k0k eHgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=pIhU8eaLRf7z3l+B8J2n3fyMzAo9urqPpN7Nm7QZZV8=; b=opavYLXuK9+rFGnOT83lzHZ8nz+vncaRZFSAEdHksO7xrNUxm0arqO9WUSoFgcfqBw xYBS9vFuarookhK1gkHGqXECMDXAoAkWW2t/aovJbb0yXHMSlX7oWpfdvyI+oW/LPnEz y452cN1y00AcSdQEtRsyHz4CVjEyiY21EenJPltto1+lgI3zlSQDXJVWe8qMQExaUKIJ siksAGy91bJKPzOn2DqylfoK7gmDu+f0GbfDHvM0s7OTD8LS9B2vaCTPzB2+irrhTl5Y 0ieAFjW9R9WD0fgJSjDpULCgmmfS4EaQmojKE68ujuD0N8b2E8OS8zumPDLYslCOeXj5 8vog== X-Gm-Message-State: AJcUukfkIKOO18FLAdk3wdUIlinPfZRi3BFJlkVTrW+2y5Vyn6XlhYjy 66Yp74X+j91te+LRv4Ae1dr0j/qqTO7Ehg== X-Google-Smtp-Source: ALg8bN5Y/ZU0JQk4+nEbRBqJyXjMWKBNi8LzwpeimUqnMyVD/fuTvue8Q8znTZcOPX8P38tVb7oDfQ== X-Received: by 2002:a17:902:9a04:: with SMTP id v4mr8751782plp.34.1547088276902; Wed, 09 Jan 2019 18:44:36 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.34 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:35 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 10/15] io_uring: batch io_kiocb allocation Date: Wed, 9 Jan 2019 19:43:59 -0700 Message-Id: <20190110024404.25372-11-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024359.bl5mHeomLfjjZkPAJO4dfqoucRMUmkdfMe1QnT8Edjc@z> Similarly to how we use the state->ios_left to know how many references to get to a file, we can use it to allocate the io_kiocb's we need in bulk. Signed-off-by: Jens Axboe --- fs/io_uring.c | 71 +++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 52 insertions(+), 19 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index cd2dfc153338..b5233786b5a8 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -127,6 +127,13 @@ struct io_submit_state { struct list_head req_list; unsigned int req_count; + /* + * io_kiocb alloc cache + */ + void *kiocbs[IO_IOPOLL_BATCH]; + unsigned int free_kiocbs; + unsigned int cur_kiocb; + /* * File reference cache */ @@ -196,36 +203,58 @@ static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) return &ring->cqes[tail & ctx->cq_mask]; } -static struct io_kiocb *io_get_kiocb(struct io_ring_ctx *ctx) +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static struct io_kiocb *io_get_kiocb(struct io_ring_ctx *ctx, + struct io_submit_state *state) { struct io_kiocb *req; if (!percpu_ref_tryget(&ctx->refs)) return NULL; - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); - if (!req) - return NULL; - - req->ki_ctx = ctx; - INIT_LIST_HEAD(&req->ki_list); - req->ki_flags = 0; - return req; -} + if (!state) + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); + else if (!state->free_kiocbs) { + size_t sz; + int ret; + + sz = min_t(size_t, state->ios_left, ARRAY_SIZE(state->kiocbs)); + ret = kmem_cache_alloc_bulk(kiocb_cachep, GFP_KERNEL, sz, + state->kiocbs); + if (ret <= 0) + goto out; + state->free_kiocbs = ret - 1; + state->cur_kiocb = 1; + req = state->kiocbs[0]; + } else { + req = state->kiocbs[state->cur_kiocb]; + state->free_kiocbs--; + state->cur_kiocb++; + } -static void io_ring_drop_ctx_ref(struct io_ring_ctx *ctx, unsigned refs) -{ - percpu_ref_put_many(&ctx->refs, refs); + if (req) { + req->ki_ctx = ctx; + req->ki_flags = 0; + return req; + } - if (waitqueue_active(&ctx->wait)) - wake_up(&ctx->wait); +out: + io_ring_drop_ctx_refs(ctx, 1); + return NULL; } static void io_free_kiocb_many(struct io_ring_ctx *ctx, void **iocbs, int *nr) { if (*nr) { kmem_cache_free_bulk(kiocb_cachep, *nr, iocbs); - io_ring_drop_ctx_ref(ctx, *nr); + io_ring_drop_ctx_refs(ctx, *nr); *nr = 0; } } @@ -233,7 +262,7 @@ static void io_free_kiocb_many(struct io_ring_ctx *ctx, void **iocbs, int *nr) static void io_free_kiocb(struct io_kiocb *iocb) { kmem_cache_free(kiocb_cachep, iocb); - io_ring_drop_ctx_ref(iocb->ki_ctx, 1); + io_ring_drop_ctx_refs(iocb->ki_ctx, 1); } /* @@ -761,7 +790,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, if (unlikely(sqe->flags)) return -EINVAL; - req = io_get_kiocb(ctx); + req = io_get_kiocb(ctx, state); if (unlikely(!req)) return -EAGAIN; @@ -828,6 +857,9 @@ static void io_submit_state_end(struct io_submit_state *state) if (!list_empty(&state->req_list)) io_flush_state_reqs(state->ctx, state); io_file_put(state, NULL); + if (state->free_kiocbs) + kmem_cache_free_bulk(kiocb_cachep, state->free_kiocbs, + &state->kiocbs[state->cur_kiocb]); } /* @@ -839,6 +871,7 @@ static void io_submit_state_start(struct io_submit_state *state, state->ctx = ctx; INIT_LIST_HEAD(&state->req_list); state->req_count = 0; + state->free_kiocbs = 0; state->file = NULL; state->ios_left = max_ios; #ifdef CONFIG_BLOCK @@ -1071,7 +1104,7 @@ SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, ret = __io_uring_enter(ctx, to_submit, min_complete, flags); mutex_unlock(&ctx->uring_lock); } - io_ring_drop_ctx_ref(ctx, 1); + io_ring_drop_ctx_refs(ctx, 1); out_fput: fdput(f); return ret; -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10FCCC43387 for ; Thu, 10 Jan 2019 02:44:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D600920665 for ; Thu, 10 Jan 2019 02:44:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="d9sWb0mu" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727141AbfAJCol (ORCPT ); Wed, 9 Jan 2019 21:44:41 -0500 Received: from mail-pl1-f196.google.com ([209.85.214.196]:42460 "EHLO mail-pl1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727142AbfAJCok (ORCPT ); Wed, 9 Jan 2019 21:44:40 -0500 Received: by mail-pl1-f196.google.com with SMTP id y1so4490274plp.9 for ; Wed, 09 Jan 2019 18:44:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=Lz/3jHz1x36txQZ/D99mDx/1sUz8ZTGETsX8P4ZemoU=; b=d9sWb0muIRwsYjCKOuKFwxHjb9LUUk3z63TSzxGh7t1xKWU2JieA/r5CA5Go3jGnki lWfTsXXQq+gzHO7BAwk7aDp3DmwWt8ODHUlIj1cICqDKVNtECKBFkvTxPDHghdKlej+Z pwlNiYLXoH+RuQJgyp5XNdcrAc/i4mlZRu0t6yYgLQ2L4Ut26EFLmDtrM+fVyRiAKqws yBcJxAbuYMr6S/dZoWcIbJIUEaT6uDWd2F8imbvVmBvkwRuNXG3m3zontBVK5oCRgZ0x DbguXDtSq3R08v23Cz+UVVPD76/fuRNkmPgFYdUKbwKxKUupBbJZ3QHYCgq8x1nx72d7 J27g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=Lz/3jHz1x36txQZ/D99mDx/1sUz8ZTGETsX8P4ZemoU=; b=EAzSnHh1fcA1wHVhVYogg+u7UpBIo/hSX8AOPkhrjpgx+ydmwhCWLmkkmA1e5bKxOu xaLAowQ4R5kTYDS42zX0bWnWh8k+ARGSH5ScVWoLzqEa1p+44e2E6vdK9ddwCiZhrbOM IQKZ8W2XNJnr7uHAP5KHyeXsM4KSQtvuCgKKXcL10flI39vYZ0aqG37rCTtxso7bMbDL xE5eni2yB2df5hnzm6xqROibaRiHxP8XHxXu3KEzJwLj4o6i0mfMg4GFP3fIgvl/BI4W m9wM+nHe7LV01Sb1Otg3LR3P/J/XO3P6j9SuVreRyoiiXOt18l1nT47aHwPBcsXdB7lO Q/rw== X-Gm-Message-State: AJcUukdgW/Z2Hds8mVw/F4gXdy45l6Z+1f1pvtsz0wboWc0yemv9l/zg 1njZm1UtwHYU4uHLaUDVfKZpIiVoxfnu/g== X-Google-Smtp-Source: ALg8bN4vnguRzI2mPWHmT9MrA1qQHDYaFwG3DeJ2vEutEW1rPuUdxLTyFbeBBqDwic1mAqYVsA/Y5A== X-Received: by 2002:a17:902:264:: with SMTP id 91mr8667854plc.108.1547088279132; Wed, 09 Jan 2019 18:44:39 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.36 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:38 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 11/15] block: implement bio helper to add iter bvec pages to bio Date: Wed, 9 Jan 2019 19:44:00 -0700 Message-Id: <20190110024404.25372-12-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024400.5o18bGM2nhltodLxUOep4TvFh3d6wwMHV7sMXGJPXg0@z> For an ITER_BVEC, we can just iterate the iov and add the pages to the bio directly. This requires that the caller doesn't releases the pages on IO completion, we add a BIO_HOLD_PAGES flag for that. The current two callers of bio_iov_iter_get_pages() are updated to check if they need to release pages on completion. This makes them work with bvecs that contain kernel mapped pages already. Signed-off-by: Jens Axboe --- block/bio.c | 59 ++++++++++++++++++++++++++++++++------- fs/block_dev.c | 5 ++-- fs/iomap.c | 5 ++-- include/linux/blk_types.h | 1 + 4 files changed, 56 insertions(+), 14 deletions(-) diff --git a/block/bio.c b/block/bio.c index 4db1008309ed..7af4f45d2ed6 100644 --- a/block/bio.c +++ b/block/bio.c @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) +{ + const struct bio_vec *bv = iter->bvec; + unsigned int len; + size_t size; + + len = min_t(size_t, bv->bv_len, iter->count); + size = bio_add_page(bio, bv->bv_page, len, + bv->bv_offset + iter->iov_offset); + if (size == len) { + iov_iter_advance(iter, size); + return 0; + } + + return -EINVAL; +} + #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) /** @@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) } /** - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * bio_iov_iter_get_pages - add user or kernel pages to a bio * @bio: bio to add pages to - * @iter: iov iterator describing the region to be mapped + * @iter: iov iterator describing the region to be added + * + * This takes either an iterator pointing to user memory, or one pointing to + * kernel pages (BVEC iterator). If we're adding user pages, we pin them and + * map them into the kernel. On IO completion, the caller should put those + * pages. If we're adding kernel pages, we just have to add the pages to the + * bio directly. We don't grab an extra reference to those pages (the user + * should already have that), and we don't put the page on IO completion. + * The caller needs to check if the bio is flagged BIO_HOLD_PAGES on IO + * completion. If it isn't, then pages should be released. * - * Pins pages from *iter and appends them to @bio's bvec array. The - * pages will have to be released using put_page() when done. * The function tries, but does not guarantee, to pin as many pages as - * fit into the bio, or are requested in *iter, whatever is smaller. - * If MM encounters an error pinning the requested pages, it stops. - * Error is returned only if 0 pages could be pinned. + * fit into the bio, or are requested in *iter, whatever is smaller. If + * MM encounters an error pinning the requested pages, it stops. Error + * is returned only if 0 pages could be pinned. */ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { + const bool is_bvec = iov_iter_is_bvec(iter); unsigned short orig_vcnt = bio->bi_vcnt; + /* + * If this is a BVEC iter, then the pages are kernel pages. Don't + * release them on IO completion. + */ + if (is_bvec) + bio_set_flag(bio, BIO_HOLD_PAGES); + do { - int ret = __bio_iov_iter_get_pages(bio, iter); + int ret; + + if (is_bvec) + ret = __bio_iov_bvec_add_pages(bio, iter); + else + ret = __bio_iov_iter_get_pages(bio, iter); if (unlikely(ret)) return bio->bi_vcnt > orig_vcnt ? 0 : ret; @@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work) next = bio->bi_private; bio_set_pages_dirty(bio); - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); } } @@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio) goto defer; } - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); return; defer: diff --git a/fs/block_dev.c b/fs/block_dev.c index 2ebd2a0d7789..b7742014c9de 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -324,8 +324,9 @@ static void blkdev_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/fs/iomap.c b/fs/iomap.c index 4ee50b76b4a1..0a64c9c51203 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1582,8 +1582,9 @@ static void iomap_dio_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 5c7e7f859a24..97e206855cd3 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -215,6 +215,7 @@ struct bio { /* * bio flags */ +#define BIO_HOLD_PAGES 0 /* don't put O_DIRECT pages */ #define BIO_SEG_VALID 1 /* bi_phys_segments valid */ #define BIO_CLONED 2 /* doesn't own data */ #define BIO_BOUNCED 3 /* bio is a bounce bio */ -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EDCD8C43444 for ; Thu, 10 Jan 2019 02:44:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B698520665 for ; Thu, 10 Jan 2019 02:44:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="jKci7gK8" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727161AbfAJCon (ORCPT ); Wed, 9 Jan 2019 21:44:43 -0500 Received: from mail-pf1-f196.google.com ([209.85.210.196]:40767 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727153AbfAJCon (ORCPT ); Wed, 9 Jan 2019 21:44:43 -0500 Received: by mail-pf1-f196.google.com with SMTP id i12so4598400pfo.7 for ; Wed, 09 Jan 2019 18:44:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=E+uVN/Gf1zNysNQvdyWHCD5E6ExnZZsJtsButMXmMw8=; b=jKci7gK8zdD/lOjcNroUZ+pROMiGPmlR6xmyFb4cuM0dpbBoPgQCGwu2E94pLVmwZy fntP9LicNkCEkQFloia/yPhz3qWXq3Ksa+zh4mclTPTBNY/VUkt9KLxJPok8AXcSCRbF hHBLNX7Mh93jt/hQl8q9NE8OX1+Y96hgucTVez+ELokdLzaMOBrJhcTj62lhkYDHybm5 Jg6KtAyRGxMt4zPaodarDQDMfXfRKi2W92O5ZEGx2Wh7vXEegYhy00YHXaiv3Kp4NV/9 wYpQywAASizPlGVAaJfKK1rBHNfr4+yMudpJvQlrtEf34YOaXiHWkL+1lREYUUp3ODFG DPFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=E+uVN/Gf1zNysNQvdyWHCD5E6ExnZZsJtsButMXmMw8=; b=oI0+XC3tCUkGzubWcByV9Tu/yxiS54PKjpInxtXwMlEyJHTvhVBudTajoM/MwJcTL0 7iXgR9MrD1ZK3uO6O9DtKdCOZ4uhpUYkTPap4C2Z81/541vKxQbetRSxloUsvUlf8e5R EzlMvlPKAET50a+sO+uvMF3EhTL9aA+vZSqi6bDaviqk0DI5H6I3TMi6BZAMTmp8DUrC zc+zG3R2qwYHOT74QIk3CMX50u7lKQCpgqgP7YILh75babyjup+otxy5x0e0ghJwdVyU ETn1hN8YsgX5KM9qzePrBbePDeysTCkRJ+7rMSZn/zgtP13VeBysl08+DA3+zY5RHUJM mBDA== X-Gm-Message-State: AJcUukdiC+y/LYc3CD60/jQ28uKoQzMyp9+IcVk7S61FRN/Z4Uk45f4k qqurQiF9U0aPJTCud5WV/1Qn7c/cCjyLtw== X-Google-Smtp-Source: ALg8bN4q8MYPMW1tq64u+D0Vu8eWqBjC8lrhOWgUHSxOOlLSmtG2x3k2+/K8eNVJV3Sr//ZoWNJA5A== X-Received: by 2002:a62:6204:: with SMTP id w4mr8554083pfb.5.1547088281210; Wed, 09 Jan 2019 18:44:41 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.39 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:40 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 12/15] io_uring: add support for pre-mapped user IO buffers Date: Wed, 9 Jan 2019 19:44:01 -0700 Message-Id: <20190110024404.25372-13-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024401.eKUMXZeB6VBZ0CGcll9n2uC3On5wtmWdLYY9Hq7A_Yk@z> If we have fixed user buffers, we can map them into the kernel when we setup the io_context. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must pass in an array of iovecs that contain the desired buffer addresses and lengths. These buffers can then be mapped into the kernel for the life time of the io_uring, as opposed to just the duration of the each single IO. The application can then use the IORING_OP_{READ,WRITE}_FIXED to perform IO to these fixed locations. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. A limit of 4M is imposed as the largest buffer we currently support. There's nothing preventing us from going larger, but we need some cap, and 4M seemed like it would definitely be big enough. RLIMIT_MEMLOCK is used to cap the total amount of memory pinned. Signed-off-by: Jens Axboe --- fs/io_uring.c | 202 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 2 + 2 files changed, 196 insertions(+), 8 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index b5233786b5a8..7ab20258e39b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -23,6 +23,8 @@ #include #include #include +#include +#include #include #include @@ -53,6 +55,13 @@ struct io_cq_ring { struct io_uring_cqe cqes[]; }; +struct io_mapped_ubuf { + u64 ubuf; + size_t len; + struct bio_vec *bvec; + unsigned int nr_bvecs; +}; + struct io_ring_ctx { struct percpu_ref refs; @@ -69,6 +78,9 @@ struct io_ring_ctx { unsigned cq_entries; unsigned cq_mask; + /* if used, fixed mapped user buffers */ + struct io_mapped_ubuf *user_bufs; + struct completion ctx_done; /* iopoll submission state */ @@ -656,11 +668,42 @@ static void io_iopoll_kiocb_issued(struct io_submit_state *state, io_iopoll_iocb_add_state(state, kiocb); } +static int io_import_fixed(int rw, struct io_kiocb *kiocb, + const struct io_uring_sqe *sqe, + struct iov_iter *iter) +{ + struct io_ring_ctx *ctx = kiocb->ki_ctx; + struct io_mapped_ubuf *imu; + size_t len = sqe->len; + size_t offset; + int index; + + /* attempt to use fixed buffers without having provided iovecs */ + if (!ctx->user_bufs) + return -EFAULT; + + /* io_submit_sqe() already validated the index */ + index = array_index_nospec(kiocb->ki_index, ctx->sq_entries); + imu = &ctx->user_bufs[index]; + if ((unsigned long) sqe->addr < imu->ubuf || + (unsigned long) sqe->addr + len > imu->ubuf + imu->len) + return -EFAULT; + + /* + * May not be a start of buffer, set size appropriately + * and advance us to the beginning. + */ + offset = (unsigned long) sqe->addr - imu->ubuf; + iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len); + if (offset) + iov_iter_advance(iter, offset); + return 0; +} + static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - void __user *buf = (void __user *) (uintptr_t) sqe->addr; struct kiocb *req = &kiocb->rw; struct iov_iter iter; struct file *file; @@ -678,7 +721,15 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, if (unlikely(!file->f_op->read_iter)) goto out_fput; - ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (sqe->opcode == IORING_OP_READV) { + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + + ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, + &iter); + } else { + ret = io_import_fixed(READ, kiocb, sqe, &iter); + iovec = NULL; + } if (ret) goto out_fput; @@ -696,7 +747,6 @@ static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - void __user *buf = (void __user *) (uintptr_t) sqe->addr; struct kiocb *req = &kiocb->rw; struct iov_iter iter; struct file *file; @@ -714,7 +764,14 @@ static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, if (unlikely(!file->f_op->write_iter)) goto out_fput; - ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (sqe->opcode == IORING_OP_WRITEV) { + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + + ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + } else { + ret = io_import_fixed(WRITE, kiocb, sqe, &iter); + iovec = NULL; + } if (ret) goto out_fput; @@ -802,9 +859,11 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, ret = -EINVAL; switch (sqe->opcode) { case IORING_OP_READV: + case IORING_OP_READ_FIXED: ret = io_read(req, sqe, state); break; case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: ret = io_write(req, sqe, state); break; case IORING_OP_FSYNC: @@ -1007,6 +1066,127 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } +static void io_sqe_buffer_unmap(struct io_ring_ctx *ctx) +{ + int i, j; + + if (!ctx->user_bufs) + return; + + for (i = 0; i < ctx->sq_entries; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + + for (j = 0; j < imu->nr_bvecs; j++) + put_page(imu->bvec[j].bv_page); + + kfree(imu->bvec); + imu->nr_bvecs = 0; + } + + kfree(ctx->user_bufs); + ctx->user_bufs = NULL; +} + +static int io_sqe_buffer_map(struct io_ring_ctx *ctx, + struct iovec __user *iovecs) +{ + unsigned long total_pages, page_limit; + struct page **pages = NULL; + int i, j, got_pages = 0; + int ret = -EINVAL; + + ctx->user_bufs = kcalloc(ctx->sq_entries, sizeof(struct io_mapped_ubuf), + GFP_KERNEL); + if (!ctx->user_bufs) + return -ENOMEM; + + /* Don't allow more pages than we can safely lock */ + total_pages = 0; + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + for (i = 0; i < ctx->sq_entries; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + unsigned long off, start, end, ubuf; + int pret, nr_pages; + struct iovec iov; + size_t size; + + ret = -EFAULT; + if (copy_from_user(&iov, &iovecs[i], sizeof(iov))) + goto err; + + /* + * Don't impose further limits on the size and buffer + * constraints here, we'll -EINVAL later when IO is + * submitted if they are wrong. + */ + ret = -EFAULT; + if (!iov.iov_base) + goto err; + + /* arbitrary limit, but we need something */ + if (iov.iov_len > SZ_4M) + goto err; + + ubuf = (unsigned long) iov.iov_base; + end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = ubuf >> PAGE_SHIFT; + nr_pages = end - start; + + ret = -ENOMEM; + if (total_pages + nr_pages > page_limit) + goto err; + + if (!pages || nr_pages > got_pages) { + kfree(pages); + pages = kmalloc(nr_pages * sizeof(struct page *), + GFP_KERNEL); + if (!pages) + goto err; + got_pages = nr_pages; + } + + imu->bvec = kmalloc(nr_pages * sizeof(struct bio_vec), + GFP_KERNEL); + if (!imu->bvec) + goto err; + + down_write(¤t->mm->mmap_sem); + pret = get_user_pages_longterm(ubuf, nr_pages, 1, pages, NULL); + up_write(¤t->mm->mmap_sem); + + if (pret < nr_pages) { + if (pret < 0) + ret = pret; + goto err; + } + + off = ubuf & ~PAGE_MASK; + size = iov.iov_len; + for (j = 0; j < nr_pages; j++) { + size_t vec_len; + + vec_len = min_t(size_t, size, PAGE_SIZE - off); + imu->bvec[j].bv_page = pages[j]; + imu->bvec[j].bv_len = vec_len; + imu->bvec[j].bv_offset = off; + off = 0; + size -= vec_len; + } + /* store original address for later verification */ + imu->ubuf = ubuf; + imu->len = iov.iov_len; + imu->nr_bvecs = nr_pages; + total_pages += nr_pages; + } + kfree(pages); + return 0; +err: + kfree(pages); + io_sqe_buffer_unmap(ctx); + return ret; +} + static void io_free_scq_urings(struct io_ring_ctx *ctx) { if (ctx->sq_ring) { @@ -1027,6 +1207,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); + io_sqe_buffer_unmap(ctx); percpu_ref_exit(&ctx->refs); kfree(ctx); } @@ -1185,7 +1366,8 @@ static void io_fill_offsets(struct io_uring_params *p) p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); } -static int io_uring_create(unsigned entries, struct io_uring_params *p) +static int io_uring_create(unsigned entries, struct io_uring_params *p, + struct iovec __user *iovecs) { struct io_ring_ctx *ctx; int ret; @@ -1207,6 +1389,12 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) if (ret) goto err; + if (iovecs) { + ret = io_sqe_buffer_map(ctx, iovecs); + if (ret) + goto err; + } + ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, O_RDWR | O_CLOEXEC); if (ret < 0) @@ -1240,10 +1428,8 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, if (p.flags & ~IORING_SETUP_IOPOLL) return -EINVAL; - if (iovecs) - return -EINVAL; - ret = io_uring_create(entries, &p); + ret = io_uring_create(entries, &p, iovecs); if (ret < 0) return ret; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ba9e5b851f73..80d1a8224b9c 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -40,6 +40,8 @@ struct io_uring_sqe { #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 #define IORING_OP_FDSYNC 4 +#define IORING_OP_READ_FIXED 5 +#define IORING_OP_WRITE_FIXED 6 /* * IO completion data structure (Completion Queue Entry) -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D0EFC43387 for ; Thu, 10 Jan 2019 02:44:48 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 17E542173B for ; Thu, 10 Jan 2019 02:44:48 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="f21rJ7K1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727189AbfAJCoq (ORCPT ); Wed, 9 Jan 2019 21:44:46 -0500 Received: from mail-pf1-f195.google.com ([209.85.210.195]:44434 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727162AbfAJCop (ORCPT ); Wed, 9 Jan 2019 21:44:45 -0500 Received: by mail-pf1-f195.google.com with SMTP id u6so4583621pfh.11 for ; Wed, 09 Jan 2019 18:44:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=/dpaO2R3HJe56fTX4cXfsKpL2XQCbaiSgmrj4ZFB8cQ=; b=f21rJ7K13uzes1BFhnk47YebtdIOwYKg2Sdu+X8/DTAiExPuGzVcieogR/NdvuoWKy Jdv/KLdrQqafjN+tsyT1Uma+84aSkFtfkeAiiXyOCNDqkXw1HALXweokFvxASb3UgH4l OxqRFQJ3fIdemMNZQtJRyjQPibYefW+42pQFUKQLlunBefsn6LKscqIzsznhAZbAm/+k 95/1eZunmfmxL9fe3I7rFpKwjnNliBKv4dD0ZjbuS7QfdXwlSvv9TEQ2pnbnCKRHTeRT vmXoglO5ss8xntcEJiVgPejffO1J1QyW2qFb+UN/nFE+rOrJZiOMB7XGaf6sjxIPGyuy GFlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=/dpaO2R3HJe56fTX4cXfsKpL2XQCbaiSgmrj4ZFB8cQ=; b=aQtVb56/QDbvViE9G5bF6ygYdkN9LP7aivCL34Qzwpn+UBWAZCQ6erau3j82KMB6eb 8PjivwAQvPO3Qi4pVrHHqDyqdiHEkwZLvOR3VRcQ11ZsrK4DUMntgK8ctXJuk4JPPsTT IYS+PrW9aBVb+KdWB3TRI3KrD2NnUIkF24cfeQC1q0GwLZlXUYYz3GnhjX3GXSLekIPA RdyKWAR4AXf+xG/p6PTOE+SCaEnTQV2Ap/qyA9NXO+68cp/N55HZfOuneoa/WGS4algH HMmncQdkQtjMgc87PXhE56uaCR9MYURpE7UCoOBuZbmXcLs3v4KF2MzF/VdebubQLKgg A7aQ== X-Gm-Message-State: AJcUukdrPqskbKDWTsU36rg1YkWjDSENgev1fOgjhHnWgKCVty+be6iF 9y5mpkiJczs1LuMIKBHWEznwRmOBqQ935Q== X-Google-Smtp-Source: ALg8bN4NGJEb7HmA7JDInRbQG5ByDzYoEihNPJAx940wQNfH4CGeX0FER879eO2KIVLoXWjFl1g2og== X-Received: by 2002:a63:e247:: with SMTP id y7mr5297429pgj.84.1547088283321; Wed, 09 Jan 2019 18:44:43 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.41 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:42 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 13/15] io_uring: support kernel side submission Date: Wed, 9 Jan 2019 19:44:02 -0700 Message-Id: <20190110024404.25372-14-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024402.DYH-DvIiy3eCrT6Q6clgSzDoLMaArMGwnbCgqpTeFts@z> Add support for backing the io_uring fd with either a thread, or a workqueue and letting those handle the submission for us. This can be used to reduce overhead for submission, or to always make submission async. The latter is particularly useful for buffered aio, which is now fully async with this feature. For polled IO, we could have the kernel side thread hammer on the SQ ring and submit when it finds IO. This would mean that an application would NEVER have to enter the kernel to do IO! Didn't add this yet, but it would be trivial to add. If an application sets IORING_SETUP_SCQTHREAD, the io_uring gets a single thread backing. If used with buffered IO, this will limit the device queue depth to 1, but it will be async, IOs will simply be serialized. Or an application can set IORING_SETUP_SQWQ, in which case the urings get a work queue backing. The concurrency level is the mininum of twice the available CPUs, or the queue depth specific for the context. For this mode, we attempt to do buffered reads inline, in case they are cached. So we should only punt to a workqueue, if we would have to block to get our data. Tested with polling, no polling, fixedbufs, no fixedbufs, buffered, O_DIRECT. See this sample application for how to use it: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe --- fs/io_uring.c | 378 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 5 +- 2 files changed, 369 insertions(+), 14 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 7ab20258e39b..da46872ecd67 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -25,6 +26,8 @@ #include #include #include +#include +#include #include #include @@ -62,6 +65,14 @@ struct io_mapped_ubuf { unsigned int nr_bvecs; }; +struct io_sq_offload { + struct task_struct *thread; /* if using a thread */ + struct workqueue_struct *wq; /* wq offload */ + struct mm_struct *mm; + struct files_struct *files; + wait_queue_head_t wait; +}; + struct io_ring_ctx { struct percpu_ref refs; @@ -71,6 +82,7 @@ struct io_ring_ctx { struct io_sq_ring *sq_ring; unsigned sq_entries; unsigned sq_mask; + unsigned sq_thread_cpu; struct io_uring_sqe *sq_sqes; /* CQ ring */ @@ -81,6 +93,9 @@ struct io_ring_ctx { /* if used, fixed mapped user buffers */ struct io_mapped_ubuf *user_bufs; + /* sq ring submitter thread, if used */ + struct io_sq_offload sq_offload; + struct completion ctx_done; /* iopoll submission state */ @@ -115,6 +130,7 @@ struct io_kiocb { unsigned long ki_flags; #define KIOCB_F_IOPOLL_COMPLETED 0 /* polled IO has completed */ #define KIOCB_F_IOPOLL_EAGAIN 1 /* submission got EAGAIN */ +#define KIOCB_F_FORCE_NONBLOCK 2 /* inline submission attempt */ }; #define IO_PLUG_THRESHOLD 2 @@ -125,6 +141,12 @@ struct sqe_submit { unsigned index; }; +struct io_work { + struct work_struct work; + struct io_ring_ctx *ctx; + struct sqe_submit submit; +}; + struct io_submit_state { struct io_ring_ctx *ctx; @@ -471,6 +493,20 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, spin_unlock_irqrestore(&ctx->completion_lock, flags); } +static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, + long error) +{ + io_cqring_fill_event(ctx, s->index, error, 0); + + /* + * for thread offload, app could already be sleeping in io_ring_enter() + * before we get to flag the error. wake them up, if needed. + */ + if (ctx->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); @@ -543,7 +579,7 @@ static struct file *io_file_get(struct io_submit_state *state, int fd) } static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, - struct io_submit_state *state) + struct io_submit_state *state, bool force_nonblock) { struct io_ring_ctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; @@ -567,6 +603,10 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, ret = kiocb_set_rw_flags(req, sqe->rw_flags); if (unlikely(ret)) goto out_fput; + if (force_nonblock) { + req->ki_flags |= IOCB_NOWAIT; + set_bit(KIOCB_F_FORCE_NONBLOCK, &kiocb->ki_flags); + } if (ctx->flags & IORING_SETUP_IOPOLL) { ret = -EOPNOTSUPP; @@ -701,7 +741,7 @@ static int io_import_fixed(int rw, struct io_kiocb *kiocb, } static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, - struct io_submit_state *state) + struct io_submit_state *state, bool nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *req = &kiocb->rw; @@ -709,7 +749,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, sqe, state); + ret = io_prep_rw(kiocb, sqe, state, nonblock); if (ret) return ret; file = req->ki_filp; @@ -734,8 +774,18 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, goto out_fput; ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter)); - if (!ret) - io_rw_done(req, call_read_iter(file, req, &iter)); + if (!ret) { + ssize_t ret2; + + /* + * Catch -EAGAIN return for forced non-blocking submission + */ + ret2 = call_read_iter(file, req, &iter); + if (!nonblock || ret2 != -EAGAIN) + io_rw_done(req, ret2); + else + ret = -EAGAIN; + } kfree(iovec); out_fput: if (unlikely(ret)) @@ -752,7 +802,7 @@ static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, sqe, state); + ret = io_prep_rw(kiocb, sqe, state, false); if (ret) return ret; file = req->ki_filp; @@ -837,7 +887,7 @@ static int io_fsync(struct io_kiocb *kiocb, const struct io_uring_sqe *sqe, } static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, - struct io_submit_state *state) + struct io_submit_state *state, bool force_nonblock) { const struct io_uring_sqe *sqe = s->sqe; struct io_kiocb *req; @@ -860,7 +910,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, switch (sqe->opcode) { case IORING_OP_READV: case IORING_OP_READ_FIXED: - ret = io_read(req, sqe, state); + ret = io_read(req, sqe, state, force_nonblock); break; case IORING_OP_WRITEV: case IORING_OP_WRITE_FIXED: @@ -988,7 +1038,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) if (!io_peek_sqring(ctx, &s)) break; - ret = io_submit_sqe(ctx, &s, statep); + ret = io_submit_sqe(ctx, &s, statep, false); if (ret) break; @@ -1037,15 +1087,237 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) return ring->r.head == ring->r.tail ? ret : 0; } +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_work *iw = container_of(work, struct io_work, work); + struct io_ring_ctx *ctx = iw->ctx; + struct io_sq_offload *sqo = &ctx->sq_offload; + mm_segment_t old_fs = get_fs(); + struct files_struct *old_files; + int ret; + + old_files = current->files; + current->files = sqo->files; + + if (sqo->mm) { + if (!mmget_not_zero(sqo->mm)) { + ret = -EFAULT; + goto err; + } + use_mm(sqo->mm); + } + + set_fs(USER_DS); + + ret = io_submit_sqe(ctx, &iw->submit, NULL, false); + + set_fs(old_fs); + if (sqo->mm) { + unuse_mm(sqo->mm); + mmput(sqo->mm); + } + +err: + if (ret) + io_fill_cq_error(ctx, &iw->submit, ret); + current->files = old_files; + kfree(iw); +} + +static int io_queue_async_work(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_work *work; + + work = kmalloc(sizeof(*work), GFP_KERNEL); + if (!work) + return -ENOMEM; + + memcpy(&work->submit, s, sizeof(*s)); + INIT_WORK(&work->work, io_sq_wq_submit_work); + work->ctx = ctx; + queue_work(ctx->sq_offload.wq, &work->work); + return 0; +} + +static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, + unsigned int nr, struct mm_struct *cur_mm, + bool mm_fault) +{ + struct io_submit_state state, *statep = NULL; + int ret, i, submitted = 0; + + if (nr > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, nr); + statep = &state; + } + + for (i = 0; i < nr; i++) { + if (unlikely(mm_fault)) + ret = -EFAULT; + else + ret = io_submit_sqe(ctx, &sqes[i], statep, false); + if (!ret) { + submitted++; + continue; + } + + io_fill_cq_error(ctx, &sqes[i], ret); + } + + if (statep) + io_submit_state_end(&state); + + return submitted; +} + +/* + * sq thread only supports O_DIRECT or FIXEDBUFS IO + */ +static int io_sq_thread(void *data) +{ + struct sqe_submit sqes[IO_IOPOLL_BATCH]; + struct io_ring_ctx *ctx = data; + struct io_sq_offload *sqo = &ctx->sq_offload; + struct mm_struct *cur_mm = NULL; + struct files_struct *old_files; + mm_segment_t old_fs; + DEFINE_WAIT(wait); + + old_files = current->files; + current->files = sqo->files; + + old_fs = get_fs(); + set_fs(USER_DS); + + while (!kthread_should_stop()) { + bool mm_fault = false; + int i; + + if (!io_peek_sqring(ctx, &sqes[0])) { + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + + prepare_to_wait(&sqo->wait, &wait, TASK_INTERRUPTIBLE); + if (!io_peek_sqring(ctx, &sqes[0])) { + if (kthread_should_park()) + kthread_parkme(); + if (kthread_should_stop()) { + finish_wait(&sqo->wait, &wait); + break; + } + if (signal_pending(current)) + flush_signals(current); + schedule(); + finish_wait(&sqo->wait, &wait); + continue; + } + finish_wait(&sqo->wait, &wait); + } + + /* If ->mm is set, we're not doing FIXEDBUFS */ + if (sqo->mm && !cur_mm) { + mm_fault = !mmget_not_zero(sqo->mm); + if (!mm_fault) { + use_mm(sqo->mm); + cur_mm = sqo->mm; + } + } + + i = 0; + do { + if (i == ARRAY_SIZE(sqes)) + break; + i++; + io_inc_sqring(ctx); + } while (io_peek_sqring(ctx, &sqes[i])); + + io_submit_sqes(ctx, sqes, i, cur_mm, mm_fault); + } + current->files = old_files; + set_fs(old_fs); + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + } + return 0; +} + +/* + * If this is a read, try a cached inline read first. If the IO is in the + * page cache, we can satisfy it without blocking and without having to + * punt to a threaded execution. This is much faster, particularly for + * lower queue depth IO, and it's always a lot more efficient. + */ +static bool io_sq_try_inline(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + int ret; + + if (s->sqe->opcode != IORING_OP_READV && + s->sqe->opcode != IORING_OP_READ_FIXED) + return false; + + ret = io_submit_sqe(ctx, s, NULL, true); + + /* + * If we get -EAGAIN, return false to submit out-of-line. Any other + * result and we're done, caller will fill in CQ ring event. + */ + return ret != -EAGAIN; +} + +static int io_sq_wq_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + struct sqe_submit s; + int ret, queued; + + ret = queued = 0; + while (io_peek_sqring(ctx, &s)) { + ret = io_sq_try_inline(ctx, &s); + if (!ret) { + ret = io_queue_async_work(ctx, &s); + if (ret) + break; + } + io_inc_sqring(ctx); + queued++; + if (queued == to_submit) + break; + } + + return queued ? queued : ret; +} + static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, unsigned min_complete, unsigned flags) { int ret = 0; if (to_submit) { - ret = io_ring_submit(ctx, to_submit); - if (ret < 0) - return ret; + /* + * Three options here: + * 1) We have an sq thread, just wake it up to do submissions + * 2) We have an sq wq, queue a work item for each sqe + * 3) Submit directly + */ + if (ctx->flags & IORING_SETUP_SQTHREAD) { + wake_up(&ctx->sq_offload.wait); + ret = to_submit; + } else if (ctx->flags & IORING_SETUP_SQWQ) { + ret = io_sq_wq_submit(ctx, to_submit); + } else { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } } if (flags & IORING_ENTER_GETEVENTS) { unsigned nr_events = 0; @@ -1187,6 +1459,77 @@ static int io_sqe_buffer_map(struct io_ring_ctx *ctx, return ret; } +static int io_sq_thread(void *); + +static int io_sq_thread_start(struct io_ring_ctx *ctx) +{ + struct io_sq_offload *sqo = &ctx->sq_offload; + int ret; + + memset(sqo, 0, sizeof(*sqo)); + init_waitqueue_head(&sqo->wait); + + if (!ctx->user_bufs) + sqo->mm = current->mm; + + /* + * This is safe since 'current' has the fd installed, and if that + * gets closed on exit, then fops->release() is invoked which + * waits for the sq thread and sq workqueue to flush and exit + * before exiting. + */ + ret = -EBADF; + sqo->files = current->files; + if (!sqo->files) + goto err; + + if (ctx->flags & IORING_SETUP_SQTHREAD) { + sqo->thread = kthread_create_on_cpu(io_sq_thread, ctx, + ctx->sq_thread_cpu, + "io_uring-sq"); + if (IS_ERR(sqo->thread)) { + ret = PTR_ERR(sqo->thread); + sqo->thread = NULL; + goto err; + } + wake_up_process(sqo->thread); + } else if (ctx->flags & IORING_SETUP_SQWQ) { + int concurrency; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + concurrency = min(ctx->sq_entries - 1, 2 * num_online_cpus()); + sqo->wq = alloc_workqueue("io_ring-wq", + WQ_UNBOUND | WQ_FREEZABLE, + concurrency); + if (!sqo->wq) { + ret = -ENOMEM; + goto err; + } + } + + return 0; +err: + if (sqo->files) + sqo->files = NULL; + if (sqo->mm) + sqo->mm = NULL; + return ret; +} + +static void io_sq_thread_stop(struct io_ring_ctx *ctx) +{ + struct io_sq_offload *sqo = &ctx->sq_offload; + + if (sqo->thread) { + kthread_park(sqo->thread); + kthread_stop(sqo->thread); + sqo->thread = NULL; + } else if (sqo->wq) { + destroy_workqueue(sqo->wq); + sqo->wq = NULL; + } +} + static void io_free_scq_urings(struct io_ring_ctx *ctx) { if (ctx->sq_ring) { @@ -1205,6 +1548,7 @@ static void io_free_scq_urings(struct io_ring_ctx *ctx) static void io_ring_ctx_free(struct io_ring_ctx *ctx) { + io_sq_thread_stop(ctx); io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); io_sqe_buffer_unmap(ctx); @@ -1394,6 +1738,13 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, if (ret) goto err; } + if (p->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) { + ctx->sq_thread_cpu = p->sq_thread_cpu; + + ret = io_sq_thread_start(ctx); + if (ret) + goto err; + } ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, O_RDWR | O_CLOEXEC); @@ -1426,7 +1777,8 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, return -EINVAL; } - if (p.flags & ~IORING_SETUP_IOPOLL) + if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQTHREAD | + IORING_SETUP_SQWQ)) return -EINVAL; ret = io_uring_create(entries, &p, iovecs); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 80d1a8224b9c..79004940f7da 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -35,6 +35,8 @@ struct io_uring_sqe { * io_uring_setup() flags */ #define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ +#define IORING_SETUP_SQTHREAD (1 << 1) /* Use SQ thread */ +#define IORING_SETUP_SQWQ (1 << 2) /* Use SQ workqueue */ #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 @@ -95,7 +97,8 @@ struct io_uring_params { __u32 sq_entries; __u32 cq_entries; __u32 flags; - __u16 resv[10]; + __u16 sq_thread_cpu; + __u16 resv[9]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; }; -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6FD1C43612 for ; Thu, 10 Jan 2019 02:44:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6D1D921738 for ; Thu, 10 Jan 2019 02:44:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="IbOilqrU" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727195AbfAJCos (ORCPT ); Wed, 9 Jan 2019 21:44:48 -0500 Received: from mail-pf1-f193.google.com ([209.85.210.193]:34148 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727181AbfAJCoq (ORCPT ); Wed, 9 Jan 2019 21:44:46 -0500 Received: by mail-pf1-f193.google.com with SMTP id h3so4614573pfg.1 for ; Wed, 09 Jan 2019 18:44:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=3JuFz5QIRhGi/j4jA+t2a3g4bhbuz3byvY82MJr6spw=; b=IbOilqrU34gtyEQzi4WCF/6MNX3wiwcOu00iCL3NFIenHu8DZJtEwk+zTxhLJFFi+0 uAiQWk7PfYQ6QzCo7dXH3LpJNJi6lEKNyNnBncKxLAbHSM5bd/egA5fh6TGklslMx0lD GPvbeezdroOpTAGcD7jWkfnze71AB2PblDy0EYugroCUorsUuRnkFcwkC0GPfgRgEUu7 HrshAnVBKmvm0PqdRXM45G9k377ylZAkX9B/GY0iO1XAJEypayFzEdEREDHE2GvgnUHy nMjff94BKHN96PQCUMqRz3VVT5CIu5nYyooXYWd0RIItoGvvhzmtRxxO59Kt+YgGMvkW DO+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=3JuFz5QIRhGi/j4jA+t2a3g4bhbuz3byvY82MJr6spw=; b=rgVG8uTgx5K47qFLKX9EezWey2yiBYvQ9Z+MRCKZhp7CEm3TNZQ1naubZCuRrAQyJa cL39D0eRbQY308SRRDaQe1n1PGF6uqKK5MVN5QKWn6w/xF1/hgHdrUkvLzxl8Wfy7U5B WDuF/B4Q4xnt4Tk4KWXlJ5JVPcgZBy/UpgMELmRQJk3E41yxMIXEGEKin3TSrCH7ZhnM wj/b+vuFhKZ5szh1YX0EC5bMF3TZFUp5HuieFR5IjRi4kFDrk/ihHhflOJ5mIDSoefSR qjRJYh85NgNbWVo03t5OEblwYQk1V9y1sYGm0oEjUXOz3n3yDcVyyk+XcgNu8NxyIF9S Uiug== X-Gm-Message-State: AJcUukflaNWtMD0kq/j4lk/BJmzlQ1VrunhE+Eag+OKR7byfkqQKjrq9 HD4vsXcMRAT7rJ82TW82Cupafwir7/+ofQ== X-Google-Smtp-Source: ALg8bN4fIen3QnSo5prGj6ExcWkVYAbOWJ/Bq8WX4E+OxVCj4te6xmZ8IeLjYfecWFVG0pS/PDWpuA== X-Received: by 2002:a62:c613:: with SMTP id m19mr8537500pfg.207.1547088285419; Wed, 09 Jan 2019 18:44:45 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.43 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:44 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 14/15] io_uring: add submission polling Date: Wed, 9 Jan 2019 19:44:03 -0700 Message-Id: <20190110024404.25372-15-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024403._ggoMI-3Kifsluc4A6MkTaFffoxf9Tg0oK_MlRKBwEs@z> This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new events and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions. For O_DIRECT, we can do this with just SQTHREAD being enabled. For buffered aio, we need the workqueue as well. If we can satisfy the buffered inline from the SQTHREAD, we do that. If not, we punt to the workqueue. This is just like buffered aio off the io_uring_enter(2) system call. Proof of concept. If the thread has been idle for 1 second, it will set sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with: barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, to_submit, 0, 0); instead of calling it unconditionally. Improvements: 1) Maybe have smarter backoff. Busy loop for X time, then go to monitor/mwait, finally the schedule we have now after an idle second. Might not be worth the complexity. 2) Probably want the application to pass in the appropriate grace period, not hard code it at 1 second. Signed-off-by: Jens Axboe --- fs/io_uring.c | 102 +++++++++++++++++++++++++++++++--- include/uapi/linux/io_uring.h | 3 + 2 files changed, 97 insertions(+), 8 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index da46872ecd67..6c62329b00ec 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -67,6 +67,7 @@ struct io_mapped_ubuf { struct io_sq_offload { struct task_struct *thread; /* if using a thread */ + bool thread_poll; struct workqueue_struct *wq; /* wq offload */ struct mm_struct *mm; struct files_struct *files; @@ -1145,17 +1146,35 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, { struct io_submit_state state, *statep = NULL; int ret, i, submitted = 0; + bool nonblock; if (nr > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, ctx, nr); statep = &state; } + /* + * Having both a thread and a workqueue only makes sense for buffered + * IO, where we can't submit in an async fashion. Use the NOWAIT + * trick from the SQ thread, and punt to the workqueue if we can't + * satisfy this iocb without blocking. This is only necessary + * for buffered IO with sqthread polled submission. + */ + nonblock = (ctx->flags & IORING_SETUP_SQWQ) != 0; + for (i = 0; i < nr; i++) { - if (unlikely(mm_fault)) + if (unlikely(mm_fault)) { ret = -EFAULT; - else - ret = io_submit_sqe(ctx, &sqes[i], statep, false); + } else { + ret = io_submit_sqe(ctx, &sqes[i], statep, nonblock); + /* nogo, submit to workqueue */ + if (nonblock && ret == -EAGAIN) + ret = io_queue_async_work(ctx, &sqes[i]); + if (!ret) { + submitted++; + continue; + } + } if (!ret) { submitted++; continue; @@ -1171,7 +1190,10 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, } /* - * sq thread only supports O_DIRECT or FIXEDBUFS IO + * SQ thread is woken if the app asked for offloaded submission. This can + * be either O_DIRECT, in which case we do submissions directly, or it can + * be buffered IO, in which case we do them inline if we can do so without + * blocking. If we can't, then we punt to a workqueue. */ static int io_sq_thread(void *data) { @@ -1182,6 +1204,8 @@ static int io_sq_thread(void *data) struct files_struct *old_files; mm_segment_t old_fs; DEFINE_WAIT(wait); + unsigned inflight; + unsigned long timeout; old_files = current->files; current->files = sqo->files; @@ -1189,11 +1213,49 @@ static int io_sq_thread(void *data) old_fs = get_fs(); set_fs(USER_DS); + timeout = inflight = 0; while (!kthread_should_stop()) { bool mm_fault = false; int i; + if (sqo->thread_poll && inflight) { + unsigned int nr_events = 0; + + /* + * Normal IO, just pretend everything completed. + * We don't have to poll completions for that. + */ + if (ctx->flags & IORING_SETUP_IOPOLL) { + /* + * App should not use IORING_ENTER_GETEVENTS + * with thread polling, but if it does, then + * ensure we are mutually exclusive. + */ + if (mutex_trylock(&ctx->uring_lock)) { + io_iopoll_check(ctx, &nr_events, 0); + mutex_unlock(&ctx->uring_lock); + } + } else { + nr_events = inflight; + } + + inflight -= nr_events; + if (!inflight) + timeout = jiffies + HZ; + } + if (!io_peek_sqring(ctx, &sqes[0])) { + /* + * If we're polling, let us spin for a second without + * work before going to sleep. + */ + if (sqo->thread_poll) { + if (inflight || !time_after(jiffies, timeout)) { + cpu_relax(); + continue; + } + } + /* * Drop cur_mm before scheduling, we can't hold it for * long periods (or over schedule()). Do this before @@ -1207,6 +1269,13 @@ static int io_sq_thread(void *data) } prepare_to_wait(&sqo->wait, &wait, TASK_INTERRUPTIBLE); + + /* Tell userspace we may need a wakeup call */ + if (sqo->thread_poll) { + ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP; + smp_wmb(); + } + if (!io_peek_sqring(ctx, &sqes[0])) { if (kthread_should_park()) kthread_parkme(); @@ -1218,6 +1287,13 @@ static int io_sq_thread(void *data) flush_signals(current); schedule(); finish_wait(&sqo->wait, &wait); + + if (sqo->thread_poll) { + struct io_sq_ring *ring; + + ring = ctx->sq_ring; + ring->flags &= ~IORING_SQ_NEED_WAKEUP; + } continue; } finish_wait(&sqo->wait, &wait); @@ -1240,7 +1316,7 @@ static int io_sq_thread(void *data) io_inc_sqring(ctx); } while (io_peek_sqring(ctx, &sqes[i])); - io_submit_sqes(ctx, sqes, i, cur_mm, mm_fault); + inflight += io_submit_sqes(ctx, sqes, i, cur_mm, mm_fault); } current->files = old_files; set_fs(old_fs); @@ -1483,6 +1559,9 @@ static int io_sq_thread_start(struct io_ring_ctx *ctx) if (!sqo->files) goto err; + if (ctx->flags & IORING_SETUP_SQPOLL) + sqo->thread_poll = true; + if (ctx->flags & IORING_SETUP_SQTHREAD) { sqo->thread = kthread_create_on_cpu(io_sq_thread, ctx, ctx->sq_thread_cpu, @@ -1493,7 +1572,8 @@ static int io_sq_thread_start(struct io_ring_ctx *ctx) goto err; } wake_up_process(sqo->thread); - } else if (ctx->flags & IORING_SETUP_SQWQ) { + } + if (ctx->flags & IORING_SETUP_SQWQ) { int concurrency; /* Do QD, or 2 * CPUS, whatever is smallest */ @@ -1524,7 +1604,8 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx) kthread_park(sqo->thread); kthread_stop(sqo->thread); sqo->thread = NULL; - } else if (sqo->wq) { + } + if (sqo->wq) { destroy_workqueue(sqo->wq); sqo->wq = NULL; } @@ -1738,6 +1819,11 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, if (ret) goto err; } + if ((p->flags & IORING_SETUP_SQPOLL) && + !(p->flags & IORING_SETUP_SQTHREAD)) { + ret = -EINVAL; + goto err; + } if (p->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) { ctx->sq_thread_cpu = p->sq_thread_cpu; @@ -1778,7 +1864,7 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, } if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQTHREAD | - IORING_SETUP_SQWQ)) + IORING_SETUP_SQWQ | IORING_SETUP_SQPOLL)) return -EINVAL; ret = io_uring_create(entries, &p, iovecs); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 79004940f7da..9321eb97479d 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -37,6 +37,7 @@ struct io_uring_sqe { #define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ #define IORING_SETUP_SQTHREAD (1 << 1) /* Use SQ thread */ #define IORING_SETUP_SQWQ (1 << 2) /* Use SQ workqueue */ +#define IORING_SETUP_SQPOLL (1 << 3) /* SQ thread polls */ #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 @@ -75,6 +76,8 @@ struct io_sqring_offsets { __u32 resv[3]; }; +#define IORING_SQ_NEED_WAKEUP (1 << 0) /* needs io_uring_enter wakeup */ + struct io_cqring_offsets { __u32 head; __u32 tail; -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E632C43612 for ; Thu, 10 Jan 2019 02:44:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E36BC21738 for ; Thu, 10 Jan 2019 02:44:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="mjbJtvrf" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727199AbfAJCot (ORCPT ); Wed, 9 Jan 2019 21:44:49 -0500 Received: from mail-pg1-f196.google.com ([209.85.215.196]:45686 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727190AbfAJCot (ORCPT ); Wed, 9 Jan 2019 21:44:49 -0500 Received: by mail-pg1-f196.google.com with SMTP id y4so4160992pgc.12 for ; Wed, 09 Jan 2019 18:44:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=NA7bxp3FsqdCNDD3mLzXDoV5LzMApWnvX0Tn1Xg1AHs=; b=mjbJtvrf70R1VXZEXb5lfZMvLIy7Esqml9CkARVFQEJX3Q3U84rUlbQqLQreaXMzfj VNhpHP6qwqxzLVB9oUnDRycDUNDSybEzsZWFevhaGh4/jUahmTLa2EuREKgk4PNHmc8M F9lcYfcf0s+xBlqfcl12Jw1+C1/fEAm5PuNB8GRLu1/ty+6+D3k74ucUJfXvSa/50dwm kgwuhvSsojfE31F/IlpgP2t5gw+ea+EhQvDyRu8sCqtzOBsWzoHBfsjdNxqzihC+9V3v 642fk7p+xp2ADY0vBf5mkwFsm0C0rr6uP2RSpMOMfm+nYtfalxZ9IGo0ZkdJsFZr3uHu p7Kg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=NA7bxp3FsqdCNDD3mLzXDoV5LzMApWnvX0Tn1Xg1AHs=; b=uL9zCWRi+eJZynLR5GW+nVRGhKtGXP/ox9RqCDyyp69unP+LGlUNP6ZPMfbY9cZRuB TKSFimgBijKoAYIov0bYkhHGcIBWy3GiTkrQNHbuhecaDV4JF3qu24eaVWm3qnzOC4Ca TsAldhwOiWjDzp0nZloBiEDY0zFsvQGjgKIpG4RaxpCzb8bGNYJkAoR5s9VM0AEe5fhT 5Y2tvXY/jD0JLZ6leBNablbZrEnprbKrPWGHTisnrIVjX3YKOG//FZMG9xN2g5wFHfYd BGraXnIqgFEHwvaWgjVSFW+lV7B87tkT4GPzqHfEARiJeEm312gU6FwbiiTK5cUVrnzX h6cA== X-Gm-Message-State: AJcUukc8bSGT+/pHnPsa14rRo4Ghm3/jwwByVRGdzmu6XlBo0krahCuW /jy+/v0md3q77DOLbhvfYBHNbCRwv49nLg== X-Google-Smtp-Source: ALg8bN6UBaTSzG4vSHOJFEwqJkNmGUkFXzeP9UV+qoqmAkpoxyExfL8X8/yVHySLhsODbBAlAXs4Lg== X-Received: by 2002:a65:4904:: with SMTP id p4mr7812461pgs.384.1547088287516; Wed, 09 Jan 2019 18:44:47 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v15sm105799631pfn.94.2019.01.09.18.44.45 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 09 Jan 2019 18:44:46 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 15/15] io_uring: add io_uring_event cache hit information Date: Wed, 9 Jan 2019 19:44:04 -0700 Message-Id: <20190110024404.25372-16-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110024404.gpNT8aEGRHsG39BHB_I_6IIox6viNdo5w82OqH3zC8Q@z> Add hint on whether a read was served out of the page cache, or if it hit media. This is useful for buffered async IO, O_DIRECT reads would never have this set (for obvious reasons). If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT set. Signed-off-by: Jens Axboe --- fs/io_uring.c | 7 ++++++- include/uapi/linux/io_uring.h | 5 +++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 6c62329b00ec..f2a603c447ba 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -511,11 +511,16 @@ static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); + unsigned ev_flags = 0; kiocb_end_write(kiocb); fput(kiocb->ki_filp); - io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, 0); + + if (res > 0 && test_bit(KIOCB_F_FORCE_NONBLOCK, &iocb->ki_flags)) + ev_flags = IOCQE_FLAG_CACHEHIT; + + io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, ev_flags); io_free_kiocb(iocb); } diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 9321eb97479d..20e4c22e040d 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -55,6 +55,11 @@ struct io_uring_cqe { __u32 flags; }; +/* + * io_uring_event->flags + */ +#define IOCQE_FLAG_CACHEHIT (1 << 0) /* IO did not hit media */ + /* * Magic offsets for the application to mmap the data it needs */ -- 2.17.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 26004C43444 for ; Thu, 10 Jan 2019 23:12:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id F2FF7213F2 for ; Thu, 10 Jan 2019 23:12:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728292AbfAJXMv (ORCPT ); Thu, 10 Jan 2019 18:12:51 -0500 Received: from mx1.redhat.com ([209.132.183.28]:38968 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726149AbfAJXMu (ORCPT ); Thu, 10 Jan 2019 18:12:50 -0500 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id AB8368E66F; Thu, 10 Jan 2019 23:12:50 +0000 (UTC) Received: from segfault.boston.devel.redhat.com (segfault.boston.devel.redhat.com [10.19.60.26]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E1D1760C67; Thu, 10 Jan 2019 23:12:49 +0000 (UTC) From: Jeff Moyer To: Jens Axboe Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, avi@scylladb.com Subject: Re: [PATCH 15/15] io_uring: add io_uring_event cache hit information References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-16-axboe@kernel.dk> X-PGP-KeyID: 1F78E1B4 X-PGP-CertKey: F6FE 280D 8293 F72C 65FD 5A58 1FF8 A7CA 1F78 E1B4 Date: Thu, 10 Jan 2019 18:12:49 -0500 In-Reply-To: <20190110024404.25372-16-axboe@kernel.dk> (Jens Axboe's message of "Wed, 9 Jan 2019 19:44:04 -0700") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Thu, 10 Jan 2019 23:12:50 +0000 (UTC) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110231249.AChwqkbcujgJddSVpuBY4-hzjv1Eva5gnRSXnOsOqYE@z> Jens Axboe writes: > Add hint on whether a read was served out of the page cache, or if it > hit media. This is useful for buffered async IO, O_DIRECT reads would > never have this set (for obvious reasons). > > If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT > set. We may want to hold off on this one until the whole mincore/RWF_NOWAIT debate is sorted. [1] Cheers, Jeff [1] https://lore.kernel.org/lkml/20190109022430.GE27534@dastard/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS, URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 679B7C43612 for ; Thu, 10 Jan 2019 23:47:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 368A72173B for ; Thu, 10 Jan 2019 23:47:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="SeF3SstE" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730458AbfAJXrd (ORCPT ); Thu, 10 Jan 2019 18:47:33 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:40874 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730453AbfAJXrd (ORCPT ); Thu, 10 Jan 2019 18:47:33 -0500 Received: by mail-pg1-f193.google.com with SMTP id z10so5493626pgp.7 for ; Thu, 10 Jan 2019 15:47:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=orQTzzFZhstfOxWiPDU+V7t1O1pECdJw/+2s72IMmcc=; b=SeF3SstEcUZfYBuNQph+H/vpm0nE3gQjOSgxzfOtPxzP9J5er9sYzR4IzVrfSVEx6z N3QG+KB36CdlXbXrB1qHX6Lwpk+nYVOpQxGTwsn3gm7l+HCSeit9WQJ9SFnk3XK/KmB8 9hCBHmR9S1RR3Q+dsGd4BJs0tDhNx2+REAMzLRXtYT21+BhNYYD/TbZQzgEfI8EuUwJL IrzIzH5uzXBlxkenG3QcBbmEmQC9SDhvyF9t8WgClIBWpD8OAOGfxpRWk6qFDAr0ISMO ke26UR0YIC704qyK96xPOmirDKCK0Um8x2KzNGY09CoiwVL0gLZ9uvslF4B7xE/59fRe qrSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=orQTzzFZhstfOxWiPDU+V7t1O1pECdJw/+2s72IMmcc=; b=H5VaCIJwv62koVQI4KG+qD/LVrngUC9qsRBoC3jiISzye4hEmK1iU+2xV4yIgSVf/0 OVxxASxvE037G0Z0a7emM03fCEb/fSWEPqowpOhVo4mJg3NtXBpHFB+qN7Lz0/2dc3w2 mvWETJVDiY/YPTzqQOWaBXJT7XaECV+Q8HICLI26kuCICMx5XkSQA5eO4KA/zKxwKbU9 vaj05Q0ZFI6w3hX1kksTm9IbnC1jz0XY+dK6PygH7mmAY+ZK+xM1iJ9RBsAlAyhP4zWv rJ78yefQ6lUCN/UIHhn+Bz2c/CVyEi7d/YWQtKmHaXx++rk2uFVtft3TiH+6AMJKWjk6 RnAg== X-Gm-Message-State: AJcUukfDiZtymLFePD+hU63z8oeFhaVbvRQEShoB5iIDykzKBd3hrxX5 5UoIUbQiAK1v/jTuxE0LMVb/Hw== X-Google-Smtp-Source: ALg8bN5KAeFB7fRw9I0sM3uYmmMHb0aP1fwLVBIT3GUqVCOufH/9+FB1eBsxBEsUh33VDxMzA+zH8w== X-Received: by 2002:a62:670f:: with SMTP id b15mr12171472pfc.212.1547164052793; Thu, 10 Jan 2019 15:47:32 -0800 (PST) Received: from [192.168.1.121] (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id v14sm178237847pgf.3.2019.01.10.15.47.30 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 10 Jan 2019 15:47:31 -0800 (PST) Subject: Re: [PATCH 15/15] io_uring: add io_uring_event cache hit information To: Jeff Moyer Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, avi@scylladb.com References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-16-axboe@kernel.dk> From: Jens Axboe Message-ID: Date: Thu, 10 Jan 2019 16:47:28 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190110234728.C0EciJoPYIEa4ER6wW23SbGo8xxzyKR82gB8lXdvLac@z> On 1/10/19 4:12 PM, Jeff Moyer wrote: > Jens Axboe writes: > >> Add hint on whether a read was served out of the page cache, or if it >> hit media. This is useful for buffered async IO, O_DIRECT reads would >> never have this set (for obvious reasons). >> >> If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT >> set. > > We may want to hold off on this one until the whole mincore/RWF_NOWAIT > debate is sorted. [1] Definitely, it's why it's separate and at the end of the series. But in reality, this doesn't leak anything that timing doesn't already tell you. So it's kind of a moot point. -- Jens Axboe From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DD78FC43612 for ; Fri, 11 Jan 2019 09:46:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B6C0A20874 for ; Fri, 11 Jan 2019 09:46:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730790AbfAKJqE (ORCPT ); Fri, 11 Jan 2019 04:46:04 -0500 Received: from mx2.suse.de ([195.135.220.15]:50644 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727943AbfAKJqE (ORCPT ); Fri, 11 Jan 2019 04:46:04 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id C96E1AF4B; Fri, 11 Jan 2019 09:46:01 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Fri, 11 Jan 2019 10:46:01 +0100 From: Roman Penyaev To: Jens Axboe Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org Subject: Re: [PATCHSET v2] io_uring IO interface In-Reply-To: <20190110024404.25372-1-axboe@kernel.dk> References: <20190110024404.25372-1-axboe@kernel.dk> Message-ID: X-Sender: rpenyaev@suse.de User-Agent: Roundcube Webmail Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190111094601.k9rjkP53CUHsoKOixKFrvad0pTjbc9cdWhnfOgJ529E@z> Hi Jens, That is interesting. Recently I sent an rfc related to epoll uring: https://lore.kernel.org/lkml/20190109164025.24554-1-rpenyaev@suse.de which can be mapped to userspace and all ready events can be consumed from it directly. I am wondering, is it possible to make some common API for all kind of ready events / urings, or it doesn't make any sense? -- Roman On 2019-01-10 03:43, Jens Axboe wrote: > Here's v2 of the io_uring interface. See the v1 posting for some more > info: > > https://lore.kernel.org/linux-block/20190108165645.19311-1-axboe@kernel.dk/ > > The data structures changed, to improve the symmetry of the submission > and completion side. The io_uring_iocb is now io_uring_sqe, but it > otherwise remains the same as before. Ditto on the completion side, > where io_uring_event is now io_uring_cqe. > > I've updated the fio io_uring test app, and the io_uring engine. The > liburing git repo has also been adapted to the various changes since > the > v1 posting. As a reminder, the liburing git repo contains some helpers > for doing IO without having to muck with the ring directly, setting up > an io_uring context, etc. Clone that here: > > git://git.kernel.dk/liburing > > In terms of usage, there's also a small test app here: > > http://git.kernel.dk/cgit/fio/plain/t/io_uring.c > > and the liburing repo has a few test apps in test/ as well. > > Patches are aginst 5.0-rc1, but can also be found here: > > git://git.kernel.dk/linux-block io_uring > > Changes since v1: > > - Fail IORING_OP_{READ,WRITE}_FIXED if not configured > - Fix ctx drop ref issue on failure to close ring_fd when sq thread/wq > are in use > - Move to separate Kconfig entry (CONFIG_IO_URING) > - Add SPDX headers > - Drop gcc ism of zero sized arrays > - Rename io_uring_iocb -> io_uring_sqe > - Rename io_uring_event -> io_uring_cqe > - Drop needless io_event_ring and io_iocb_ring structures > - Drop ctx->max_reqs, use ->sq_entries > - Drop unused ->ring_lock > - Drop io_ring_ctx slab cache > - Fix state batched kiocb alloc failure to put ctx > - Fix missing write ordering barrier when filling in the cqe > - Drop io_req_init() > - Various renames > - Fix a few lines that were too long > - Address other minor review comments > - Fix IORING_SETUP_SQPOLL being set without IORING_SETUP_SQTHREAD > - Drop IORING_SETUP_FIXEDBUFS, iovecs being non-NULL is enough > - Fix error handling free of ctx in setup path > - Change standard read/write commands to be iov based READV/WRITEV > - Pass in struct sqe_submit instead of separate sqe/index everywhere > - Fix reap of polled events on fops->release() > - Lock uring for sq thread polling > - Don't grab ->completion_lock for polled IO cqe filling > - Fix ev_flags vs flags typo > - Consolidate parts of the io_ring_ctx alignment > > Documentation/filesystems/vfs.txt | 3 + > arch/x86/entry/syscalls/syscall_64.tbl | 2 + > block/bio.c | 59 +- > fs/Makefile | 1 + > fs/block_dev.c | 19 +- > fs/file.c | 15 +- > fs/file_table.c | 9 +- > fs/gfs2/file.c | 2 + > fs/io_uring.c | 1890 ++++++++++++++++++++++++ > fs/iomap.c | 48 +- > fs/xfs/xfs_file.c | 1 + > include/linux/bio.h | 14 + > include/linux/blk_types.h | 1 + > include/linux/file.h | 2 + > include/linux/fs.h | 6 +- > include/linux/iomap.h | 1 + > include/linux/syscalls.h | 5 + > include/uapi/linux/io_uring.h | 114 ++ > init/Kconfig | 8 + > kernel/sys_ni.c | 2 + > 20 files changed, 2163 insertions(+), 39 deletions(-) From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54E7EC43387 for ; Fri, 11 Jan 2019 16:12:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 2185020700 for ; Fri, 11 Jan 2019 16:12:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="tIGcLPO3" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387699AbfAKQMJ (ORCPT ); Fri, 11 Jan 2019 11:12:09 -0500 Received: from mail-it1-f196.google.com ([209.85.166.196]:55736 "EHLO mail-it1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728944AbfAKQMI (ORCPT ); Fri, 11 Jan 2019 11:12:08 -0500 Received: by mail-it1-f196.google.com with SMTP id m62so3260360ith.5; Fri, 11 Jan 2019 08:12:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=nlBzsEs/XxBusvnYWKzL1PBITBpfByToXfynZ72hRww=; b=tIGcLPO3tNyxvASaaMmclrPxueuO9Mqp4FzY0bFDxrFLDXXzWkBmGF0NIBAxahwWAh zdfW6NBbj3Xu5n90AtMVUb8slD1unzPb3TZiSXkcI/QBIY4G98wf0RjreGB3ebb0y7sE C9BPS1KBEpi+JFK+WsUCCW/+41k6fRUU/TkvjY5qN/OQJJCqNwO19T9ndD/NfxT/6/49 lLz3cteMGH9s9Sx6G8hkj9CACYJPkfG1b0gGpkBW4uNqGXv4Oz7NhbUdxYPVRMJ0ID3x bSzA2NBbSPQhoihuQZd58lMLy789oDYkq8n0eJyQnsgJZnxA4tXpRBWU4F58AcpQPw3f oh1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=nlBzsEs/XxBusvnYWKzL1PBITBpfByToXfynZ72hRww=; b=RKOzDtDLPFaBh5xgiIeh+fsOMxdsMEYIHY9N00zv4LDxf9vnxY/kANaCCMPK+O9vVn IAZzRTWv9yT1w+/AJe/FbPGY5bQZhiym1mj59ngyiiDAhLAETb9em24VDAy5zF96BbHc cuyzbT0BRTrkLndqB3KbFGZti5kpANVBePNPohRONVrIF05Vg/U8DRbnBH3bZUL7oZW8 6AWKHO39bWRWVPCf5JM3/P+Asgwt+ETLGyvtM9ot/gTio+gOBZl2hsnseWCgHP7xOLmJ lJOqCz+NXPJGOhs20lrj1Ff9oM1KtIxS85TfyykrEKRP8dj+d0T1cEOK8ROu577edqy/ 4pMQ== X-Gm-Message-State: AJcUukddQAA8s/44aGhOMeWpTdVlYXZ6YSyfVJ+YvgIFlzPHfdf7WDze FvOYaL1CeBZ3I5xc6MyX5VVS4CnvGTdsLBPgF8s= X-Google-Smtp-Source: ALg8bN6KyNJEZkJomzRQWBOwyrZfdzQYREHKE/HOgkQBqA7CXQdqBzty3YXh6x4hwKucnEFX7vr3tjEXKBUDalkJkY0= X-Received: by 2002:a24:d3cb:: with SMTP id n194mr1530248itg.57.1547223128088; Fri, 11 Jan 2019 08:12:08 -0800 (PST) MIME-Version: 1.0 References: <20190110024404.25372-1-axboe@kernel.dk> In-Reply-To: From: Ilya Dryomov Date: Fri, 11 Jan 2019 17:11:57 +0100 Message-ID: Subject: Re: [PATCHSET v2] io_uring IO interface To: Roman Penyaev Cc: Jens Axboe , linux-fsdevel , linux-aio@kvack.org, linux-block , linux-arch@vger.kernel.org, Christoph Hellwig , jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190111161157.si3x3AvhQWu5dQk4GpjK-1SfMlQy3bpN08HwgwYd6kQ@z> On Fri, Jan 11, 2019 at 10:51 AM Roman Penyaev wrote: > > Hi Jens, > > That is interesting. Recently I sent an rfc related to epoll uring: > > https://lore.kernel.org/lkml/20190109164025.24554-1-rpenyaev@suse.de > > which can be mapped to userspace and all ready events can be consumed > from it directly. I am wondering, is it possible to make some common > API for all kind of ready events / urings, or it doesn't make any > sense? I think you can use the new IOCB_CMD_POLL from Christoph and avoid epoll_wait() in favor of aio/io_uring interface, at least in new high performance applications. Reaping events entirely in userspace (i.e. performing io_getevents() without entering the kernel) has been possible for a long time even with the existing aio interface. Thanks, Ilya From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C483C43612 for ; Fri, 11 Jan 2019 16:21:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 67CF12084C for ; Fri, 11 Jan 2019 16:21:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731478AbfAKQVq (ORCPT ); Fri, 11 Jan 2019 11:21:46 -0500 Received: from verein.lst.de ([213.95.11.211]:57024 "EHLO newverein.lst.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729961AbfAKQVq (ORCPT ); Fri, 11 Jan 2019 11:21:46 -0500 Received: by newverein.lst.de (Postfix, from userid 2407) id A8E3168DD2; Fri, 11 Jan 2019 17:21:43 +0100 (CET) Date: Fri, 11 Jan 2019 17:21:43 +0100 From: Christoph Hellwig To: Ilya Dryomov Cc: Roman Penyaev , Jens Axboe , linux-fsdevel , linux-aio@kvack.org, linux-block , linux-arch@vger.kernel.org, Christoph Hellwig , jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org Subject: Re: [PATCHSET v2] io_uring IO interface Message-ID: <20190111162143.GA14914@lst.de> References: <20190110024404.25372-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190111162143.yQI1dJDhL0Uj4GTq-AmEtdwyOnZsYAIH4ejvO_fQeSc@z> On Fri, Jan 11, 2019 at 05:11:57PM +0100, Ilya Dryomov wrote: > I think you can use the new IOCB_CMD_POLL from Christoph and avoid > epoll_wait() in favor of aio/io_uring interface, at least in new high > performance applications. Reaping events entirely in userspace (i.e. > performing io_getevents() without entering the kernel) has been > possible for a long time even with the existing aio interface. For io_uring we can reuse the IOCB_CMD_POLL concept, but we'd have to add a new cancel command, as the uring right now doesn't support cancelation. But I'd rather make that command a new opcode instead of a separate syscall, which would lead to a nicer design. A prototype for this should be fairly easy, I'd just want someone to actually use it for real life testing, like ScyllaDB does for IOCB_CMD_POLL. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64AC3C43444 for ; Fri, 11 Jan 2019 16:39:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3508920872 for ; Fri, 11 Jan 2019 16:39:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732237AbfAKQjh (ORCPT ); Fri, 11 Jan 2019 11:39:37 -0500 Received: from mx2.suse.de ([195.135.220.15]:43374 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1731998AbfAKQjg (ORCPT ); Fri, 11 Jan 2019 11:39:36 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 1B753AEF5; Fri, 11 Jan 2019 16:39:35 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Fri, 11 Jan 2019 17:39:33 +0100 From: Roman Penyaev To: Ilya Dryomov Cc: Jens Axboe , linux-fsdevel , linux-aio@kvack.org, linux-block , linux-arch@vger.kernel.org, Christoph Hellwig , jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org Subject: Re: [PATCHSET v2] io_uring IO interface In-Reply-To: References: <20190110024404.25372-1-axboe@kernel.dk> Message-ID: <43189f1be5f03697f750631d31ffaed5@suse.de> X-Sender: rpenyaev@suse.de User-Agent: Roundcube Webmail Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190111163933.Vw7LrysmBd4nnMy9KO5j22nv3VbVAF53RZ_tSvOcSZQ@z> On 2019-01-11 17:11, Ilya Dryomov wrote: > On Fri, Jan 11, 2019 at 10:51 AM Roman Penyaev > wrote: >> >> Hi Jens, >> >> That is interesting. Recently I sent an rfc related to epoll uring: >> >> https://lore.kernel.org/lkml/20190109164025.24554-1-rpenyaev@suse.de >> >> which can be mapped to userspace and all ready events can be consumed >> from it directly. I am wondering, is it possible to make some common >> API for all kind of ready events / urings, or it doesn't make any >> sense? > > I think you can use the new IOCB_CMD_POLL from Christoph and avoid > epoll_wait() in favor of aio/io_uring interface, at least in new high > performance applications. Yeah, I saw this extension for aio from Christoph. I was motivated to extend epoll with uring to avoid constant recharging of file descriptors, i.e. once you inserted descriptor to epoll you just consume events from uring (of course in that particular case only edge triggered events are supported). Also recently for epoll I fixed contention on event callback making hot path completely lockless, i.e. with uring epoll can become a nice thingy in terms of performance. But can any descriptor (on vfs layer) be extended to have a uring? To have some common API? Then if event source (say socket) has a uring (do not know how, just thoughts) and event destination (aio, epoll) has a uring, then reading on userside can be a matter of traversing urings. > Reaping events entirely in userspace (i.e. > performing io_getevents() without entering the kernel) has been > possible for a long time even with the existing aio interface. True. -- Roman From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 413AFC43387 for ; Fri, 11 Jan 2019 18:06:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 173C920652 for ; Fri, 11 Jan 2019 18:06:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="dYoqMfGt" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389740AbfAKSGK (ORCPT ); Fri, 11 Jan 2019 13:06:10 -0500 Received: from mail-pf1-f178.google.com ([209.85.210.178]:40023 "EHLO mail-pf1-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733211AbfAKSFj (ORCPT ); Fri, 11 Jan 2019 13:05:39 -0500 Received: by mail-pf1-f178.google.com with SMTP id i12so7295342pfo.7 for ; Fri, 11 Jan 2019 10:05:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=KXI71lrceaBNGaDQEEbP1H6jV0DXjuTfPVEN31a43AM=; b=dYoqMfGtVjQcNU+3E7MlVSdFQIvgSNQg/mKJakIfBrqnW9RKFdoiPrKG7ZUKNfRE32 u1EK+uhVqSWdU/nGOQbiZsfBq0JxG35tVTlSFMYJ9eJ3w3ztKiVqXoNmR/7M7fSZOA7t VMiSFK/3kZf3fjlaBS+lZz6q0F/3FT2Tov0OdqT1z7L/OolEKex5RACWx6czaosgjcfM J9TdHCgAAJ0BIjB02r5X/CYzHvtLTp6M+jojXvJBMajpcL9QJwawgOlMyTAsN6m/nVq5 R4vPmv5Mo4Ol+MWwFadFrp026X1FE4vniUwiMOrheS7yNlguxM9S4sVJaSO2TOVduFni nB0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=KXI71lrceaBNGaDQEEbP1H6jV0DXjuTfPVEN31a43AM=; b=Ue6eB6h2RvFtsKtOqgJerxbCdnmC87y7WrYeeXz7UYli1cOgZE+YidMafU9mXSkJ+J VJVo2hsMz6zVhw3CsFjP9xpQ/+FogY+YWHGfaU6oAlFUr65sETKnDP3BHR8wmG0ut3C6 cWafF85MBnxMsnvvHDh3WWZLPQIlQl2r8yN/7LzdrYTAv2A9Dg9v/iJ5iPmNxU6ykXCE Swudm9LT4ZpxUCfYlX2vuqdg/QkmB6HX2AKuUICyrsuYj/LEnN+EftNkJSwsp+NG2H37 GPQdVjneJmR5V84biXnm0tWrmPc7ZAXdRajTMgPoybgLatHgzCX/p2ecIthGkdzYeffn CARg== X-Gm-Message-State: AJcUukeQn/GIvlAqiLHg+cguNNpE2iKhJFMW0CIAmyvD7FvllgVHYIB7 OtBXiNPMrO47mYo5XOHZXtAjPST58zZyxw== X-Google-Smtp-Source: ALg8bN6hkP1XScA/ClUQm56dxW8M+zTYWamzzwAqnEwtjmrrliIIMYBL5VXAZnVjl7jXwT3OAKAwZQ== X-Received: by 2002:a62:220d:: with SMTP id i13mr15425956pfi.162.1547229938732; Fri, 11 Jan 2019 10:05:38 -0800 (PST) Received: from ?IPv6:2620:10d:c081:1131::1250? ([2620:10d:c090:180::1:fab8]) by smtp.gmail.com with ESMTPSA id i184sm109108844pfc.41.2019.01.11.10.05.36 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Jan 2019 10:05:37 -0800 (PST) Subject: Re: [PATCHSET v2] io_uring IO interface To: Roman Penyaev Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, linux-block-owner@vger.kernel.org References: <20190110024404.25372-1-axboe@kernel.dk> From: Jens Axboe Message-ID: <4a4a23c7-7842-0a12-2d46-c892cf2082bd@kernel.dk> Date: Fri, 11 Jan 2019 11:05:35 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190111180535.BtxzUcPp_uXpXWTclc9yyNH6_zoemOk7jJsjF-Mm8Ng@z> On 1/11/19 2:46 AM, Roman Penyaev wrote: > Hi Jens, > > That is interesting. Recently I sent an rfc related to epoll uring: > > https://lore.kernel.org/lkml/20190109164025.24554-1-rpenyaev@suse.de > > which can be mapped to userspace and all ready events can be consumed > from it directly. I am wondering, is it possible to make some common > API for all kind of ready events / urings, or it doesn't make any > sense? Not sure that's easily possible, even out of the two rings in io_uring, the sq and cq rings aren't the same. The latter is sequentially written, as completions come in. The former ring is actually indexes into the array, so you can submit things out of order when needed. -- Jens Axboe From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CEF7CC43444 for ; Fri, 11 Jan 2019 18:19:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A23DE205C9 for ; Fri, 11 Jan 2019 18:19:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="caJKE3jP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730968AbfAKSTs (ORCPT ); Fri, 11 Jan 2019 13:19:48 -0500 Received: from aserp2130.oracle.com ([141.146.126.79]:36478 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725446AbfAKSTs (ORCPT ); Fri, 11 Jan 2019 13:19:48 -0500 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id x0BI9bqT147138; Fri, 11 Jan 2019 18:19:39 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=to : cc : subject : from : references : date : in-reply-to : message-id : mime-version : content-type; s=corp-2018-07-02; bh=rDHptoK6ruzQkKoLu4moRY+mcxzk6gaCs+qYnTYfEEI=; b=caJKE3jPayo+uHEYJGTiis4V0p3Cqot0V/9dwpUIJ4ut9fSU/XPyye/e4G5QzQ50SeOk sAMiAnoiTzos8RYTZF85J6ny5raWktOQBn3oHEbwPUkhQYf0urEb39cKqYlYjKc30YiU 5coO1zdcS4Uv2Ai4COaPXNT0B2e67rDjpTjUYejmG1+mw/4QCqSAv80MM/FB5Xsx2OLQ aZ2Sq6lHT07qmqob8TqfHmPxMQ3dlTN3ggw7uFwNweO3AQcHUlTIy+Vu+eMdGwrgOFUy xTgbsKV9Jgr6ZpkAyta0XqQaiM80Fht8yogmkskoDL+8Z8LKM9gkxmM8rEityVzZTebY kQ== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp2130.oracle.com with ESMTP id 2ptj3eemgc-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 11 Jan 2019 18:19:39 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id x0BIJcFn029145 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 11 Jan 2019 18:19:39 GMT Received: from abhmp0002.oracle.com (abhmp0002.oracle.com [141.146.116.8]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id x0BIJcM4025113; Fri, 11 Jan 2019 18:19:38 GMT Received: from ca-mkp.ca.oracle.com (/10.159.214.123) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 11 Jan 2019 10:19:38 -0800 To: Jens Axboe Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com Subject: Re: [PATCH 05/15] Add io_uring IO interface From: "Martin K. Petersen" Organization: Oracle Corporation References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-6-axboe@kernel.dk> Date: Fri, 11 Jan 2019 13:19:35 -0500 In-Reply-To: <20190110024404.25372-6-axboe@kernel.dk> (Jens Axboe's message of "Wed, 9 Jan 2019 19:43:54 -0700") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9133 signatures=668680 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=623 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1901110146 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190111181935.YHfjYcXDFwIJk0J0hJ4oZWV9ICLgaQxNWXyelyjnDnw@z> Jens, > +struct io_uring_sqe { > + __u8 opcode; > + __u8 flags; > + __u16 ioprio; > + __s32 fd; > + __u64 off; > + union { > + void *addr; > + __u64 __pad; > + }; > + __u32 len; > + union { > + __kernel_rwf_t rw_flags; > + __u32 __resv; > + }; > +}; A bit tongue in cheek and yet somewhat serious: While I'm super excited about the 4 x 64 bitness of the sqe, where does the integrity buffer go? Or the 128-bit KV store key. How do we extend this interface beyond the flags? -- Martin K. Petersen Oracle Linux Engineering From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C71DC43444 for ; Fri, 11 Jan 2019 18:34:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D542F2183F for ; Fri, 11 Jan 2019 18:34:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="YaRRhW4r" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732322AbfAKSep (ORCPT ); Fri, 11 Jan 2019 13:34:45 -0500 Received: from mail-pf1-f194.google.com ([209.85.210.194]:41400 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729884AbfAKSep (ORCPT ); Fri, 11 Jan 2019 13:34:45 -0500 Received: by mail-pf1-f194.google.com with SMTP id b7so7331012pfi.8 for ; Fri, 11 Jan 2019 10:34:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=lszYkSIu7S0G0GkbSWA6G+IUU99PAcTok/8iYuS+0Ig=; b=YaRRhW4r0sXDl899rnnW0bmtsUu+0KKxCP0jj2tRzDrR0dusN6rJEBnDBw4WhY28Ph deJyTcNTtTuSVoPZbVLfyFyftncmg5WG/TGIfCwSy/WnR/w4qFbhv5ABKq0HKrpDykud LncFXcN00wus7FWpzGxANj1xB+GiTiAWNtikIP6kBLUxXpanH/2DtKG1U+r6znZQlIV2 yhyvAzSfoGrvMMPXWSKeyhB21q9S0VZcPCvUqavsvvJDbaSOAH+nIH14ZZvcKFKExxmw jOc/Tz3KllhN3w8diLSnCnyI2OkizfiNFFbAl3UoxSEag8mUZnBmOpFN8NQ0NyS2fU24 M0Sw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=lszYkSIu7S0G0GkbSWA6G+IUU99PAcTok/8iYuS+0Ig=; b=hpz26aM2YJvkoQ3x48uQWkHdx3CrD+Vk0NhntD7tG9oaiwqfcDVT3sJtvhdF4cVnvZ qcPyezIriYAJ29fZgTFDJiTjYMi9pSn15+tnC0hagBeNEhfz0/MUFl6AfJqfMudqLliQ giw/0yy1n1o3EcrNi1rM4YsfT3OkAghP/pBabW8jf4Xmzq9An0hCpBBtMLtxz06Xj+Yl ePd2rmYnq/DQhOgflr2WJcKyF9DNoC6v7dJ87mQ6E5J0R11auBFQDR90SJO1pXMtCDIK 3/4uioANF9azexeo1KdwN4H5HtChzPQxp1z8R4iCUMmBHaEymQ6DxpsreiDVXOCROYM0 1W9A== X-Gm-Message-State: AJcUukdek+M0hm9emzlSp+JGVfZOUCKxFFq8QkqTFm76rE9U+bjpIlGa SPvQhBwZOZTwbCpgvtp+rlwNJg== X-Google-Smtp-Source: ALg8bN6xID5YBAtKrZY68YrPvVJLAiahon10dfUNpDL3hYzG5nv9ljX9bXqwuVpp0uDT7xWsDYs+Eg== X-Received: by 2002:a63:2c82:: with SMTP id s124mr14048409pgs.73.1547231684220; Fri, 11 Jan 2019 10:34:44 -0800 (PST) Received: from ?IPv6:2620:10d:c081:1131::1250? ([2620:10d:c090:180::1:fab8]) by smtp.gmail.com with ESMTPSA id k191sm92447325pgd.9.2019.01.11.10.34.41 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Jan 2019 10:34:42 -0800 (PST) Subject: Re: [PATCH 05/15] Add io_uring IO interface To: "Martin K. Petersen" Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-6-axboe@kernel.dk> From: Jens Axboe Message-ID: Date: Fri, 11 Jan 2019 11:34:40 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190111183440.U6r93ddYp4iBHLnhNkNuX1CPjzn2Afqnhp-ii0eOiws@z> On 1/11/19 11:19 AM, Martin K. Petersen wrote: > > Jens, > >> +struct io_uring_sqe { >> + __u8 opcode; >> + __u8 flags; >> + __u16 ioprio; >> + __s32 fd; >> + __u64 off; >> + union { >> + void *addr; >> + __u64 __pad; >> + }; >> + __u32 len; >> + union { >> + __kernel_rwf_t rw_flags; >> + __u32 __resv; >> + }; >> +}; > > A bit tongue in cheek and yet somewhat serious: While I'm super excited > about the 4 x 64 bitness of the sqe, where does the integrity buffer go? > Or the 128-bit KV store key. How do we extend this interface beyond the > flags? For integrity buffers, how about we stash them on the side? The newer series has an extra system call, io_uring_register(), which is currently used for registering files and buffers for IO on the side. You could trivially tie an integrity buffer to an sqe through that. For KV, I thint that's an actually interesting use case (sorry, integrity), and we might just want to bite the bullet and extend the sqe to full 64 bytes. We're currently at 48 bytes, which leaves us with 16 bytes of space for KV, and other use cases. -- Jens Axboe From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9AAD0C43612 for ; Sun, 13 Jan 2019 16:22:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6A53920836 for ; Sun, 13 Jan 2019 16:22:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="h9BCgz12" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726687AbfAMQWW (ORCPT ); Sun, 13 Jan 2019 11:22:22 -0500 Received: from mail-pf1-f194.google.com ([209.85.210.194]:32951 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726543AbfAMQWV (ORCPT ); Sun, 13 Jan 2019 11:22:21 -0500 Received: by mail-pf1-f194.google.com with SMTP id c123so9226751pfb.0 for ; Sun, 13 Jan 2019 08:22:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:from:to:cc:references:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=xGTUGaUsnkrHE3I4ytaWS6tuX9Ef328HUPpl0Zip+JA=; b=h9BCgz12uXcjyXiN19LiCb0lSGdfKtJ+KsvKDmu8XuK7K/zZsG/OSHKTGif4h9EH7c BXNm7eIQuUG7GQVQBuC1HqSmgyWtw+GNKghvU2fIvWsVza9UhYlMrT6at/SeTiCv3Pfe hOypNja0ktrzn6CVWx1MKd5cTllDBdOu5yymi1cdvCG4Z1mFsdkuqiZ7mwiblGGcT33t OiIJFu/Wt8kirJvhdpH8u+LPsIQzgi0yBiiX2GCFKkaoFDMYnqTofgE04Q1dwzJWZWp1 I6Nx2Izvh66ZpBThxiJ2HlywkbIl3CtxMeYppSxVkb8z8o1xNoUOtWMWapQB7ju4dlqA CqCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:references:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=xGTUGaUsnkrHE3I4ytaWS6tuX9Ef328HUPpl0Zip+JA=; b=Ux8OM8UlsVGSrMgPuQO28EpYxBcNWuUa+SAul/HidqM/ViMLsxrtaznswL9DhLItHA rWc36Bnk4FKQ1G+ivhV6+AqVFK/5UybJXottYqhb4rpuKZwYvRHrEuy1UhLm97hnT99u oqa8BeyJmfkffhKGhFXPcjTUoiUb/KR3rU1NBZkhRPEFV+HdhQM85NecGMmU7lxMzsVK zDHx5cL0KLBZQlXqSK4PJs2dU82gMgNHMeeDBYP9lqCFiw3lg3sPfcdBqpEJRW53k2av 8DRmLqaSy3BL+gi55kfsNND/89FMDwb7+MLA0qV3PbvHEvY8zMV4r/Lwqtg3SSKyfi6Y 6mIg== X-Gm-Message-State: AJcUukfN5VYUskD6+oSj+AgvJFkRsG1+r8DwnabxFyebv4mn5YRSQQrC dzLHpZaqBSQA++WJxH9Uzj9O6A== X-Google-Smtp-Source: ALg8bN43/u+99uzhQ/tJtqNbJtgZ7cajVfkImzFThqWf5uC+iqK6E5cQjQq3Ov1aoO3/7n98SvSVMA== X-Received: by 2002:a63:314d:: with SMTP id x74mr19887928pgx.10.1547396540496; Sun, 13 Jan 2019 08:22:20 -0800 (PST) Received: from [192.168.1.121] (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id e187sm106139906pfa.130.2019.01.13.08.22.17 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 13 Jan 2019 08:22:18 -0800 (PST) Subject: Re: [PATCH 05/15] Add io_uring IO interface From: Jens Axboe To: "Martin K. Petersen" Cc: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-6-axboe@kernel.dk> Message-ID: <54976aac-bef2-880f-dc10-e3030189a08a@kernel.dk> Date: Sun, 13 Jan 2019 09:22:16 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190113162216.Qr-hMc-gNGaWP2QbLwVyqjr3162OTxo5ABSDN2e8nq0@z> On 1/11/19 11:34 AM, Jens Axboe wrote: > On 1/11/19 11:19 AM, Martin K. Petersen wrote: >> >> Jens, >> >>> +struct io_uring_sqe { >>> + __u8 opcode; >>> + __u8 flags; >>> + __u16 ioprio; >>> + __s32 fd; >>> + __u64 off; >>> + union { >>> + void *addr; >>> + __u64 __pad; >>> + }; >>> + __u32 len; >>> + union { >>> + __kernel_rwf_t rw_flags; >>> + __u32 __resv; >>> + }; >>> +}; >> >> A bit tongue in cheek and yet somewhat serious: While I'm super excited >> about the 4 x 64 bitness of the sqe, where does the integrity buffer go? >> Or the 128-bit KV store key. How do we extend this interface beyond the >> flags? > > For integrity buffers, how about we stash them on the side? The newer > series has an extra system call, io_uring_register(), which is currently > used for registering files and buffers for IO on the side. You could > trivially tie an integrity buffer to an sqe through that. > > For KV, I thint that's an actually interesting use case (sorry, > integrity), and we might just want to bite the bullet and extend the sqe > to full 64 bytes. We're currently at 48 bytes, which leaves us with 16 > bytes of space for KV, and other use cases. I bit the bullet and bumped the size. 64 bytes is a nicer size in terms of cachelines anyway, and I really doubt that 48 vs 64 bytes makes a size consumption problem for anyone. The buf_index is only used for the fixed buffers, which means that we have 16 bytes / 128 bits that we can grab for things like KV. -- Jens Axboe From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92368C43387 for ; Tue, 15 Jan 2019 17:32:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6482E20656 for ; Tue, 15 Jan 2019 17:32:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="484HcAMv" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388354AbfAORcO (ORCPT ); Tue, 15 Jan 2019 12:32:14 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:42104 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726022AbfAORcO (ORCPT ); Tue, 15 Jan 2019 12:32:14 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id x0FHE0CW020845; Tue, 15 Jan 2019 17:32:04 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=to : cc : subject : from : references : date : in-reply-to : message-id : mime-version : content-type; s=corp-2018-07-02; bh=u0WE5hfj6ulueNXnNXvJ1h/hCVgbLZphc8ZAdhtDa+4=; b=484HcAMvzc/XB+7NjoS1WfwWRsvQY+OX+bekwzHcMlmSLgYRd5itdXZNcfJpZpJiX+7L dWmirobgCb+1FJalFboAS8cuqBVo2JlnVWnkjulQgXvGKB/9ucAWRaojjuo1cM7ZjMBf THHuqjMlHIVtM5dFBEU/gOe/ETbfcNO5WOfCKPfED5LT2/AVvIr9Y/Lf1FYQCTiRWmfe jxGj4gcw/6Ye2d6HVf/lQqNQRBNB8NqofKzmO95QkOEIbhinMoCd8e16Bmh/nKBaITI9 O5uYJYIBGq0eSTkVgwWii1yu4C/276PKM5l/4zUEPYthgf8c3RPttmkeiTXi/S5juMnP rg== Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by userp2120.oracle.com with ESMTP id 2pybjs5900-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 15 Jan 2019 17:32:03 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserv0021.oracle.com (8.14.4/8.14.4) with ESMTP id x0FHW2Ci025656 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 15 Jan 2019 17:32:03 GMT Received: from abhmp0002.oracle.com (abhmp0002.oracle.com [141.146.116.8]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id x0FHW2IE031843; Tue, 15 Jan 2019 17:32:02 GMT Received: from ca-mkp.ca.oracle.com (/10.159.214.123) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 15 Jan 2019 09:32:02 -0800 To: Jens Axboe Cc: "Martin K. Petersen" , linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org, hch@lst.de, jmoyer@redhat.com, avi@scylladb.com Subject: Re: [PATCH 05/15] Add io_uring IO interface From: "Martin K. Petersen" Organization: Oracle Corporation References: <20190110024404.25372-1-axboe@kernel.dk> <20190110024404.25372-6-axboe@kernel.dk> <54976aac-bef2-880f-dc10-e3030189a08a@kernel.dk> Date: Tue, 15 Jan 2019 12:31:59 -0500 In-Reply-To: <54976aac-bef2-880f-dc10-e3030189a08a@kernel.dk> (Jens Axboe's message of "Sun, 13 Jan 2019 09:22:16 -0700") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9137 signatures=668682 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=676 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1901150143 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Message-ID: <20190115173159.EgLIp37qm9owhteZF-InX6YIEioQ6vzfFSW8IT37MKg@z> Jens, > I bit the bullet and bumped the size. 64 bytes is a nicer size in terms > of cachelines anyway, and I really doubt that 48 vs 64 bytes makes a > size consumption problem for anyone. > > The buf_index is only used for the fixed buffers, which means that we have > 16 bytes / 128 bits that we can grab for things like KV. Great! -- Martin K. Petersen Oracle Linux Engineering