* [PATCHSET v1] io_uring IO interface @ 2019-01-08 16:56 Jens Axboe 2019-01-08 16:56 ` [PATCH 01/16] fs: add an iopoll method to struct file_operations Jens Axboe ` (16 more replies) 0 siblings, 17 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch; +Cc: hch, jmoyer, avi After some arm twisting from Christoph, I finally caved and divorced the aio-poll patches from aio/libaio itself. The io_uring interface itself is useful and efficient, and after rebasing all the new goodies on top of that, there was little reason to retail the aio connection. Hence io_uring was born. This is what I previously called scqring for aio, but now as a standalone entity. Patch #5 adds the core of this interface, but in short, it has two main data structures: struct io_uring_iocb { __u8 opcode; __u8 flags; __u16 ioprio; __s32 fd; __u64 off; union { void *addr; __u64 __pad; }; __u32 len; union { __kernel_rwf_t rw_flags; __u32 __resv; }; }; struct io_uring_event { __u64 index; /* what iocb this event came from */ __s32 res; /* result code for this event */ __u32 flags; }; The SQ ring is an array of indexes into an array of io_uring_iocbs, which describe the IO to be done. The SQ ring is an array of io_uring_events, which describe a completion event. Both of these rings are mapped into the application through mmap(2), at special magic offsets. The application manipulates the ring directly, and then communicates with the kernel through these two system calls: io_uring_setup(entries, iovecs, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_iocbs. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. I've started a liburing git repo for this, which contains some helpers for doing IO without having to muck with the ring directly, setting up an io_uring context, etc. Clone that here: git://git.kernel.dk/liburing In terms of usage, there's also a small test app here: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c and if you're into fio, there's a io_uring engine included with that as well for test purposes. In terms of features, this has everything that the prior aio-poll postings did. Later patches add support for polled IO, fixed buffers, kernel side submission and polling, buffered aio, etc. Also a number of bug fixes in here from previous postings. Series is against 5.0-rc1, and can also be found in my io_uring branch. For now just x86-64 has the system calls wired up, and liburing also only supports x86-64. The latter just needs system call numbers and reasonable read/write barrier defines to work, however. Documentation/filesystems/vfs.txt | 3 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + block/bio.c | 59 +- fs/Makefile | 2 +- fs/block_dev.c | 19 +- fs/file.c | 15 +- fs/file_table.c | 9 +- fs/gfs2/file.c | 2 + fs/io_uring.c | 1907 ++++++++++++++++++++++++ fs/iomap.c | 48 +- fs/xfs/xfs_file.c | 1 + include/linux/bio.h | 14 + include/linux/blk_types.h | 1 + include/linux/file.h | 2 + include/linux/fs.h | 6 +- include/linux/iomap.h | 1 + include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 115 ++ kernel/sys_ni.c | 2 + 19 files changed, 2173 insertions(+), 40 deletions(-) -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 01/16] fs: add an iopoll method to struct file_operations 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 02/16] block: wire up block device iopoll method Jens Axboe ` (15 subsequent siblings) 16 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe From: Christoph Hellwig <hch@lst.de> This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. TODO: we can probably union ki_cookie with the existing hint and I/O priority fields to avoid struct kiocb growth. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> --- Documentation/filesystems/vfs.txt | 3 +++ include/linux/fs.h | 2 ++ 2 files changed, 5 insertions(+) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 8dc8e9c2913f..761c6fd24a53 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -902,6 +903,8 @@ otherwise noted. write_iter: possibly asynchronous write with iov_iter as source + iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents iterate_shared: called when the VFS needs to read the directory contents diff --git a/include/linux/fs.h b/include/linux/fs.h index 811c77743dad..ccb0b7a63aa5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -310,6 +310,7 @@ struct kiocb { int ki_flags; u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ + unsigned int ki_cookie; /* for ->iopoll */ } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1786,6 +1787,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 02/16] block: wire up block device iopoll method 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe 2019-01-08 16:56 ` [PATCH 01/16] fs: add an iopoll method to struct file_operations Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 03/16] block: add bio_set_polled() helper Jens Axboe ` (14 subsequent siblings) 16 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe From: Christoph Hellwig <hch@lst.de> Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/block_dev.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index c546cdce77e6..5415579f3e14 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -279,6 +279,14 @@ struct blkdev_dio { static struct bio_set blkdev_dio_pool; +static int blkdev_iopoll(struct kiocb *kiocb, bool wait) +{ + struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host); + struct request_queue *q = bdev_get_queue(bdev); + + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); +} + static void blkdev_bio_end_io(struct bio *bio) { struct blkdev_dio *dio = bio->bi_private; @@ -396,6 +404,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_opf |= REQ_HIPRI; qc = submit_bio(bio); + WRITE_ONCE(iocb->ki_cookie, qc); break; } @@ -2068,6 +2077,7 @@ const struct file_operations def_blk_fops = { .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, + .iopoll = blkdev_iopoll, .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 03/16] block: add bio_set_polled() helper 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe 2019-01-08 16:56 ` [PATCH 01/16] fs: add an iopoll method to struct file_operations Jens Axboe 2019-01-08 16:56 ` [PATCH 02/16] block: wire up block device iopoll method Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-10 9:43 ` Ming Lei 2019-01-08 16:56 ` [PATCH 04/16] iomap: wire up the iopoll method Jens Axboe ` (13 subsequent siblings) 16 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe For the upcoming async polled IO, we can't sleep allocating requests. If we do, then we introduce a deadlock where the submitter already has async polled IO in-flight, but can't wait for them to complete since polled requests must be active found and reaped. Utilize the helper in the blockdev DIRECT_IO code. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/block_dev.c | 4 ++-- include/linux/bio.h | 14 ++++++++++++++ 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 5415579f3e14..2ebd2a0d7789 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, task_io_account_write(ret); } if (iocb->ki_flags & IOCB_HIPRI) - bio.bi_opf |= REQ_HIPRI; + bio_set_polled(&bio, iocb); qc = submit_bio(&bio); for (;;) { @@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) { if (iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; + bio_set_polled(bio, iocb); qc = submit_bio(bio); WRITE_ONCE(iocb->ki_cookie, qc); diff --git a/include/linux/bio.h b/include/linux/bio.h index 7380b094dcca..f6f0a2b3cbc8 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, #endif /* CONFIG_BLK_DEV_INTEGRITY */ +/* + * Mark a bio as polled. Note that for async polled IO, the caller must + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). + * We cannot block waiting for requests on polled IO, as those completions + * must be found by the caller. This is different than IRQ driven IO, where + * it's safe to wait for IO to complete. + */ +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) +{ + bio->bi_opf |= REQ_HIPRI; + if (!is_sync_kiocb(kiocb)) + bio->bi_opf |= REQ_NOWAIT; +} + #endif /* CONFIG_BLOCK */ #endif /* __LINUX_BIO_H */ -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH 03/16] block: add bio_set_polled() helper 2019-01-08 16:56 ` [PATCH 03/16] block: add bio_set_polled() helper Jens Axboe @ 2019-01-10 9:43 ` Ming Lei 2019-01-10 9:43 ` Ming Lei 2019-01-10 16:05 ` Jens Axboe 0 siblings, 2 replies; 62+ messages in thread From: Ming Lei @ 2019-01-10 9:43 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On Tue, Jan 08, 2019 at 09:56:32AM -0700, Jens Axboe wrote: > For the upcoming async polled IO, we can't sleep allocating requests. > If we do, then we introduce a deadlock where the submitter already > has async polled IO in-flight, but can't wait for them to complete > since polled requests must be active found and reaped. > > Utilize the helper in the blockdev DIRECT_IO code. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > fs/block_dev.c | 4 ++-- > include/linux/bio.h | 14 ++++++++++++++ > 2 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 5415579f3e14..2ebd2a0d7789 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > task_io_account_write(ret); > } > if (iocb->ki_flags & IOCB_HIPRI) > - bio.bi_opf |= REQ_HIPRI; > + bio_set_polled(&bio, iocb); > > qc = submit_bio(&bio); > for (;;) { > @@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > if (!nr_pages) { > if (iocb->ki_flags & IOCB_HIPRI) > - bio->bi_opf |= REQ_HIPRI; > + bio_set_polled(bio, iocb); > > qc = submit_bio(bio); > WRITE_ONCE(iocb->ki_cookie, qc); > diff --git a/include/linux/bio.h b/include/linux/bio.h > index 7380b094dcca..f6f0a2b3cbc8 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, > > #endif /* CONFIG_BLK_DEV_INTEGRITY */ > > +/* > + * Mark a bio as polled. Note that for async polled IO, the caller must > + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). > + * We cannot block waiting for requests on polled IO, as those completions > + * must be found by the caller. This is different than IRQ driven IO, where > + * it's safe to wait for IO to complete. > + */ > +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) > +{ > + bio->bi_opf |= REQ_HIPRI; > + if (!is_sync_kiocb(kiocb)) > + bio->bi_opf |= REQ_NOWAIT; > +} > + REQ_NOWAIT doesn't cover allocating split bio, is that a issue? BTW, could you explain a bit about the deadlock in case of sleep from request allocation? Thanks, Ming -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 03/16] block: add bio_set_polled() helper 2019-01-10 9:43 ` Ming Lei @ 2019-01-10 9:43 ` Ming Lei 2019-01-10 16:05 ` Jens Axboe 1 sibling, 0 replies; 62+ messages in thread From: Ming Lei @ 2019-01-10 9:43 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On Tue, Jan 08, 2019 at 09:56:32AM -0700, Jens Axboe wrote: > For the upcoming async polled IO, we can't sleep allocating requests. > If we do, then we introduce a deadlock where the submitter already > has async polled IO in-flight, but can't wait for them to complete > since polled requests must be active found and reaped. > > Utilize the helper in the blockdev DIRECT_IO code. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > fs/block_dev.c | 4 ++-- > include/linux/bio.h | 14 ++++++++++++++ > 2 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 5415579f3e14..2ebd2a0d7789 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, > task_io_account_write(ret); > } > if (iocb->ki_flags & IOCB_HIPRI) > - bio.bi_opf |= REQ_HIPRI; > + bio_set_polled(&bio, iocb); > > qc = submit_bio(&bio); > for (;;) { > @@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) > nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); > if (!nr_pages) { > if (iocb->ki_flags & IOCB_HIPRI) > - bio->bi_opf |= REQ_HIPRI; > + bio_set_polled(bio, iocb); > > qc = submit_bio(bio); > WRITE_ONCE(iocb->ki_cookie, qc); > diff --git a/include/linux/bio.h b/include/linux/bio.h > index 7380b094dcca..f6f0a2b3cbc8 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, > > #endif /* CONFIG_BLK_DEV_INTEGRITY */ > > +/* > + * Mark a bio as polled. Note that for async polled IO, the caller must > + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). > + * We cannot block waiting for requests on polled IO, as those completions > + * must be found by the caller. This is different than IRQ driven IO, where > + * it's safe to wait for IO to complete. > + */ > +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) > +{ > + bio->bi_opf |= REQ_HIPRI; > + if (!is_sync_kiocb(kiocb)) > + bio->bi_opf |= REQ_NOWAIT; > +} > + REQ_NOWAIT doesn't cover allocating split bio, is that a issue? BTW, could you explain a bit about the deadlock in case of sleep from request allocation? Thanks, Ming ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 03/16] block: add bio_set_polled() helper 2019-01-10 9:43 ` Ming Lei 2019-01-10 9:43 ` Ming Lei @ 2019-01-10 16:05 ` Jens Axboe 2019-01-10 16:05 ` Jens Axboe 1 sibling, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-10 16:05 UTC (permalink / raw) To: Ming Lei Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On 1/10/19 2:43 AM, Ming Lei wrote: > On Tue, Jan 08, 2019 at 09:56:32AM -0700, Jens Axboe wrote: >> For the upcoming async polled IO, we can't sleep allocating requests. >> If we do, then we introduce a deadlock where the submitter already >> has async polled IO in-flight, but can't wait for them to complete >> since polled requests must be active found and reaped. >> >> Utilize the helper in the blockdev DIRECT_IO code. >> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> fs/block_dev.c | 4 ++-- >> include/linux/bio.h | 14 ++++++++++++++ >> 2 files changed, 16 insertions(+), 2 deletions(-) >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 5415579f3e14..2ebd2a0d7789 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, >> task_io_account_write(ret); >> } >> if (iocb->ki_flags & IOCB_HIPRI) >> - bio.bi_opf |= REQ_HIPRI; >> + bio_set_polled(&bio, iocb); >> >> qc = submit_bio(&bio); >> for (;;) { >> @@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) >> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); >> if (!nr_pages) { >> if (iocb->ki_flags & IOCB_HIPRI) >> - bio->bi_opf |= REQ_HIPRI; >> + bio_set_polled(bio, iocb); >> >> qc = submit_bio(bio); >> WRITE_ONCE(iocb->ki_cookie, qc); >> diff --git a/include/linux/bio.h b/include/linux/bio.h >> index 7380b094dcca..f6f0a2b3cbc8 100644 >> --- a/include/linux/bio.h >> +++ b/include/linux/bio.h >> @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, >> >> #endif /* CONFIG_BLK_DEV_INTEGRITY */ >> >> +/* >> + * Mark a bio as polled. Note that for async polled IO, the caller must >> + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). >> + * We cannot block waiting for requests on polled IO, as those completions >> + * must be found by the caller. This is different than IRQ driven IO, where >> + * it's safe to wait for IO to complete. >> + */ >> +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) >> +{ >> + bio->bi_opf |= REQ_HIPRI; >> + if (!is_sync_kiocb(kiocb)) >> + bio->bi_opf |= REQ_NOWAIT; >> +} >> + > > REQ_NOWAIT doesn't cover allocating split bio, is that a issue? Yes, that might be an issue. I'll look into what we should do about that, for now it's not a huge problem. > BTW, could you explain a bit about the deadlock in case of sleep from > request allocation? It's more a live lock I guess, but the issue is that with polled IO, we don't get an IRQ. For normal IO, if you run out, you can just sleep and wait for an IRQ to come in, trigger a completion (or multiple), which will then wake you up. For polled IO, you have to find those completions. Hence if you just go to sleep, nobody is going to find those completions for you. You'll then be waiting forever for an event, that will never trigger. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 03/16] block: add bio_set_polled() helper 2019-01-10 16:05 ` Jens Axboe @ 2019-01-10 16:05 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-10 16:05 UTC (permalink / raw) To: Ming Lei Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On 1/10/19 2:43 AM, Ming Lei wrote: > On Tue, Jan 08, 2019 at 09:56:32AM -0700, Jens Axboe wrote: >> For the upcoming async polled IO, we can't sleep allocating requests. >> If we do, then we introduce a deadlock where the submitter already >> has async polled IO in-flight, but can't wait for them to complete >> since polled requests must be active found and reaped. >> >> Utilize the helper in the blockdev DIRECT_IO code. >> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> fs/block_dev.c | 4 ++-- >> include/linux/bio.h | 14 ++++++++++++++ >> 2 files changed, 16 insertions(+), 2 deletions(-) >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 5415579f3e14..2ebd2a0d7789 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, >> task_io_account_write(ret); >> } >> if (iocb->ki_flags & IOCB_HIPRI) >> - bio.bi_opf |= REQ_HIPRI; >> + bio_set_polled(&bio, iocb); >> >> qc = submit_bio(&bio); >> for (;;) { >> @@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) >> nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); >> if (!nr_pages) { >> if (iocb->ki_flags & IOCB_HIPRI) >> - bio->bi_opf |= REQ_HIPRI; >> + bio_set_polled(bio, iocb); >> >> qc = submit_bio(bio); >> WRITE_ONCE(iocb->ki_cookie, qc); >> diff --git a/include/linux/bio.h b/include/linux/bio.h >> index 7380b094dcca..f6f0a2b3cbc8 100644 >> --- a/include/linux/bio.h >> +++ b/include/linux/bio.h >> @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, >> >> #endif /* CONFIG_BLK_DEV_INTEGRITY */ >> >> +/* >> + * Mark a bio as polled. Note that for async polled IO, the caller must >> + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). >> + * We cannot block waiting for requests on polled IO, as those completions >> + * must be found by the caller. This is different than IRQ driven IO, where >> + * it's safe to wait for IO to complete. >> + */ >> +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) >> +{ >> + bio->bi_opf |= REQ_HIPRI; >> + if (!is_sync_kiocb(kiocb)) >> + bio->bi_opf |= REQ_NOWAIT; >> +} >> + > > REQ_NOWAIT doesn't cover allocating split bio, is that a issue? Yes, that might be an issue. I'll look into what we should do about that, for now it's not a huge problem. > BTW, could you explain a bit about the deadlock in case of sleep from > request allocation? It's more a live lock I guess, but the issue is that with polled IO, we don't get an IRQ. For normal IO, if you run out, you can just sleep and wait for an IRQ to come in, trigger a completion (or multiple), which will then wake you up. For polled IO, you have to find those completions. Hence if you just go to sleep, nobody is going to find those completions for you. You'll then be waiting forever for an event, that will never trigger. -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 04/16] iomap: wire up the iopoll method 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (2 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 03/16] block: add bio_set_polled() helper Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 05/16] Add io_uring IO interface Jens Axboe ` (12 subsequent siblings) 16 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe From: Christoph Hellwig <hch@lst.de> Store the request queue the last bio was submitted to in the iocb private data in addition to the cookie so that we find the right block device. Also refactor the common direct I/O bio submission code into a nice little helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Modified to use bio_set_polled(). Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/gfs2/file.c | 2 ++ fs/iomap.c | 43 ++++++++++++++++++++++++++++--------------- fs/xfs/xfs_file.c | 1 + include/linux/iomap.h | 1 + 4 files changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index a2dea5bc0427..58a768e59712 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, @@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, diff --git a/fs/iomap.c b/fs/iomap.c index a3088fae567b..4ee50b76b4a1 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1454,6 +1454,28 @@ struct iomap_dio { }; }; +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin) +{ + struct request_queue *q = READ_ONCE(kiocb->private); + + if (!q) + return 0; + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin); +} +EXPORT_SYMBOL_GPL(iomap_dio_iopoll); + +static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, + struct bio *bio) +{ + atomic_inc(&dio->ref); + + if (dio->iocb->ki_flags & IOCB_HIPRI) + bio_set_polled(bio, dio->iocb); + + dio->submit.last_queue = bdev_get_queue(iomap->bdev); + dio->submit.cookie = submit_bio(bio); +} + static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; @@ -1566,7 +1588,7 @@ static void iomap_dio_bio_end_io(struct bio *bio) } } -static blk_qc_t +static void iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, unsigned len) { @@ -1580,15 +1602,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - if (dio->iocb->ki_flags & IOCB_HIPRI) - flags |= REQ_HIPRI; - get_page(page); __bio_add_page(bio, page, len, 0); bio_set_op_attrs(bio, REQ_OP_WRITE, flags); - - atomic_inc(&dio->ref); - return submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } static loff_t @@ -1691,9 +1708,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio_set_pages_dirty(bio); } - if (dio->iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; - iov_iter_advance(dio->submit.iter, n); dio->size += n; @@ -1701,11 +1715,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, copied += n; nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES); - - atomic_inc(&dio->ref); - - dio->submit.last_queue = bdev_get_queue(iomap->bdev); - dio->submit.cookie = submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } while (nr_pages); /* @@ -1916,6 +1926,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; + WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie); + WRITE_ONCE(iocb->private, dio->submit.last_queue); + if (!atomic_dec_and_test(&dio->ref)) { if (!dio->wait_for_completion) return -EIOCBQUEUED; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e47425071e65..60c2da41f0fc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = { .write_iter = xfs_file_write_iter, .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = xfs_file_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = xfs_file_compat_ioctl, diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 9a4258154b25..0fefb5455bda 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret, unsigned flags); ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, iomap_dio_end_io_t end_io); +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin); #ifdef CONFIG_SWAP struct file; -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 05/16] Add io_uring IO interface 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (3 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 04/16] iomap: wire up the iopoll method Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-09 12:10 ` Christoph Hellwig 2019-01-08 16:56 ` [PATCH 06/16] io_uring: support for IO polling Jens Axboe ` (11 subsequent siblings) 16 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_iocb data structure, and completions are generated in the form of io_uring_event data structures. The SQ ring is an index into the iocb_io_uring array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered and can point to any io_uring_iocb. Two new system calls are added for this: io_uring_setup(entries, iovecs, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_iocbs. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe <axboe@kernel.dk> --- arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 2 +- fs/io_uring.c | 849 +++++++++++++++++++++++++ include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 101 +++ kernel/sys_ni.c | 2 + 6 files changed, 960 insertions(+), 1 deletion(-) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..453ff7a79002 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common io_uring_setup __x64_sys_io_uring_setup +336 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..9ef9987b4192 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o -obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_AIO) += aio.o io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..ae2b886282bb --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,849 @@ +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2019 Jens Axboe + */ +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/errno.h> +#include <linux/syscalls.h> +#include <linux/refcount.h> +#include <linux/uio.h> + +#include <linux/sched/signal.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include <linux/mmu_context.h> +#include <linux/percpu.h> +#include <linux/slab.h> +#include <linux/workqueue.h> +#include <linux/blkdev.h> +#include <linux/anon_inodes.h> + +#include <linux/uaccess.h> +#include <linux/nospec.h> + +#include <uapi/linux/io_uring.h> + +#include "internal.h" + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[0]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_event events[0]; +}; + +struct io_iocb_ring { + struct io_sq_ring *ring; + unsigned entries; + unsigned ring_mask; + struct io_uring_iocb *iocbs; +}; + +struct io_event_ring { + struct io_cq_ring *ring; + unsigned entries; + unsigned ring_mask; +}; + +struct io_ring_ctx { + struct percpu_ref refs; + + unsigned int flags; + unsigned int max_reqs; + + struct io_iocb_ring sq_ring; + struct io_event_ring cq_ring; + + struct work_struct work; + + struct { + struct mutex uring_lock; + } ____cacheline_aligned_in_smp; + + struct { + struct mutex ring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; +}; + +struct fsync_iocb { + struct work_struct work; + struct file *file; + bool datasync; +}; + +struct io_kiocb { + union { + struct kiocb rw; + struct fsync_iocb fsync; + }; + + struct io_ring_ctx *ki_ctx; + unsigned long ki_index; + struct list_head ki_list; + unsigned long ki_flags; +}; + +#define IO_PLUG_THRESHOLD 2 + +static struct kmem_cache *kiocb_cachep, *ioctx_cachep; + +static const struct file_operations io_scqring_fops; + +static void io_ring_ctx_free(struct work_struct *work); +static void io_ring_ctx_ref_free(struct percpu_ref *ref); + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kmem_cache_zalloc(ioctx_cachep, GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kmem_cache_free(ioctx_cachep, ctx); + return NULL; + } + + ctx->flags = p->flags; + ctx->max_reqs = p->sq_entries; + + INIT_WORK(&ctx->work, io_ring_ctx_free); + + spin_lock_init(&ctx->completion_lock); + mutex_init(&ctx->ring_lock); + init_waitqueue_head(&ctx->wait); + mutex_init(&ctx->uring_lock); + + return ctx; +} + +static void io_inc_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring.ring; + + ring->r.tail++; + smp_wmb(); +} + +static struct io_uring_event *io_peek_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring.ring; + unsigned tail; + + smp_rmb(); + tail = READ_ONCE(ring->r.tail); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + return &ring->events[tail & ctx->cq_ring.ring_mask]; +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); + if (!req) + return NULL; + + percpu_ref_get(&ctx->refs); + req->ki_ctx = ctx; + INIT_LIST_HEAD(&req->ki_list); + req->ki_flags = 0; + return req; +} + +static inline void iocb_put(struct io_kiocb *iocb) +{ + percpu_ref_put(&iocb->ki_ctx->refs); + kmem_cache_free(kiocb_cachep, iocb); +} + +static void io_complete_iocb(struct io_ring_ctx *ctx, struct io_kiocb *iocb) +{ + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); + iocb_put(iocb); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_fill_event(struct io_uring_event *ev, struct io_kiocb *kiocb, + long res, unsigned flags) +{ + ev->index = kiocb->ki_index; + ev->res = res; + ev->flags = flags; +} + +static void io_cqring_fill_event(struct io_kiocb *iocb, long res, + unsigned ev_flags) +{ + struct io_ring_ctx *ctx = iocb->ki_ctx; + struct io_uring_event *ev; + unsigned long flags; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + spin_lock_irqsave(&ctx->completion_lock, flags); + ev = io_peek_cqring(ctx); + if (ev) { + io_fill_event(ev, iocb, res, ev_flags); + io_inc_cqring(ctx); + } else + ctx->cq_ring.ring->overflow++; + spin_unlock_irqrestore(&ctx->completion_lock, flags); +} + +static void io_complete_scqring(struct io_kiocb *iocb, long res, unsigned flags) +{ + io_cqring_fill_event(iocb, res, flags); + io_complete_iocb(iocb->ki_ctx, iocb); +} + +static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_complete_scqring(iocb, res, 0); +} + +static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) +{ + struct kiocb *req = &kiocb->rw; + int ret; + + req->ki_filp = fget(iocb->fd); + if (unlikely(!req->ki_filp)) + return -EBADF; + req->ki_pos = iocb->off; + req->ki_flags = iocb_flags(req->ki_filp); + req->ki_hint = ki_hint_validate(file_write_hint(req->ki_filp)); + if (iocb->ioprio) { + ret = ioprio_check_cap(iocb->ioprio); + if (ret) + goto out_fput; + + req->ki_ioprio = iocb->ioprio; + } else + req->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(req, iocb->rw_flags); + if (unlikely(ret)) + goto out_fput; + + /* no one is going to poll for this I/O */ + req->ki_flags &= ~IOCB_HIPRI; + req->ki_complete = io_complete_scqring_rw; + return 0; +out_fput: + fput(req->ki_filp); + return ret; +} + +static int io_setup_rw(int rw, const struct io_uring_iocb *iocb, + struct iovec **iovec, struct iov_iter *iter) +{ + void __user *buf = (void __user *)(uintptr_t)iocb->addr; + size_t ret; + + ret = import_single_range(rw, buf, iocb->len, *iovec, iter); + *iovec = NULL; + return ret; +} + +static inline void io_rw_done(struct kiocb *req, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * There's no easy way to restart the syscall since other AIO's + * may be already running. Just fail this IO with EINTR. + */ + ret = -EINTR; + /*FALLTHRU*/ + default: + req->ki_complete(req, ret, 0); + } +} + +static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *req = &kiocb->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(kiocb, iocb); + if (ret) + return ret; + file = req->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = io_setup_rw(READ, iocb, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter)); + if (!ret) + io_rw_done(req, call_read_iter(file, req, &iter)); + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *kiocb, + const struct io_uring_iocb *iocb) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *req = &kiocb->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(kiocb, iocb); + if (ret) + return ret; + file = req->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = io_setup_rw(WRITE, iocb, &iovec, &iter); + if (ret) + goto out_fput; + ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE); + } + req->ki_flags |= IOCB_WRITE; + io_rw_done(req, call_write_iter(file, req, &iter)); + } + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static void io_fsync_work(struct work_struct *work) +{ + struct fsync_iocb *req = container_of(work, struct fsync_iocb, work); + struct io_kiocb *iocb = container_of(req, struct io_kiocb, fsync); + int ret; + + ret = vfs_fsync(req->file, req->datasync); + fput(req->file); + + io_complete_scqring(iocb, ret, 0); +} + +static int io_fsync(struct fsync_iocb *req, const struct io_uring_iocb *iocb, + bool datasync) +{ + if (unlikely(iocb->addr || iocb->off || iocb->len || iocb->__resv)) + return -EINVAL; + + req->file = fget(iocb->fd); + if (unlikely(!req->file)) + return -EBADF; + if (unlikely(!req->file->f_op->fsync)) { + fput(req->file); + return -EINVAL; + } + + req->datasync = datasync; + INIT_WORK(&req->work, io_fsync_work); + schedule_work(&req->work); + return 0; +} + +static int __io_submit_one(struct io_ring_ctx *ctx, + const struct io_uring_iocb *iocb, + unsigned long ki_index) +{ + struct io_kiocb *req; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(iocb->flags)) + return -EINVAL; + + req = io_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + ret = -EINVAL; + if (ki_index >= ctx->max_reqs) + goto out_put_req; + req->ki_index = ki_index; + + ret = -EINVAL; + switch (iocb->opcode) { + case IORING_OP_READ: + ret = io_read(req, iocb); + break; + case IORING_OP_WRITE: + ret = io_write(req, iocb); + break; + case IORING_OP_FSYNC: + ret = io_fsync(&req->fsync, iocb, false); + break; + case IORING_OP_FDSYNC: + ret = io_fsync(&req->fsync, iocb, true); + break; + default: + ret = -EINVAL; + break; + } + + /* + * If ret is 0, ->ki_complete() has either been called, or will get + * called later on. Anything else, we need to free the req. + */ + if (ret) + goto out_put_req; + return 0; +out_put_req: + iocb_put(req); + return ret; +} + +static void io_inc_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring.ring; + + ring->r.head++; + smp_wmb(); +} + +static const struct io_uring_iocb *io_peek_sqring(struct io_ring_ctx *ctx, + unsigned *iocb_index) +{ + struct io_sq_ring *ring = ctx->sq_ring.ring; + unsigned head; + + smp_rmb(); + head = READ_ONCE(ring->r.head); + if (head == READ_ONCE(ring->r.tail)) + return NULL; + + head = ring->array[head & ctx->sq_ring.ring_mask]; + if (head < ctx->sq_ring.entries) { + *iocb_index = head; + return &ctx->sq_ring.iocbs[head]; + } + + /* drop invalid entries */ + ring->r.head++; + ring->dropped++; + smp_wmb(); + return NULL; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + const struct io_uring_iocb *iocb; + unsigned iocb_index; + + iocb = io_peek_sqring(ctx, &iocb_index); + if (!iocb) + break; + + ret = __io_submit_one(ctx, iocb, iocb_index); + if (ret) + break; + + submit++; + io_inc_sqring(ctx); + } + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) +{ + struct io_cq_ring *ring = ctx->cq_ring.ring; + DEFINE_WAIT(wait); + int ret; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, + unsigned min_complete, unsigned flags) +{ + int ret = 0; + + if (to_submit) { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } + if (flags & IORING_ENTER_GETEVENTS) { + int get_ret; + + if (!ret && to_submit) + min_complete = 0; + + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) + ret = get_ret; + } + + return ret; +} + +static void io_free_scq_urings(struct io_ring_ctx *ctx) +{ + if (ctx->sq_ring.ring) { + page_frag_free(ctx->sq_ring.ring); + ctx->sq_ring.ring = NULL; + } + if (ctx->sq_ring.iocbs) { + page_frag_free(ctx->sq_ring.iocbs); + ctx->sq_ring.iocbs = NULL; + } + if (ctx->cq_ring.ring) { + page_frag_free(ctx->cq_ring.ring); + ctx->cq_ring.ring = NULL; + } +} + +static void io_ring_ctx_free(struct work_struct *work) +{ + struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, work); + + io_free_scq_urings(ctx); + percpu_ref_exit(&ctx->refs); + kmem_cache_free(ioctx_cachep, ctx); +} + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + schedule_work(&ctx->work); +} + +static int io_scqring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + percpu_ref_kill(&ctx->refs); + return 0; +} + +static int io_scqring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring.ring; + break; + case IORING_OFF_IOCB: + ptr = ctx->sq_ring.iocbs; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring.ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags) +{ + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (f.file) { + struct io_ring_ctx *ctx; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_scqring_fops) + goto err; + + ctx = f.file->private_data; + ret = -EBUSY; + if (!mutex_trylock(&ctx->uring_lock)) + goto err; + + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); + mutex_unlock(&ctx->uring_lock); +err: + fdput(f); + } + + return ret; +} + +static const struct file_operations io_scqring_fops = { + .release = io_scqring_release, + .mmap = io_scqring_mmap, +}; + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring.ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_ring.ring_mask = sq_ring->ring_mask; + ctx->sq_ring.entries = sq_ring->ring_entries; + + ctx->sq_ring.iocbs = io_mem_alloc(sizeof(struct io_uring_iocb) * + p->sq_entries); + if (!ctx->sq_ring.iocbs) + goto err; + + cq_ring = io_mem_alloc(struct_size(cq_ring, events, p->cq_entries)); + if (!cq_ring) + goto err; + + ctx->cq_ring.ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_ring.ring_mask = cq_ring->ring_mask; + ctx->cq_ring.entries = cq_ring->ring_entries; + return 0; +err: + io_free_scq_urings(ctx); + return -ENOMEM; +} + +static void io_fill_offsets(struct io_uring_params *p) +{ + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.events = offsetof(struct io_cq_ring, events); +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + int ret; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the iocbs are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + ctx = io_ring_ctx_alloc(p); + if (!ctx) + return -ENOMEM; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + io_fill_offsets(p); + return ret; +err: + percpu_ref_kill(&ctx->refs); + return ret; +} + +/* + * sys_io_uring_setup: + * Sets up an aio uring context, and returns the fd. Applications asks + * for a ring size, we return the actual sq/cq ring sizes (among other + * things) in the params structure passed in. + */ +SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, + struct io_uring_params __user *, params) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + if (iovecs) + return -EINVAL; + + ret = io_uring_create(entries, &p); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +static int __init io_uring_setup(void) +{ + kiocb_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + ioctx_cachep = KMEM_CACHE(io_ring_ctx, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_setup); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..6d40939f65cd 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include <linux/types.h> #include <linux/aio_abi.h> @@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, struct iovec __user *iov, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..c31ac84d9f53 --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,101 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include <linux/fs.h> +#include <linux/types.h> + +/* + * IO submission data structure + */ +struct io_uring_iocb { + __u8 opcode; + __u8 flags; + __u16 ioprio; + __s32 fd; + __u64 off; + union { + void *addr; + __u64 __pad; + }; + __u32 len; + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; +}; + +#define IORING_OP_READ 1 +#define IORING_OP_WRITE 2 +#define IORING_OP_FSYNC 3 +#define IORING_OP_FDSYNC 4 + +/* + * IO completion data structure + */ +struct io_uring_event { + __u64 index; /* what iocb this event came from */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * io_uring_event->flags + */ +#define IOEV_FLAG_CACHEHIT (1 << 0) /* IO did not hit media */ + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_IOCB 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv[3]; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 events; + __u32 resv[4]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1 << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u16 resv[10]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-08 16:56 ` [PATCH 05/16] Add io_uring IO interface Jens Axboe @ 2019-01-09 12:10 ` Christoph Hellwig 2019-01-09 15:53 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Christoph Hellwig @ 2019-01-09 12:10 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi > index 293733f61594..9ef9987b4192 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o > obj-$(CONFIG_TIMERFD) += timerfd.o > obj-$(CONFIG_EVENTFD) += eventfd.o > obj-$(CONFIG_USERFAULTFD) += userfaultfd.o > -obj-$(CONFIG_AIO) += aio.o > +obj-$(CONFIG_AIO) += aio.o io_uring.o It is probablt worth adding a new config symbol for the uring as no code is shared with aio. > diff --git a/fs/io_uring.c b/fs/io_uring.c > new file mode 100644 > index 000000000000..ae2b886282bb > --- /dev/null > +++ b/fs/io_uring.c > @@ -0,0 +1,849 @@ > +/* > + * Shared application/kernel submission and completion ring pairs, for > + * supporting fast/efficient IO. > + * > + * Copyright (C) 2019 Jens Axboe > + */ Add an SPDX header to all new files, please. > +struct io_sq_ring { > + struct io_uring r; > + u32 ring_mask; > + u32 ring_entries; > + u32 dropped; > + u32 flags; > + u32 array[0]; > +}; field[0] is a legacy gcc extension, the proper C99+ way is field[]. > + > +struct io_iocb_ring { > + struct io_sq_ring *ring; > + unsigned entries; > + unsigned ring_mask; > + struct io_uring_iocb *iocbs; > +}; > + > +struct io_event_ring { > + struct io_cq_ring *ring; > + unsigned entries; > + unsigned ring_mask; > +}; Btw, do we really need there structures? It would seem simpler to just embedd them into the containing structure as: struct io_sq_ring *sq_ring; unsigned sq_ring_entries; unsigned sq_ring_mask; struct io_uring_iocb *sq_ring_iocbs; struct io_cq_ring *cq_ring; unsigned cq_ring_entries; unsigned cq_ring_mask; > +struct io_ring_ctx { > + struct percpu_ref refs; > + > + unsigned int flags; > + unsigned int max_reqs; max_reqs can probably go away in favour of the sq ring nr_entries field. > + struct io_iocb_ring sq_ring; > + struct io_event_ring cq_ring; > + > + struct work_struct work; > + > + struct { > + struct mutex uring_lock; > + } ____cacheline_aligned_in_smp; > + > + struct { > + struct mutex ring_lock; > + wait_queue_head_t wait; > + } ____cacheline_aligned_in_smp; > + > + struct { > + spinlock_t completion_lock; > + } ____cacheline_aligned_in_smp; > +}; Can you take a deep look if we need to keep all of ring_lock, completion_lock and the later added poll locking? From a quick look is isn't entirely clear what the locking strategy on the completion side is. It needs to be documented and can hopefully be simplified. > +struct fsync_iocb { > + struct work_struct work; > + struct file *file; > + bool datasync; > +}; Do we actually need this? Can't we just reuse the later thread offload for fsync? Maybe just add fsync support once everything else is done to make that simpler. > +static const struct file_operations io_scqring_fops; > + > +static void io_ring_ctx_free(struct work_struct *work); > +static void io_ring_ctx_ref_free(struct percpu_ref *ref); Can you try to avoid to need the forward delcaration? (except for the fops, where we probably need it). > > + > +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) > +{ > + struct io_ring_ctx *ctx; > + > + ctx = kmem_cache_zalloc(ioctx_cachep, GFP_KERNEL); > + if (!ctx) > + return NULL; Do we really need an explicit slab for the contexts? > +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) Maybe replace the req name with something matching the structure name? (and more on the structure name later). > +{ > + struct io_kiocb *req; > + > + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); > + if (!req) > + return NULL; > + > + percpu_ref_get(&ctx->refs); > + req->ki_ctx = ctx; > + INIT_LIST_HEAD(&req->ki_list); We never do a list_empty ceck on ki_list, so there should be no need to initialize it. > +static void io_fill_event(struct io_uring_event *ev, struct io_kiocb *kiocb, > + long res, unsigned flags) > +{ > + ev->index = kiocb->ki_index; > + ev->res = res; > + ev->flags = flags; > +} Probably no need for this helper. > +static void io_complete_scqring(struct io_kiocb *iocb, long res, unsigned flags) > +{ > + io_cqring_fill_event(iocb, res, flags); > + io_complete_iocb(iocb->ki_ctx, iocb); > +} Probably no need for this helper either. > + ret = kiocb_set_rw_flags(req, iocb->rw_flags); > + if (unlikely(ret)) > + goto out_fput; > + > + /* no one is going to poll for this I/O */ > + req->ki_flags &= ~IOCB_HIPRI; Now that we don't have the aio legacy to deal with should we just reject IOCB_HIPRI on a non-polled context? > +static int io_setup_rw(int rw, const struct io_uring_iocb *iocb, > + struct iovec **iovec, struct iov_iter *iter) > +{ > + void __user *buf = (void __user *)(uintptr_t)iocb->addr; > + size_t ret; > + > + ret = import_single_range(rw, buf, iocb->len, *iovec, iter); > + *iovec = NULL; > + return ret; > +} Is there any point in supporting non-vectored operations here? > + if (S_ISREG(file_inode(file)->i_mode)) { > + __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true); > + __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE); > + } Overly long lines. > +static int __io_submit_one(struct io_ring_ctx *ctx, > + const struct io_uring_iocb *iocb, > + unsigned long ki_index) Maybe calls this io_ring_submit_one? Or generally find a nice prefix for all the functions in this file? > + f = fdget(fd); > + if (f.file) { > + struct io_ring_ctx *ctx; Please just return early on fialure instead of forcing another level of indentation. > + > + ctx->sq_ring.iocbs = io_mem_alloc(sizeof(struct io_uring_iocb) * > + p->sq_entries); Use array_size(). > +/* > + * sys_io_uring_setup: > + * Sets up an aio uring context, and returns the fd. Applications asks > + * for a ring size, we return the actual sq/cq ring sizes (among other > + * things) in the params structure passed in. > + */ Can we drop this odd aio-style comment format? In fact the syscall documentation probably just belongs into the man page only anyway. Same for the uring_enter syscall. > +struct io_uring_iocb { Should we just call this io_uring_sqe? > +/* > + * IO completion data structure > + */ > +struct io_uring_event { > + __u64 index; /* what iocb this event came from */ > + __s32 res; /* result code for this event */ > + __u32 flags; > +}; io_uring_cqe? ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-09 12:10 ` Christoph Hellwig @ 2019-01-09 15:53 ` Jens Axboe 2019-01-09 18:30 ` Christoph Hellwig 0 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-09 15:53 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 5:10 AM, Christoph Hellwig wrote: >> index 293733f61594..9ef9987b4192 100644 >> --- a/fs/Makefile >> +++ b/fs/Makefile >> @@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o >> obj-$(CONFIG_TIMERFD) += timerfd.o >> obj-$(CONFIG_EVENTFD) += eventfd.o >> obj-$(CONFIG_USERFAULTFD) += userfaultfd.o >> -obj-$(CONFIG_AIO) += aio.o >> +obj-$(CONFIG_AIO) += aio.o io_uring.o > > It is probablt worth adding a new config symbol for the uring as no > code is shared with aio. Agreed, done. >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> new file mode 100644 >> index 000000000000..ae2b886282bb >> --- /dev/null >> +++ b/fs/io_uring.c >> @@ -0,0 +1,849 @@ >> +/* >> + * Shared application/kernel submission and completion ring pairs, for >> + * supporting fast/efficient IO. >> + * >> + * Copyright (C) 2019 Jens Axboe >> + */ > > Add an SPDX header to all new files, please. Done >> +struct io_sq_ring { >> + struct io_uring r; >> + u32 ring_mask; >> + u32 ring_entries; >> + u32 dropped; >> + u32 flags; >> + u32 array[0]; >> +}; > > field[0] is a legacy gcc extension, the proper C99+ way is field[]. Fixed >> +struct io_iocb_ring { >> + struct io_sq_ring *ring; >> + unsigned entries; >> + unsigned ring_mask; >> + struct io_uring_iocb *iocbs; >> +}; >> + >> +struct io_event_ring { >> + struct io_cq_ring *ring; >> + unsigned entries; >> + unsigned ring_mask; >> +}; > > Btw, do we really need there structures? It would seem simpler > to just embedd them into the containing structure as: > > struct io_sq_ring *sq_ring; > unsigned sq_ring_entries; > unsigned sq_ring_mask; > struct io_uring_iocb *sq_ring_iocbs; > > struct io_cq_ring *cq_ring; > unsigned cq_ring_entries; > unsigned cq_ring_mask; Yeah, I guess we use it directly in so few places that we may as well just get rid of the structs for these. > > >> +struct io_ring_ctx { >> + struct percpu_ref refs; >> + >> + unsigned int flags; >> + unsigned int max_reqs; > > max_reqs can probably go away in favour of the sq ring nr_entries > field. Indeed, killed. >> + struct io_iocb_ring sq_ring; >> + struct io_event_ring cq_ring; >> + >> + struct work_struct work; >> + >> + struct { >> + struct mutex uring_lock; >> + } ____cacheline_aligned_in_smp; >> + >> + struct { >> + struct mutex ring_lock; >> + wait_queue_head_t wait; >> + } ____cacheline_aligned_in_smp; >> + >> + struct { >> + spinlock_t completion_lock; >> + } ____cacheline_aligned_in_smp; >> +}; > > Can you take a deep look if we need to keep all of ring_lock, > completion_lock and the later added poll locking? From a quick look > is isn't entirely clear what the locking strategy on the completion > side is. It needs to be documented and can hopefully be simplified. I think we just need to kill ring_lock, it's actually not even used. I'll take a closer look at the locking as well. > >> +struct fsync_iocb { >> + struct work_struct work; >> + struct file *file; >> + bool datasync; >> +}; > > Do we actually need this? Can't we just reuse the later thread > offload for fsync? Maybe just add fsync support once everything else > is done to make that simpler. We can just use the sq thread, but we don't always have that backing. I guess we could create it lazily if an fsync comes in. I'll take a look at adding that as a separate thing. >> +static const struct file_operations io_scqring_fops; >> + >> +static void io_ring_ctx_free(struct work_struct *work); >> +static void io_ring_ctx_ref_free(struct percpu_ref *ref); > > Can you try to avoid to need the forward delcaration? (except for the > fops, where we probably need it). I got rid of one of them in my current tree already, I'll see if I can dump the other one. >> +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) >> +{ >> + struct io_ring_ctx *ctx; >> + >> + ctx = kmem_cache_zalloc(ioctx_cachep, GFP_KERNEL); >> + if (!ctx) >> + return NULL; > > Do we really need an explicit slab for the contexts? Not sure, guess it depends on the frequency of them. But I suspect that it won't matter one bit, I'll kill this slab. > >> +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) > > Maybe replace the req name with something matching the structure > name? (and more on the structure name later). Make sense. >> +{ >> + struct io_kiocb *req; >> + >> + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); >> + if (!req) >> + return NULL; >> + >> + percpu_ref_get(&ctx->refs); >> + req->ki_ctx = ctx; >> + INIT_LIST_HEAD(&req->ki_list); > > We never do a list_empty ceck on ki_list, so there should be no need > to initialize it. Killed >> +static void io_fill_event(struct io_uring_event *ev, struct io_kiocb *kiocb, >> + long res, unsigned flags) >> +{ >> + ev->index = kiocb->ki_index; >> + ev->res = res; >> + ev->flags = flags; >> +} > > Probably no need for this helper. Killed. Also realized that we're missing a store ordering barrier after filling in 'ev', but before incrementing the ring. >> +static void io_complete_scqring(struct io_kiocb *iocb, long res, unsigned flags) >> +{ >> + io_cqring_fill_event(iocb, res, flags); >> + io_complete_iocb(iocb->ki_ctx, iocb); >> +} > > Probably no need for this helper either. Killed > >> + ret = kiocb_set_rw_flags(req, iocb->rw_flags); >> + if (unlikely(ret)) >> + goto out_fput; >> + >> + /* no one is going to poll for this I/O */ >> + req->ki_flags &= ~IOCB_HIPRI; > > Now that we don't have the aio legacy to deal with should we just > reject IOCB_HIPRI on a non-polled context? Yes I think so, we don't have any legacy behavior to adhere to. > >> +static int io_setup_rw(int rw, const struct io_uring_iocb *iocb, >> + struct iovec **iovec, struct iov_iter *iter) >> +{ >> + void __user *buf = (void __user *)(uintptr_t)iocb->addr; >> + size_t ret; >> + >> + ret = import_single_range(rw, buf, iocb->len, *iovec, iter); >> + *iovec = NULL; >> + return ret; >> +} > > Is there any point in supporting non-vectored operations here? Not sure I follow? >> + if (S_ISREG(file_inode(file)->i_mode)) { >> + __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true); >> + __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE); >> + } > > Overly long lines. Fixed >> +static int __io_submit_one(struct io_ring_ctx *ctx, >> + const struct io_uring_iocb *iocb, >> + unsigned long ki_index) > > Maybe calls this io_ring_submit_one? Or generally find a nice prefix > for all the functions in this file? Agree, some of this is leftover cruft from the aio side. I'll clean it up. >> + f = fdget(fd); >> + if (f.file) { >> + struct io_ring_ctx *ctx; > > Please just return early on fialure instead of forcing another level > of indentation. Sure, done. > >> + >> + ctx->sq_ring.iocbs = io_mem_alloc(sizeof(struct io_uring_iocb) * >> + p->sq_entries); > > Use array_size(). Done >> +/* >> + * sys_io_uring_setup: >> + * Sets up an aio uring context, and returns the fd. Applications asks >> + * for a ring size, we return the actual sq/cq ring sizes (among other >> + * things) in the params structure passed in. >> + */ > > Can we drop this odd aio-style comment format? In fact the syscall > documentation probably just belongs into the man page only anyway. > > Same for the uring_enter syscall. Sure, not a big deal to me, dropped. >> +struct io_uring_iocb { > > Should we just call this io_uring_sqe? > >> +/* >> + * IO completion data structure >> + */ >> +struct io_uring_event { >> + __u64 index; /* what iocb this event came from */ >> + __s32 res; /* result code for this event */ >> + __u32 flags; >> +}; > > io_uring_cqe? I'm fine with that, I like the symmetry of the names. -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-09 15:53 ` Jens Axboe @ 2019-01-09 18:30 ` Christoph Hellwig 2019-01-09 20:07 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Christoph Hellwig @ 2019-01-09 18:30 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On Wed, Jan 09, 2019 at 08:53:31AM -0700, Jens Axboe wrote: > >> +static int io_setup_rw(int rw, const struct io_uring_iocb *iocb, > >> + struct iovec **iovec, struct iov_iter *iter) > >> +{ > >> + void __user *buf = (void __user *)(uintptr_t)iocb->addr; > >> + size_t ret; > >> + > >> + ret = import_single_range(rw, buf, iocb->len, *iovec, iter); > >> + *iovec = NULL; > >> + return ret; > >> +} > > > > Is there any point in supporting non-vectored operations here? > > Not sure I follow? This version only supports non-vectored read and write, that is the equivalent of pread/pwrite. Many AIO users really need vectored operations, that is preadv/pwritev semantics indirecting through a struct iovec array. The non-vectored version can be trivially emulated using a vector of 1, which is what we do in the kernel I/O stack everywhere. So I think we should just support the vectored version here, and not the non-vectored one. See my io_uring branch for the sketeched implementation. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-09 18:30 ` Christoph Hellwig @ 2019-01-09 20:07 ` Jens Axboe 2019-01-09 20:07 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-09 20:07 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 11:30 AM, Christoph Hellwig wrote: > On Wed, Jan 09, 2019 at 08:53:31AM -0700, Jens Axboe wrote: >>>> +static int io_setup_rw(int rw, const struct io_uring_iocb *iocb, >>>> + struct iovec **iovec, struct iov_iter *iter) >>>> +{ >>>> + void __user *buf = (void __user *)(uintptr_t)iocb->addr; >>>> + size_t ret; >>>> + >>>> + ret = import_single_range(rw, buf, iocb->len, *iovec, iter); >>>> + *iovec = NULL; >>>> + return ret; >>>> +} >>> >>> Is there any point in supporting non-vectored operations here? >> >> Not sure I follow? > > This version only supports non-vectored read and write, that is > the equivalent of pread/pwrite. Many AIO users really need vectored > operations, that is preadv/pwritev semantics indirecting through > a struct iovec array. The non-vectored version can be trivially > emulated using a vector of 1, which is what we do in the kernel > I/O stack everywhere. So I think we should just support the vectored > version here, and not the non-vectored one. See my io_uring branch > for the sketeched implementation. OK, I see what you mean, so only supported the vectored version. Probably makes more sense, I'll make the change. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-09 20:07 ` Jens Axboe @ 2019-01-09 20:07 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-09 20:07 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 11:30 AM, Christoph Hellwig wrote: > On Wed, Jan 09, 2019 at 08:53:31AM -0700, Jens Axboe wrote: >>>> +static int io_setup_rw(int rw, const struct io_uring_iocb *iocb, >>>> + struct iovec **iovec, struct iov_iter *iter) >>>> +{ >>>> + void __user *buf = (void __user *)(uintptr_t)iocb->addr; >>>> + size_t ret; >>>> + >>>> + ret = import_single_range(rw, buf, iocb->len, *iovec, iter); >>>> + *iovec = NULL; >>>> + return ret; >>>> +} >>> >>> Is there any point in supporting non-vectored operations here? >> >> Not sure I follow? > > This version only supports non-vectored read and write, that is > the equivalent of pread/pwrite. Many AIO users really need vectored > operations, that is preadv/pwritev semantics indirecting through > a struct iovec array. The non-vectored version can be trivially > emulated using a vector of 1, which is what we do in the kernel > I/O stack everywhere. So I think we should just support the vectored > version here, and not the non-vectored one. See my io_uring branch > for the sketeched implementation. OK, I see what you mean, so only supported the vectored version. Probably makes more sense, I'll make the change. -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 06/16] io_uring: support for IO polling 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (4 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 05/16] Add io_uring IO interface Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-09 12:11 ` Christoph Hellwig 2019-01-08 16:56 ` [PATCH 07/16] io_uring: add submission side request cache Jens Axboe ` (10 subsequent siblings) 16 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe Add polled variants of the read and write commands. These act like their non-polled counterparts, except we expect to poll for completion of them. To use polling, io_uring_setup() must be used with the IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match polled and non-polled IO on an io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 227 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 10 +- 2 files changed, 227 insertions(+), 10 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index ae2b886282bb..02eab2f42c63 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -76,7 +76,14 @@ struct io_ring_ctx { struct work_struct work; + /* iopoll submission state */ struct { + spinlock_t poll_lock; + struct list_head poll_submitted; + } ____cacheline_aligned_in_smp; + + struct { + struct list_head poll_completing; struct mutex uring_lock; } ____cacheline_aligned_in_smp; @@ -106,10 +113,14 @@ struct io_kiocb { unsigned long ki_index; struct list_head ki_list; unsigned long ki_flags; +#define KIOCB_F_IOPOLL_COMPLETED 0 /* polled IO has completed */ +#define KIOCB_F_IOPOLL_EAGAIN 1 /* submission got EAGAIN */ }; #define IO_PLUG_THRESHOLD 2 +#define IO_IOPOLL_BATCH 8 + static struct kmem_cache *kiocb_cachep, *ioctx_cachep; static const struct file_operations io_scqring_fops; @@ -138,6 +149,9 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) spin_lock_init(&ctx->completion_lock); mutex_init(&ctx->ring_lock); init_waitqueue_head(&ctx->wait); + spin_lock_init(&ctx->poll_lock); + INIT_LIST_HEAD(&ctx->poll_submitted); + INIT_LIST_HEAD(&ctx->poll_completing); mutex_init(&ctx->uring_lock); return ctx; @@ -185,6 +199,15 @@ static inline void iocb_put(struct io_kiocb *iocb) kmem_cache_free(kiocb_cachep, iocb); } +static void iocb_put_many(struct io_ring_ctx *ctx, void **iocbs, int *nr) +{ + if (*nr) { + percpu_ref_put_many(&ctx->refs, *nr); + kmem_cache_free_bulk(kiocb_cachep, *nr, iocbs); + *nr = 0; + } +} + static void io_complete_iocb(struct io_ring_ctx *ctx, struct io_kiocb *iocb) { if (waitqueue_active(&ctx->wait)) @@ -192,6 +215,134 @@ static void io_complete_iocb(struct io_ring_ctx *ctx, struct io_kiocb *iocb) iocb_put(iocb); } +/* + * Find and free completed poll iocbs + */ +static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) +{ + void *iocbs[IO_IOPOLL_BATCH]; + struct io_kiocb *iocb, *n; + int to_free = 0; + + list_for_each_entry_safe(iocb, n, &ctx->poll_completing, ki_list) { + if (!test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags)) + continue; + if (to_free == ARRAY_SIZE(iocbs)) + iocb_put_many(ctx, iocbs, &to_free); + + list_del(&iocb->ki_list); + iocbs[to_free++] = iocb; + + fput(iocb->rw.ki_filp); + (*nr_events)++; + } + + if (to_free) + iocb_put_many(ctx, iocbs, &to_free); +} + +/* + * Poll for a mininum of 'min' events, and a maximum of 'max'. Note that if + * min == 0 we consider that a non-spinning poll check - we'll still enter + * the driver poll loop, but only as a non-spinning completion check. + */ +static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, + long min) +{ + struct io_kiocb *iocb; + int found, polled, ret; + + /* + * Check if we already have done events that satisfy what we need + */ + if (!list_empty(&ctx->poll_completing)) { + io_iopoll_reap(ctx, nr_events); + if (min && *nr_events >= min) + return 0; + } + + /* + * Take in a new working set from the submitted list, if possible. + */ + if (!list_empty_careful(&ctx->poll_submitted)) { + spin_lock(&ctx->poll_lock); + list_splice_init(&ctx->poll_submitted, &ctx->poll_completing); + spin_unlock(&ctx->poll_lock); + } + + if (list_empty(&ctx->poll_completing)) + return 0; + + /* + * Check again now that we have a new batch. + */ + io_iopoll_reap(ctx, nr_events); + if (min && *nr_events >= min) + return 0; + + polled = found = 0; + list_for_each_entry(iocb, &ctx->poll_completing, ki_list) { + /* + * Poll for needed events with spin == true, anything after + * that we just check if we have more, up to max. + */ + bool spin = !polled || *nr_events < min; + struct kiocb *kiocb = &iocb->rw; + + if (test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags)) + break; + + found++; + ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); + if (ret < 0) + return ret; + + polled += ret; + } + + io_iopoll_reap(ctx, nr_events); + if (*nr_events >= min) + return 0; + return found; +} + +/* + * We can't just wait for polled events to come to us, we have to actively + * find and complete them. + */ +static void io_iopoll_reap_events(struct io_ring_ctx *ctx) +{ + if (!(ctx->flags & IORING_SETUP_IOPOLL)) + return; + + while (!list_empty_careful(&ctx->poll_submitted) || + !list_empty(&ctx->poll_completing)) { + unsigned int nr_events = 0; + + io_iopoll_getevents(ctx, &nr_events, 1); + } +} + +static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, + long min) +{ + int ret = 0; + + while (!*nr_events || !need_resched()) { + int tmin = 0; + + if (*nr_events < min) + tmin = min - *nr_events; + + ret = io_iopoll_getevents(ctx, nr_events, tmin); + if (ret <= 0) + break; + ret = 0; + } + + return ret; +} + static void kiocb_end_write(struct kiocb *kiocb) { if (kiocb->ki_flags & IOCB_WRITE) { @@ -253,8 +404,23 @@ static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) io_complete_scqring(iocb, res, 0); } +static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + if (unlikely(res == -EAGAIN)) { + set_bit(KIOCB_F_IOPOLL_EAGAIN, &iocb->ki_flags); + } else { + io_cqring_fill_event(iocb, res, 0); + set_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags); + } +} + static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) { + struct io_ring_ctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; int ret; @@ -277,9 +443,19 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) if (unlikely(ret)) goto out_fput; - /* no one is going to poll for this I/O */ - req->ki_flags &= ~IOCB_HIPRI; - req->ki_complete = io_complete_scqring_rw; + if (ctx->flags & IORING_SETUP_IOPOLL) { + ret = -EOPNOTSUPP; + if (!(req->ki_flags & IOCB_DIRECT) || + !req->ki_filp->f_op->iopoll) + goto out_fput; + + req->ki_flags |= IOCB_HIPRI; + req->ki_complete = io_complete_scqring_iopoll; + } else { + /* no one is going to poll for this I/O */ + req->ki_flags &= ~IOCB_HIPRI; + req->ki_complete = io_complete_scqring_rw; + } return 0; out_fput: fput(req->ki_filp); @@ -317,6 +493,30 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret) } } +/* + * After the iocb has been issued, it's safe to be found on the poll list. + * Adding the kiocb to the list AFTER submission ensures that we don't + * find it from a io_getevents() thread before the issuer is done accessing + * the kiocb cookie. + */ +static void io_iopoll_iocb_issued(struct io_kiocb *kiocb) +{ + /* + * For fast devices, IO may have already completed. If it has, add + * it to the front so we find it first. We can't add to the poll_done + * list as that's unlocked from the completion side. + */ + const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags); + struct io_ring_ctx *ctx = kiocb->ki_ctx; + + spin_lock(&ctx->poll_lock); + if (front) + list_add(&kiocb->ki_list, &ctx->poll_submitted); + else + list_add_tail(&kiocb->ki_list, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); +} + static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; @@ -459,9 +659,13 @@ static int __io_submit_one(struct io_ring_ctx *ctx, ret = io_write(req, iocb); break; case IORING_OP_FSYNC: + if (ctx->flags & IORING_SETUP_IOPOLL) + break; ret = io_fsync(&req->fsync, iocb, false); break; case IORING_OP_FDSYNC: + if (ctx->flags & IORING_SETUP_IOPOLL) + break; ret = io_fsync(&req->fsync, iocb, true); break; default: @@ -475,6 +679,13 @@ static int __io_submit_one(struct io_ring_ctx *ctx, */ if (ret) goto out_put_req; + if (ctx->flags & IORING_SETUP_IOPOLL) { + if (test_bit(KIOCB_F_IOPOLL_EAGAIN, &req->ki_flags)) { + ret = -EAGAIN; + goto out_put_req; + } + io_iopoll_iocb_issued(req); + } return 0; out_put_req: iocb_put(req); @@ -589,12 +800,17 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } if (flags & IORING_ENTER_GETEVENTS) { + unsigned nr_events = 0; int get_ret; if (!ret && to_submit) min_complete = 0; - get_ret = io_cqring_wait(ctx, min_complete); + if (ctx->flags & IORING_SETUP_IOPOLL) + get_ret = io_iopoll_check(ctx, &nr_events, + min_complete); + else + get_ret = io_cqring_wait(ctx, min_complete); if (get_ret < 0 && !ret) ret = get_ret; } @@ -622,6 +838,7 @@ static void io_ring_ctx_free(struct work_struct *work) { struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, work); + io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); percpu_ref_exit(&ctx->refs); kmem_cache_free(ioctx_cachep, ctx); @@ -825,7 +1042,7 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, return -EINVAL; } - if (p.flags) + if (p.flags & ~IORING_SETUP_IOPOLL) return -EINVAL; if (iovecs) return -EINVAL; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index c31ac84d9f53..f7ba30747816 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -31,6 +31,11 @@ struct io_uring_iocb { }; }; +/* + * io_uring_setup() flags + */ +#define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ + #define IORING_OP_READ 1 #define IORING_OP_WRITE 2 #define IORING_OP_FSYNC 3 @@ -45,11 +50,6 @@ struct io_uring_event { __u32 flags; }; -/* - * io_uring_event->flags - */ -#define IOEV_FLAG_CACHEHIT (1 << 0) /* IO did not hit media */ - /* * Magic offsets for the application to mmap the data it needs */ -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH 06/16] io_uring: support for IO polling 2019-01-08 16:56 ` [PATCH 06/16] io_uring: support for IO polling Jens Axboe @ 2019-01-09 12:11 ` Christoph Hellwig 2019-01-09 15:53 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Christoph Hellwig @ 2019-01-09 12:11 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On Tue, Jan 08, 2019 at 09:56:35AM -0700, Jens Axboe wrote: > Add polled variants of the read and write commands. These act like their > non-polled counterparts, except we expect to poll for completion of > them. These aren't really need command variants, but a different type of context. > case IORING_OP_FSYNC: > + if (ctx->flags & IORING_SETUP_IOPOLL) > + break; > ret = io_fsync(&req->fsync, iocb, false); > break; > case IORING_OP_FDSYNC: > + if (ctx->flags & IORING_SETUP_IOPOLL) > + break; > ret = io_fsync(&req->fsync, iocb, true); I'd move this check into io_fsync to avoid a little duplication. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 06/16] io_uring: support for IO polling 2019-01-09 12:11 ` Christoph Hellwig @ 2019-01-09 15:53 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-09 15:53 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 5:11 AM, Christoph Hellwig wrote: > On Tue, Jan 08, 2019 at 09:56:35AM -0700, Jens Axboe wrote: >> Add polled variants of the read and write commands. These act like their >> non-polled counterparts, except we expect to poll for completion of >> them. > > These aren't really need command variants, but a different type of context. That is phrased poorly, I'll fix that. >> case IORING_OP_FSYNC: >> + if (ctx->flags & IORING_SETUP_IOPOLL) >> + break; >> ret = io_fsync(&req->fsync, iocb, false); >> break; >> case IORING_OP_FDSYNC: >> + if (ctx->flags & IORING_SETUP_IOPOLL) >> + break; >> ret = io_fsync(&req->fsync, iocb, true); > > I'd move this check into io_fsync to avoid a little duplication. Done -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 07/16] io_uring: add submission side request cache 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (5 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 06/16] io_uring: support for IO polling Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 08/16] fs: add fget_many() and fput_many() Jens Axboe ` (9 subsequent siblings) 16 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe We have to add each submitted polled request to the io_context poll_submitted list, which means we have to grab the poll_lock. We already use the block plug to batch submissions if we're doing a batch of IO submissions, extend that to cover the poll requests internally as well. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 122 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 106 insertions(+), 16 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 02eab2f42c63..9f36eb728208 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -121,6 +121,21 @@ struct io_kiocb { #define IO_IOPOLL_BATCH 8 +struct io_submit_state { + struct io_ring_ctx *ctx; + + struct blk_plug plug; +#ifdef CONFIG_BLOCK + struct blk_plug_cb plug_cb; +#endif + + /* + * Polled iocbs that have been submitted, but not added to the ctx yet + */ + struct list_head req_list; + unsigned int req_count; +}; + static struct kmem_cache *kiocb_cachep, *ioctx_cachep; static const struct file_operations io_scqring_fops; @@ -494,21 +509,29 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret) } /* - * After the iocb has been issued, it's safe to be found on the poll list. - * Adding the kiocb to the list AFTER submission ensures that we don't - * find it from a io_getevents() thread before the issuer is done accessing - * the kiocb cookie. + * Called either at the end of IO submission, or through a plug callback + * because we're going to schedule. Moves out local batch of requests to + * the ctx poll list, so they can be found for polling + reaping. */ -static void io_iopoll_iocb_issued(struct io_kiocb *kiocb) +static void io_flush_state_reqs(struct io_ring_ctx *ctx, + struct io_submit_state *state) { + spin_lock(&ctx->poll_lock); + list_splice_tail_init(&state->req_list, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); + state->req_count = 0; +} + +static void io_iopoll_iocb_add_list(struct io_kiocb *kiocb) +{ + const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags); + struct io_ring_ctx *ctx = kiocb->ki_ctx; + /* * For fast devices, IO may have already completed. If it has, add * it to the front so we find it first. We can't add to the poll_done * list as that's unlocked from the completion side. */ - const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags); - struct io_ring_ctx *ctx = kiocb->ki_ctx; - spin_lock(&ctx->poll_lock); if (front) list_add(&kiocb->ki_list, &ctx->poll_submitted); @@ -517,6 +540,33 @@ static void io_iopoll_iocb_issued(struct io_kiocb *kiocb) spin_unlock(&ctx->poll_lock); } +static void io_iopoll_iocb_add_state(struct io_submit_state *state, + struct io_kiocb *kiocb) +{ + if (test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags)) + list_add(&kiocb->ki_list, &state->req_list); + else + list_add_tail(&kiocb->ki_list, &state->req_list); + + if (++state->req_count >= IO_IOPOLL_BATCH) + io_flush_state_reqs(state->ctx, state); +} + +/* + * After the iocb has been issued, it's safe to be found on the poll list. + * Adding the kiocb to the list AFTER submission ensures that we don't + * find it from a io_getevents() thread before the issuer is done accessing + * the kiocb cookie. + */ +static void io_iopoll_iocb_issued(struct io_submit_state *state, + struct io_kiocb *kiocb) +{ + if (!state || !IS_ENABLED(CONFIG_BLOCK)) + io_iopoll_iocb_add_list(kiocb); + else + io_iopoll_iocb_add_state(state, kiocb); +} + static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; @@ -632,7 +682,8 @@ static int io_fsync(struct fsync_iocb *req, const struct io_uring_iocb *iocb, static int __io_submit_one(struct io_ring_ctx *ctx, const struct io_uring_iocb *iocb, - unsigned long ki_index) + unsigned long ki_index, + struct io_submit_state *state) { struct io_kiocb *req; ssize_t ret; @@ -684,7 +735,7 @@ static int __io_submit_one(struct io_ring_ctx *ctx, ret = -EAGAIN; goto out_put_req; } - io_iopoll_iocb_issued(req); + io_iopoll_iocb_issued(state, req); } return 0; out_put_req: @@ -692,6 +743,43 @@ static int __io_submit_one(struct io_ring_ctx *ctx, return ret; } +#ifdef CONFIG_BLOCK +static void io_state_unplug(struct blk_plug_cb *cb, bool from_schedule) +{ + struct io_submit_state *state; + + state = container_of(cb, struct io_submit_state, plug_cb); + if (!list_empty(&state->req_list)) + io_flush_state_reqs(state->ctx, state); +} +#endif + +/* + * Batched submission is done, ensure local IO is flushed out. + */ +static void io_submit_state_end(struct io_submit_state *state) +{ + blk_finish_plug(&state->plug); + if (!list_empty(&state->req_list)) + io_flush_state_reqs(state->ctx, state); +} + +/* + * Start submission side cache. + */ +static void io_submit_state_start(struct io_submit_state *state, + struct io_ring_ctx *ctx) +{ + state->ctx = ctx; + INIT_LIST_HEAD(&state->req_list); + state->req_count = 0; +#ifdef CONFIG_BLOCK + state->plug_cb.callback = io_state_unplug; + blk_start_plug(&state->plug); + list_add(&state->plug_cb.list, &state->plug.cb_list); +#endif +} + static void io_inc_sqring(struct io_ring_ctx *ctx) { struct io_sq_ring *ring = ctx->sq_ring.ring; @@ -726,11 +814,13 @@ static const struct io_uring_iocb *io_peek_sqring(struct io_ring_ctx *ctx, static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { + struct io_submit_state state, *statep = NULL; int i, ret = 0, submit = 0; - struct blk_plug plug; - if (to_submit > IO_PLUG_THRESHOLD) - blk_start_plug(&plug); + if (to_submit > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx); + statep = &state; + } for (i = 0; i < to_submit; i++) { const struct io_uring_iocb *iocb; @@ -740,7 +830,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) if (!iocb) break; - ret = __io_submit_one(ctx, iocb, iocb_index); + ret = __io_submit_one(ctx, iocb, iocb_index, statep); if (ret) break; @@ -748,8 +838,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) io_inc_sqring(ctx); } - if (to_submit > IO_PLUG_THRESHOLD) - blk_finish_plug(&plug); + if (statep) + io_submit_state_end(statep); return submit ? submit : ret; } -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 08/16] fs: add fget_many() and fput_many() 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (6 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 07/16] io_uring: add submission side request cache Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 09/16] io_uring: use fget/fput_many() for file references Jens Axboe ` (8 subsequent siblings) 16 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/file.c | 15 ++++++++++----- fs/file_table.c | 9 +++++++-- include/linux/file.h | 2 ++ include/linux/fs.h | 4 +++- 4 files changed, 22 insertions(+), 8 deletions(-) diff --git a/fs/file.c b/fs/file.c index 3209ee271c41..e0d7ce70e860 100644 --- a/fs/file.c +++ b/fs/file.c @@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files) spin_unlock(&files->file_lock); } -static struct file *__fget(unsigned int fd, fmode_t mask) +static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs) { struct files_struct *files = current->files; struct file *file; @@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask) */ if (file->f_mode & mask) file = NULL; - else if (!get_file_rcu(file)) + else if (!get_file_rcu_many(file, refs)) goto loop; } rcu_read_unlock(); @@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask) return file; } +struct file *fget_many(unsigned int fd, unsigned int refs) +{ + return __fget(fd, FMODE_PATH, refs); +} + struct file *fget(unsigned int fd) { - return __fget(fd, FMODE_PATH); + return fget_many(fd, 1); } EXPORT_SYMBOL(fget); struct file *fget_raw(unsigned int fd) { - return __fget(fd, 0); + return __fget(fd, 0, 1); } EXPORT_SYMBOL(fget_raw); @@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask) return 0; return (unsigned long)file; } else { - file = __fget(fd, mask); + file = __fget(fd, mask, 1); if (!file) return 0; return FDPUT_FPUT | (unsigned long)file; diff --git a/fs/file_table.c b/fs/file_table.c index 5679e7fcb6b0..155d7514a094 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -326,9 +326,9 @@ void flush_delayed_fput(void) static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); -void fput(struct file *file) +void fput_many(struct file *file, unsigned int refs) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (atomic_long_sub_and_test(refs, &file->f_count)) { struct task_struct *task = current; if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { @@ -347,6 +347,11 @@ void fput(struct file *file) } } +void fput(struct file *file) +{ + fput_many(file, 1); +} + /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without diff --git a/include/linux/file.h b/include/linux/file.h index 6b2fb032416c..3fcddff56bc4 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -13,6 +13,7 @@ struct file; extern void fput(struct file *); +extern void fput_many(struct file *, unsigned int); struct file_operations; struct vfsmount; @@ -44,6 +45,7 @@ static inline void fdput(struct fd fd) } extern struct file *fget(unsigned int fd); +extern struct file *fget_many(unsigned int fd, unsigned int refs); extern struct file *fget_raw(unsigned int fd); extern unsigned long __fdget(unsigned int fd); extern unsigned long __fdget_raw(unsigned int fd); diff --git a/include/linux/fs.h b/include/linux/fs.h index ccb0b7a63aa5..acaad78b6781 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f) atomic_long_inc(&f->f_count); return f; } -#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count) +#define get_file_rcu_many(x, cnt) \ + atomic_long_add_unless(&(x)->f_count, (cnt), 0) +#define get_file_rcu(x) get_file_rcu_many((x), 1) #define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1) #define file_count(x) atomic_long_read(&(x)->f_count) -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 09/16] io_uring: use fget/fput_many() for file references 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (7 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 08/16] fs: add fget_many() and fput_many() Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 10/16] io_uring: split kiocb init from allocation Jens Axboe ` (7 subsequent siblings) 16 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe On the submission side, add file reference batching to the io_submit_state. We get as many references as the number of iocbs we are submitting, and drop unused ones if we end up switching files. The assumption here is that we're usually only dealing with one fd, and if there are multiple, hopefuly they are at least somewhat ordered. Could trivially be extended to cover multiple fds, if needed. On the completion side we do the same thing, except this is trivially done just locally in io_iopoll_reap(). Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 105 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 92 insertions(+), 13 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 9f36eb728208..afbaebb63012 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -134,6 +134,15 @@ struct io_submit_state { */ struct list_head req_list; unsigned int req_count; + + /* + * File reference cache + */ + struct file *file; + unsigned int fd; + unsigned int has_refs; + unsigned int used_refs; + unsigned int ios_left; }; static struct kmem_cache *kiocb_cachep, *ioctx_cachep; @@ -237,7 +246,8 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) { void *iocbs[IO_IOPOLL_BATCH]; struct io_kiocb *iocb, *n; - int to_free = 0; + int file_count, to_free = 0; + struct file *file = NULL; list_for_each_entry_safe(iocb, n, &ctx->poll_completing, ki_list) { if (!test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags)) @@ -248,10 +258,27 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) list_del(&iocb->ki_list); iocbs[to_free++] = iocb; - fput(iocb->rw.ki_filp); + /* + * Batched puts of the same file, to avoid dirtying the + * file usage count multiple times, if avoidable. + */ + if (!file) { + file = iocb->rw.ki_filp; + file_count = 1; + } else if (file == iocb->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = iocb->rw.ki_filp; + file_count = 1; + } + (*nr_events)++; } + if (file) + fput_many(file, file_count); + if (to_free) iocb_put_many(ctx, iocbs, &to_free); } @@ -433,13 +460,60 @@ static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) } } -static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) +static void io_file_put(struct io_submit_state *state, struct file *file) +{ + if (!state) { + fput(file); + } else if (state->file) { + int diff = state->has_refs - state->used_refs; + + if (diff) + fput_many(state->file, diff); + state->file = NULL; + } +} + +/* + * Get as many references to a file as we have IOs left in this submission, + * assuming most submissions are for one file, or at least that each file + * has more than one submission. + */ +static struct file *io_file_get(struct io_submit_state *state, int fd) +{ + if (!state) + return fget(fd); + + if (!state->file) { +get_file: + state->file = fget_many(fd, state->ios_left); + if (!state->file) + return NULL; + + state->fd = fd; + state->has_refs = state->ios_left; + state->used_refs = 1; + state->ios_left--; + return state->file; + } + + if (state->fd == fd) { + state->used_refs++; + state->ios_left--; + return state->file; + } + + io_file_put(state, NULL); + goto get_file; +} + +static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, + struct io_submit_state *state) { struct io_ring_ctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; int ret; - req->ki_filp = fget(iocb->fd); + req->ki_filp = io_file_get(state, iocb->fd); if (unlikely(!req->ki_filp)) return -EBADF; req->ki_pos = iocb->off; @@ -473,7 +547,7 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) } return 0; out_fput: - fput(req->ki_filp); + io_file_put(state, req->ki_filp); return ret; } @@ -567,7 +641,8 @@ static void io_iopoll_iocb_issued(struct io_submit_state *state, io_iopoll_iocb_add_state(state, kiocb); } -static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) +static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, + struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *req = &kiocb->rw; @@ -575,7 +650,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, iocb); + ret = io_prep_rw(kiocb, iocb, state); if (ret) return ret; file = req->ki_filp; @@ -602,7 +677,8 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb) } static ssize_t io_write(struct io_kiocb *kiocb, - const struct io_uring_iocb *iocb) + const struct io_uring_iocb *iocb, + struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *req = &kiocb->rw; @@ -610,7 +686,7 @@ static ssize_t io_write(struct io_kiocb *kiocb, struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, iocb); + ret = io_prep_rw(kiocb, iocb, state); if (ret) return ret; file = req->ki_filp; @@ -704,10 +780,10 @@ static int __io_submit_one(struct io_ring_ctx *ctx, ret = -EINVAL; switch (iocb->opcode) { case IORING_OP_READ: - ret = io_read(req, iocb); + ret = io_read(req, iocb, state); break; case IORING_OP_WRITE: - ret = io_write(req, iocb); + ret = io_write(req, iocb, state); break; case IORING_OP_FSYNC: if (ctx->flags & IORING_SETUP_IOPOLL) @@ -762,17 +838,20 @@ static void io_submit_state_end(struct io_submit_state *state) blk_finish_plug(&state->plug); if (!list_empty(&state->req_list)) io_flush_state_reqs(state->ctx, state); + io_file_put(state, NULL); } /* * Start submission side cache. */ static void io_submit_state_start(struct io_submit_state *state, - struct io_ring_ctx *ctx) + struct io_ring_ctx *ctx, unsigned max_ios) { state->ctx = ctx; INIT_LIST_HEAD(&state->req_list); state->req_count = 0; + state->file = NULL; + state->ios_left = max_ios; #ifdef CONFIG_BLOCK state->plug_cb.callback = io_state_unplug; blk_start_plug(&state->plug); @@ -818,7 +897,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) int i, ret = 0, submit = 0; if (to_submit > IO_PLUG_THRESHOLD) { - io_submit_state_start(&state, ctx); + io_submit_state_start(&state, ctx, to_submit); statep = &state; } -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 10/16] io_uring: split kiocb init from allocation 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (8 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 09/16] io_uring: use fget/fput_many() for file references Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-09 12:12 ` Christoph Hellwig 2019-01-08 16:56 ` [PATCH 11/16] io_uring: batch io_kiocb allocation Jens Axboe ` (6 subsequent siblings) 16 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe In preparation from having pre-allocated requests, that we then just need to initialize before use. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index afbaebb63012..11d045f0f799 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -202,6 +202,14 @@ static struct io_uring_event *io_peek_cqring(struct io_ring_ctx *ctx) return &ring->events[tail & ctx->cq_ring.ring_mask]; } +static void io_req_init(struct io_ring_ctx *ctx, struct io_kiocb *req) +{ + percpu_ref_get(&ctx->refs); + req->ki_ctx = ctx; + INIT_LIST_HEAD(&req->ki_list); + req->ki_flags = 0; +} + static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) { struct io_kiocb *req; @@ -210,10 +218,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) if (!req) return NULL; - percpu_ref_get(&ctx->refs); - req->ki_ctx = ctx; - INIT_LIST_HEAD(&req->ki_list); - req->ki_flags = 0; + io_req_init(ctx, req); return req; } -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH 10/16] io_uring: split kiocb init from allocation 2019-01-08 16:56 ` [PATCH 10/16] io_uring: split kiocb init from allocation Jens Axboe @ 2019-01-09 12:12 ` Christoph Hellwig 2019-01-09 16:56 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Christoph Hellwig @ 2019-01-09 12:12 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On Tue, Jan 08, 2019 at 09:56:39AM -0700, Jens Axboe wrote: > In preparation from having pre-allocated requests, that we then just > need to initialize before use. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > fs/io_uring.c | 13 +++++++++---- > 1 file changed, 9 insertions(+), 4 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index afbaebb63012..11d045f0f799 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -202,6 +202,14 @@ static struct io_uring_event *io_peek_cqring(struct io_ring_ctx *ctx) > return &ring->events[tail & ctx->cq_ring.ring_mask]; > } > > +static void io_req_init(struct io_ring_ctx *ctx, struct io_kiocb *req) > +{ > + percpu_ref_get(&ctx->refs); > + req->ki_ctx = ctx; > + INIT_LIST_HEAD(&req->ki_list); > + req->ki_flags = 0; We still only have a single caller of this in the final tree, and I don't think this function helps. So I'd just drop the patch. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 10/16] io_uring: split kiocb init from allocation 2019-01-09 12:12 ` Christoph Hellwig @ 2019-01-09 16:56 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-09 16:56 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 5:12 AM, Christoph Hellwig wrote: > On Tue, Jan 08, 2019 at 09:56:39AM -0700, Jens Axboe wrote: >> In preparation from having pre-allocated requests, that we then just >> need to initialize before use. >> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> fs/io_uring.c | 13 +++++++++---- >> 1 file changed, 9 insertions(+), 4 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index afbaebb63012..11d045f0f799 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -202,6 +202,14 @@ static struct io_uring_event *io_peek_cqring(struct io_ring_ctx *ctx) >> return &ring->events[tail & ctx->cq_ring.ring_mask]; >> } >> >> +static void io_req_init(struct io_ring_ctx *ctx, struct io_kiocb *req) >> +{ >> + percpu_ref_get(&ctx->refs); >> + req->ki_ctx = ctx; >> + INIT_LIST_HEAD(&req->ki_list); >> + req->ki_flags = 0; > > We still only have a single caller of this in the final tree, and > I don't think this function helps. So I'd just drop the patch. Dropped -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 11/16] io_uring: batch io_kiocb allocation 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (9 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 10/16] io_uring: split kiocb init from allocation Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-09 12:13 ` Christoph Hellwig 2019-01-08 16:56 ` [PATCH 12/16] block: implement bio helper to add iter bvec pages to bio Jens Axboe ` (5 subsequent siblings) 16 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe Similarly to how we use the state->ios_left to know how many references to get to a file, we can use it to allocate the io_kiocb's we need in bulk. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 41 +++++++++++++++++++++++++++++++++++------ 1 file changed, 35 insertions(+), 6 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 11d045f0f799..62778d7ffb8d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -135,6 +135,13 @@ struct io_submit_state { struct list_head req_list; unsigned int req_count; + /* + * io_kiocb alloc cache + */ + void *iocbs[IO_IOPOLL_BATCH]; + unsigned int free_iocbs; + unsigned int cur_iocb; + /* * File reference cache */ @@ -210,15 +217,33 @@ static void io_req_init(struct io_ring_ctx *ctx, struct io_kiocb *req) req->ki_flags = 0; } -static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, + struct io_submit_state *state) { struct io_kiocb *req; - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); - if (!req) - return NULL; + if (!state) + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); + else if (!state->free_iocbs) { + size_t size; + int ret; + + size = min_t(size_t, state->ios_left, ARRAY_SIZE(state->iocbs)); + ret = kmem_cache_alloc_bulk(kiocb_cachep, GFP_KERNEL, size, + state->iocbs); + if (ret <= 0) + return ERR_PTR(-ENOMEM); + state->free_iocbs = ret - 1; + state->cur_iocb = 1; + req = state->iocbs[0]; + } else { + req = state->iocbs[state->cur_iocb]; + state->free_iocbs--; + state->cur_iocb++; + } - io_req_init(ctx, req); + if (req) + io_req_init(ctx, req); return req; } @@ -773,7 +798,7 @@ static int __io_submit_one(struct io_ring_ctx *ctx, if (unlikely(iocb->flags)) return -EINVAL; - req = io_get_req(ctx); + req = io_get_req(ctx, state); if (unlikely(!req)) return -EAGAIN; @@ -844,6 +869,9 @@ static void io_submit_state_end(struct io_submit_state *state) if (!list_empty(&state->req_list)) io_flush_state_reqs(state->ctx, state); io_file_put(state, NULL); + if (state->free_iocbs) + kmem_cache_free_bulk(kiocb_cachep, state->free_iocbs, + &state->iocbs[state->cur_iocb]); } /* @@ -855,6 +883,7 @@ static void io_submit_state_start(struct io_submit_state *state, state->ctx = ctx; INIT_LIST_HEAD(&state->req_list); state->req_count = 0; + state->free_iocbs = 0; state->file = NULL; state->ios_left = max_ios; #ifdef CONFIG_BLOCK -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH 11/16] io_uring: batch io_kiocb allocation 2019-01-08 16:56 ` [PATCH 11/16] io_uring: batch io_kiocb allocation Jens Axboe @ 2019-01-09 12:13 ` Christoph Hellwig 2019-01-09 16:57 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Christoph Hellwig @ 2019-01-09 12:13 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi > + if (!state) > + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); Just return an error here if kmem_cache_alloc fails. > + if (req) > + io_req_init(ctx, req); Because all the other ones can't reached this with a NULL req. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 11/16] io_uring: batch io_kiocb allocation 2019-01-09 12:13 ` Christoph Hellwig @ 2019-01-09 16:57 ` Jens Axboe 2019-01-09 19:03 ` Christoph Hellwig 0 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-09 16:57 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 5:13 AM, Christoph Hellwig wrote: >> + if (!state) >> + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); > > Just return an error here if kmem_cache_alloc fails. > >> + if (req) >> + io_req_init(ctx, req); > > Because all the other ones can't reached this with a NULL req. This is different in the current tree, since I properly fixed the ctx ref issue. -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 11/16] io_uring: batch io_kiocb allocation 2019-01-09 16:57 ` Jens Axboe @ 2019-01-09 19:03 ` Christoph Hellwig 2019-01-09 20:08 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Christoph Hellwig @ 2019-01-09 19:03 UTC (permalink / raw) To: Jens Axboe Cc: Christoph Hellwig, linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On Wed, Jan 09, 2019 at 09:57:59AM -0700, Jens Axboe wrote: > On 1/9/19 5:13 AM, Christoph Hellwig wrote: > >> + if (!state) > >> + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); > > > > Just return an error here if kmem_cache_alloc fails. > > > >> + if (req) > >> + io_req_init(ctx, req); > > > > Because all the other ones can't reached this with a NULL req. > > This is different in the current tree, since I properly fixed the > ctx ref issue. Your tree does a percpu_ref_tryget very first, and then leaks that if kmem_cache_alloc_bulk fails, and also is inconsistent for NULL vs ERR_PTR returns. I think you want something like this on top: diff --git a/fs/io_uring.c b/fs/io_uring.c index 35d055dcbc22..6c95749e9601 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -250,14 +250,6 @@ static struct io_uring_event *io_peek_cqring(struct io_ring_ctx *ctx) return &ring->events[tail & ctx->cq_ring.ring_mask]; } -static bool io_req_init(struct io_ring_ctx *ctx, struct io_kiocb *req) -{ - req->ki_ctx = ctx; - INIT_LIST_HEAD(&req->ki_list); - req->ki_flags = 0; - return true; -} - static void io_ring_drop_ctx_ref(struct io_ring_ctx *ctx, unsigned refs) { percpu_ref_put_many(&ctx->refs, refs); @@ -274,9 +266,11 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, if (!percpu_ref_tryget(&ctx->refs)) return NULL; - if (!state) + if (!state) { req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); - else if (!state->free_iocbs) { + if (!req) + goto out_drop_ref; + } else if (!state->free_iocbs) { size_t size; int ret; @@ -284,7 +278,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, ret = kmem_cache_alloc_bulk(kiocb_cachep, GFP_KERNEL, size, state->iocbs); if (ret <= 0) - return ERR_PTR(-ENOMEM); + goto out_drop_ref; state->free_iocbs = ret - 1; state->cur_iocb = 1; req = state->iocbs[0]; @@ -294,11 +288,11 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, state->cur_iocb++; } - if (req) { - io_req_init(ctx, req); - return req; - } + req->ki_ctx = ctx; + req->ki_flags = 0; + return req; +out_drop_ref: io_ring_drop_ctx_ref(ctx, 1); return NULL; } ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH 11/16] io_uring: batch io_kiocb allocation 2019-01-09 19:03 ` Christoph Hellwig @ 2019-01-09 20:08 ` Jens Axboe 2019-01-09 20:08 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-09 20:08 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 12:03 PM, Christoph Hellwig wrote: > On Wed, Jan 09, 2019 at 09:57:59AM -0700, Jens Axboe wrote: >> On 1/9/19 5:13 AM, Christoph Hellwig wrote: >>>> + if (!state) >>>> + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); >>> >>> Just return an error here if kmem_cache_alloc fails. >>> >>>> + if (req) >>>> + io_req_init(ctx, req); >>> >>> Because all the other ones can't reached this with a NULL req. >> >> This is different in the current tree, since I properly fixed the >> ctx ref issue. > > Your tree does a percpu_ref_tryget very first, and then leaks that if > kmem_cache_alloc_bulk fails, and also is inconsistent for NULL vs > ERR_PTR returns. I think you want something like this on top: I fixed it up while doing the rebase already. Just haven't pushed anything out until I've had a chance to run it through testing. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 11/16] io_uring: batch io_kiocb allocation 2019-01-09 20:08 ` Jens Axboe @ 2019-01-09 20:08 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-09 20:08 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 12:03 PM, Christoph Hellwig wrote: > On Wed, Jan 09, 2019 at 09:57:59AM -0700, Jens Axboe wrote: >> On 1/9/19 5:13 AM, Christoph Hellwig wrote: >>>> + if (!state) >>>> + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); >>> >>> Just return an error here if kmem_cache_alloc fails. >>> >>>> + if (req) >>>> + io_req_init(ctx, req); >>> >>> Because all the other ones can't reached this with a NULL req. >> >> This is different in the current tree, since I properly fixed the >> ctx ref issue. > > Your tree does a percpu_ref_tryget very first, and then leaks that if > kmem_cache_alloc_bulk fails, and also is inconsistent for NULL vs > ERR_PTR returns. I think you want something like this on top: I fixed it up while doing the rebase already. Just haven't pushed anything out until I've had a chance to run it through testing. -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 12/16] block: implement bio helper to add iter bvec pages to bio 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (10 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 11/16] io_uring: batch io_kiocb allocation Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 13/16] io_uring: add support for pre-mapped user IO buffers Jens Axboe ` (4 subsequent siblings) 16 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe For an ITER_BVEC, we can just iterate the iov and add the pages to the bio directly. This requires that the caller doesn't releases the pages on IO completion, we add a BIO_HOLD_PAGES flag for that. The current two callers of bio_iov_iter_get_pages() are updated to check if they need to release pages on completion. This makes them work with bvecs that contain kernel mapped pages already. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- block/bio.c | 59 ++++++++++++++++++++++++++++++++------- fs/block_dev.c | 5 ++-- fs/iomap.c | 5 ++-- include/linux/blk_types.h | 1 + 4 files changed, 56 insertions(+), 14 deletions(-) diff --git a/block/bio.c b/block/bio.c index 4db1008309ed..7af4f45d2ed6 100644 --- a/block/bio.c +++ b/block/bio.c @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) +{ + const struct bio_vec *bv = iter->bvec; + unsigned int len; + size_t size; + + len = min_t(size_t, bv->bv_len, iter->count); + size = bio_add_page(bio, bv->bv_page, len, + bv->bv_offset + iter->iov_offset); + if (size == len) { + iov_iter_advance(iter, size); + return 0; + } + + return -EINVAL; +} + #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) /** @@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) } /** - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * bio_iov_iter_get_pages - add user or kernel pages to a bio * @bio: bio to add pages to - * @iter: iov iterator describing the region to be mapped + * @iter: iov iterator describing the region to be added + * + * This takes either an iterator pointing to user memory, or one pointing to + * kernel pages (BVEC iterator). If we're adding user pages, we pin them and + * map them into the kernel. On IO completion, the caller should put those + * pages. If we're adding kernel pages, we just have to add the pages to the + * bio directly. We don't grab an extra reference to those pages (the user + * should already have that), and we don't put the page on IO completion. + * The caller needs to check if the bio is flagged BIO_HOLD_PAGES on IO + * completion. If it isn't, then pages should be released. * - * Pins pages from *iter and appends them to @bio's bvec array. The - * pages will have to be released using put_page() when done. * The function tries, but does not guarantee, to pin as many pages as - * fit into the bio, or are requested in *iter, whatever is smaller. - * If MM encounters an error pinning the requested pages, it stops. - * Error is returned only if 0 pages could be pinned. + * fit into the bio, or are requested in *iter, whatever is smaller. If + * MM encounters an error pinning the requested pages, it stops. Error + * is returned only if 0 pages could be pinned. */ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { + const bool is_bvec = iov_iter_is_bvec(iter); unsigned short orig_vcnt = bio->bi_vcnt; + /* + * If this is a BVEC iter, then the pages are kernel pages. Don't + * release them on IO completion. + */ + if (is_bvec) + bio_set_flag(bio, BIO_HOLD_PAGES); + do { - int ret = __bio_iov_iter_get_pages(bio, iter); + int ret; + + if (is_bvec) + ret = __bio_iov_bvec_add_pages(bio, iter); + else + ret = __bio_iov_iter_get_pages(bio, iter); if (unlikely(ret)) return bio->bi_vcnt > orig_vcnt ? 0 : ret; @@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work) next = bio->bi_private; bio_set_pages_dirty(bio); - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); } } @@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio) goto defer; } - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); return; defer: diff --git a/fs/block_dev.c b/fs/block_dev.c index 2ebd2a0d7789..b7742014c9de 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -324,8 +324,9 @@ static void blkdev_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/fs/iomap.c b/fs/iomap.c index 4ee50b76b4a1..0a64c9c51203 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1582,8 +1582,9 @@ static void iomap_dio_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 5c7e7f859a24..97e206855cd3 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -215,6 +215,7 @@ struct bio { /* * bio flags */ +#define BIO_HOLD_PAGES 0 /* don't put O_DIRECT pages */ #define BIO_SEG_VALID 1 /* bi_phys_segments valid */ #define BIO_CLONED 2 /* doesn't own data */ #define BIO_BOUNCED 3 /* bio is a bounce bio */ -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 13/16] io_uring: add support for pre-mapped user IO buffers 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (11 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 12/16] block: implement bio helper to add iter bvec pages to bio Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-09 12:16 ` Christoph Hellwig 2019-01-08 16:56 ` [PATCH 14/16] io_uring: support kernel side submission Jens Axboe ` (3 subsequent siblings) 16 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe If we have fixed user buffers, we can map them into the kernel when we setup the io_context. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must set IORING_SETUP_FIXEDBUFS and pass in an array of iovecs that contain the desired buffer addresses and lengths. These buffers can then be mapped into the kernel for the life time of the io_uring, as opposed to just the duration of the each single IO. The application can then use the IORING_OP_{READ,WRITE}_FIXED to perform IO to these fixed locations. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. A limit of 4M is imposed as the largest buffer we currently support. There's nothing preventing us from going larger, but we need some cap, and 4M seemed like it would definitely be big enough. RLIMIT_MEMLOCK is used to cap the total amount of memory pinned. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 212 +++++++++++++++++++++++++++++++--- include/uapi/linux/io_uring.h | 3 + 2 files changed, 201 insertions(+), 14 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 62778d7ffb8d..92129f867824 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -22,6 +22,8 @@ #include <linux/workqueue.h> #include <linux/blkdev.h> #include <linux/anon_inodes.h> +#include <linux/sizes.h> +#include <linux/nospec.h> #include <linux/uaccess.h> #include <linux/nospec.h> @@ -65,6 +67,13 @@ struct io_event_ring { unsigned ring_mask; }; +struct io_mapped_ubuf { + u64 ubuf; + size_t len; + struct bio_vec *bvec; + unsigned int nr_bvecs; +}; + struct io_ring_ctx { struct percpu_ref refs; @@ -74,6 +83,9 @@ struct io_ring_ctx { struct io_iocb_ring sq_ring; struct io_event_ring cq_ring; + /* if used, fixed mapped user buffers */ + struct io_mapped_ubuf *user_bufs; + struct work_struct work; /* iopoll submission state */ @@ -581,13 +593,45 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, return ret; } -static int io_setup_rw(int rw, const struct io_uring_iocb *iocb, - struct iovec **iovec, struct iov_iter *iter) +static int io_setup_rw(int rw, struct io_kiocb *kiocb, + const struct io_uring_iocb *iocb, struct iovec **iovec, + struct iov_iter *iter, bool kaddr) { void __user *buf = (void __user *)(uintptr_t)iocb->addr; size_t ret; - ret = import_single_range(rw, buf, iocb->len, *iovec, iter); + if (!kaddr) { + ret = import_single_range(rw, buf, iocb->len, *iovec, iter); + } else { + struct io_ring_ctx *ctx = kiocb->ki_ctx; + struct io_mapped_ubuf *imu; + size_t len = iocb->len; + size_t offset; + int index; + + /* __io_submit_one() already validated the index */ + index = array_index_nospec(kiocb->ki_index, + ctx->max_reqs); + imu = &ctx->user_bufs[index]; + if ((unsigned long) iocb->addr < imu->ubuf || + (unsigned long) iocb->addr + len > imu->ubuf + imu->len) { + ret = -EFAULT; + goto err; + } + + /* + * May not be a start of buffer, set size appropriately + * and advance us to the beginning. + */ + offset = (unsigned long) iocb->addr - imu->ubuf; + iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, + offset + len); + if (offset) + iov_iter_advance(iter, offset); + ret = 0; + + } +err: *iovec = NULL; return ret; } @@ -672,7 +716,7 @@ static void io_iopoll_iocb_issued(struct io_submit_state *state, } static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, - struct io_submit_state *state) + struct io_submit_state *state, bool kaddr) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *req = &kiocb->rw; @@ -692,7 +736,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, if (unlikely(!file->f_op->read_iter)) goto out_fput; - ret = io_setup_rw(READ, iocb, &iovec, &iter); + ret = io_setup_rw(READ, kiocb, iocb, &iovec, &iter, kaddr); if (ret) goto out_fput; @@ -708,7 +752,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, static ssize_t io_write(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, - struct io_submit_state *state) + struct io_submit_state *state, bool kaddr) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *req = &kiocb->rw; @@ -728,7 +772,7 @@ static ssize_t io_write(struct io_kiocb *kiocb, if (unlikely(!file->f_op->write_iter)) goto out_fput; - ret = io_setup_rw(WRITE, iocb, &iovec, &iter); + ret = io_setup_rw(WRITE, kiocb, iocb, &iovec, &iter, kaddr); if (ret) goto out_fput; ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter)); @@ -810,10 +854,16 @@ static int __io_submit_one(struct io_ring_ctx *ctx, ret = -EINVAL; switch (iocb->opcode) { case IORING_OP_READ: - ret = io_read(req, iocb, state); + ret = io_read(req, iocb, state, false); + break; + case IORING_OP_READ_FIXED: + ret = io_read(req, iocb, state, true); break; case IORING_OP_WRITE: - ret = io_write(req, iocb, state); + ret = io_write(req, iocb, state, false); + break; + case IORING_OP_WRITE_FIXED: + ret = io_write(req, iocb, state, true); break; case IORING_OP_FSYNC: if (ctx->flags & IORING_SETUP_IOPOLL) @@ -1021,6 +1071,127 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } +static void io_iocb_buffer_unmap(struct io_ring_ctx *ctx) +{ + int i, j; + + if (!ctx->user_bufs) + return; + + for (i = 0; i < ctx->max_reqs; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + + for (j = 0; j < imu->nr_bvecs; j++) + put_page(imu->bvec[j].bv_page); + + kfree(imu->bvec); + imu->nr_bvecs = 0; + } + + kfree(ctx->user_bufs); + ctx->user_bufs = NULL; +} + +static int io_iocb_buffer_map(struct io_ring_ctx *ctx, + struct iovec __user *iovecs) +{ + unsigned long total_pages, page_limit; + struct page **pages = NULL; + int i, j, got_pages = 0; + int ret = -EINVAL; + + ctx->user_bufs = kcalloc(ctx->max_reqs, sizeof(struct io_mapped_ubuf), + GFP_KERNEL); + if (!ctx->user_bufs) + return -ENOMEM; + + /* Don't allow more pages than we can safely lock */ + total_pages = 0; + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + for (i = 0; i < ctx->max_reqs; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + unsigned long off, start, end, ubuf; + int pret, nr_pages; + struct iovec iov; + size_t size; + + ret = -EFAULT; + if (copy_from_user(&iov, &iovecs[i], sizeof(iov))) + goto err; + + /* + * Don't impose further limits on the size and buffer + * constraints here, we'll -EINVAL later when IO is + * submitted if they are wrong. + */ + ret = -EFAULT; + if (!iov.iov_base) + goto err; + + /* arbitrary limit, but we need something */ + if (iov.iov_len > SZ_4M) + goto err; + + ubuf = (unsigned long) iov.iov_base; + end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = ubuf >> PAGE_SHIFT; + nr_pages = end - start; + + ret = -ENOMEM; + if (total_pages + nr_pages > page_limit) + goto err; + + if (!pages || nr_pages > got_pages) { + kfree(pages); + pages = kmalloc(nr_pages * sizeof(struct page *), + GFP_KERNEL); + if (!pages) + goto err; + got_pages = nr_pages; + } + + imu->bvec = kmalloc(nr_pages * sizeof(struct bio_vec), + GFP_KERNEL); + if (!imu->bvec) + goto err; + + down_write(¤t->mm->mmap_sem); + pret = get_user_pages(ubuf, nr_pages, 1, pages, NULL); + up_write(¤t->mm->mmap_sem); + + if (pret < nr_pages) { + if (pret < 0) + ret = pret; + goto err; + } + + off = ubuf & ~PAGE_MASK; + size = iov.iov_len; + for (j = 0; j < nr_pages; j++) { + size_t vec_len; + + vec_len = min_t(size_t, size, PAGE_SIZE - off); + imu->bvec[j].bv_page = pages[j]; + imu->bvec[j].bv_len = vec_len; + imu->bvec[j].bv_offset = off; + off = 0; + size -= vec_len; + } + /* store original address for later verification */ + imu->ubuf = ubuf; + imu->len = iov.iov_len; + imu->nr_bvecs = nr_pages; + total_pages += nr_pages; + } + kfree(pages); + return 0; +err: + kfree(pages); + io_iocb_buffer_unmap(ctx); + return ret; +} + static void io_free_scq_urings(struct io_ring_ctx *ctx) { if (ctx->sq_ring.ring) { @@ -1043,6 +1214,7 @@ static void io_ring_ctx_free(struct work_struct *work) io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); + io_iocb_buffer_unmap(ctx); percpu_ref_exit(&ctx->refs); kmem_cache_free(ioctx_cachep, ctx); } @@ -1191,11 +1363,19 @@ static void io_fill_offsets(struct io_uring_params *p) p->cq_off.events = offsetof(struct io_cq_ring, events); } -static int io_uring_create(unsigned entries, struct io_uring_params *p) +static int io_uring_create(unsigned entries, struct io_uring_params *p, + struct iovec __user *iovecs) { struct io_ring_ctx *ctx; int ret; + /* + * We don't use the iovecs without fixed buffers being asked for. + * Error out if they don't match. + */ + if (!(p->flags & IORING_SETUP_FIXEDBUFS) && iovecs) + return -EINVAL; + /* * Use twice as many entries for the CQ ring. It's possible for the * application to drive a higher depth than the size of the SQ ring, @@ -1213,6 +1393,12 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) if (ret) goto err; + if (p->flags & IORING_SETUP_FIXEDBUFS) { + ret = io_iocb_buffer_map(ctx, iovecs); + if (ret) + goto err; + } + ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, O_RDWR | O_CLOEXEC); if (ret < 0) @@ -1245,12 +1431,10 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, return -EINVAL; } - if (p.flags & ~IORING_SETUP_IOPOLL) - return -EINVAL; - if (iovecs) + if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_FIXEDBUFS)) return -EINVAL; - ret = io_uring_create(entries, &p); + ret = io_uring_create(entries, &p, iovecs); if (ret < 0) return ret; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index f7ba30747816..925fd6ca3f38 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -35,11 +35,14 @@ struct io_uring_iocb { * io_uring_setup() flags */ #define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ +#define IORING_SETUP_FIXEDBUFS (1 << 1) /* IO buffers are fixed */ #define IORING_OP_READ 1 #define IORING_OP_WRITE 2 #define IORING_OP_FSYNC 3 #define IORING_OP_FDSYNC 4 +#define IORING_OP_READ_FIXED 5 +#define IORING_OP_WRITE_FIXED 6 /* * IO completion data structure -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH 13/16] io_uring: add support for pre-mapped user IO buffers 2019-01-08 16:56 ` [PATCH 13/16] io_uring: add support for pre-mapped user IO buffers Jens Axboe @ 2019-01-09 12:16 ` Christoph Hellwig 2019-01-09 17:06 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Christoph Hellwig @ 2019-01-09 12:16 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi > +static int io_setup_rw(int rw, struct io_kiocb *kiocb, > + const struct io_uring_iocb *iocb, struct iovec **iovec, > + struct iov_iter *iter, bool kaddr) > { > void __user *buf = (void __user *)(uintptr_t)iocb->addr; > size_t ret; > > - ret = import_single_range(rw, buf, iocb->len, *iovec, iter); > + if (!kaddr) { > + ret = import_single_range(rw, buf, iocb->len, *iovec, iter); > + } else { > + struct io_ring_ctx *ctx = kiocb->ki_ctx; > + struct io_mapped_ubuf *imu; > + size_t len = iocb->len; > + size_t offset; > + int index; > + > + /* __io_submit_one() already validated the index */ > + index = array_index_nospec(kiocb->ki_index, > + ctx->max_reqs); > + imu = &ctx->user_bufs[index]; > + if ((unsigned long) iocb->addr < imu->ubuf || > + (unsigned long) iocb->addr + len > imu->ubuf + imu->len) { > + ret = -EFAULT; > + goto err; > + } > + > + /* > + * May not be a start of buffer, set size appropriately > + * and advance us to the beginning. > + */ > + offset = (unsigned long) iocb->addr - imu->ubuf; > + iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, > + offset + len); > + if (offset) > + iov_iter_advance(iter, offset); > + ret = 0; > + Please split this code in a separate helper. > if (unlikely(!file->f_op->read_iter)) > goto out_fput; > > - ret = io_setup_rw(READ, iocb, &iovec, &iter); > + ret = io_setup_rw(READ, kiocb, iocb, &iovec, &iter, kaddr); And I'd personally just call that helper here based on the opcode and avoid magic bool arguments. > + down_write(¤t->mm->mmap_sem); > + pret = get_user_pages(ubuf, nr_pages, 1, pages, NULL); > + up_write(¤t->mm->mmap_sem); This needs to be get_user_pages_longterm. > + * We don't use the iovecs without fixed buffers being asked for. > + * Error out if they don't match. > + */ > + if (!(p->flags & IORING_SETUP_FIXEDBUFS) && iovecs) > + return -EINVAL; I don't think we need the IORING_SETUP_FIXEDBUFS flag at all, as a non-zero iovecs pointer is enough of an indication. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 13/16] io_uring: add support for pre-mapped user IO buffers 2019-01-09 12:16 ` Christoph Hellwig @ 2019-01-09 17:06 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-09 17:06 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 5:16 AM, Christoph Hellwig wrote: >> +static int io_setup_rw(int rw, struct io_kiocb *kiocb, >> + const struct io_uring_iocb *iocb, struct iovec **iovec, >> + struct iov_iter *iter, bool kaddr) >> { >> void __user *buf = (void __user *)(uintptr_t)iocb->addr; >> size_t ret; >> >> - ret = import_single_range(rw, buf, iocb->len, *iovec, iter); >> + if (!kaddr) { >> + ret = import_single_range(rw, buf, iocb->len, *iovec, iter); >> + } else { >> + struct io_ring_ctx *ctx = kiocb->ki_ctx; >> + struct io_mapped_ubuf *imu; >> + size_t len = iocb->len; >> + size_t offset; >> + int index; >> + >> + /* __io_submit_one() already validated the index */ >> + index = array_index_nospec(kiocb->ki_index, >> + ctx->max_reqs); >> + imu = &ctx->user_bufs[index]; >> + if ((unsigned long) iocb->addr < imu->ubuf || >> + (unsigned long) iocb->addr + len > imu->ubuf + imu->len) { >> + ret = -EFAULT; >> + goto err; >> + } >> + >> + /* >> + * May not be a start of buffer, set size appropriately >> + * and advance us to the beginning. >> + */ >> + offset = (unsigned long) iocb->addr - imu->ubuf; >> + iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, >> + offset + len); >> + if (offset) >> + iov_iter_advance(iter, offset); >> + ret = 0; >> + > > Please split this code in a separate helper. Done >> if (unlikely(!file->f_op->read_iter)) >> goto out_fput; >> >> - ret = io_setup_rw(READ, iocb, &iovec, &iter); >> + ret = io_setup_rw(READ, kiocb, iocb, &iovec, &iter, kaddr); > > And I'd personally just call that helper here based on the opcode and > avoid magic bool arguments. Then we can also fold the switch cases, cleans it up. >> + down_write(¤t->mm->mmap_sem); >> + pret = get_user_pages(ubuf, nr_pages, 1, pages, NULL); >> + up_write(¤t->mm->mmap_sem); > > This needs to be get_user_pages_longterm. Done >> + * We don't use the iovecs without fixed buffers being asked for. >> + * Error out if they don't match. >> + */ >> + if (!(p->flags & IORING_SETUP_FIXEDBUFS) && iovecs) >> + return -EINVAL; > > I don't think we need the IORING_SETUP_FIXEDBUFS flag at all, as a > non-zero iovecs pointer is enough of an indication. Good point, no point in that redundancy. Fixed. -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 14/16] io_uring: support kernel side submission 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (12 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 13/16] io_uring: add support for pre-mapped user IO buffers Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-09 19:06 ` Christoph Hellwig 2019-01-08 16:56 ` [PATCH 15/16] io_uring: add submission polling Jens Axboe ` (2 subsequent siblings) 16 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe Add support for backing the io_uring fd with either a thread, or a workqueue and letting those handle the submission for us. This can be used to reduce overhead for submission, or to always make submission async. The latter is particularly useful for buffered aio, which is now fully async with this feature. For polled IO, we could have the kernel side thread hammer on the SQ ring and submit when it finds IO. This would mean that an application would NEVER have to enter the kernel to do IO! Didn't add this yet, but it would be trivial to add. If an application sets IORING_SETUP_SCQTHREAD, the io_uring gets a single thread backing. If used with buffered IO, this will limit the device queue depth to 1, but it will be async, IOs will simply be serialized. Or an application can set IORING_SETUP_SQWQ, in which case the urings get a work queue backing. The concurrency level is the mininum of twice the available CPUs, or the queue depth specific for the context. For this mode, we attempt to do buffered reads inline, in case they are cached. So we should only punt to a workqueue, if we would have to block to get our data. Tested with polling, no polling, fixedbufs, no fixedbufs, buffered, O_DIRECT. See this sample application for how to use it: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 405 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 5 +- 2 files changed, 387 insertions(+), 23 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 92129f867824..e6a808a89b78 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -14,6 +14,7 @@ #include <linux/sched/signal.h> #include <linux/fs.h> #include <linux/file.h> +#include <linux/fdtable.h> #include <linux/mm.h> #include <linux/mman.h> #include <linux/mmu_context.h> @@ -24,6 +25,8 @@ #include <linux/anon_inodes.h> #include <linux/sizes.h> #include <linux/nospec.h> +#include <linux/kthread.h> +#include <linux/sched/mm.h> #include <linux/uaccess.h> #include <linux/nospec.h> @@ -58,6 +61,7 @@ struct io_iocb_ring { struct io_sq_ring *ring; unsigned entries; unsigned ring_mask; + unsigned sq_thread_cpu; struct io_uring_iocb *iocbs; }; @@ -74,6 +78,14 @@ struct io_mapped_ubuf { unsigned int nr_bvecs; }; +struct io_sq_offload { + struct task_struct *thread; /* if using a thread */ + struct workqueue_struct *wq; /* wq offload */ + struct mm_struct *mm; + struct files_struct *files; + wait_queue_head_t wait; +}; + struct io_ring_ctx { struct percpu_ref refs; @@ -88,6 +100,9 @@ struct io_ring_ctx { struct work_struct work; + /* sq ring submitter thread, if used */ + struct io_sq_offload sq_offload; + /* iopoll submission state */ struct { spinlock_t poll_lock; @@ -127,6 +142,7 @@ struct io_kiocb { unsigned long ki_flags; #define KIOCB_F_IOPOLL_COMPLETED 0 /* polled IO has completed */ #define KIOCB_F_IOPOLL_EAGAIN 1 /* submission got EAGAIN */ +#define KIOCB_F_FORCE_NONBLOCK 2 /* inline submission attempt */ }; #define IO_PLUG_THRESHOLD 2 @@ -164,6 +180,18 @@ struct io_submit_state { unsigned int ios_left; }; +struct iocb_submit { + const struct io_uring_iocb *iocb; + unsigned int index; +}; + +struct io_work { + struct work_struct work; + struct io_ring_ctx *ctx; + struct io_uring_iocb iocb; + unsigned iocb_index; +}; + static struct kmem_cache *kiocb_cachep, *ioctx_cachep; static const struct file_operations io_scqring_fops; @@ -442,18 +470,17 @@ static void kiocb_end_write(struct kiocb *kiocb) } } -static void io_fill_event(struct io_uring_event *ev, struct io_kiocb *kiocb, +static void io_fill_event(struct io_uring_event *ev, unsigned ki_index, long res, unsigned flags) { - ev->index = kiocb->ki_index; + ev->index = ki_index; ev->res = res; ev->flags = flags; } -static void io_cqring_fill_event(struct io_kiocb *iocb, long res, - unsigned ev_flags) +static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index, + long res, unsigned ev_flags) { - struct io_ring_ctx *ctx = iocb->ki_ctx; struct io_uring_event *ev; unsigned long flags; @@ -465,7 +492,7 @@ static void io_cqring_fill_event(struct io_kiocb *iocb, long res, spin_lock_irqsave(&ctx->completion_lock, flags); ev = io_peek_cqring(ctx); if (ev) { - io_fill_event(ev, iocb, res, ev_flags); + io_fill_event(ev, ki_index, res, ev_flags); io_inc_cqring(ctx); } else ctx->cq_ring.ring->overflow++; @@ -474,10 +501,24 @@ static void io_cqring_fill_event(struct io_kiocb *iocb, long res, static void io_complete_scqring(struct io_kiocb *iocb, long res, unsigned flags) { - io_cqring_fill_event(iocb, res, flags); + io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, flags); io_complete_iocb(iocb->ki_ctx, iocb); } +static void io_fill_cq_error(struct io_ring_ctx *ctx, unsigned ki_index, + long error) +{ + io_cqring_fill_event(ctx, ki_index, error, 0); + + /* + * for thread offload, app could already be sleeping in io_ring_enter() + * before we get to flag the error. wake them up, if needed. + */ + if (ctx->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); @@ -485,6 +526,7 @@ static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) kiocb_end_write(kiocb); fput(kiocb->ki_filp); + io_complete_scqring(iocb, res, 0); } @@ -497,7 +539,7 @@ static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) if (unlikely(res == -EAGAIN)) { set_bit(KIOCB_F_IOPOLL_EAGAIN, &iocb->ki_flags); } else { - io_cqring_fill_event(iocb, res, 0); + io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, 0); set_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags); } } @@ -549,7 +591,7 @@ static struct file *io_file_get(struct io_submit_state *state, int fd) } static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, - struct io_submit_state *state) + struct io_submit_state *state, bool force_nonblock) { struct io_ring_ctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; @@ -573,6 +615,10 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, ret = kiocb_set_rw_flags(req, iocb->rw_flags); if (unlikely(ret)) goto out_fput; + if (force_nonblock) { + req->ki_flags |= IOCB_NOWAIT; + set_bit(KIOCB_F_FORCE_NONBLOCK, &kiocb->ki_flags); + } if (ctx->flags & IORING_SETUP_IOPOLL) { ret = -EOPNOTSUPP; @@ -716,7 +762,7 @@ static void io_iopoll_iocb_issued(struct io_submit_state *state, } static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, - struct io_submit_state *state, bool kaddr) + struct io_submit_state *state, bool kaddr, bool nonblock) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *req = &kiocb->rw; @@ -724,7 +770,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, iocb, state); + ret = io_prep_rw(kiocb, iocb, state, nonblock); if (ret) return ret; file = req->ki_filp; @@ -741,8 +787,18 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb, goto out_fput; ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter)); - if (!ret) - io_rw_done(req, call_read_iter(file, req, &iter)); + if (!ret) { + ssize_t ret2; + + /* + * Catch -EAGAIN return for forced non-blocking submission + */ + ret2 = call_read_iter(file, req, &iter); + if (!nonblock || ret2 != -EAGAIN) + io_rw_done(req, ret2); + else + ret = -EAGAIN; + } kfree(iovec); out_fput: if (unlikely(ret)) @@ -760,7 +816,7 @@ static ssize_t io_write(struct io_kiocb *kiocb, struct file *file; ssize_t ret; - ret = io_prep_rw(kiocb, iocb, state); + ret = io_prep_rw(kiocb, iocb, state, false); if (ret) return ret; file = req->ki_filp; @@ -833,7 +889,7 @@ static int io_fsync(struct fsync_iocb *req, const struct io_uring_iocb *iocb, static int __io_submit_one(struct io_ring_ctx *ctx, const struct io_uring_iocb *iocb, unsigned long ki_index, - struct io_submit_state *state) + struct io_submit_state *state, bool force_nonblock) { struct io_kiocb *req; ssize_t ret; @@ -854,10 +910,10 @@ static int __io_submit_one(struct io_ring_ctx *ctx, ret = -EINVAL; switch (iocb->opcode) { case IORING_OP_READ: - ret = io_read(req, iocb, state, false); + ret = io_read(req, iocb, state, false, force_nonblock); break; case IORING_OP_READ_FIXED: - ret = io_read(req, iocb, state, true); + ret = io_read(req, iocb, state, true, force_nonblock); break; case IORING_OP_WRITE: ret = io_write(req, iocb, state, false); @@ -993,7 +1049,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) if (!iocb) break; - ret = __io_submit_one(ctx, iocb, iocb_index, statep); + ret = __io_submit_one(ctx, iocb, iocb_index, statep, false); if (ret) break; @@ -1042,15 +1098,239 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) return ring->r.head == ring->r.tail ? ret : 0; } +static int io_submit_iocbs(struct io_ring_ctx *ctx, struct iocb_submit *iocbs, + unsigned int nr, struct mm_struct *cur_mm, + bool mm_fault) +{ + struct io_submit_state state, *statep = NULL; + int ret, i, submitted = 0; + + if (nr > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, nr); + statep = &state; + } + + for (i = 0; i < nr; i++) { + if (unlikely(mm_fault)) + ret = -EFAULT; + else + ret = __io_submit_one(ctx, iocbs[i].iocb, + iocbs[i].index, statep, false); + if (!ret) { + submitted++; + continue; + } + + io_fill_cq_error(ctx, iocbs[i].index, ret); + } + + if (statep) + io_submit_state_end(&state); + + return submitted; +} + +/* + * sq thread only supports O_DIRECT or FIXEDBUFS IO + */ +static int io_sq_thread(void *data) +{ + struct iocb_submit iocbs[IO_IOPOLL_BATCH]; + struct io_ring_ctx *ctx = data; + struct io_sq_offload *sqo = &ctx->sq_offload; + struct mm_struct *cur_mm = NULL; + struct files_struct *old_files; + mm_segment_t old_fs; + DEFINE_WAIT(wait); + + old_files = current->files; + current->files = sqo->files; + + old_fs = get_fs(); + set_fs(USER_DS); + + while (!kthread_should_stop()) { + const struct io_uring_iocb *iocb; + bool mm_fault = false; + unsigned iocb_index; + int i; + + iocb = io_peek_sqring(ctx, &iocb_index); + if (!iocb) { + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + + prepare_to_wait(&sqo->wait, &wait, TASK_INTERRUPTIBLE); + iocb = io_peek_sqring(ctx, &iocb_index); + if (!iocb) { + if (kthread_should_park()) + kthread_parkme(); + if (kthread_should_stop()) { + finish_wait(&sqo->wait, &wait); + break; + } + if (signal_pending(current)) + flush_signals(current); + schedule(); + } + finish_wait(&sqo->wait, &wait); + if (!iocb) + continue; + } + + /* If ->mm is set, we're not doing FIXEDBUFS */ + if (sqo->mm && !cur_mm) { + mm_fault = !mmget_not_zero(sqo->mm); + if (!mm_fault) { + use_mm(sqo->mm); + cur_mm = sqo->mm; + } + } + + i = 0; + do { + if (i == ARRAY_SIZE(iocbs)) + break; + iocbs[i].iocb = iocb; + iocbs[i].index = iocb_index; + ++i; + io_inc_sqring(ctx); + } while ((iocb = io_peek_sqring(ctx, &iocb_index)) != NULL); + + io_submit_iocbs(ctx, iocbs, i, cur_mm, mm_fault); + } + current->files = old_files; + set_fs(old_fs); + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + } + return 0; +} + +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_work *iw = container_of(work, struct io_work, work); + struct io_ring_ctx *ctx = iw->ctx; + struct io_sq_offload *sqo = &ctx->sq_offload; + mm_segment_t old_fs = get_fs(); + struct files_struct *old_files; + int ret; + + old_files = current->files; + current->files = sqo->files; + + if (sqo->mm) { + if (!mmget_not_zero(sqo->mm)) { + ret = -EFAULT; + goto err; + } + use_mm(sqo->mm); + } + + set_fs(USER_DS); + + ret = __io_submit_one(ctx, &iw->iocb, iw->iocb_index, NULL, false); + + set_fs(old_fs); + if (sqo->mm) { + unuse_mm(sqo->mm); + mmput(sqo->mm); + } + +err: + if (ret) + io_fill_cq_error(ctx, iw->iocb_index, ret); + current->files = old_files; + kfree(iw); +} + +/* + * If this is a read, try a cached inline read first. If the IO is in the + * page cache, we can satisfy it without blocking and without having to + * punt to a threaded execution. This is much faster, particularly for + * lower queue depth IO, and it's always a lot more efficient. + */ +static bool io_sq_try_inline(struct io_ring_ctx *ctx, + const struct io_uring_iocb *iocb, unsigned index) +{ + int ret; + + if (iocb->opcode != IORING_OP_READ && + iocb->opcode != IORING_OP_READ_FIXED) + return false; + + ret = __io_submit_one(ctx, iocb, index, NULL, true); + + /* + * If we get -EAGAIN, return false to submit out-of-line. Any other + * result and we're done, call will fill in CQ ring event. + */ + return ret != -EAGAIN; +} + +static int io_sq_wq_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + const struct io_uring_iocb *iocb; + struct io_work *work; + unsigned iocb_index; + int ret, queued; + + ret = queued = 0; + while ((iocb = io_peek_sqring(ctx, &iocb_index)) != NULL) { + ret = io_sq_try_inline(ctx, iocb, iocb_index); + if (!ret) { + work = kmalloc(sizeof(*work), GFP_KERNEL); + if (!work) { + ret = -ENOMEM; + break; + } + memcpy(&work->iocb, iocb, sizeof(*iocb)); + io_inc_sqring(ctx); + work->iocb_index = iocb_index; + INIT_WORK(&work->work, io_sq_wq_submit_work); + work->ctx = ctx; + queue_work(ctx->sq_offload.wq, &work->work); + } + queued++; + if (queued == to_submit) + break; + } + + return queued ? queued : ret; +} + static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, unsigned min_complete, unsigned flags) { int ret = 0; if (to_submit) { - ret = io_ring_submit(ctx, to_submit); - if (ret < 0) - return ret; + /* + * Three options here: + * 1) We have an sq thread, just wake it up to do submissions + * 2) We have an sq wq, queue a work item for each iocb + * 3) Submit directly + */ + if (ctx->flags & IORING_SETUP_SQTHREAD) { + wake_up(&ctx->sq_offload.wait); + ret = to_submit; + } else if (ctx->flags & IORING_SETUP_SQWQ) { + ret = io_sq_wq_submit(ctx, to_submit); + } else { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } } if (flags & IORING_ENTER_GETEVENTS) { unsigned nr_events = 0; @@ -1192,6 +1472,78 @@ static int io_iocb_buffer_map(struct io_ring_ctx *ctx, return ret; } +static int io_sq_thread(void *); + +static int io_sq_thread_start(struct io_ring_ctx *ctx) +{ + struct io_sq_offload *sqo = &ctx->sq_offload; + struct io_iocb_ring *ring = &ctx->sq_ring; + int ret; + + memset(sqo, 0, sizeof(*sqo)); + init_waitqueue_head(&sqo->wait); + + if (!(ctx->flags & IORING_SETUP_FIXEDBUFS)) + sqo->mm = current->mm; + + ret = -EBADF; + sqo->files = get_files_struct(current); + if (!sqo->files) + goto err; + + if (ctx->flags & IORING_SETUP_SQTHREAD) { + sqo->thread = kthread_create_on_cpu(io_sq_thread, ctx, + ring->sq_thread_cpu, + "io_uring-sq"); + if (IS_ERR(sqo->thread)) { + ret = PTR_ERR(sqo->thread); + sqo->thread = NULL; + goto err; + } + wake_up_process(sqo->thread); + } else if (ctx->flags & IORING_SETUP_SQWQ) { + int concurrency; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + concurrency = min(ring->entries - 1, 2 * num_online_cpus()); + sqo->wq = alloc_workqueue("io_ring-wq", + WQ_UNBOUND | WQ_FREEZABLE, + concurrency); + if (!sqo->wq) { + ret = -ENOMEM; + goto err; + } + } + + return 0; +err: + if (sqo->files) { + put_files_struct(sqo->files); + sqo->files = NULL; + } + if (sqo->mm) + sqo->mm = NULL; + return ret; +} + +static void io_sq_thread_stop(struct io_ring_ctx *ctx) +{ + struct io_sq_offload *sqo = &ctx->sq_offload; + + if (sqo->thread) { + kthread_park(sqo->thread); + kthread_stop(sqo->thread); + sqo->thread = NULL; + } else if (sqo->wq) { + destroy_workqueue(sqo->wq); + sqo->wq = NULL; + } + if (sqo->files) { + put_files_struct(sqo->files); + sqo->files = NULL; + } +} + static void io_free_scq_urings(struct io_ring_ctx *ctx) { if (ctx->sq_ring.ring) { @@ -1212,6 +1564,7 @@ static void io_ring_ctx_free(struct work_struct *work) { struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, work); + io_sq_thread_stop(ctx); io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); io_iocb_buffer_unmap(ctx); @@ -1398,6 +1751,13 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, if (ret) goto err; } + if (p->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) { + ctx->sq_ring.sq_thread_cpu = p->sq_thread_cpu; + + ret = io_sq_thread_start(ctx); + if (ret) + goto err; + } ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, O_RDWR | O_CLOEXEC); @@ -1431,7 +1791,8 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, return -EINVAL; } - if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_FIXEDBUFS)) + if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_FIXEDBUFS | + IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) return -EINVAL; ret = io_uring_create(entries, &p, iovecs); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 925fd6ca3f38..4f0a8ce49f9a 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -36,6 +36,8 @@ struct io_uring_iocb { */ #define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ #define IORING_SETUP_FIXEDBUFS (1 << 1) /* IO buffers are fixed */ +#define IORING_SETUP_SQTHREAD (1 << 2) /* Use SQ thread */ +#define IORING_SETUP_SQWQ (1 << 3) /* Use SQ workqueue */ #define IORING_OP_READ 1 #define IORING_OP_WRITE 2 @@ -96,7 +98,8 @@ struct io_uring_params { __u32 sq_entries; __u32 cq_entries; __u32 flags; - __u16 resv[10]; + __u16 sq_thread_cpu; + __u16 resv[9]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; }; -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH 14/16] io_uring: support kernel side submission 2019-01-08 16:56 ` [PATCH 14/16] io_uring: support kernel side submission Jens Axboe @ 2019-01-09 19:06 ` Christoph Hellwig 2019-01-09 20:49 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Christoph Hellwig @ 2019-01-09 19:06 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi > +struct iocb_submit { > + const struct io_uring_iocb *iocb; > + unsigned int index; > +}; > + > +struct io_work { > + struct work_struct work; > + struct io_ring_ctx *ctx; > + struct io_uring_iocb iocb; > + unsigned iocb_index; > +}; I think we should use struct iocb_submit everywhere where we pass an iocb + index, including in the io_work structure. Here is how I did that to my old tree: http://git.infradead.org/users/hch/misc.git/commitdiff/65929ed6f143a1c996305a10a15fd7f64d32595f Btw, I think we can also remove the separate io_work allocation, just do the io_get_req in the submission context, and overlay the io_work onto the iocb in a union of some sort. This avoids a whole memory allocation. ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 14/16] io_uring: support kernel side submission 2019-01-09 19:06 ` Christoph Hellwig @ 2019-01-09 20:49 ` Jens Axboe 2019-01-09 20:49 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-09 20:49 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 12:06 PM, Christoph Hellwig wrote: >> +struct iocb_submit { >> + const struct io_uring_iocb *iocb; >> + unsigned int index; >> +}; >> + >> +struct io_work { >> + struct work_struct work; >> + struct io_ring_ctx *ctx; >> + struct io_uring_iocb iocb; >> + unsigned iocb_index; >> +}; > > I think we should use struct iocb_submit everywhere where we pass > an iocb + index, including in the io_work structure. Here is how > I did that to my old tree: > > http://git.infradead.org/users/hch/misc.git/commitdiff/65929ed6f143a1c996305a10a15fd7f64d32595f I like that unification. > Btw, I think we can also remove the separate io_work allocation, > just do the io_get_req in the submission context, and overlay > the io_work onto the iocb in a union of some sort. This avoids > a whole memory allocation. I'll take a look at that. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 14/16] io_uring: support kernel side submission 2019-01-09 20:49 ` Jens Axboe @ 2019-01-09 20:49 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-09 20:49 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, jmoyer, avi On 1/9/19 12:06 PM, Christoph Hellwig wrote: >> +struct iocb_submit { >> + const struct io_uring_iocb *iocb; >> + unsigned int index; >> +}; >> + >> +struct io_work { >> + struct work_struct work; >> + struct io_ring_ctx *ctx; >> + struct io_uring_iocb iocb; >> + unsigned iocb_index; >> +}; > > I think we should use struct iocb_submit everywhere where we pass > an iocb + index, including in the io_work structure. Here is how > I did that to my old tree: > > http://git.infradead.org/users/hch/misc.git/commitdiff/65929ed6f143a1c996305a10a15fd7f64d32595f I like that unification. > Btw, I think we can also remove the separate io_work allocation, > just do the io_get_req in the submission context, and overlay > the io_work onto the iocb in a union of some sort. This avoids > a whole memory allocation. I'll take a look at that. -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 15/16] io_uring: add submission polling 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (13 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 14/16] io_uring: support kernel side submission Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 16/16] io_uring: add io_uring_event cache hit information Jens Axboe 2019-01-09 16:00 ` [PATCHSET v1] io_uring IO interface Matthew Wilcox 16 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new events and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions. For O_DIRECT, we can do this with just SQTHREAD being enabled. For buffered aio, we need the workqueue as well. If we can satisfy the buffered inline from the SQTHREAD, we do that. If not, we punt to the workqueue. This is just like buffered aio off the io_uring_enter(2) system call. Proof of concept. If the thread has been idle for 1 second, it will set sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with: barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, to_submit, 0, 0); instead of calling it unconditionally. Improvements: 1) Maybe have smarter backoff. Busy loop for X time, then go to monitor/mwait, finally the schedule we have now after an idle second. Might not be worth the complexity. 2) Probably want the application to pass in the appropriate grace period, not hard code it at 1 second. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 135 ++++++++++++++++++++++++++++------ include/uapi/linux/io_uring.h | 3 + 2 files changed, 115 insertions(+), 23 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index e6a808a89b78..6c10841e4342 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -80,7 +80,8 @@ struct io_mapped_ubuf { struct io_sq_offload { struct task_struct *thread; /* if using a thread */ - struct workqueue_struct *wq; /* wq offload */ + bool thread_poll; + struct workqueue_struct *wq; /* wq offload */ struct mm_struct *mm; struct files_struct *files; wait_queue_head_t wait; @@ -198,6 +199,7 @@ static const struct file_operations io_scqring_fops; static void io_ring_ctx_free(struct work_struct *work); static void io_ring_ctx_ref_free(struct percpu_ref *ref); +static void io_sq_wq_submit_work(struct work_struct *work); static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) { @@ -1098,27 +1100,59 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) return ring->r.head == ring->r.tail ? ret : 0; } +static int io_queue_async_work(struct io_ring_ctx *ctx, struct iocb_submit *is) +{ + struct io_work *work; + + work = kmalloc(sizeof(*work), GFP_KERNEL); + if (work) { + memcpy(&work->iocb, is->iocb, sizeof(*is->iocb)); + work->iocb_index = is->index; + INIT_WORK(&work->work, io_sq_wq_submit_work); + work->ctx = ctx; + queue_work(ctx->sq_offload.wq, &work->work); + return 0; + } + + return -ENOMEM; +} + static int io_submit_iocbs(struct io_ring_ctx *ctx, struct iocb_submit *iocbs, unsigned int nr, struct mm_struct *cur_mm, bool mm_fault) { struct io_submit_state state, *statep = NULL; int ret, i, submitted = 0; + bool force_nonblock; if (nr > IO_PLUG_THRESHOLD) { io_submit_state_start(&state, ctx, nr); statep = &state; } + /* + * Having both a thread and a workqueue only makes sense for buffered + * IO, where we can't submit in an async fashion. Use the NOWAIT + * trick from the SQ thread, and punt to the workqueue if we can't + * satisfy this iocb without blocking. This is only necessary + * for buffered IO with sqthread polled submission. + */ + force_nonblock = (ctx->flags & IORING_SETUP_SQWQ) != 0; + for (i = 0; i < nr; i++) { - if (unlikely(mm_fault)) + if (unlikely(mm_fault)) { ret = -EFAULT; - else + } else { ret = __io_submit_one(ctx, iocbs[i].iocb, - iocbs[i].index, statep, false); - if (!ret) { - submitted++; - continue; + iocbs[i].index, statep, + force_nonblock); + /* nogo, submit to workqueue */ + if (force_nonblock && ret == -EAGAIN) + ret = io_queue_async_work(ctx, &iocbs[i]); + if (!ret) { + submitted++; + continue; + } } io_fill_cq_error(ctx, iocbs[i].index, ret); @@ -1131,7 +1165,10 @@ static int io_submit_iocbs(struct io_ring_ctx *ctx, struct iocb_submit *iocbs, } /* - * sq thread only supports O_DIRECT or FIXEDBUFS IO + * SQ thread is woken if the app asked for offloaded submission. This can + * be either O_DIRECT, in which case we do submissions directly, or it can + * be buffered IO, in which case we do them inline if we can do so without + * blocking. If we can't, then we punt to a workqueue. */ static int io_sq_thread(void *data) { @@ -1142,6 +1179,8 @@ static int io_sq_thread(void *data) struct files_struct *old_files; mm_segment_t old_fs; DEFINE_WAIT(wait); + unsigned inflight; + unsigned long timeout; old_files = current->files; current->files = sqo->files; @@ -1149,14 +1188,43 @@ static int io_sq_thread(void *data) old_fs = get_fs(); set_fs(USER_DS); + timeout = inflight = 0; while (!kthread_should_stop()) { const struct io_uring_iocb *iocb; bool mm_fault = false; unsigned iocb_index; int i; + if (sqo->thread_poll && inflight) { + unsigned int nr_events = 0; + + /* + * Normal IO, just pretend everything completed. + * We don't have to poll completions for that. + */ + if (ctx->flags & IORING_SETUP_IOPOLL) + io_iopoll_check(ctx, &nr_events, 0); + else + nr_events = inflight; + + inflight -= nr_events; + if (!inflight) + timeout = jiffies + HZ; + } + iocb = io_peek_sqring(ctx, &iocb_index); if (!iocb) { + /* + * If we're polling, let us spin for a second without + * work before going to sleep. + */ + if (sqo->thread_poll) { + if (inflight || !time_after(jiffies, timeout)) { + cpu_relax(); + continue; + } + } + /* * Drop cur_mm before scheduling, we can't hold it for * long periods (or over schedule()). Do this before @@ -1170,6 +1238,16 @@ static int io_sq_thread(void *data) } prepare_to_wait(&sqo->wait, &wait, TASK_INTERRUPTIBLE); + + /* Tell userspace we may need a wakeup call */ + if (sqo->thread_poll) { + struct io_sq_ring *ring; + + ring = ctx->sq_ring.ring; + ring->flags |= IORING_SQ_NEED_WAKEUP; + smp_wmb(); + } + iocb = io_peek_sqring(ctx, &iocb_index); if (!iocb) { if (kthread_should_park()) @@ -1181,6 +1259,13 @@ static int io_sq_thread(void *data) if (signal_pending(current)) flush_signals(current); schedule(); + + if (sqo->thread_poll) { + struct io_sq_ring *ring; + + ring = ctx->sq_ring.ring; + ring->flags &= ~IORING_SQ_NEED_WAKEUP; + } } finish_wait(&sqo->wait, &wait); if (!iocb) @@ -1206,7 +1291,7 @@ static int io_sq_thread(void *data) io_inc_sqring(ctx); } while ((iocb = io_peek_sqring(ctx, &iocb_index)) != NULL); - io_submit_iocbs(ctx, iocbs, i, cur_mm, mm_fault); + inflight += io_submit_iocbs(ctx, iocbs, i, cur_mm, mm_fault); } current->files = old_files; set_fs(old_fs); @@ -1281,7 +1366,6 @@ static bool io_sq_try_inline(struct io_ring_ctx *ctx, static int io_sq_wq_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { const struct io_uring_iocb *iocb; - struct io_work *work; unsigned iocb_index; int ret, queued; @@ -1289,18 +1373,17 @@ static int io_sq_wq_submit(struct io_ring_ctx *ctx, unsigned int to_submit) while ((iocb = io_peek_sqring(ctx, &iocb_index)) != NULL) { ret = io_sq_try_inline(ctx, iocb, iocb_index); if (!ret) { - work = kmalloc(sizeof(*work), GFP_KERNEL); - if (!work) { - ret = -ENOMEM; + struct iocb_submit is = { + .iocb = iocb, + .index = iocb_index + }; + + ret = io_queue_async_work(ctx, &is); + if (ret) break; - } - memcpy(&work->iocb, iocb, sizeof(*iocb)); - io_inc_sqring(ctx); - work->iocb_index = iocb_index; - INIT_WORK(&work->work, io_sq_wq_submit_work); - work->ctx = ctx; - queue_work(ctx->sq_offload.wq, &work->work); } + + io_inc_sqring(ctx); queued++; if (queued == to_submit) break; @@ -1491,6 +1574,9 @@ static int io_sq_thread_start(struct io_ring_ctx *ctx) if (!sqo->files) goto err; + if (ctx->flags & IORING_SETUP_SQPOLL) + sqo->thread_poll = true; + if (ctx->flags & IORING_SETUP_SQTHREAD) { sqo->thread = kthread_create_on_cpu(io_sq_thread, ctx, ring->sq_thread_cpu, @@ -1501,7 +1587,8 @@ static int io_sq_thread_start(struct io_ring_ctx *ctx) goto err; } wake_up_process(sqo->thread); - } else if (ctx->flags & IORING_SETUP_SQWQ) { + } + if (ctx->flags & IORING_SETUP_SQWQ) { int concurrency; /* Do QD, or 2 * CPUS, whatever is smallest */ @@ -1534,7 +1621,8 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx) kthread_park(sqo->thread); kthread_stop(sqo->thread); sqo->thread = NULL; - } else if (sqo->wq) { + } + if (sqo->wq) { destroy_workqueue(sqo->wq); sqo->wq = NULL; } @@ -1792,7 +1880,8 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs, } if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_FIXEDBUFS | - IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) + IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ | + IORING_SETUP_SQPOLL)) return -EINVAL; ret = io_uring_create(entries, &p, iovecs); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 4f0a8ce49f9a..bd665d38dd97 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -38,6 +38,7 @@ struct io_uring_iocb { #define IORING_SETUP_FIXEDBUFS (1 << 1) /* IO buffers are fixed */ #define IORING_SETUP_SQTHREAD (1 << 2) /* Use SQ thread */ #define IORING_SETUP_SQWQ (1 << 3) /* Use SQ workqueue */ +#define IORING_SETUP_SQPOLL (1 << 4) /* SQ thread polls */ #define IORING_OP_READ 1 #define IORING_OP_WRITE 2 @@ -76,6 +77,8 @@ struct io_sqring_offsets { __u32 resv[3]; }; +#define IORING_SQ_NEED_WAKEUP (1 << 0) /* needs io_uring_enter wakeup */ + struct io_cqring_offsets { __u32 head; __u32 tail; -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 16/16] io_uring: add io_uring_event cache hit information 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (14 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 15/16] io_uring: add submission polling Jens Axboe @ 2019-01-08 16:56 ` Jens Axboe 2019-01-09 16:00 ` [PATCHSET v1] io_uring IO interface Matthew Wilcox 16 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-08 16:56 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe Add hint on whether a read was served out of the page cache, or if it hit media. This is useful for buffered async IO, O_DIRECT reads would never have this set (for obvious reasons). Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/io_uring.c | 6 +++++- include/uapi/linux/io_uring.h | 5 +++++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 6c10841e4342..50b9cfa8c861 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -524,12 +524,16 @@ static void io_fill_cq_error(struct io_ring_ctx *ctx, unsigned ki_index, static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw); + unsigned ev_flags = 0; kiocb_end_write(kiocb); fput(kiocb->ki_filp); - io_complete_scqring(iocb, res, 0); + if (res > 0 && test_bit(KIOCB_F_FORCE_NONBLOCK, &iocb->ki_flags)) + ev_flags = IOEV_FLAG_CACHEHIT; + + io_complete_scqring(iocb, res, ev_flags); } static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index bd665d38dd97..7dd21126f142 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -56,6 +56,11 @@ struct io_uring_event { __u32 flags; }; +/* + * io_uring_event->flags + */ +#define IOEV_FLAG_CACHEHIT (1 << 0) /* IO did not hit media */ + /* * Magic offsets for the application to mmap the data it needs */ -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCHSET v1] io_uring IO interface 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe ` (15 preceding siblings ...) 2019-01-08 16:56 ` [PATCH 16/16] io_uring: add io_uring_event cache hit information Jens Axboe @ 2019-01-09 16:00 ` Matthew Wilcox 2019-01-09 16:27 ` Chris Mason 16 siblings, 1 reply; 62+ messages in thread From: Matthew Wilcox @ 2019-01-09 16:00 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On Tue, Jan 08, 2019 at 09:56:29AM -0700, Jens Axboe wrote: > After some arm twisting from Christoph, I finally caved and divorced the > aio-poll patches from aio/libaio itself. The io_uring interface itself > is useful and efficient, and after rebasing all the new goodies on top > of that, there was little reason to retail the aio connection. > > Hence io_uring was born. This is what I previously called scqring for > aio, but now as a standalone entity. Patch #5 adds the core of this > interface, but in short, it has two main data structures: Please can we call it something that looks a little less like io_urine? aio_ring? io_ring? ring_io? ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCHSET v1] io_uring IO interface 2019-01-09 16:00 ` [PATCHSET v1] io_uring IO interface Matthew Wilcox @ 2019-01-09 16:27 ` Chris Mason 0 siblings, 0 replies; 62+ messages in thread From: Chris Mason @ 2019-01-09 16:27 UTC (permalink / raw) To: Matthew Wilcox Cc: Jens Axboe, linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On 9 Jan 2019, at 11:00, Matthew Wilcox wrote: > On Tue, Jan 08, 2019 at 09:56:29AM -0700, Jens Axboe wrote: >> After some arm twisting from Christoph, I finally caved and divorced >> the >> aio-poll patches from aio/libaio itself. The io_uring interface >> itself >> is useful and efficient, and after rebasing all the new goodies on >> top >> of that, there was little reason to retail the aio connection. >> >> Hence io_uring was born. This is what I previously called scqring for >> aio, but now as a standalone entity. Patch #5 adds the core of this >> interface, but in short, it has two main data structures: > > Please can we call it something that looks a little less like > io_urine? > aio_ring? io_ring? ring_io? We must be using the same fonts, I've been harassing Jens with jokes about this for days. But, I totally missed the link between io_uring and https://twitter.com/axboe/status/927571366085246976 until just now. It's clear that Jens has been planning this for over a year, and he was just waiting for us to catch on. Really though, io_uring is growing on me. The kernel is full of rings and it does help set it apart. -chris ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCHSET v3] io_uring IO interface @ 2019-01-12 21:29 Jens Axboe 2019-01-12 21:30 ` [PATCH 05/16] Add " Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-12 21:29 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch; +Cc: hch, jmoyer, avi Here's v3 of the io_uring interface. Since data structures etc have changed since the v1 posting, here's a refresher of what io_uring is and how it works. io_uring is a submission queue (SQ) and completion queue (CQ) pair that an application can use to communicate with the kernel for doing IO. This isn't aio/libaio, but it provides a similar set of features, as well as some new ones: - io_uring is a lot more efficient than aio. A lot, and in many ways. - io_uring supports buffered aio. Not just that, but efficiently as well. Cached data isn't punted to an async context. - io_uring supports polled IO, it takes advantage of the blk-mq polling work that went into 5.0-rc. - io_uring supports kernel side submissions for polled IO. This enables IO without ever having to do a system call. - io_uring supports fixed buffers for O_DIRECT. Buffers can be registered after an io_uring context has been setup, which eliminates the need to do get_user_pages() / put_pages() for each and every IO. To use io_uring, you must first setup an io_uring context. This is done through the first of three new system calls: io_uring_setup(entries, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqe's. Once the rings are setup, the application then mmap's these rings to communicate with the kernel. See a sample application I wrote that natively does this: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c IO is done by filling out an io_uring_sqe, and updating the SQ ring. The format of the sqe is as follows: struct io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ __u64 off; /* offset into file */ union { void *addr; /* buffer or iovecs */ __u64 __pad; }; __u32 len; /* buffer size or number of iovecs */ union { __kernel_rwf_t rw_flags; __u32 fsync_flags; }; __u16 buf_index; /* index into fixed buffers, if used */ __u16 __pad2; __u32 __pad3; __u64 user_data; /* data to be passed back at completion time */ }; Most of this is self explanatory. The ->user_data field is passed back through a completion event, so the application can track IOs individually. Completions are posted on the CQ ring when an sqe completes, they are a struct io_uring_cqe and the format is as follows: struct io_uring_cqe { __u64 user_data; /* sqe->data submission passed back */ __s32 res; /* result code for this event */ __u32 flags; }; To either submit IO or reap completions, there's a 2nd new system call: io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. The sample application mentioned above uses the rings directly, but for most uses cases, I intend to have the necessary support in a liburing library that abstracts it enough for application to use in a performant way, without having to deal with the intricacies of the ring. There's already some basic support there and a few test applications, but that side definitely needs some work. Find that repo here: git://git.kernel.dk/liburing io_uring is designed to be fast and scalable. I've demonstrated 1.6M 4k IOPS from a single core on my aging test box, and on the latency front, we're also doing extremely well. It's designed to both be async and batching, if you wish, the application gets to control how to use that side. If you want to play with io_uring, see the sample app above, the liburing repo, or the fio io_uring engine as well. Patches are against 5.0-rc1 (ish), and can also be found in my 'io_uring' git branch: git://git.kernel.dk/linux-block io_uring Since v2 - Separate fixed buffers from sqe entries. register/unregister them through the new io_uring_register(2) system call - sqe->index is now sqe->buf_index to make it clearer - fixed buffers require sqe->flags to have IOSQE_FIXED_BUFFER set - Add sqe field that is passed back at completion through the cqe, instead of passing back the original sqe index. This is more useful as it allows per-life of IO data, ->index did not. - Cleanup async IO punting - Don't punt O_DIRECT writes to async handling - Make sq thread just for polling (submissions and completions) - Always enable sq workqueue for async offload - Use GFP_ATOMIC for req allocation - Fix bio_vec being an unknown type on some kconfigs - New IORING_OP_FSYNC implementation - Add fixed fileset support through io_uring_register(2) - Integrate workqueue support into main patchset - Fix io_sq_thread() logic for when to grab current->mm - Fix io_sq_thread() off-by-one - Improve polling performance for multiple files in an io_uring context - Have CONFIG_IO_URING select ANON_INODES - Don't make io_kiocb->ki_flags atomic - Be fully consistent in naming, for some reason we used the same mess that aio.c is, where io_kiocb,kiocb,iocb are used interchangably. 'req' is now always io_kiocb, 'kiocb' is always kiocb. - Rename KIOCB_F_* flags as they are req flags, REQ_F_*. Documentation/filesystems/vfs.txt | 3 + arch/x86/entry/syscalls/syscall_64.tbl | 3 + block/bio.c | 59 +- fs/Makefile | 1 + fs/block_dev.c | 19 +- fs/file.c | 15 +- fs/file_table.c | 9 +- fs/gfs2/file.c | 2 + fs/io_uring.c | 2023 ++++++++++++++++++++++++ fs/iomap.c | 48 +- fs/xfs/xfs_file.c | 1 + include/linux/bio.h | 14 + include/linux/blk_types.h | 1 + include/linux/file.h | 2 + include/linux/fs.h | 6 +- include/linux/iomap.h | 1 + include/linux/syscalls.h | 7 + include/uapi/linux/io_uring.h | 147 ++ init/Kconfig | 9 + kernel/sys_ni.c | 3 + 20 files changed, 2334 insertions(+), 39 deletions(-) -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 05/16] Add io_uring IO interface 2019-01-12 21:29 [PATCHSET v3] " Jens Axboe @ 2019-01-12 21:30 ` Jens Axboe 2019-01-12 21:30 ` Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-12 21:30 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_sqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered and can point to any io_uring_iocb. Two new system calls are added for this: io_uring_setup(entries, iovecs, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_iocbs. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe <axboe@kernel.dk> --- arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 925 +++++++++++++++++++++++++ include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 96 +++ init/Kconfig | 9 + kernel/sys_ni.c | 2 + 7 files changed, 1040 insertions(+) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..453ff7a79002 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common io_uring_setup __x64_sys_io_uring_setup +336 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..df8fe19cdb74 --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,925 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2019 Jens Axboe + */ +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/errno.h> +#include <linux/syscalls.h> +#include <linux/refcount.h> +#include <linux/uio.h> + +#include <linux/sched/signal.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/fdtable.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include <linux/mmu_context.h> +#include <linux/percpu.h> +#include <linux/slab.h> +#include <linux/workqueue.h> +#include <linux/blkdev.h> +#include <linux/anon_inodes.h> +#include <linux/sched/mm.h> + +#include <linux/uaccess.h> +#include <linux/nospec.h> + +#include <uapi/linux/io_uring.h> + +#include "internal.h" + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct percpu_ref refs; + + unsigned int flags; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned sq_entries; + unsigned sq_mask; + unsigned sq_thread_cpu; + struct io_uring_sqe *sq_sqes; + + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cq_entries; + unsigned cq_mask; + + /* IO offload */ + struct workqueue_struct *sqo_wq; + struct mm_struct *sqo_mm; + struct files_struct *sqo_files; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; +}; + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned index; +}; + +struct io_work { + struct work_struct work; + struct sqe_submit submit; +}; + +struct io_kiocb { + union { + struct kiocb rw; + struct io_work work; + }; + + struct io_ring_ctx *ki_ctx; + struct list_head ki_list; + unsigned long ki_flags; +#define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ + u64 ki_user_data; +}; + +#define IO_PLUG_THRESHOLD 2 + +static struct kmem_cache *req_cachep; + +static const struct file_operations io_scqring_fops; + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + + init_completion(&ctx->ctx_done); + + spin_lock_init(&ctx->completion_lock); + init_waitqueue_head(&ctx->wait); + mutex_init(&ctx->uring_lock); + + return ctx; +} + +static void io_inc_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + ring->r.tail++; + smp_wmb(); +} + +static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + smp_rmb(); + tail = READ_ONCE(ring->r.tail); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + return &ring->cqes[tail & ctx->cq_mask]; +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (!percpu_ref_tryget(&ctx->refs)) + return NULL; + + req = kmem_cache_alloc(req_cachep, GFP_ATOMIC | __GFP_NOWARN); + if (!req) + return NULL; + + req->ki_ctx = ctx; + INIT_LIST_HEAD(&req->ki_list); + req->ki_flags = 0; + return req; +} + +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_free_req(struct io_kiocb *req) +{ + kmem_cache_free(req_cachep, req); + io_ring_drop_ctx_refs(req->ki_ctx, 1); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + unsigned long flags; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + spin_lock_irqsave(&ctx->completion_lock, flags); + cqe = io_peek_cqring(ctx); + if (cqe) { + cqe->user_data = ki_user_data; + cqe->res = res; + cqe->flags = ev_flags; + smp_wmb(); + io_inc_cqring(ctx); + } else + ctx->cq_ring->overflow++; + spin_unlock_irqrestore(&ctx->completion_lock, flags); +} + +static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, + long error) +{ + io_cqring_fill_event(ctx, s->index, error, 0); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_fill_event(req->ki_ctx, req->ki_user_data, res, 0); + io_free_req(req); +} + +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct kiocb *kiocb = &req->rw; + int ret; + + kiocb->ki_filp = fget(sqe->fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + kiocb->ki_pos = sqe->off; + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); + if (sqe->ioprio) { + ret = ioprio_check_cap(sqe->ioprio); + if (ret) + goto out_fput; + + kiocb->ki_ioprio = sqe->ioprio; + } else + kiocb->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags); + if (unlikely(ret)) + goto out_fput; + if (force_nonblock) { + kiocb->ki_flags |= IOCB_NOWAIT; + req->ki_flags |= REQ_F_FORCE_NONBLOCK; + } + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + kiocb->ki_complete = io_complete_scqring_rw; + return 0; +out_fput: + fput(kiocb->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *req, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * There's no easy way to restart the syscall since other AIO's + * may be already running. Just fail this IO with EINTR. + */ + ret = -EINTR; + /*FALLTHRU*/ + default: + req->ki_complete(req, ret, 0); + } +} + +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + if (!ret) { + ssize_t ret2; + + /* Catch -EAGAIN return for forced non-blocking submission */ + ret2 = call_read_iter(file, kiocb, &iter); + if (!force_nonblock || ret2 != -EAGAIN) + io_rw_done(kiocb, ret2); + else + ret = -EAGAIN; + } + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EAGAIN; + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) + goto out_fput; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, + iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + kiocb->ki_flags |= IOCB_WRITE; + io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); + } +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct sqe_submit *s, bool force_nonblock) +{ + const struct io_uring_sqe *sqe = s->sqe; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(sqe->flags || sqe->__pad2)) + return -EINVAL; + + if (unlikely(s->index >= ctx->sq_entries)) + return -EINVAL; + req->ki_user_data = sqe->user_data; + + ret = -EINVAL; + switch (sqe->opcode) { + case IORING_OP_READV: + ret = io_read(req, sqe, force_nonblock); + break; + case IORING_OP_WRITEV: + ret = io_write(req, sqe, force_nonblock); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work.work); + struct io_ring_ctx *ctx = req->ki_ctx; + mm_segment_t old_fs = get_fs(); + struct files_struct *old_files; + int ret; + + /* + * Ensure we clear previously set flags. even it NOWAIT was originally + * set, it's pointless now that we're in an async context. + */ + req->rw.ki_flags &= ~IOCB_NOWAIT; + req->ki_flags &= ~REQ_F_FORCE_NONBLOCK; + + old_files = current->files; + current->files = ctx->sqo_files; + + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + + use_mm(ctx->sqo_mm); + set_fs(USER_DS); + + ret = __io_submit_sqe(ctx, req, &req->work.submit, false); + + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); +err: + if (ret) { + io_fill_cq_error(ctx, &req->work.submit, ret); + io_free_req(req); + } + current->files = old_files; +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_kiocb *req; + ssize_t ret; + + req = io_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + ret = __io_submit_sqe(ctx, req, s, true); + if (ret == -EAGAIN) { + memcpy(&req->work.submit, s, sizeof(*s)); + INIT_WORK(&req->work.work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work.work); + ret = 0; + } + if (ret) + io_free_req(req); + + return ret; +} + +static void io_inc_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + ring->r.head++; + smp_wmb(); +} + +static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + smp_rmb(); + head = READ_ONCE(ring->r.head); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = ring->array[head & ctx->sq_mask]; + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + return true; + } + + /* drop invalid entries */ + ring->r.head++; + ring->dropped++; + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_peek_sqring(ctx, &s)) + break; + + ret = io_submit_sqe(ctx, &s); + if (ret) + break; + + submit++; + io_inc_sqring(ctx); + } + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) +{ + struct io_cq_ring *ring = ctx->cq_ring; + DEFINE_WAIT(wait); + int ret = 0; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, + unsigned min_complete, unsigned flags) +{ + int ret = 0; + + if (to_submit) { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } + if (flags & IORING_ENTER_GETEVENTS) { + int get_ret; + + if (!ret && to_submit) + min_complete = 0; + + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) + ret = get_ret; + } + + return ret; +} + +static int io_sq_offload_start(struct io_ring_ctx *ctx) +{ + int ret; + + ctx->sqo_mm = current->mm; + + /* + * This is safe since 'current' has the fd installed, and if that + * gets closed on exit, then fops->release() is invoked which + * waits for the sq thread and sq workqueue to flush and exit + * before exiting. + */ + ret = -EBADF; + ctx->sqo_files = current->files; + if (!ctx->sqo_files) + goto err; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, + min(ctx->sq_entries - 1, 2 * num_online_cpus())); + if (!ctx->sqo_wq) { + ret = -ENOMEM; + goto err; + } + + return 0; +err: + if (ctx->sqo_files) + ctx->sqo_files = NULL; + ctx->sqo_mm = NULL; + return ret; +} + +static void io_sq_offload_stop(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_wq) { + destroy_workqueue(ctx->sqo_wq); + ctx->sqo_wq = NULL; + } +} + +static void io_free_scq_urings(struct io_ring_ctx *ctx) +{ + if (ctx->sq_ring) { + page_frag_free(ctx->sq_ring); + ctx->sq_ring = NULL; + } + if (ctx->sq_sqes) { + page_frag_free(ctx->sq_sqes); + ctx->sq_sqes = NULL; + } + if (ctx->cq_ring) { + page_frag_free(ctx->cq_ring); + ctx->cq_ring = NULL; + } +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + io_sq_offload_stop(ctx); + io_free_scq_urings(ctx); + percpu_ref_exit(&ctx->refs); + kfree(ctx); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_scqring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_scqring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_scqring_fops) + goto out_fput; + + ret = -EINVAL; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return ret; +} + +static const struct file_operations io_scqring_fops = { + .release = io_scqring_release, + .mmap = io_scqring_mmap, +}; + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + int ret; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + ret = -EOVERFLOW; + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + goto err; + ret = -ENOMEM; + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) + goto err; + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) + goto err; + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +err: + io_free_scq_urings(ctx); + return ret; +} + +static void io_fill_offsets(struct io_uring_params *p) +{ + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + int ret; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + ctx = io_ring_ctx_alloc(p); + if (!ctx) + return -ENOMEM; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = io_sq_offload_start(ctx); + if (ret) + goto err; + + ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + io_fill_offsets(p); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + + ret = io_uring_create(entries, &p); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +static int __init io_uring_setup(void) +{ + req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_setup); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..542757a4c898 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include <linux/types.h> #include <linux/aio_abi.h> @@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..dbbfc02bc0a8 --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,96 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include <linux/fs.h> +#include <linux/types.h> + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; /* type of operation for this sqe */ + __u8 flags; /* as of now unused */ + __u16 ioprio; /* ioprio for the request */ + __s32 fd; /* file descriptor to do IO on */ + __u64 off; /* offset into file */ + union { + void *addr; /* buffer or iovecs */ + __u64 __pad; + }; + __u32 len; /* buffer size or number of iovecs */ + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; + __u64 __pad2; + __u64 user_data; /* data to be passed back at completion time */ +}; + +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv[3]; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u32 resv[4]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1 << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u16 resv[10]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..ce7bd7af9312 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1402,6 +1402,15 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + select ANON_INODES + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and completion IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 05/16] Add io_uring IO interface 2019-01-12 21:30 ` [PATCH 05/16] Add " Jens Axboe @ 2019-01-12 21:30 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-12 21:30 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_sqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered and can point to any io_uring_iocb. Two new system calls are added for this: io_uring_setup(entries, iovecs, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_iocbs. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe <axboe@kernel.dk> --- arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 925 +++++++++++++++++++++++++ include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 96 +++ init/Kconfig | 9 + kernel/sys_ni.c | 2 + 7 files changed, 1040 insertions(+) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..453ff7a79002 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common io_uring_setup __x64_sys_io_uring_setup +336 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..df8fe19cdb74 --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,925 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2019 Jens Axboe + */ +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/errno.h> +#include <linux/syscalls.h> +#include <linux/refcount.h> +#include <linux/uio.h> + +#include <linux/sched/signal.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/fdtable.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include <linux/mmu_context.h> +#include <linux/percpu.h> +#include <linux/slab.h> +#include <linux/workqueue.h> +#include <linux/blkdev.h> +#include <linux/anon_inodes.h> +#include <linux/sched/mm.h> + +#include <linux/uaccess.h> +#include <linux/nospec.h> + +#include <uapi/linux/io_uring.h> + +#include "internal.h" + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct percpu_ref refs; + + unsigned int flags; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned sq_entries; + unsigned sq_mask; + unsigned sq_thread_cpu; + struct io_uring_sqe *sq_sqes; + + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cq_entries; + unsigned cq_mask; + + /* IO offload */ + struct workqueue_struct *sqo_wq; + struct mm_struct *sqo_mm; + struct files_struct *sqo_files; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; +}; + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned index; +}; + +struct io_work { + struct work_struct work; + struct sqe_submit submit; +}; + +struct io_kiocb { + union { + struct kiocb rw; + struct io_work work; + }; + + struct io_ring_ctx *ki_ctx; + struct list_head ki_list; + unsigned long ki_flags; +#define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ + u64 ki_user_data; +}; + +#define IO_PLUG_THRESHOLD 2 + +static struct kmem_cache *req_cachep; + +static const struct file_operations io_scqring_fops; + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + + init_completion(&ctx->ctx_done); + + spin_lock_init(&ctx->completion_lock); + init_waitqueue_head(&ctx->wait); + mutex_init(&ctx->uring_lock); + + return ctx; +} + +static void io_inc_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + ring->r.tail++; + smp_wmb(); +} + +static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + smp_rmb(); + tail = READ_ONCE(ring->r.tail); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + return &ring->cqes[tail & ctx->cq_mask]; +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (!percpu_ref_tryget(&ctx->refs)) + return NULL; + + req = kmem_cache_alloc(req_cachep, GFP_ATOMIC | __GFP_NOWARN); + if (!req) + return NULL; + + req->ki_ctx = ctx; + INIT_LIST_HEAD(&req->ki_list); + req->ki_flags = 0; + return req; +} + +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_free_req(struct io_kiocb *req) +{ + kmem_cache_free(req_cachep, req); + io_ring_drop_ctx_refs(req->ki_ctx, 1); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + unsigned long flags; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + spin_lock_irqsave(&ctx->completion_lock, flags); + cqe = io_peek_cqring(ctx); + if (cqe) { + cqe->user_data = ki_user_data; + cqe->res = res; + cqe->flags = ev_flags; + smp_wmb(); + io_inc_cqring(ctx); + } else + ctx->cq_ring->overflow++; + spin_unlock_irqrestore(&ctx->completion_lock, flags); +} + +static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, + long error) +{ + io_cqring_fill_event(ctx, s->index, error, 0); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_fill_event(req->ki_ctx, req->ki_user_data, res, 0); + io_free_req(req); +} + +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct kiocb *kiocb = &req->rw; + int ret; + + kiocb->ki_filp = fget(sqe->fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + kiocb->ki_pos = sqe->off; + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); + if (sqe->ioprio) { + ret = ioprio_check_cap(sqe->ioprio); + if (ret) + goto out_fput; + + kiocb->ki_ioprio = sqe->ioprio; + } else + kiocb->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags); + if (unlikely(ret)) + goto out_fput; + if (force_nonblock) { + kiocb->ki_flags |= IOCB_NOWAIT; + req->ki_flags |= REQ_F_FORCE_NONBLOCK; + } + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + kiocb->ki_complete = io_complete_scqring_rw; + return 0; +out_fput: + fput(kiocb->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *req, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * There's no easy way to restart the syscall since other AIO's + * may be already running. Just fail this IO with EINTR. + */ + ret = -EINTR; + /*FALLTHRU*/ + default: + req->ki_complete(req, ret, 0); + } +} + +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + if (!ret) { + ssize_t ret2; + + /* Catch -EAGAIN return for forced non-blocking submission */ + ret2 = call_read_iter(file, kiocb, &iter); + if (!force_nonblock || ret2 != -EAGAIN) + io_rw_done(kiocb, ret2); + else + ret = -EAGAIN; + } + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EAGAIN; + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) + goto out_fput; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, + iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + kiocb->ki_flags |= IOCB_WRITE; + io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); + } +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct sqe_submit *s, bool force_nonblock) +{ + const struct io_uring_sqe *sqe = s->sqe; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(sqe->flags || sqe->__pad2)) + return -EINVAL; + + if (unlikely(s->index >= ctx->sq_entries)) + return -EINVAL; + req->ki_user_data = sqe->user_data; + + ret = -EINVAL; + switch (sqe->opcode) { + case IORING_OP_READV: + ret = io_read(req, sqe, force_nonblock); + break; + case IORING_OP_WRITEV: + ret = io_write(req, sqe, force_nonblock); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work.work); + struct io_ring_ctx *ctx = req->ki_ctx; + mm_segment_t old_fs = get_fs(); + struct files_struct *old_files; + int ret; + + /* + * Ensure we clear previously set flags. even it NOWAIT was originally + * set, it's pointless now that we're in an async context. + */ + req->rw.ki_flags &= ~IOCB_NOWAIT; + req->ki_flags &= ~REQ_F_FORCE_NONBLOCK; + + old_files = current->files; + current->files = ctx->sqo_files; + + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + + use_mm(ctx->sqo_mm); + set_fs(USER_DS); + + ret = __io_submit_sqe(ctx, req, &req->work.submit, false); + + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); +err: + if (ret) { + io_fill_cq_error(ctx, &req->work.submit, ret); + io_free_req(req); + } + current->files = old_files; +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_kiocb *req; + ssize_t ret; + + req = io_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + ret = __io_submit_sqe(ctx, req, s, true); + if (ret == -EAGAIN) { + memcpy(&req->work.submit, s, sizeof(*s)); + INIT_WORK(&req->work.work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work.work); + ret = 0; + } + if (ret) + io_free_req(req); + + return ret; +} + +static void io_inc_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + ring->r.head++; + smp_wmb(); +} + +static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + smp_rmb(); + head = READ_ONCE(ring->r.head); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = ring->array[head & ctx->sq_mask]; + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + return true; + } + + /* drop invalid entries */ + ring->r.head++; + ring->dropped++; + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_peek_sqring(ctx, &s)) + break; + + ret = io_submit_sqe(ctx, &s); + if (ret) + break; + + submit++; + io_inc_sqring(ctx); + } + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) +{ + struct io_cq_ring *ring = ctx->cq_ring; + DEFINE_WAIT(wait); + int ret = 0; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, + unsigned min_complete, unsigned flags) +{ + int ret = 0; + + if (to_submit) { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } + if (flags & IORING_ENTER_GETEVENTS) { + int get_ret; + + if (!ret && to_submit) + min_complete = 0; + + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) + ret = get_ret; + } + + return ret; +} + +static int io_sq_offload_start(struct io_ring_ctx *ctx) +{ + int ret; + + ctx->sqo_mm = current->mm; + + /* + * This is safe since 'current' has the fd installed, and if that + * gets closed on exit, then fops->release() is invoked which + * waits for the sq thread and sq workqueue to flush and exit + * before exiting. + */ + ret = -EBADF; + ctx->sqo_files = current->files; + if (!ctx->sqo_files) + goto err; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, + min(ctx->sq_entries - 1, 2 * num_online_cpus())); + if (!ctx->sqo_wq) { + ret = -ENOMEM; + goto err; + } + + return 0; +err: + if (ctx->sqo_files) + ctx->sqo_files = NULL; + ctx->sqo_mm = NULL; + return ret; +} + +static void io_sq_offload_stop(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_wq) { + destroy_workqueue(ctx->sqo_wq); + ctx->sqo_wq = NULL; + } +} + +static void io_free_scq_urings(struct io_ring_ctx *ctx) +{ + if (ctx->sq_ring) { + page_frag_free(ctx->sq_ring); + ctx->sq_ring = NULL; + } + if (ctx->sq_sqes) { + page_frag_free(ctx->sq_sqes); + ctx->sq_sqes = NULL; + } + if (ctx->cq_ring) { + page_frag_free(ctx->cq_ring); + ctx->cq_ring = NULL; + } +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + io_sq_offload_stop(ctx); + io_free_scq_urings(ctx); + percpu_ref_exit(&ctx->refs); + kfree(ctx); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_scqring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_scqring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_scqring_fops) + goto out_fput; + + ret = -EINVAL; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return ret; +} + +static const struct file_operations io_scqring_fops = { + .release = io_scqring_release, + .mmap = io_scqring_mmap, +}; + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + int ret; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + ret = -EOVERFLOW; + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + goto err; + ret = -ENOMEM; + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) + goto err; + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) + goto err; + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +err: + io_free_scq_urings(ctx); + return ret; +} + +static void io_fill_offsets(struct io_uring_params *p) +{ + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + int ret; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + ctx = io_ring_ctx_alloc(p); + if (!ctx) + return -ENOMEM; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = io_sq_offload_start(ctx); + if (ret) + goto err; + + ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + io_fill_offsets(p); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + + ret = io_uring_create(entries, &p); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +static int __init io_uring_setup(void) +{ + req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_setup); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..542757a4c898 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include <linux/types.h> #include <linux/aio_abi.h> @@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..dbbfc02bc0a8 --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,96 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include <linux/fs.h> +#include <linux/types.h> + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; /* type of operation for this sqe */ + __u8 flags; /* as of now unused */ + __u16 ioprio; /* ioprio for the request */ + __s32 fd; /* file descriptor to do IO on */ + __u64 off; /* offset into file */ + union { + void *addr; /* buffer or iovecs */ + __u64 __pad; + }; + __u32 len; /* buffer size or number of iovecs */ + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; + __u64 __pad2; + __u64 user_data; /* data to be passed back at completion time */ +}; + +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv[3]; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u32 resv[4]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1 << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u16 resv[10]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..ce7bd7af9312 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1402,6 +1402,15 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + select ANON_INODES + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and completion IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* (unknown), @ 2019-01-15 2:55 Jens Axboe 2019-01-15 2:55 ` [PATCH 05/16] Add io_uring IO interface Jens Axboe 0 siblings, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-15 2:55 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch; +Cc: hch, jmoyer, avi Here's v4 of the io_uring interface. No user visible changes this time, outside of bumping the io_uring_sqe submission entry to a full 64-bytes. This aligns better with caches, and leaves us some room to grow for future features. See the v3 posting for full details on the API: https://lore.kernel.org/linux-block/20190112213011.1439-1-axboe@kernel.dk/ What I neglected to mention in the v3 posting, is that the fixed buffer and fixed file interfaces are available through the io_uring_register() system call. This means they can be registered (and unregistered) independently of the io_uring context setup. Patches are against 5.0-rc2 and can also be found in my 'io_uring' git branch: git://git.kernel.dk/linux-block io_uring Changes since v3: - Clean up fixed buffer index validation - Add IORING_OP_NOP for ring perf testing - Drop struct io_kiocb ki_* variable prefix, it clashes with struct kiocb for no reason except to cause confusement - Bump io_uring_sqe to 64 bytes. Cacheline sized and aligned (on x86-64), and more future proof - Use kmalloc_array() - Make the page mlock rlimit incremental and not for root / CAP_IPC_LOCK - Ensure io_uring_register() can't race with fops->release() - Simplify and improve iopoll implementation - Use FOLL_WRITE instead of open-coding it - Fix 32-bit vs 64-bit sizing for the io_uring_register() structs - Added x86 32-bit system calls - Added 32-bit compat mode - Rebased on 5.0-rc2 Documentation/filesystems/vfs.txt | 3 + arch/x86/entry/syscalls/syscall_32.tbl | 3 + arch/x86/entry/syscalls/syscall_64.tbl | 3 + block/bio.c | 59 +- fs/Makefile | 1 + fs/block_dev.c | 19 +- fs/file.c | 15 +- fs/file_table.c | 9 +- fs/gfs2/file.c | 2 + fs/io_uring.c | 2072 ++++++++++++++++++++++++ fs/iomap.c | 48 +- fs/xfs/xfs_file.c | 1 + include/linux/bio.h | 14 + include/linux/blk_types.h | 1 + include/linux/file.h | 2 + include/linux/fs.h | 6 +- include/linux/iomap.h | 1 + include/linux/sched/user.h | 2 +- include/linux/syscalls.h | 7 + include/uapi/linux/io_uring.h | 155 ++ init/Kconfig | 9 + kernel/sys_ni.c | 3 + 22 files changed, 2395 insertions(+), 40 deletions(-) -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* [PATCH 05/16] Add io_uring IO interface 2019-01-15 2:55 (unknown), Jens Axboe @ 2019-01-15 2:55 ` Jens Axboe 2019-01-15 2:55 ` Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-15 2:55 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_sqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered and can point to any io_uring_iocb. Two new system calls are added for this: io_uring_setup(entries, iovecs, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_iocbs. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe <axboe@kernel.dk> --- arch/x86/entry/syscalls/syscall_32.tbl | 2 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 977 +++++++++++++++++++++++++ include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 97 +++ init/Kconfig | 9 + kernel/sys_ni.c | 2 + 8 files changed, 1095 insertions(+) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 3cf7b533b3d1..194e79c0032e 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,3 +398,5 @@ 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq +387 i386 io_uring_setup sys_io_uring_setup __ia32_compat_sys_io_uring_setup +388 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..453ff7a79002 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common io_uring_setup __x64_sys_io_uring_setup +336 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..148eb3af7dc4 --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,977 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2019 Jens Axboe + */ +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/errno.h> +#include <linux/syscalls.h> +#include <linux/compat.h> +#include <linux/refcount.h> +#include <linux/uio.h> + +#include <linux/sched/signal.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/fdtable.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include <linux/mmu_context.h> +#include <linux/percpu.h> +#include <linux/slab.h> +#include <linux/workqueue.h> +#include <linux/blkdev.h> +#include <linux/anon_inodes.h> +#include <linux/sched/mm.h> + +#include <linux/uaccess.h> +#include <linux/nospec.h> + +#include <uapi/linux/io_uring.h> + +#include "internal.h" + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct percpu_ref refs; + + unsigned int flags; + bool compat; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned sq_entries; + unsigned sq_mask; + unsigned sq_thread_cpu; + struct io_uring_sqe *sq_sqes; + + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cq_entries; + unsigned cq_mask; + + /* IO offload */ + struct workqueue_struct *sqo_wq; + struct mm_struct *sqo_mm; + struct files_struct *sqo_files; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; +}; + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned index; +}; + +struct io_work { + struct work_struct work; + struct sqe_submit submit; +}; + +struct io_kiocb { + union { + struct kiocb rw; + struct io_work work; + }; + + struct io_ring_ctx *ctx; + struct list_head list; + unsigned long flags; +#define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ + u64 user_data; +}; + +#define IO_PLUG_THRESHOLD 2 + +static struct kmem_cache *req_cachep; + +static const struct file_operations io_uring_fops; + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + init_completion(&ctx->ctx_done); + spin_lock_init(&ctx->completion_lock); + init_waitqueue_head(&ctx->wait); + mutex_init(&ctx->uring_lock); + return ctx; +} + +static void io_inc_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + ring->r.tail++; + smp_wmb(); +} + +static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + smp_rmb(); + tail = READ_ONCE(ring->r.tail); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + return &ring->cqes[tail & ctx->cq_mask]; +} + +static void __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + cqe = io_peek_cqring(ctx); + if (cqe) { + cqe->user_data = ki_user_data; + cqe->res = res; + cqe->flags = ev_flags; + smp_wmb(); + io_inc_cqring(ctx); + } else + ctx->cq_ring->overflow++; +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + __io_cqring_fill_event(ctx, ki_user_data, res, ev_flags); + spin_unlock_irqrestore(&ctx->completion_lock, flags); +} + +static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, + long error) +{ + io_cqring_fill_event(ctx, s->index, error, 0); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (!percpu_ref_tryget(&ctx->refs)) + return NULL; + + req = kmem_cache_alloc(req_cachep, GFP_ATOMIC | __GFP_NOWARN); + if (!req) + return NULL; + + req->ctx = ctx; + INIT_LIST_HEAD(&req->list); + req->flags = 0; + return req; +} + +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_free_req(struct io_kiocb *req) +{ + kmem_cache_free(req_cachep, req); + io_ring_drop_ctx_refs(req->ctx, 1); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_complete_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_fill_event(req->ctx, req->user_data, res, 0); + io_free_req(req); +} + +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct kiocb *kiocb = &req->rw; + int ret; + + kiocb->ki_filp = fget(sqe->fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + kiocb->ki_pos = sqe->off; + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); + if (sqe->ioprio) { + ret = ioprio_check_cap(sqe->ioprio); + if (ret) + goto out_fput; + + kiocb->ki_ioprio = sqe->ioprio; + } else + kiocb->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags); + if (unlikely(ret)) + goto out_fput; + if (force_nonblock) { + kiocb->ki_flags |= IOCB_NOWAIT; + req->flags |= REQ_F_FORCE_NONBLOCK; + } + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + kiocb->ki_complete = io_complete_rw; + return 0; +out_fput: + fput(kiocb->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * There's no easy way to restart the syscall since other AIO's + * may be already running. Just fail this IO with EINTR. + */ + ret = -EINTR; + /*FALLTHRU*/ + default: + kiocb->ki_complete(kiocb, ret, 0); + } +} + +static int io_import_iovec(struct io_ring_ctx *ctx, int rw, + const struct io_uring_sqe *sqe, + struct iovec **iovec, struct iov_iter *iter) +{ + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + +#ifdef CONFIG_COMPAT + if (ctx->compat) + return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV, + iovec, iter); +#endif + return import_iovec(rw, buf, sqe->len, UIO_FASTIOV, iovec, iter); +} + +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + if (!ret) { + ssize_t ret2; + + /* Catch -EAGAIN return for forced non-blocking submission */ + ret2 = call_read_iter(file, kiocb, &iter); + if (!force_nonblock || ret2 != -EAGAIN) + io_rw_done(kiocb, ret2); + else + ret = -EAGAIN; + } + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EAGAIN; + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) + goto out_fput; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, WRITE, sqe, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, + iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + kiocb->ki_flags |= IOCB_WRITE; + io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); + } +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +/* + * IORING_OP_NOP just posts a completion event, nothing else. + */ +static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_ring_ctx *ctx = req->ctx; + + __io_cqring_fill_event(ctx, sqe->user_data, 0, 0); + io_free_req(req); + return 0; +} + +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct sqe_submit *s, bool force_nonblock) +{ + const struct io_uring_sqe *sqe = s->sqe; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(sqe->flags)) + return -EINVAL; + + if (unlikely(s->index >= ctx->sq_entries)) + return -EINVAL; + req->user_data = sqe->user_data; + + ret = -EINVAL; + switch (sqe->opcode) { + case IORING_OP_NOP: + ret = io_nop(req, sqe); + break; + case IORING_OP_READV: + ret = io_read(req, sqe, force_nonblock); + break; + case IORING_OP_WRITEV: + ret = io_write(req, sqe, force_nonblock); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work.work); + struct io_ring_ctx *ctx = req->ctx; + mm_segment_t old_fs = get_fs(); + struct files_struct *old_files; + int ret; + + /* + * Ensure we clear previously set flags. even it NOWAIT was originally + * set, it's pointless now that we're in an async context. + */ + req->rw.ki_flags &= ~IOCB_NOWAIT; + req->flags &= ~REQ_F_FORCE_NONBLOCK; + + old_files = current->files; + current->files = ctx->sqo_files; + + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + + use_mm(ctx->sqo_mm); + set_fs(USER_DS); + + ret = __io_submit_sqe(ctx, req, &req->work.submit, false); + + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); +err: + if (ret) { + io_fill_cq_error(ctx, &req->work.submit, ret); + io_free_req(req); + } + current->files = old_files; +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_kiocb *req; + ssize_t ret; + + req = io_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + ret = __io_submit_sqe(ctx, req, s, true); + if (ret == -EAGAIN) { + memcpy(&req->work.submit, s, sizeof(*s)); + INIT_WORK(&req->work.work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work.work); + ret = 0; + } + if (ret) + io_free_req(req); + + return ret; +} + +static void io_inc_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + ring->r.head++; + smp_wmb(); +} + +static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + smp_rmb(); + head = READ_ONCE(ring->r.head); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = ring->array[head & ctx->sq_mask]; + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + return true; + } + + /* drop invalid entries */ + ring->r.head++; + ring->dropped++; + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_peek_sqring(ctx, &s)) + break; + + ret = io_submit_sqe(ctx, &s); + if (ret) + break; + + submit++; + io_inc_sqring(ctx); + } + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) +{ + struct io_cq_ring *ring = ctx->cq_ring; + DEFINE_WAIT(wait); + int ret = 0; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, + unsigned min_complete, unsigned flags) +{ + int ret = 0; + + if (to_submit) { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } + if (flags & IORING_ENTER_GETEVENTS) { + int get_ret; + + if (!ret && to_submit) + min_complete = 0; + + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) + ret = get_ret; + } + + return ret; +} + +static int io_sq_offload_start(struct io_ring_ctx *ctx) +{ + int ret; + + ctx->sqo_mm = current->mm; + + /* + * This is safe since 'current' has the fd installed, and if that + * gets closed on exit, then fops->release() is invoked which + * waits for the sq thread and sq workqueue to flush and exit + * before exiting. + */ + ret = -EBADF; + ctx->sqo_files = current->files; + if (!ctx->sqo_files) + goto err; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, + min(ctx->sq_entries - 1, 2 * num_online_cpus())); + if (!ctx->sqo_wq) { + ret = -ENOMEM; + goto err; + } + + return 0; +err: + if (ctx->sqo_files) + ctx->sqo_files = NULL; + ctx->sqo_mm = NULL; + return ret; +} + +static void io_sq_offload_stop(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_wq) { + destroy_workqueue(ctx->sqo_wq); + ctx->sqo_wq = NULL; + } +} + +static void io_free_scq_urings(struct io_ring_ctx *ctx) +{ + if (ctx->sq_ring) { + page_frag_free(ctx->sq_ring); + ctx->sq_ring = NULL; + } + if (ctx->sq_sqes) { + page_frag_free(ctx->sq_sqes); + ctx->sq_sqes = NULL; + } + if (ctx->cq_ring) { + page_frag_free(ctx->cq_ring); + ctx->cq_ring = NULL; + } +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + io_sq_offload_stop(ctx); + io_free_scq_urings(ctx); + percpu_ref_exit(&ctx->refs); + kfree(ctx); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + mutex_lock(&ctx->uring_lock); + percpu_ref_kill(&ctx->refs); + mutex_unlock(&ctx->uring_lock); + + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_uring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_uring_fops) + goto out_fput; + + ret = -EINVAL; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return ret; +} + +static const struct file_operations io_uring_fops = { + .release = io_uring_release, + .mmap = io_uring_mmap, +}; + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + int ret; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + ret = -EOVERFLOW; + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + goto err; + ret = -ENOMEM; + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) + goto err; + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) + goto err; + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +err: + io_free_scq_urings(ctx); + return ret; +} + +static void io_fill_offsets(struct io_uring_params *p) +{ + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p, + bool compat) +{ + struct io_ring_ctx *ctx; + int ret; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + ctx = io_ring_ctx_alloc(p); + if (!ctx) + return -ENOMEM; + ctx->compat = compat; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = io_sq_offload_start(ctx); + if (ret) + goto err; + + ret = anon_inode_getfd("[io_uring]", &io_uring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + io_fill_offsets(p); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +static long io_uring_setup(u32 entries, struct io_uring_params __user *params, + bool compat) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + + ret = io_uring_create(entries, &p, compat); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + return io_uring_setup(entries, params, false); +} + +#ifdef CONFIG_COMPAT +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + return io_uring_setup(entries, params, true); +} +#endif + +static int __init io_uring_init(void) +{ + req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_init); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..542757a4c898 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include <linux/types.h> #include <linux/aio_abi.h> @@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..a1ebaa09e1b8 --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,97 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include <linux/fs.h> +#include <linux/types.h> + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; /* type of operation for this sqe */ + __u8 flags; /* as of now unused */ + __u16 ioprio; /* ioprio for the request */ + __s32 fd; /* file descriptor to do IO on */ + __u64 off; /* offset into file */ + union { + void *addr; /* buffer or iovecs */ + __u64 __pad; + }; + __u32 len; /* buffer size or number of iovecs */ + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; + __u64 user_data; /* data to be passed back at completion time */ + __u64 __pad2[3]; +}; + +#define IORING_OP_NOP 0 +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv[3]; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u32 resv[4]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1 << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u16 resv[10]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..ce7bd7af9312 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1402,6 +1402,15 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + select ANON_INODES + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and completion IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ -- 2.17.1 -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply related [flat|nested] 62+ messages in thread
* [PATCH 05/16] Add io_uring IO interface 2019-01-15 2:55 ` [PATCH 05/16] Add io_uring IO interface Jens Axboe @ 2019-01-15 2:55 ` Jens Axboe 2019-01-15 16:51 ` Jonathan Corbet 2019-01-16 10:41 ` Arnd Bergmann 2 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-15 2:55 UTC (permalink / raw) To: linux-fsdevel, linux-aio, linux-block, linux-arch Cc: hch, jmoyer, avi, Jens Axboe The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_sqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered and can point to any io_uring_iocb. Two new system calls are added for this: io_uring_setup(entries, iovecs, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_iocbs. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe <axboe@kernel.dk> --- arch/x86/entry/syscalls/syscall_32.tbl | 2 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 977 +++++++++++++++++++++++++ include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 97 +++ init/Kconfig | 9 + kernel/sys_ni.c | 2 + 8 files changed, 1095 insertions(+) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 3cf7b533b3d1..194e79c0032e 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,3 +398,5 @@ 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq +387 i386 io_uring_setup sys_io_uring_setup __ia32_compat_sys_io_uring_setup +388 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..453ff7a79002 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common io_uring_setup __x64_sys_io_uring_setup +336 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..148eb3af7dc4 --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,977 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2019 Jens Axboe + */ +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/errno.h> +#include <linux/syscalls.h> +#include <linux/compat.h> +#include <linux/refcount.h> +#include <linux/uio.h> + +#include <linux/sched/signal.h> +#include <linux/fs.h> +#include <linux/file.h> +#include <linux/fdtable.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include <linux/mmu_context.h> +#include <linux/percpu.h> +#include <linux/slab.h> +#include <linux/workqueue.h> +#include <linux/blkdev.h> +#include <linux/anon_inodes.h> +#include <linux/sched/mm.h> + +#include <linux/uaccess.h> +#include <linux/nospec.h> + +#include <uapi/linux/io_uring.h> + +#include "internal.h" + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct percpu_ref refs; + + unsigned int flags; + bool compat; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned sq_entries; + unsigned sq_mask; + unsigned sq_thread_cpu; + struct io_uring_sqe *sq_sqes; + + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cq_entries; + unsigned cq_mask; + + /* IO offload */ + struct workqueue_struct *sqo_wq; + struct mm_struct *sqo_mm; + struct files_struct *sqo_files; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; +}; + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned index; +}; + +struct io_work { + struct work_struct work; + struct sqe_submit submit; +}; + +struct io_kiocb { + union { + struct kiocb rw; + struct io_work work; + }; + + struct io_ring_ctx *ctx; + struct list_head list; + unsigned long flags; +#define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ + u64 user_data; +}; + +#define IO_PLUG_THRESHOLD 2 + +static struct kmem_cache *req_cachep; + +static const struct file_operations io_uring_fops; + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + init_completion(&ctx->ctx_done); + spin_lock_init(&ctx->completion_lock); + init_waitqueue_head(&ctx->wait); + mutex_init(&ctx->uring_lock); + return ctx; +} + +static void io_inc_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + ring->r.tail++; + smp_wmb(); +} + +static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + smp_rmb(); + tail = READ_ONCE(ring->r.tail); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + return &ring->cqes[tail & ctx->cq_mask]; +} + +static void __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + cqe = io_peek_cqring(ctx); + if (cqe) { + cqe->user_data = ki_user_data; + cqe->res = res; + cqe->flags = ev_flags; + smp_wmb(); + io_inc_cqring(ctx); + } else + ctx->cq_ring->overflow++; +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + __io_cqring_fill_event(ctx, ki_user_data, res, ev_flags); + spin_unlock_irqrestore(&ctx->completion_lock, flags); +} + +static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, + long error) +{ + io_cqring_fill_event(ctx, s->index, error, 0); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (!percpu_ref_tryget(&ctx->refs)) + return NULL; + + req = kmem_cache_alloc(req_cachep, GFP_ATOMIC | __GFP_NOWARN); + if (!req) + return NULL; + + req->ctx = ctx; + INIT_LIST_HEAD(&req->list); + req->flags = 0; + return req; +} + +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_free_req(struct io_kiocb *req) +{ + kmem_cache_free(req_cachep, req); + io_ring_drop_ctx_refs(req->ctx, 1); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_complete_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_fill_event(req->ctx, req->user_data, res, 0); + io_free_req(req); +} + +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct kiocb *kiocb = &req->rw; + int ret; + + kiocb->ki_filp = fget(sqe->fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + kiocb->ki_pos = sqe->off; + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); + if (sqe->ioprio) { + ret = ioprio_check_cap(sqe->ioprio); + if (ret) + goto out_fput; + + kiocb->ki_ioprio = sqe->ioprio; + } else + kiocb->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags); + if (unlikely(ret)) + goto out_fput; + if (force_nonblock) { + kiocb->ki_flags |= IOCB_NOWAIT; + req->flags |= REQ_F_FORCE_NONBLOCK; + } + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + kiocb->ki_complete = io_complete_rw; + return 0; +out_fput: + fput(kiocb->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * There's no easy way to restart the syscall since other AIO's + * may be already running. Just fail this IO with EINTR. + */ + ret = -EINTR; + /*FALLTHRU*/ + default: + kiocb->ki_complete(kiocb, ret, 0); + } +} + +static int io_import_iovec(struct io_ring_ctx *ctx, int rw, + const struct io_uring_sqe *sqe, + struct iovec **iovec, struct iov_iter *iter) +{ + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + +#ifdef CONFIG_COMPAT + if (ctx->compat) + return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV, + iovec, iter); +#endif + return import_iovec(rw, buf, sqe->len, UIO_FASTIOV, iovec, iter); +} + +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + if (!ret) { + ssize_t ret2; + + /* Catch -EAGAIN return for forced non-blocking submission */ + ret2 = call_read_iter(file, kiocb, &iter); + if (!force_nonblock || ret2 != -EAGAIN) + io_rw_done(kiocb, ret2); + else + ret = -EAGAIN; + } + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EAGAIN; + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) + goto out_fput; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, WRITE, sqe, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, + iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + kiocb->ki_flags |= IOCB_WRITE; + io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); + } +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +/* + * IORING_OP_NOP just posts a completion event, nothing else. + */ +static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_ring_ctx *ctx = req->ctx; + + __io_cqring_fill_event(ctx, sqe->user_data, 0, 0); + io_free_req(req); + return 0; +} + +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct sqe_submit *s, bool force_nonblock) +{ + const struct io_uring_sqe *sqe = s->sqe; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(sqe->flags)) + return -EINVAL; + + if (unlikely(s->index >= ctx->sq_entries)) + return -EINVAL; + req->user_data = sqe->user_data; + + ret = -EINVAL; + switch (sqe->opcode) { + case IORING_OP_NOP: + ret = io_nop(req, sqe); + break; + case IORING_OP_READV: + ret = io_read(req, sqe, force_nonblock); + break; + case IORING_OP_WRITEV: + ret = io_write(req, sqe, force_nonblock); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work.work); + struct io_ring_ctx *ctx = req->ctx; + mm_segment_t old_fs = get_fs(); + struct files_struct *old_files; + int ret; + + /* + * Ensure we clear previously set flags. even it NOWAIT was originally + * set, it's pointless now that we're in an async context. + */ + req->rw.ki_flags &= ~IOCB_NOWAIT; + req->flags &= ~REQ_F_FORCE_NONBLOCK; + + old_files = current->files; + current->files = ctx->sqo_files; + + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + + use_mm(ctx->sqo_mm); + set_fs(USER_DS); + + ret = __io_submit_sqe(ctx, req, &req->work.submit, false); + + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); +err: + if (ret) { + io_fill_cq_error(ctx, &req->work.submit, ret); + io_free_req(req); + } + current->files = old_files; +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_kiocb *req; + ssize_t ret; + + req = io_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + ret = __io_submit_sqe(ctx, req, s, true); + if (ret == -EAGAIN) { + memcpy(&req->work.submit, s, sizeof(*s)); + INIT_WORK(&req->work.work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work.work); + ret = 0; + } + if (ret) + io_free_req(req); + + return ret; +} + +static void io_inc_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + ring->r.head++; + smp_wmb(); +} + +static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + smp_rmb(); + head = READ_ONCE(ring->r.head); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = ring->array[head & ctx->sq_mask]; + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + return true; + } + + /* drop invalid entries */ + ring->r.head++; + ring->dropped++; + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_peek_sqring(ctx, &s)) + break; + + ret = io_submit_sqe(ctx, &s); + if (ret) + break; + + submit++; + io_inc_sqring(ctx); + } + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) +{ + struct io_cq_ring *ring = ctx->cq_ring; + DEFINE_WAIT(wait); + int ret = 0; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, + unsigned min_complete, unsigned flags) +{ + int ret = 0; + + if (to_submit) { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } + if (flags & IORING_ENTER_GETEVENTS) { + int get_ret; + + if (!ret && to_submit) + min_complete = 0; + + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) + ret = get_ret; + } + + return ret; +} + +static int io_sq_offload_start(struct io_ring_ctx *ctx) +{ + int ret; + + ctx->sqo_mm = current->mm; + + /* + * This is safe since 'current' has the fd installed, and if that + * gets closed on exit, then fops->release() is invoked which + * waits for the sq thread and sq workqueue to flush and exit + * before exiting. + */ + ret = -EBADF; + ctx->sqo_files = current->files; + if (!ctx->sqo_files) + goto err; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, + min(ctx->sq_entries - 1, 2 * num_online_cpus())); + if (!ctx->sqo_wq) { + ret = -ENOMEM; + goto err; + } + + return 0; +err: + if (ctx->sqo_files) + ctx->sqo_files = NULL; + ctx->sqo_mm = NULL; + return ret; +} + +static void io_sq_offload_stop(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_wq) { + destroy_workqueue(ctx->sqo_wq); + ctx->sqo_wq = NULL; + } +} + +static void io_free_scq_urings(struct io_ring_ctx *ctx) +{ + if (ctx->sq_ring) { + page_frag_free(ctx->sq_ring); + ctx->sq_ring = NULL; + } + if (ctx->sq_sqes) { + page_frag_free(ctx->sq_sqes); + ctx->sq_sqes = NULL; + } + if (ctx->cq_ring) { + page_frag_free(ctx->cq_ring); + ctx->cq_ring = NULL; + } +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + io_sq_offload_stop(ctx); + io_free_scq_urings(ctx); + percpu_ref_exit(&ctx->refs); + kfree(ctx); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + mutex_lock(&ctx->uring_lock); + percpu_ref_kill(&ctx->refs); + mutex_unlock(&ctx->uring_lock); + + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_uring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_uring_fops) + goto out_fput; + + ret = -EINVAL; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return ret; +} + +static const struct file_operations io_uring_fops = { + .release = io_uring_release, + .mmap = io_uring_mmap, +}; + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + int ret; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + ret = -EOVERFLOW; + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + goto err; + ret = -ENOMEM; + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) + goto err; + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) + goto err; + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +err: + io_free_scq_urings(ctx); + return ret; +} + +static void io_fill_offsets(struct io_uring_params *p) +{ + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p, + bool compat) +{ + struct io_ring_ctx *ctx; + int ret; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + ctx = io_ring_ctx_alloc(p); + if (!ctx) + return -ENOMEM; + ctx->compat = compat; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = io_sq_offload_start(ctx); + if (ret) + goto err; + + ret = anon_inode_getfd("[io_uring]", &io_uring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + io_fill_offsets(p); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +static long io_uring_setup(u32 entries, struct io_uring_params __user *params, + bool compat) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + + ret = io_uring_create(entries, &p, compat); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + return io_uring_setup(entries, params, false); +} + +#ifdef CONFIG_COMPAT +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + return io_uring_setup(entries, params, true); +} +#endif + +static int __init io_uring_init(void) +{ + req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_init); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..542757a4c898 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include <linux/types.h> #include <linux/aio_abi.h> @@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..a1ebaa09e1b8 --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,97 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include <linux/fs.h> +#include <linux/types.h> + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; /* type of operation for this sqe */ + __u8 flags; /* as of now unused */ + __u16 ioprio; /* ioprio for the request */ + __s32 fd; /* file descriptor to do IO on */ + __u64 off; /* offset into file */ + union { + void *addr; /* buffer or iovecs */ + __u64 __pad; + }; + __u32 len; /* buffer size or number of iovecs */ + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; + __u64 user_data; /* data to be passed back at completion time */ + __u64 __pad2[3]; +}; + +#define IORING_OP_NOP 0 +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv[3]; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u32 resv[4]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1 << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u16 resv[10]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..ce7bd7af9312 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1402,6 +1402,15 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + select ANON_INODES + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and completion IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ -- 2.17.1 ^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-15 2:55 ` [PATCH 05/16] Add io_uring IO interface Jens Axboe 2019-01-15 2:55 ` Jens Axboe @ 2019-01-15 16:51 ` Jonathan Corbet 2019-01-15 16:51 ` Jonathan Corbet 2019-01-15 16:55 ` Jens Axboe 2019-01-16 10:41 ` Arnd Bergmann 2 siblings, 2 replies; 62+ messages in thread From: Jonathan Corbet @ 2019-01-15 16:51 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On Mon, 14 Jan 2019 19:55:20 -0700 Jens Axboe <axboe@kernel.dk> wrote: So the [0/16] cover letter seems to have gone astray this time? Anyway, a couple of minor notes/questions: > The submission queue (SQ) and completion queue (CQ) rings are shared > between the application and the kernel. This eliminates the need to > copy data back and forth to submit and complete IO. > > IO submissions use the io_uring_sqe data structure, and completions > are generated in the form of io_uring_sqe data structures. The SQ > ring is an index into the io_uring_sqe array, which makes it possible > to submit a batch of IOs without them being contiguous in the ring. > The CQ ring is always contiguous, as completion events are inherently > unordered and can point to any io_uring_iocb. > > Two new system calls are added for this: > > io_uring_setup(entries, iovecs, params) > Sets up a context for doing async IO. On success, returns a file > descriptor that the application can mmap to gain access to the > SQ ring, CQ ring, and io_uring_iocbs. Looking at the code, it would appear that the "iovecs" parameter doesn't actually exist. > io_uring_enter(fd, to_submit, min_complete, flags) > Initiates IO against the rings mapped to this fd, or waits for > them to complete, or both The behavior is controlled by the > parameters passed in. If 'min_complete' is non-zero, then we'll > try and submit new IO. If IORING_ENTER_GETEVENTS is set, the > kernel will wait for 'min_complete' events, if they aren't > already available. I feel like I'm missing something here. Rather than have the IORING_ENTER_GETEVENTS flag, why not just wait if min_complete > 0 ? Thanks, jon -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-15 16:51 ` Jonathan Corbet @ 2019-01-15 16:51 ` Jonathan Corbet 2019-01-15 16:55 ` Jens Axboe 1 sibling, 0 replies; 62+ messages in thread From: Jonathan Corbet @ 2019-01-15 16:51 UTC (permalink / raw) To: Jens Axboe Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On Mon, 14 Jan 2019 19:55:20 -0700 Jens Axboe <axboe@kernel.dk> wrote: So the [0/16] cover letter seems to have gone astray this time? Anyway, a couple of minor notes/questions: > The submission queue (SQ) and completion queue (CQ) rings are shared > between the application and the kernel. This eliminates the need to > copy data back and forth to submit and complete IO. > > IO submissions use the io_uring_sqe data structure, and completions > are generated in the form of io_uring_sqe data structures. The SQ > ring is an index into the io_uring_sqe array, which makes it possible > to submit a batch of IOs without them being contiguous in the ring. > The CQ ring is always contiguous, as completion events are inherently > unordered and can point to any io_uring_iocb. > > Two new system calls are added for this: > > io_uring_setup(entries, iovecs, params) > Sets up a context for doing async IO. On success, returns a file > descriptor that the application can mmap to gain access to the > SQ ring, CQ ring, and io_uring_iocbs. Looking at the code, it would appear that the "iovecs" parameter doesn't actually exist. > io_uring_enter(fd, to_submit, min_complete, flags) > Initiates IO against the rings mapped to this fd, or waits for > them to complete, or both The behavior is controlled by the > parameters passed in. If 'min_complete' is non-zero, then we'll > try and submit new IO. If IORING_ENTER_GETEVENTS is set, the > kernel will wait for 'min_complete' events, if they aren't > already available. I feel like I'm missing something here. Rather than have the IORING_ENTER_GETEVENTS flag, why not just wait if min_complete > 0 ? Thanks, jon ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-15 16:51 ` Jonathan Corbet 2019-01-15 16:51 ` Jonathan Corbet @ 2019-01-15 16:55 ` Jens Axboe 2019-01-15 16:55 ` Jens Axboe 2019-01-15 17:26 ` Jens Axboe 1 sibling, 2 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-15 16:55 UTC (permalink / raw) To: Jonathan Corbet Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On 1/15/19 9:51 AM, Jonathan Corbet wrote: > On Mon, 14 Jan 2019 19:55:20 -0700 > Jens Axboe <axboe@kernel.dk> wrote: > > So the [0/16] cover letter seems to have gone astray this time? It did go out, but I forgot to add a Subject line to it... https://marc.info/?l=linux-block&m=154752095709422&w=2 >> The submission queue (SQ) and completion queue (CQ) rings are shared >> between the application and the kernel. This eliminates the need to >> copy data back and forth to submit and complete IO. >> >> IO submissions use the io_uring_sqe data structure, and completions >> are generated in the form of io_uring_sqe data structures. The SQ >> ring is an index into the io_uring_sqe array, which makes it possible >> to submit a batch of IOs without them being contiguous in the ring. >> The CQ ring is always contiguous, as completion events are inherently >> unordered and can point to any io_uring_iocb. >> >> Two new system calls are added for this: >> >> io_uring_setup(entries, iovecs, params) >> Sets up a context for doing async IO. On success, returns a file >> descriptor that the application can mmap to gain access to the >> SQ ring, CQ ring, and io_uring_iocbs. > > Looking at the code, it would appear that the "iovecs" parameter doesn't > actually exist. Indeed, need to update that commit message. and io_uring_iocbs should now be io_uring_sqes. The iovec/file registration is done through io_uring_register(2). >> io_uring_enter(fd, to_submit, min_complete, flags) >> Initiates IO against the rings mapped to this fd, or waits for >> them to complete, or both The behavior is controlled by the >> parameters passed in. If 'min_complete' is non-zero, then we'll >> try and submit new IO. If IORING_ENTER_GETEVENTS is set, the >> kernel will wait for 'min_complete' events, if they aren't >> already available. > > I feel like I'm missing something here. Rather than have the > IORING_ENTER_GETEVENTS flag, why not just wait if min_complete > 0 ? For polled IO, it's useful to be able to check if we have events that can be readily reaped. If min_complete > 0, then you're asking the interface to wait/poll for these events. IORING_ENTER_GETEVENTS + min_complete == 0 is a valid combination to just reap events that are already completed. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-15 16:55 ` Jens Axboe @ 2019-01-15 16:55 ` Jens Axboe 2019-01-15 17:26 ` Jens Axboe 1 sibling, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-15 16:55 UTC (permalink / raw) To: Jonathan Corbet Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On 1/15/19 9:51 AM, Jonathan Corbet wrote: > On Mon, 14 Jan 2019 19:55:20 -0700 > Jens Axboe <axboe@kernel.dk> wrote: > > So the [0/16] cover letter seems to have gone astray this time? It did go out, but I forgot to add a Subject line to it... https://marc.info/?l=linux-block&m=154752095709422&w=2 >> The submission queue (SQ) and completion queue (CQ) rings are shared >> between the application and the kernel. This eliminates the need to >> copy data back and forth to submit and complete IO. >> >> IO submissions use the io_uring_sqe data structure, and completions >> are generated in the form of io_uring_sqe data structures. The SQ >> ring is an index into the io_uring_sqe array, which makes it possible >> to submit a batch of IOs without them being contiguous in the ring. >> The CQ ring is always contiguous, as completion events are inherently >> unordered and can point to any io_uring_iocb. >> >> Two new system calls are added for this: >> >> io_uring_setup(entries, iovecs, params) >> Sets up a context for doing async IO. On success, returns a file >> descriptor that the application can mmap to gain access to the >> SQ ring, CQ ring, and io_uring_iocbs. > > Looking at the code, it would appear that the "iovecs" parameter doesn't > actually exist. Indeed, need to update that commit message. and io_uring_iocbs should now be io_uring_sqes. The iovec/file registration is done through io_uring_register(2). >> io_uring_enter(fd, to_submit, min_complete, flags) >> Initiates IO against the rings mapped to this fd, or waits for >> them to complete, or both The behavior is controlled by the >> parameters passed in. If 'min_complete' is non-zero, then we'll >> try and submit new IO. If IORING_ENTER_GETEVENTS is set, the >> kernel will wait for 'min_complete' events, if they aren't >> already available. > > I feel like I'm missing something here. Rather than have the > IORING_ENTER_GETEVENTS flag, why not just wait if min_complete > 0 ? For polled IO, it's useful to be able to check if we have events that can be readily reaped. If min_complete > 0, then you're asking the interface to wait/poll for these events. IORING_ENTER_GETEVENTS + min_complete == 0 is a valid combination to just reap events that are already completed. -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-15 16:55 ` Jens Axboe 2019-01-15 16:55 ` Jens Axboe @ 2019-01-15 17:26 ` Jens Axboe 2019-01-15 17:26 ` Jens Axboe 1 sibling, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-15 17:26 UTC (permalink / raw) To: Jonathan Corbet Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On 1/15/19 9:55 AM, Jens Axboe wrote: > On 1/15/19 9:51 AM, Jonathan Corbet wrote: >> On Mon, 14 Jan 2019 19:55:20 -0700 >> Jens Axboe <axboe@kernel.dk> wrote: >> >> So the [0/16] cover letter seems to have gone astray this time? > > It did go out, but I forgot to add a Subject line to it... > > https://marc.info/?l=linux-block&m=154752095709422&w=2 > > >>> The submission queue (SQ) and completion queue (CQ) rings are shared >>> between the application and the kernel. This eliminates the need to >>> copy data back and forth to submit and complete IO. >>> >>> IO submissions use the io_uring_sqe data structure, and completions >>> are generated in the form of io_uring_sqe data structures. The SQ >>> ring is an index into the io_uring_sqe array, which makes it possible >>> to submit a batch of IOs without them being contiguous in the ring. >>> The CQ ring is always contiguous, as completion events are inherently >>> unordered and can point to any io_uring_iocb. >>> >>> Two new system calls are added for this: >>> >>> io_uring_setup(entries, iovecs, params) >>> Sets up a context for doing async IO. On success, returns a file >>> descriptor that the application can mmap to gain access to the >>> SQ ring, CQ ring, and io_uring_iocbs. >> >> Looking at the code, it would appear that the "iovecs" parameter doesn't >> actually exist. > > Indeed, need to update that commit message. and io_uring_iocbs should > now be io_uring_sqes. > > The iovec/file registration is done through io_uring_register(2). Updated commit message which covers both this one, and the IORING_ENTER_GETEVENTS below. http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=d14a06629baef2b0701dbd01ac9c9066f73065ec -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-15 17:26 ` Jens Axboe @ 2019-01-15 17:26 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-15 17:26 UTC (permalink / raw) To: Jonathan Corbet Cc: linux-fsdevel, linux-aio, linux-block, linux-arch, hch, jmoyer, avi On 1/15/19 9:55 AM, Jens Axboe wrote: > On 1/15/19 9:51 AM, Jonathan Corbet wrote: >> On Mon, 14 Jan 2019 19:55:20 -0700 >> Jens Axboe <axboe@kernel.dk> wrote: >> >> So the [0/16] cover letter seems to have gone astray this time? > > It did go out, but I forgot to add a Subject line to it... > > https://marc.info/?l=linux-block&m=154752095709422&w=2 > > >>> The submission queue (SQ) and completion queue (CQ) rings are shared >>> between the application and the kernel. This eliminates the need to >>> copy data back and forth to submit and complete IO. >>> >>> IO submissions use the io_uring_sqe data structure, and completions >>> are generated in the form of io_uring_sqe data structures. The SQ >>> ring is an index into the io_uring_sqe array, which makes it possible >>> to submit a batch of IOs without them being contiguous in the ring. >>> The CQ ring is always contiguous, as completion events are inherently >>> unordered and can point to any io_uring_iocb. >>> >>> Two new system calls are added for this: >>> >>> io_uring_setup(entries, iovecs, params) >>> Sets up a context for doing async IO. On success, returns a file >>> descriptor that the application can mmap to gain access to the >>> SQ ring, CQ ring, and io_uring_iocbs. >> >> Looking at the code, it would appear that the "iovecs" parameter doesn't >> actually exist. > > Indeed, need to update that commit message. and io_uring_iocbs should > now be io_uring_sqes. > > The iovec/file registration is done through io_uring_register(2). Updated commit message which covers both this one, and the IORING_ENTER_GETEVENTS below. http://git.kernel.dk/cgit/linux-block/commit/?h=io_uring&id=d14a06629baef2b0701dbd01ac9c9066f73065ec -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-15 2:55 ` [PATCH 05/16] Add io_uring IO interface Jens Axboe 2019-01-15 2:55 ` Jens Axboe 2019-01-15 16:51 ` Jonathan Corbet @ 2019-01-16 10:41 ` Arnd Bergmann 2019-01-16 10:41 ` Arnd Bergmann ` (2 more replies) 2 siblings, 3 replies; 62+ messages in thread From: Arnd Bergmann @ 2019-01-16 10:41 UTC (permalink / raw) To: Jens Axboe Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On Tue, Jan 15, 2019 at 3:55 AM Jens Axboe <axboe@kernel.dk> wrote: > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 3cf7b533b3d1..194e79c0032e 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -398,3 +398,5 @@ > 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl > 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents > 386 i386 rseq sys_rseq __ia32_sys_rseq > +387 i386 io_uring_setup sys_io_uring_setup __ia32_compat_sys_io_uring_setup > +388 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index f0b1709a5ffb..453ff7a79002 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -343,6 +343,8 @@ > 332 common statx __x64_sys_statx > 333 common io_pgetevents __x64_sys_io_pgetevents > 334 common rseq __x64_sys_rseq > +335 common io_uring_setup __x64_sys_io_uring_setup > +336 common io_uring_enter __x64_sys_io_uring_enter In my series for the y2038 system calls, I'm trying to move to having the same numbers across all architectures. Unfortunately, that clashes with newly assigned numbers here, so one of us needs to pick new numbers. If my series gets merged without other changes to the numbers, the next available numbers on all architectures become 424 and 425. Could you use those here? > +SYSCALL_DEFINE2(io_uring_setup, u32, entries, > + struct io_uring_params __user *, params) > +{ > + return io_uring_setup(entries, params, false); > +} > + > +#ifdef CONFIG_COMPAT > +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, > + struct io_uring_params __user *, params) > +{ > + return io_uring_setup(entries, params, true); > +} > +#endif The compat syscall has the same calling conventions as the native one here, so I think you can just use that directly. > +/* > + * IO submission data structure (Submission Queue Entry) > + */ > +struct io_uring_sqe { > + __u8 opcode; /* type of operation for this sqe */ > + __u8 flags; /* as of now unused */ > + __u16 ioprio; /* ioprio for the request */ > + __s32 fd; /* file descriptor to do IO on */ > + __u64 off; /* offset into file */ > + union { > + void *addr; /* buffer or iovecs */ > + __u64 __pad; > + }; It seems a bit unfortunate to keep the pointer field only almost compatible between 32-bit and 64-bit big-endian architectures, as that requires an in_compat_syscall() check whenever we access the pointer from the kernel. Could you use a __u64 field to store the pointer itself instead? > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index ab9d0e3c6d50..ee5e523564bb 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); > COND_SYSCALL(io_pgetevents); > COND_SYSCALL_COMPAT(io_getevents); > COND_SYSCALL_COMPAT(io_pgetevents); > +COND_SYSCALL(io_uring_setup); > +COND_SYSCALL(io_uring_enter); Unless you remove the compat_sys_io_uring_setup() definition, this should also have a corresponding COND_SYSCALL_COMPAT() entry. Arnd -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-16 10:41 ` Arnd Bergmann @ 2019-01-16 10:41 ` Arnd Bergmann 2019-01-16 11:00 ` Arnd Bergmann 2019-01-16 15:12 ` Jens Axboe 2 siblings, 0 replies; 62+ messages in thread From: Arnd Bergmann @ 2019-01-16 10:41 UTC (permalink / raw) To: Jens Axboe Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On Tue, Jan 15, 2019 at 3:55 AM Jens Axboe <axboe@kernel.dk> wrote: > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 3cf7b533b3d1..194e79c0032e 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -398,3 +398,5 @@ > 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl > 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents > 386 i386 rseq sys_rseq __ia32_sys_rseq > +387 i386 io_uring_setup sys_io_uring_setup __ia32_compat_sys_io_uring_setup > +388 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index f0b1709a5ffb..453ff7a79002 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -343,6 +343,8 @@ > 332 common statx __x64_sys_statx > 333 common io_pgetevents __x64_sys_io_pgetevents > 334 common rseq __x64_sys_rseq > +335 common io_uring_setup __x64_sys_io_uring_setup > +336 common io_uring_enter __x64_sys_io_uring_enter In my series for the y2038 system calls, I'm trying to move to having the same numbers across all architectures. Unfortunately, that clashes with newly assigned numbers here, so one of us needs to pick new numbers. If my series gets merged without other changes to the numbers, the next available numbers on all architectures become 424 and 425. Could you use those here? > +SYSCALL_DEFINE2(io_uring_setup, u32, entries, > + struct io_uring_params __user *, params) > +{ > + return io_uring_setup(entries, params, false); > +} > + > +#ifdef CONFIG_COMPAT > +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, > + struct io_uring_params __user *, params) > +{ > + return io_uring_setup(entries, params, true); > +} > +#endif The compat syscall has the same calling conventions as the native one here, so I think you can just use that directly. > +/* > + * IO submission data structure (Submission Queue Entry) > + */ > +struct io_uring_sqe { > + __u8 opcode; /* type of operation for this sqe */ > + __u8 flags; /* as of now unused */ > + __u16 ioprio; /* ioprio for the request */ > + __s32 fd; /* file descriptor to do IO on */ > + __u64 off; /* offset into file */ > + union { > + void *addr; /* buffer or iovecs */ > + __u64 __pad; > + }; It seems a bit unfortunate to keep the pointer field only almost compatible between 32-bit and 64-bit big-endian architectures, as that requires an in_compat_syscall() check whenever we access the pointer from the kernel. Could you use a __u64 field to store the pointer itself instead? > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index ab9d0e3c6d50..ee5e523564bb 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); > COND_SYSCALL(io_pgetevents); > COND_SYSCALL_COMPAT(io_getevents); > COND_SYSCALL_COMPAT(io_pgetevents); > +COND_SYSCALL(io_uring_setup); > +COND_SYSCALL(io_uring_enter); Unless you remove the compat_sys_io_uring_setup() definition, this should also have a corresponding COND_SYSCALL_COMPAT() entry. Arnd ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-16 10:41 ` Arnd Bergmann 2019-01-16 10:41 ` Arnd Bergmann @ 2019-01-16 11:00 ` Arnd Bergmann 2019-01-16 11:00 ` Arnd Bergmann 2019-01-16 15:12 ` Jens Axboe 2 siblings, 1 reply; 62+ messages in thread From: Arnd Bergmann @ 2019-01-16 11:00 UTC (permalink / raw) To: Jens Axboe Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On Wed, Jan 16, 2019 at 11:41 AM Arnd Bergmann <arnd@arndb.de> wrote: > > +/* > > + * IO submission data structure (Submission Queue Entry) > > + */ > > +struct io_uring_sqe { > > + __u8 opcode; /* type of operation for this sqe */ > > + __u8 flags; /* as of now unused */ > > + __u16 ioprio; /* ioprio for the request */ > > + __s32 fd; /* file descriptor to do IO on */ > > + __u64 off; /* offset into file */ > > + union { > > + void *addr; /* buffer or iovecs */ > > + __u64 __pad; > > + }; > > It seems a bit unfortunate to keep the pointer field only > almost compatible between 32-bit and 64-bit big-endian > architectures, as that requires an in_compat_syscall() > check whenever we access the pointer from the kernel. > > Could you use a __u64 field to store the pointer itself > instead? To clarify: Using the u64_to_user_ptr() helper function, you can interpret a __u64 field in a uapi data structure directly as a pointer. The user space side needs a similar wrapper though. Arnd -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-16 11:00 ` Arnd Bergmann @ 2019-01-16 11:00 ` Arnd Bergmann 0 siblings, 0 replies; 62+ messages in thread From: Arnd Bergmann @ 2019-01-16 11:00 UTC (permalink / raw) To: Jens Axboe Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On Wed, Jan 16, 2019 at 11:41 AM Arnd Bergmann <arnd@arndb.de> wrote: > > +/* > > + * IO submission data structure (Submission Queue Entry) > > + */ > > +struct io_uring_sqe { > > + __u8 opcode; /* type of operation for this sqe */ > > + __u8 flags; /* as of now unused */ > > + __u16 ioprio; /* ioprio for the request */ > > + __s32 fd; /* file descriptor to do IO on */ > > + __u64 off; /* offset into file */ > > + union { > > + void *addr; /* buffer or iovecs */ > > + __u64 __pad; > > + }; > > It seems a bit unfortunate to keep the pointer field only > almost compatible between 32-bit and 64-bit big-endian > architectures, as that requires an in_compat_syscall() > check whenever we access the pointer from the kernel. > > Could you use a __u64 field to store the pointer itself > instead? To clarify: Using the u64_to_user_ptr() helper function, you can interpret a __u64 field in a uapi data structure directly as a pointer. The user space side needs a similar wrapper though. Arnd ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-16 10:41 ` Arnd Bergmann 2019-01-16 10:41 ` Arnd Bergmann 2019-01-16 11:00 ` Arnd Bergmann @ 2019-01-16 15:12 ` Jens Axboe 2019-01-16 15:12 ` Jens Axboe 2019-01-16 15:16 ` Arnd Bergmann 2 siblings, 2 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-16 15:12 UTC (permalink / raw) To: Arnd Bergmann Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On 1/16/19 3:41 AM, Arnd Bergmann wrote: > On Tue, Jan 15, 2019 at 3:55 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl >> index 3cf7b533b3d1..194e79c0032e 100644 >> --- a/arch/x86/entry/syscalls/syscall_32.tbl >> +++ b/arch/x86/entry/syscalls/syscall_32.tbl >> @@ -398,3 +398,5 @@ >> 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl >> 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents >> 386 i386 rseq sys_rseq __ia32_sys_rseq >> +387 i386 io_uring_setup sys_io_uring_setup __ia32_compat_sys_io_uring_setup >> +388 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl >> index f0b1709a5ffb..453ff7a79002 100644 >> --- a/arch/x86/entry/syscalls/syscall_64.tbl >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >> @@ -343,6 +343,8 @@ >> 332 common statx __x64_sys_statx >> 333 common io_pgetevents __x64_sys_io_pgetevents >> 334 common rseq __x64_sys_rseq >> +335 common io_uring_setup __x64_sys_io_uring_setup >> +336 common io_uring_enter __x64_sys_io_uring_enter > > In my series for the y2038 system calls, I'm trying to move to having the > same numbers across all architectures. Unfortunately, that clashes > with newly assigned numbers here, so one of us needs to pick new > numbers. > > If my series gets merged without other changes to the numbers, the next > available numbers on all architectures become 424 and 425. > > Could you use those here? Yeah that's totally fine, I don't really care what the numbers end up being, that side isn't fixed for me. >> +SYSCALL_DEFINE2(io_uring_setup, u32, entries, >> + struct io_uring_params __user *, params) >> +{ >> + return io_uring_setup(entries, params, false); >> +} >> + >> +#ifdef CONFIG_COMPAT >> +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, >> + struct io_uring_params __user *, params) >> +{ >> + return io_uring_setup(entries, params, true); >> +} >> +#endif > > The compat syscall has the same calling conventions as the > native one here, so I think you can just use that directly. Not sure I understand what you mean here. I need to know if it's the compat one, hence 'true' vs 'false', so I know what the size of the user pointers/structs are. >> +/* >> + * IO submission data structure (Submission Queue Entry) >> + */ >> +struct io_uring_sqe { >> + __u8 opcode; /* type of operation for this sqe */ >> + __u8 flags; /* as of now unused */ >> + __u16 ioprio; /* ioprio for the request */ >> + __s32 fd; /* file descriptor to do IO on */ >> + __u64 off; /* offset into file */ >> + union { >> + void *addr; /* buffer or iovecs */ >> + __u64 __pad; >> + }; > > It seems a bit unfortunate to keep the pointer field only > almost compatible between 32-bit and 64-bit big-endian > architectures, as that requires an in_compat_syscall() > check whenever we access the pointer from the kernel. > > Could you use a __u64 field to store the pointer itself > instead? I feel like I'm missing something here, we'll still need the compat code on the kernel side for 32-bit app on 64-bit kernel, so what would we solve by making this an __u64? >> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c >> index ab9d0e3c6d50..ee5e523564bb 100644 >> --- a/kernel/sys_ni.c >> +++ b/kernel/sys_ni.c >> @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); >> COND_SYSCALL(io_pgetevents); >> COND_SYSCALL_COMPAT(io_getevents); >> COND_SYSCALL_COMPAT(io_pgetevents); >> +COND_SYSCALL(io_uring_setup); >> +COND_SYSCALL(io_uring_enter); > > Unless you remove the compat_sys_io_uring_setup() definition, > this should also have a corresponding COND_SYSCALL_COMPAT() > entry. Gotcha, thanks! I'll make that change. -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-16 15:12 ` Jens Axboe @ 2019-01-16 15:12 ` Jens Axboe 2019-01-16 15:16 ` Arnd Bergmann 1 sibling, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-16 15:12 UTC (permalink / raw) To: Arnd Bergmann Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On 1/16/19 3:41 AM, Arnd Bergmann wrote: > On Tue, Jan 15, 2019 at 3:55 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl >> index 3cf7b533b3d1..194e79c0032e 100644 >> --- a/arch/x86/entry/syscalls/syscall_32.tbl >> +++ b/arch/x86/entry/syscalls/syscall_32.tbl >> @@ -398,3 +398,5 @@ >> 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl >> 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents >> 386 i386 rseq sys_rseq __ia32_sys_rseq >> +387 i386 io_uring_setup sys_io_uring_setup __ia32_compat_sys_io_uring_setup >> +388 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter >> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl >> index f0b1709a5ffb..453ff7a79002 100644 >> --- a/arch/x86/entry/syscalls/syscall_64.tbl >> +++ b/arch/x86/entry/syscalls/syscall_64.tbl >> @@ -343,6 +343,8 @@ >> 332 common statx __x64_sys_statx >> 333 common io_pgetevents __x64_sys_io_pgetevents >> 334 common rseq __x64_sys_rseq >> +335 common io_uring_setup __x64_sys_io_uring_setup >> +336 common io_uring_enter __x64_sys_io_uring_enter > > In my series for the y2038 system calls, I'm trying to move to having the > same numbers across all architectures. Unfortunately, that clashes > with newly assigned numbers here, so one of us needs to pick new > numbers. > > If my series gets merged without other changes to the numbers, the next > available numbers on all architectures become 424 and 425. > > Could you use those here? Yeah that's totally fine, I don't really care what the numbers end up being, that side isn't fixed for me. >> +SYSCALL_DEFINE2(io_uring_setup, u32, entries, >> + struct io_uring_params __user *, params) >> +{ >> + return io_uring_setup(entries, params, false); >> +} >> + >> +#ifdef CONFIG_COMPAT >> +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, >> + struct io_uring_params __user *, params) >> +{ >> + return io_uring_setup(entries, params, true); >> +} >> +#endif > > The compat syscall has the same calling conventions as the > native one here, so I think you can just use that directly. Not sure I understand what you mean here. I need to know if it's the compat one, hence 'true' vs 'false', so I know what the size of the user pointers/structs are. >> +/* >> + * IO submission data structure (Submission Queue Entry) >> + */ >> +struct io_uring_sqe { >> + __u8 opcode; /* type of operation for this sqe */ >> + __u8 flags; /* as of now unused */ >> + __u16 ioprio; /* ioprio for the request */ >> + __s32 fd; /* file descriptor to do IO on */ >> + __u64 off; /* offset into file */ >> + union { >> + void *addr; /* buffer or iovecs */ >> + __u64 __pad; >> + }; > > It seems a bit unfortunate to keep the pointer field only > almost compatible between 32-bit and 64-bit big-endian > architectures, as that requires an in_compat_syscall() > check whenever we access the pointer from the kernel. > > Could you use a __u64 field to store the pointer itself > instead? I feel like I'm missing something here, we'll still need the compat code on the kernel side for 32-bit app on 64-bit kernel, so what would we solve by making this an __u64? >> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c >> index ab9d0e3c6d50..ee5e523564bb 100644 >> --- a/kernel/sys_ni.c >> +++ b/kernel/sys_ni.c >> @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); >> COND_SYSCALL(io_pgetevents); >> COND_SYSCALL_COMPAT(io_getevents); >> COND_SYSCALL_COMPAT(io_pgetevents); >> +COND_SYSCALL(io_uring_setup); >> +COND_SYSCALL(io_uring_enter); > > Unless you remove the compat_sys_io_uring_setup() definition, > this should also have a corresponding COND_SYSCALL_COMPAT() > entry. Gotcha, thanks! I'll make that change. -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-16 15:12 ` Jens Axboe 2019-01-16 15:12 ` Jens Axboe @ 2019-01-16 15:16 ` Arnd Bergmann 2019-01-16 15:16 ` Arnd Bergmann 2019-01-16 15:25 ` Jens Axboe 1 sibling, 2 replies; 62+ messages in thread From: Arnd Bergmann @ 2019-01-16 15:16 UTC (permalink / raw) To: Jens Axboe Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On Wed, Jan 16, 2019 at 4:12 PM Jens Axboe <axboe@kernel.dk> wrote: > On 1/16/19 3:41 AM, Arnd Bergmann wrote: > > On Tue, Jan 15, 2019 at 3:55 AM Jens Axboe <axboe@kernel.dk> wrote: > >> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > >> index 3cf7b533b3d1..194e79c0032e 100644 > >> --- a/arch/x86/entry/syscalls/syscall_32.tbl > >> +++ b/arch/x86/entry/syscalls/syscall_32.tbl > >> +SYSCALL_DEFINE2(io_uring_setup, u32, entries, > >> + struct io_uring_params __user *, params) > >> +{ > >> + return io_uring_setup(entries, params, false); > >> +} > >> + > >> +#ifdef CONFIG_COMPAT > >> +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, > >> + struct io_uring_params __user *, params) > >> +{ > >> + return io_uring_setup(entries, params, true); > >> +} > >> +#endif > > > > The compat syscall has the same calling conventions as the > > native one here, so I think you can just use that directly. > > Not sure I understand what you mean here. I need to know if it's the > compat one, hence 'true' vs 'false', so I know what the size of the user > pointers/structs are. My mistake, I missed the true/false difference between the two functions. > >> +/* > >> + * IO submission data structure (Submission Queue Entry) > >> + */ > >> +struct io_uring_sqe { > >> + __u8 opcode; /* type of operation for this sqe */ > >> + __u8 flags; /* as of now unused */ > >> + __u16 ioprio; /* ioprio for the request */ > >> + __s32 fd; /* file descriptor to do IO on */ > >> + __u64 off; /* offset into file */ > >> + union { > >> + void *addr; /* buffer or iovecs */ > >> + __u64 __pad; > >> + }; > > > > It seems a bit unfortunate to keep the pointer field only > > almost compatible between 32-bit and 64-bit big-endian > > architectures, as that requires an in_compat_syscall() > > check whenever we access the pointer from the kernel. > > > > Could you use a __u64 field to store the pointer itself > > instead? > > I feel like I'm missing something here, we'll still need the compat code > on the kernel side for 32-bit app on 64-bit kernel, so what would we > solve by making this an __u64? It means you don't have to define a compat_io_uring_sqe structure with a compat_uptr_t member in it. Arnd -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-16 15:16 ` Arnd Bergmann @ 2019-01-16 15:16 ` Arnd Bergmann 2019-01-16 15:25 ` Jens Axboe 1 sibling, 0 replies; 62+ messages in thread From: Arnd Bergmann @ 2019-01-16 15:16 UTC (permalink / raw) To: Jens Axboe Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On Wed, Jan 16, 2019 at 4:12 PM Jens Axboe <axboe@kernel.dk> wrote: > On 1/16/19 3:41 AM, Arnd Bergmann wrote: > > On Tue, Jan 15, 2019 at 3:55 AM Jens Axboe <axboe@kernel.dk> wrote: > >> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > >> index 3cf7b533b3d1..194e79c0032e 100644 > >> --- a/arch/x86/entry/syscalls/syscall_32.tbl > >> +++ b/arch/x86/entry/syscalls/syscall_32.tbl > >> +SYSCALL_DEFINE2(io_uring_setup, u32, entries, > >> + struct io_uring_params __user *, params) > >> +{ > >> + return io_uring_setup(entries, params, false); > >> +} > >> + > >> +#ifdef CONFIG_COMPAT > >> +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, > >> + struct io_uring_params __user *, params) > >> +{ > >> + return io_uring_setup(entries, params, true); > >> +} > >> +#endif > > > > The compat syscall has the same calling conventions as the > > native one here, so I think you can just use that directly. > > Not sure I understand what you mean here. I need to know if it's the > compat one, hence 'true' vs 'false', so I know what the size of the user > pointers/structs are. My mistake, I missed the true/false difference between the two functions. > >> +/* > >> + * IO submission data structure (Submission Queue Entry) > >> + */ > >> +struct io_uring_sqe { > >> + __u8 opcode; /* type of operation for this sqe */ > >> + __u8 flags; /* as of now unused */ > >> + __u16 ioprio; /* ioprio for the request */ > >> + __s32 fd; /* file descriptor to do IO on */ > >> + __u64 off; /* offset into file */ > >> + union { > >> + void *addr; /* buffer or iovecs */ > >> + __u64 __pad; > >> + }; > > > > It seems a bit unfortunate to keep the pointer field only > > almost compatible between 32-bit and 64-bit big-endian > > architectures, as that requires an in_compat_syscall() > > check whenever we access the pointer from the kernel. > > > > Could you use a __u64 field to store the pointer itself > > instead? > > I feel like I'm missing something here, we'll still need the compat code > on the kernel side for 32-bit app on 64-bit kernel, so what would we > solve by making this an __u64? It means you don't have to define a compat_io_uring_sqe structure with a compat_uptr_t member in it. Arnd ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-16 15:16 ` Arnd Bergmann 2019-01-16 15:16 ` Arnd Bergmann @ 2019-01-16 15:25 ` Jens Axboe 2019-01-16 15:25 ` Jens Axboe 1 sibling, 1 reply; 62+ messages in thread From: Jens Axboe @ 2019-01-16 15:25 UTC (permalink / raw) To: Arnd Bergmann Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On 1/16/19 8:16 AM, Arnd Bergmann wrote: > On Wed, Jan 16, 2019 at 4:12 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 1/16/19 3:41 AM, Arnd Bergmann wrote: >>> On Tue, Jan 15, 2019 at 3:55 AM Jens Axboe <axboe@kernel.dk> wrote: >>>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl >>>> index 3cf7b533b3d1..194e79c0032e 100644 >>>> --- a/arch/x86/entry/syscalls/syscall_32.tbl >>>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl >>>> +SYSCALL_DEFINE2(io_uring_setup, u32, entries, >>>> + struct io_uring_params __user *, params) >>>> +{ >>>> + return io_uring_setup(entries, params, false); >>>> +} >>>> + >>>> +#ifdef CONFIG_COMPAT >>>> +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, >>>> + struct io_uring_params __user *, params) >>>> +{ >>>> + return io_uring_setup(entries, params, true); >>>> +} >>>> +#endif >>> >>> The compat syscall has the same calling conventions as the >>> native one here, so I think you can just use that directly. >> >> Not sure I understand what you mean here. I need to know if it's the >> compat one, hence 'true' vs 'false', so I know what the size of the user >> pointers/structs are. > > My mistake, I missed the true/false difference between the two > functions. > >>>> +/* >>>> + * IO submission data structure (Submission Queue Entry) >>>> + */ >>>> +struct io_uring_sqe { >>>> + __u8 opcode; /* type of operation for this sqe */ >>>> + __u8 flags; /* as of now unused */ >>>> + __u16 ioprio; /* ioprio for the request */ >>>> + __s32 fd; /* file descriptor to do IO on */ >>>> + __u64 off; /* offset into file */ >>>> + union { >>>> + void *addr; /* buffer or iovecs */ >>>> + __u64 __pad; >>>> + }; >>> >>> It seems a bit unfortunate to keep the pointer field only >>> almost compatible between 32-bit and 64-bit big-endian >>> architectures, as that requires an in_compat_syscall() >>> check whenever we access the pointer from the kernel. >>> >>> Could you use a __u64 field to store the pointer itself >>> instead? >> >> I feel like I'm missing something here, we'll still need the compat code >> on the kernel side for 32-bit app on 64-bit kernel, so what would we >> solve by making this an __u64? > > It means you don't have to define a compat_io_uring_sqe > structure with a compat_uptr_t member in it. Yeah, I finally got it, I'm making the change. Thanks! -- Jens Axboe -- To unsubscribe, send a message with 'unsubscribe linux-aio' in the body to majordomo@kvack.org. For more info on Linux AIO, see: http://www.kvack.org/aio/ Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a> ^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: [PATCH 05/16] Add io_uring IO interface 2019-01-16 15:25 ` Jens Axboe @ 2019-01-16 15:25 ` Jens Axboe 0 siblings, 0 replies; 62+ messages in thread From: Jens Axboe @ 2019-01-16 15:25 UTC (permalink / raw) To: Arnd Bergmann Cc: Linux FS-devel Mailing List, linux-aio, linux-block, linux-arch, Christoph Hellwig, Jeff Moyer, Avi Kivity On 1/16/19 8:16 AM, Arnd Bergmann wrote: > On Wed, Jan 16, 2019 at 4:12 PM Jens Axboe <axboe@kernel.dk> wrote: >> On 1/16/19 3:41 AM, Arnd Bergmann wrote: >>> On Tue, Jan 15, 2019 at 3:55 AM Jens Axboe <axboe@kernel.dk> wrote: >>>> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl >>>> index 3cf7b533b3d1..194e79c0032e 100644 >>>> --- a/arch/x86/entry/syscalls/syscall_32.tbl >>>> +++ b/arch/x86/entry/syscalls/syscall_32.tbl >>>> +SYSCALL_DEFINE2(io_uring_setup, u32, entries, >>>> + struct io_uring_params __user *, params) >>>> +{ >>>> + return io_uring_setup(entries, params, false); >>>> +} >>>> + >>>> +#ifdef CONFIG_COMPAT >>>> +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, >>>> + struct io_uring_params __user *, params) >>>> +{ >>>> + return io_uring_setup(entries, params, true); >>>> +} >>>> +#endif >>> >>> The compat syscall has the same calling conventions as the >>> native one here, so I think you can just use that directly. >> >> Not sure I understand what you mean here. I need to know if it's the >> compat one, hence 'true' vs 'false', so I know what the size of the user >> pointers/structs are. > > My mistake, I missed the true/false difference between the two > functions. > >>>> +/* >>>> + * IO submission data structure (Submission Queue Entry) >>>> + */ >>>> +struct io_uring_sqe { >>>> + __u8 opcode; /* type of operation for this sqe */ >>>> + __u8 flags; /* as of now unused */ >>>> + __u16 ioprio; /* ioprio for the request */ >>>> + __s32 fd; /* file descriptor to do IO on */ >>>> + __u64 off; /* offset into file */ >>>> + union { >>>> + void *addr; /* buffer or iovecs */ >>>> + __u64 __pad; >>>> + }; >>> >>> It seems a bit unfortunate to keep the pointer field only >>> almost compatible between 32-bit and 64-bit big-endian >>> architectures, as that requires an in_compat_syscall() >>> check whenever we access the pointer from the kernel. >>> >>> Could you use a __u64 field to store the pointer itself >>> instead? >> >> I feel like I'm missing something here, we'll still need the compat code >> on the kernel side for 32-bit app on 64-bit kernel, so what would we >> solve by making this an __u64? > > It means you don't have to define a compat_io_uring_sqe > structure with a compat_uptr_t member in it. Yeah, I finally got it, I'm making the change. Thanks! -- Jens Axboe ^ permalink raw reply [flat|nested] 62+ messages in thread
end of thread, other threads:[~2019-01-16 15:25 UTC | newest] Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-01-08 16:56 [PATCHSET v1] io_uring IO interface Jens Axboe 2019-01-08 16:56 ` [PATCH 01/16] fs: add an iopoll method to struct file_operations Jens Axboe 2019-01-08 16:56 ` [PATCH 02/16] block: wire up block device iopoll method Jens Axboe 2019-01-08 16:56 ` [PATCH 03/16] block: add bio_set_polled() helper Jens Axboe 2019-01-10 9:43 ` Ming Lei 2019-01-10 9:43 ` Ming Lei 2019-01-10 16:05 ` Jens Axboe 2019-01-10 16:05 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 04/16] iomap: wire up the iopoll method Jens Axboe 2019-01-08 16:56 ` [PATCH 05/16] Add io_uring IO interface Jens Axboe 2019-01-09 12:10 ` Christoph Hellwig 2019-01-09 15:53 ` Jens Axboe 2019-01-09 18:30 ` Christoph Hellwig 2019-01-09 20:07 ` Jens Axboe 2019-01-09 20:07 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 06/16] io_uring: support for IO polling Jens Axboe 2019-01-09 12:11 ` Christoph Hellwig 2019-01-09 15:53 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 07/16] io_uring: add submission side request cache Jens Axboe 2019-01-08 16:56 ` [PATCH 08/16] fs: add fget_many() and fput_many() Jens Axboe 2019-01-08 16:56 ` [PATCH 09/16] io_uring: use fget/fput_many() for file references Jens Axboe 2019-01-08 16:56 ` [PATCH 10/16] io_uring: split kiocb init from allocation Jens Axboe 2019-01-09 12:12 ` Christoph Hellwig 2019-01-09 16:56 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 11/16] io_uring: batch io_kiocb allocation Jens Axboe 2019-01-09 12:13 ` Christoph Hellwig 2019-01-09 16:57 ` Jens Axboe 2019-01-09 19:03 ` Christoph Hellwig 2019-01-09 20:08 ` Jens Axboe 2019-01-09 20:08 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 12/16] block: implement bio helper to add iter bvec pages to bio Jens Axboe 2019-01-08 16:56 ` [PATCH 13/16] io_uring: add support for pre-mapped user IO buffers Jens Axboe 2019-01-09 12:16 ` Christoph Hellwig 2019-01-09 17:06 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 14/16] io_uring: support kernel side submission Jens Axboe 2019-01-09 19:06 ` Christoph Hellwig 2019-01-09 20:49 ` Jens Axboe 2019-01-09 20:49 ` Jens Axboe 2019-01-08 16:56 ` [PATCH 15/16] io_uring: add submission polling Jens Axboe 2019-01-08 16:56 ` [PATCH 16/16] io_uring: add io_uring_event cache hit information Jens Axboe 2019-01-09 16:00 ` [PATCHSET v1] io_uring IO interface Matthew Wilcox 2019-01-09 16:27 ` Chris Mason 2019-01-12 21:29 [PATCHSET v3] " Jens Axboe 2019-01-12 21:30 ` [PATCH 05/16] Add " Jens Axboe 2019-01-12 21:30 ` Jens Axboe 2019-01-15 2:55 (unknown), Jens Axboe 2019-01-15 2:55 ` [PATCH 05/16] Add io_uring IO interface Jens Axboe 2019-01-15 2:55 ` Jens Axboe 2019-01-15 16:51 ` Jonathan Corbet 2019-01-15 16:51 ` Jonathan Corbet 2019-01-15 16:55 ` Jens Axboe 2019-01-15 16:55 ` Jens Axboe 2019-01-15 17:26 ` Jens Axboe 2019-01-15 17:26 ` Jens Axboe 2019-01-16 10:41 ` Arnd Bergmann 2019-01-16 10:41 ` Arnd Bergmann 2019-01-16 11:00 ` Arnd Bergmann 2019-01-16 11:00 ` Arnd Bergmann 2019-01-16 15:12 ` Jens Axboe 2019-01-16 15:12 ` Jens Axboe 2019-01-16 15:16 ` Arnd Bergmann 2019-01-16 15:16 ` Arnd Bergmann 2019-01-16 15:25 ` Jens Axboe 2019-01-16 15:25 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).