* [GIT PULL] io_uring changes for 5.9-rc1 @ 2020-08-02 21:41 Jens Axboe 2020-08-03 20:48 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2020-08-02 21:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-kernel [-- Attachment #1: Type: text/plain, Size: 9735 bytes --] Hi Linus, Lots of cleanups in here, hardening the code and/or making it easier to read and fixing buts, but a core feature/change too adding support for real async buffered reads. With the latter in place, we just need buffered write async support and we're done relying on kthreads for the fast path. In detail: - Cleanup how memory accounting is done on ring setup/free (Bijan) - sq array offset calculation fixup (Dmitry) - Consistently handle blocking off O_DIRECT submission path (me) - Support proper async buffered reads, instead of relying on kthread offload for that. This uses the page waitqueue to drive retries from task_work, like we handle poll based retry. (me) - IO completion optimizations (me) - Fix race with accounting and ring fd install (me) - Support EPOLLEXCLUSIVE (Jiufei) - Get rid of the io_kiocb unionizing, made possible by shrinking other bits (Pavel) - Completion side cleanups (Pavel) - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel) - Request environment grabbing cleanups (Pavel) - File and socket read/write cleanups (Pavel) - Improve kiocb_set_rw_flags() (Pavel) - Tons of fixes and cleanups (Pavel) - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang) This will throw a few merge conflicts. One is due to the IOCB_NOIO addition that happened late in 5.8-rc, the other is due to a change in for-5.9/block. Both are trivial to fixup, I'm attaching my merge resolution when I pulled it in locally. Please pull! The following changes since commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5: io_uring: fix lockup in io_fail_links() (2020-07-24 12:51:33 -0600) are available in the Git repository at: git://git.kernel.dk/linux-block.git tags/for-5.9/io_uring-20200802 for you to fetch changes up to fa15bafb71fd7a4d6018dae87cfaf890fd4ab47f: io_uring: flip if handling after io_setup_async_rw (2020-08-01 11:02:57 -0600) ---------------------------------------------------------------- for-5.9/io_uring-20200802 ---------------------------------------------------------------- Bijan Mottahedeh (4): io_uring: add wrappers for memory accounting io_uring: rename ctx->account_mem field io_uring: report pinned memory usage io_uring: separate reporting of ring pages from registered pages Dan Carpenter (1): io_uring: fix a use after free in io_async_task_func() Dmitry Vyukov (1): io_uring: fix sq array offset calculation Jens Axboe (31): block: provide plug based way of signaling forced no-wait semantics io_uring: always plug for any number of IOs io_uring: catch -EIO from buffered issue request failure io_uring: re-issue block requests that failed because of resources mm: allow read-ahead with IOCB_NOWAIT set mm: abstract out wake_page_match() from wake_page_function() mm: add support for async page locking mm: support async buffered reads in generic_file_buffered_read() fs: add FMODE_BUF_RASYNC block: flag block devices as supporting IOCB_WAITQ xfs: flag files as supporting buffered async reads btrfs: flag files as supporting buffered async reads mm: add kiocb_wait_page_queue_init() helper io_uring: support true async buffered reads, if file provides it Merge branch 'async-buffered.8' into for-5.9/io_uring io_uring: provide generic io_req_complete() helper io_uring: add 'io_comp_state' to struct io_submit_state io_uring: pass down completion state on the issue side io_uring: pass in completion state to appropriate issue side handlers io_uring: enable READ/WRITE to use deferred completions io_uring: use task_work for links if possible Merge branch 'io_uring-5.8' into for-5.9/io_uring io_uring: clean up io_kill_linked_timeout() locking Merge branch 'io_uring-5.8' into for-5.9/io_uring io_uring: abstract out task work running io_uring: use new io_req_task_work_add() helper throughout io_uring: only call kfree() for a non-zero pointer io_uring: get rid of __req_need_defer() io_uring: remove dead 'ctx' argument and move forward declaration Merge branch 'io_uring-5.8' into for-5.9/io_uring io_uring: don't touch 'ctx' after installing file descriptor Jiufei Xue (2): io_uring: change the poll type to be 32-bits io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior Pavel Begunkov (90): io_uring: remove setting REQ_F_MUST_PUNT in rw io_uring: remove REQ_F_MUST_PUNT io_uring: set @poll->file after @poll init io_uring: kill NULL checks for submit state io_uring: fix NULL-mm for linked reqs io-wq: compact io-wq flags numbers io-wq: return next work from ->do_work() directly io_uring: fix req->work corruption io_uring: fix punting req w/o grabbed env io_uring: fix feeding io-wq with uninit reqs io_uring: don't mark link's head for_async io_uring: fix missing io_grab_files() io_uring: fix refs underflow in io_iopoll_queue() io_uring: remove inflight batching in free_many() io_uring: dismantle req early and remove need_iter io_uring: batch-free linked requests as well io_uring: cosmetic changes for batch free io_uring: kill REQ_F_LINK_NEXT io_uring: clean up req->result setting by rw io_uring: do task_work_run() during iopoll io_uring: fix iopoll -EAGAIN handling io_uring: fix missing wake_up io_rw_reissue() io_uring: deduplicate freeing linked timeouts io_uring: replace find_next() out param with ret io_uring: kill REQ_F_TIMEOUT io_uring: kill REQ_F_TIMEOUT_NOSEQ io_uring: fix potential use after free on fallback request free io_uring: don't pass def into io_req_work_grab_env io_uring: do init work in grab_env() io_uring: factor out grab_env() from defer_prep() io_uring: do grab_env() just before punting io_uring: don't fail iopoll requeue without ->mm io_uring: fix NULL mm in io_poll_task_func() io_uring: simplify io_async_task_func() io_uring: optimise io_req_find_next() fast check io_uring: fix missing ->mm on exit io_uring: fix mis-refcounting linked timeouts io_uring: keep queue_sqe()'s fail path separately io_uring: fix lost cqe->flags io_uring: don't delay iopoll'ed req completion io_uring: fix stopping iopoll'ing too early io_uring: briefly loose locks while reaping events io_uring: partially inline io_iopoll_getevents() io_uring: remove nr_events arg from iopoll_check() io_uring: don't burn CPU for iopoll on exit io_uring: rename sr->msg into umsg io_uring: use more specific type in rcv/snd msg cp io_uring: extract io_sendmsg_copy_hdr() io_uring: replace rw->task_work with rq->task_work io_uring: simplify io_req_map_rw() io_uring: add a helper for async rw iovec prep io_uring: follow **iovec idiom in io_import_iovec io_uring: share completion list w/ per-op space io_uring: rename ctx->poll into ctx->iopoll io_uring: use inflight_entry list for iopoll'ing io_uring: use completion list for CQ overflow io_uring: add req->timeout.list io_uring: remove init for unused list io_uring: use non-intrusive list for defer io_uring: remove sequence from io_kiocb io_uring: place cflags into completion data io_uring: inline io_req_work_grab_env() io_uring: remove empty cleanup of OP_OPEN* reqs io_uring: alloc ->io in io_req_defer_prep() io_uring/io-wq: move RLIMIT_FSIZE to io-wq io_uring: simplify file ref tracking in submission state io_uring: indent left {send,recv}[msg]() io_uring: remove extra checks in send/recv io_uring: don't forget cflags in io_recv() io_uring: free selected-bufs if error'ed io_uring: move BUFFER_SELECT check into *recv[msg] io_uring: extract io_put_kbuf() helper io_uring: don't open-code recv kbuf managment io_uring: don't miscount pinned memory io_uring: return locked and pinned page accounting tasks: add put_task_struct_many() io_uring: batch put_task_struct() io_uring: don't do opcode prep twice io_uring: deduplicate io_grab_files() calls io_uring: mark ->work uninitialised after cleanup io_uring: fix missing io_queue_linked_timeout() io-wq: update hash bits io_uring: de-unionise io_kiocb io_uring: deduplicate __io_complete_rw() io_uring: fix racy overflow count reporting io_uring: fix stalled deferred requests io_uring: consolidate *_check_overflow accounting io_uring: get rid of atomic FAA for cq_timeouts fs: optimise kiocb_set_rw_flags() io_uring: flip if handling after io_setup_async_rw Randy Dunlap (1): io_uring: fix function args for !CONFIG_NET Xiaoguang Wang (1): io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works block/blk-core.c | 6 + fs/block_dev.c | 2 +- fs/btrfs/file.c | 2 +- fs/io-wq.c | 14 +- fs/io-wq.h | 11 +- fs/io_uring.c | 2588 +++++++++++++++++++++++------------------ fs/xfs/xfs_file.c | 2 +- include/linux/blkdev.h | 1 + include/linux/fs.h | 26 +- include/linux/pagemap.h | 75 ++ include/linux/sched/task.h | 6 + include/uapi/linux/io_uring.h | 4 +- mm/filemap.c | 110 +- tools/io_uring/liburing.h | 6 +- 14 files changed, 1658 insertions(+), 1195 deletions(-) -- Jens Axboe [-- Attachment #2: merge.txt --] [-- Type: text/plain, Size: 3460 bytes --] commit 32a5169a5562db6a09a2d85164e0079913ecc227 Merge: 5fb023fb414a fa15bafb71fd Author: Jens Axboe <axboe@kernel.dk> Date: Sun Aug 2 10:43:35 2020 -0600 Merge branch 'for-5.9/io_uring' into test * for-5.9/io_uring: (127 commits) io_uring: flip if handling after io_setup_async_rw fs: optimise kiocb_set_rw_flags() io_uring: don't touch 'ctx' after installing file descriptor io_uring: get rid of atomic FAA for cq_timeouts io_uring: consolidate *_check_overflow accounting io_uring: fix stalled deferred requests io_uring: fix racy overflow count reporting io_uring: deduplicate __io_complete_rw() io_uring: de-unionise io_kiocb io-wq: update hash bits io_uring: fix missing io_queue_linked_timeout() io_uring: mark ->work uninitialised after cleanup io_uring: deduplicate io_grab_files() calls io_uring: don't do opcode prep twice io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works io_uring: batch put_task_struct() tasks: add put_task_struct_many() io_uring: return locked and pinned page accounting io_uring: don't miscount pinned memory io_uring: don't open-code recv kbuf managment ... Signed-off-by: Jens Axboe <axboe@kernel.dk> diff --cc block/blk-core.c index 93104c7470e8,62a4904db921..d9d632639bd1 --- a/block/blk-core.c +++ b/block/blk-core.c @@@ -956,13 -952,30 +956,18 @@@ static inline blk_status_t blk_check_zo return BLK_STS_OK; } -static noinline_for_stack bool -generic_make_request_checks(struct bio *bio) +static noinline_for_stack bool submit_bio_checks(struct bio *bio) { - struct request_queue *q; - int nr_sectors = bio_sectors(bio); + struct request_queue *q = bio->bi_disk->queue; blk_status_t status = BLK_STS_IOERR; + struct blk_plug *plug; - char b[BDEVNAME_SIZE]; might_sleep(); - q = bio->bi_disk->queue; - if (unlikely(!q)) { - printk(KERN_ERR - "generic_make_request: Trying to access " - "nonexistent block-device %s (%Lu)\n", - bio_devname(bio, b), (long long)bio->bi_iter.bi_sector); - goto end_io; - } - + plug = blk_mq_plug(q, bio); + if (plug && plug->nowait) + bio->bi_opf |= REQ_NOWAIT; + /* * For a REQ_NOWAIT based request, return -EOPNOTSUPP * if queue is not a request based queue. diff --cc include/linux/fs.h index 41cd993ec0f6,e535543d31d9..b7f1f1b7d691 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@@ -315,7 -318,8 +318,9 @@@ enum rw_hint #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT (1 << 7) + /* iocb->ki_waitq is valid */ + #define IOCB_WAITQ (1 << 8) +#define IOCB_NOIO (1 << 9) struct kiocb { struct file *ki_filp; diff --cc mm/filemap.c index 385759c4ce4b,a5b1fa8f7ce4..4e39c1f4c7d9 --- a/mm/filemap.c +++ b/mm/filemap.c @@@ -2028,8 -2044,6 +2044,8 @@@ find_page page = find_get_page(mapping, index); if (!page) { - if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) ++ if (iocb->ki_flags & IOCB_NOIO) + goto would_block; page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); @@@ -2164,7 -2185,7 +2191,7 @@@ page_not_up_to_date_locked } readpage: - if (iocb->ki_flags & IOCB_NOIO) { - if (iocb->ki_flags & IOCB_NOWAIT) { ++ if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) { unlock_page(page); put_page(page); goto would_block; ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-02 21:41 [GIT PULL] io_uring changes for 5.9-rc1 Jens Axboe @ 2020-08-03 20:48 ` Linus Torvalds 2020-08-03 20:53 ` Linus Torvalds 2020-08-03 22:30 ` Jens Axboe 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2020-08-03 20:48 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, linux-kernel On Sun, Aug 2, 2020 at 2:41 PM Jens Axboe <axboe@kernel.dk> wrote: > > Lots of cleanups in here, hardening the code and/or making it easier to > read and fixing buts, but a core feature/change too adding support for > real async buffered reads. With the latter in place, we just need > buffered write async support and we're done relying on kthreads for the > fast path. In detail: That async buffered reads handling the the page locking flag is a mess, and I'm really happy I committed my page locking scalability change early, so that the conflicts there caught it. Re-using the page bit waitqueue types and exporting them? That part is fine, I guess, particularly since it came from the wait_bit_key thing and have a smell of being generic. Taking a random part of wake_page_function(), and calling it "wake_page_match()" even though that's not at all that it does? Not ok. Adding random kiocb helper functions to a core header file, when they are only used in one place, and when they only make sense in that one place? Not ok. When the function is called "wake_page_match()", you'd expect it matches the wake page information, wouldn't it? Yeah, it did that. And then it also checked whether the bit we're waiting had been set again, because everybody ostensibly wanted that. Except they don't any more, and that's not what the name really implied anyway. And kiocb_wait_page_queue_init() has absolutely zero business being in <linux/filemap.h>. There are absolutely no valid uses of that thing outside of the one place that calls it. I tried to fix up the things I could. That said, like a lot of io_uring code, this is some seriously opaque code. You say you've done a lot of cleanups, but I'm not convinced those cleanups are in any way offsetting adding yet another union (how many bugs did the last one add?) and a magic flag of "use this part of the union" now. And I don't know what loads you use for testing that thing, or what happens when the "lock_page_async()" case actually fails to lock, and just calls back the io_async_buf_func() wakeup function when the page has unlocked... That function doesn't actually lock the page either, but does the task work. I hope that work then knows to do the right thing, but it's really opaque and hard to follow. Anyway, I'm not entirely happy with doing these kinds of changes in the merge resolution, but the alternative was to not do the pull at all, and require you to do a lot of cleanups before I would pull it. Maybe I should have done that. So this is a slightly grumpy email about how io_uring is (a) still making me very nervous about a very lackadaisical approach to things, and having the codepaths so obscure that I'm not convinced it's not horribly buggy. And (b) I fixed things up without really being able to test them. I tested that the _normal_ paths still seem to work fine, but.. I really think that whole thing needs a lot of comments, particularly around the whole io_rw_should_retry() area. A bit and legible comment about how it will be caught by the generic_file_buffered_read() page locking code, how the two cases differ (it might get caught by the "I'm just waiting for it to be unlocked", but it could *also* get caught by the "lock page now" case), and how it continues the request. As it is, it bounces between the generic code and very io_uring specific code in strange and not easy to follow ways. I've pushed out my merge of this thing, but you might also want to take a look at commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic"). I particular, the comment about how there's no point in even testing the page bit any more when you get woken up. I left that if (test_bit(key->bit_nr, &key->page->flags)) return -1; logic in io_async_buf_func() (but it's not in "wake_page_match()" any more), but I suspect it's bogus and pointless, for the same reason that it isn't done for normal page waits now. Maybe it's better to just queue the actual work regardless, it will then be caught in the _real_ lock_page() or whatever it ends up doing - and if it only really wants to see the "uptodate" bit being set, and was just waiting for IO to finish, then it never really cared about the page lock bit at all, it just wanted to be notified about IO being done. So this was a really long email to tell you - again - that I'm not happy with how fragile io_uring is, and how the code seems to be almost intentionally written to *be* fragile. Complex and hard to understand, and as a result it has had a fairly high rate of fairly nasty bugs. I'm hoping this isn't going to be yet another case of "nasty bugs because of complexity and a total disregard for explaining what is going on". Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 20:48 ` Linus Torvalds @ 2020-08-03 20:53 ` Linus Torvalds 2020-08-03 21:10 ` Konstantin Ryabitsev 2020-08-03 22:30 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2020-08-03 20:53 UTC (permalink / raw) To: Jens Axboe, Konstantin Ryabitsev; +Cc: io-uring, linux-kernel On Mon, Aug 3, 2020 at 1:48 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I've pushed out my merge of this thing [..] It seems I'm not the only one unhappy with the pull request. For some reason I also don't see pr-tracker-bot being all happy and excited about it. I wonder why. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 20:53 ` Linus Torvalds @ 2020-08-03 21:10 ` Konstantin Ryabitsev 2020-08-03 22:31 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Konstantin Ryabitsev @ 2020-08-03 21:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jens Axboe, io-uring, linux-kernel On Mon, Aug 03, 2020 at 01:53:12PM -0700, Linus Torvalds wrote: > On Mon, Aug 3, 2020 at 1:48 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I've pushed out my merge of this thing [..] > > It seems I'm not the only one unhappy with the pull request. > > For some reason I also don't see pr-tracker-bot being all happy and > excited about it. I wonder why. My guess it's because the body consists of two text/plain MIME-parts and Python returned the merge.txt part first, where we didn't find what we were looking for. I'll see if I can teach it to walk all text/plain parts looking for magic git pull strings instead of giving up after the first one. -K ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 21:10 ` Konstantin Ryabitsev @ 2020-08-03 22:31 ` Jens Axboe 0 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2020-08-03 22:31 UTC (permalink / raw) To: Konstantin Ryabitsev, Linus Torvalds; +Cc: io-uring, linux-kernel On 8/3/20 3:10 PM, Konstantin Ryabitsev wrote: > On Mon, Aug 03, 2020 at 01:53:12PM -0700, Linus Torvalds wrote: >> On Mon, Aug 3, 2020 at 1:48 PM Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> I've pushed out my merge of this thing [..] >> >> It seems I'm not the only one unhappy with the pull request. >> >> For some reason I also don't see pr-tracker-bot being all happy and >> excited about it. I wonder why. > > My guess it's because the body consists of two text/plain MIME-parts and > Python returned the merge.txt part first, where we didn't find what we > were looking for. > > I'll see if I can teach it to walk all text/plain parts looking for > magic git pull strings instead of giving up after the first one. Thanks, I was a bit puzzled on that one too, and this time it definitely wasn't because the tag wasn't there. In terms of attachments, I'm usually a fan of inlining, but seemed cleaner to me to attach the merge resolution as there's already a ton of other stuff in that email. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 20:48 ` Linus Torvalds 2020-08-03 20:53 ` Linus Torvalds @ 2020-08-03 22:30 ` Jens Axboe 2020-08-03 23:18 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Jens Axboe @ 2020-08-03 22:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-kernel On 8/3/20 2:48 PM, Linus Torvalds wrote: > On Sun, Aug 2, 2020 at 2:41 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> Lots of cleanups in here, hardening the code and/or making it easier to >> read and fixing buts, but a core feature/change too adding support for >> real async buffered reads. With the latter in place, we just need >> buffered write async support and we're done relying on kthreads for the >> fast path. In detail: > > That async buffered reads handling the the page locking flag is a > mess, and I'm really happy I committed my page locking scalability > change early, so that the conflicts there caught it. > > Re-using the page bit waitqueue types and exporting them? > > That part is fine, I guess, particularly since it came from the > wait_bit_key thing and have a smell of being generic. > > Taking a random part of wake_page_function(), and calling it > "wake_page_match()" even though that's not at all that it does? > > Not ok. OK, I actually thought it was kind of nice and better than having that code duplicated in two spots. > Adding random kiocb helper functions to a core header file, when they > are only used in one place, and when they only make sense in that one > place? > > Not ok. I'll move that into io_uring instead. > When the function is called "wake_page_match()", you'd expect it > matches the wake page information, wouldn't it? > > Yeah, it did that. And then it also checked whether the bit we're > waiting had been set again, because everybody ostensibly wanted that. > Except they don't any more, and that's not what the name really > implied anyway. > > And kiocb_wait_page_queue_init() has absolutely zero business being in > <linux/filemap.h>. There are absolutely no valid uses of that thing > outside of the one place that calls it. > > I tried to fix up the things I could. Thanks! As mentioned, I'll prep a cleanup patch that moves the kiocb_wait_page_queue_init() out of there. > That said, like a lot of io_uring code, this is some seriously opaque > code. You say you've done a lot of cleanups, but I'm not convinced > those cleanups are in any way offsetting adding yet another union (how > many bugs did the last one add?) and a magic flag of "use this part of > the union" now. I had to think a bit about what you are referring to here, but I guess it's the iocb part. And yes, it's not ideal, but until we support polled IO with buffered IO, then there's no overlap in the use case. And I don't see us ever doing that. > And I don't know what loads you use for testing that thing, or what > happens when the "lock_page_async()" case actually fails to lock, and > just calls back the io_async_buf_func() wakeup function when the page > has unlocked... > > That function doesn't actually lock the page either, but does the task > work. I hope that work then knows to do the right thing, but it's > really opaque and hard to follow. The task work retries the whole thing, so it'll go through the normal page cache read path again with all that that entails. We only really ever use the callback to tell us when it's a good idea to retry again, there's no other retained state there at all. I didn't realize that part wasn't straight forward, so I'll add some comments as well explaining how that code flow works. It's seen a good amount of testing, both from myself and also from others. The postgres IO rewrite has been putting it through its paces, and outside of a few initial issues months ago, it's been rock solid. > Anyway, I'm not entirely happy with doing these kinds of changes in > the merge resolution, but the alternative was to not do the pull at > all, and require you to do a lot of cleanups before I would pull it. > Maybe I should have done that. > > So this is a slightly grumpy email about how io_uring is (a) still > making me very nervous about a very lackadaisical approach to things, > and having the codepaths so obscure that I'm not convinced it's not > horribly buggy. And (b) I fixed things up without really being able to > test them. I tested that the _normal_ paths still seem to work fine, > but.. I need to do a better job at commenting these parts, obviously. And while nothing is perfect, and we're definitely perfect yet, the general trend is definitely strongly towards getting rid of odd states through flags and unifying more of the code, and tons of fixes/cleanups that make things easier to read and verify... > I really think that whole thing needs a lot of comments, particularly > around the whole io_rw_should_retry() area. > > A bit and legible comment about how it will be caught by the > generic_file_buffered_read() page locking code, how the two cases > differ (it might get caught by the "I'm just waiting for it to be > unlocked", but it could *also* get caught by the "lock page now" > case), and how it continues the request. Noted, I'll write that up. > As it is, it bounces between the generic code and very io_uring > specific code in strange and not easy to follow ways. > > I've pushed out my merge of this thing, but you might also want to > take a look at commit 2a9127fcf229 ("mm: rewrite > wait_on_page_bit_common() logic"). I particular, the comment about how > there's no point in even testing the page bit any more when you get > woken up. > > I left that > > if (test_bit(key->bit_nr, &key->page->flags)) > return -1; > > logic in io_async_buf_func() (but it's not in "wake_page_match()" any > more), but I suspect it's bogus and pointless, for the same reason > that it isn't done for normal page waits now. Maybe it's better to > just queue the actual work regardless, it will then be caught in the > _real_ lock_page() or whatever it ends up doing - and if it only > really wants to see the "uptodate" bit being set, and was just waiting > for IO to finish, then it never really cared about the page lock bit > at all, it just wanted to be notified about IO being done. I just did notice your rewrite commit, and I'll adjust accordingly and test it with that too. > So this was a really long email to tell you - again - that I'm not > happy with how fragile io_uring is, and how the code seems to be > almost intentionally written to *be* fragile. Complex and hard to > understand, and as a result it has had a fairly high rate of fairly > nasty bugs. > > I'm hoping this isn't going to be yet another case of "nasty bugs > because of complexity and a total disregard for explaining what is > going on". Outside of the review from Johannes, lots of other people did look over the async buffered bits, and Andrew as well said it look good to him. So while the task_work retry apparently isn't as obvious as I had hoped, it's definitely not fragile or intentionally trying to be obtuse. I'll make a few adjustments based on your feedback, and add a patch with some comments as well. Hopefully that'll make the end result easier to follow. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 22:30 ` Jens Axboe @ 2020-08-03 23:18 ` Jens Axboe 2020-08-03 23:31 ` Jens Axboe 2020-08-03 23:34 ` Linus Torvalds 0 siblings, 2 replies; 13+ messages in thread From: Jens Axboe @ 2020-08-03 23:18 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-kernel On 8/3/20 4:30 PM, Jens Axboe wrote: >> Adding random kiocb helper functions to a core header file, when they >> are only used in one place, and when they only make sense in that one >> place? >> >> Not ok. > > I'll move that into io_uring instead. I see that you handled most of the complaints already, so thanks for that. I've run some basic testing with master and it works for me, running some more testing on production too. I took a look at the rewrite you queued up, and made a matching change on the io_uring side: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=419ebeb6f2d0d56f6b2844c0f77034d1048e37e9 and also queued a documentation patch for the retry logic and the callback handler: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=9541a9d4791c2d31ba74b92666edd3f1efd936a8 For the latter, let me know if you're happy with the explanation, or if you want other parts documented more thoroughly too. I'll make a pass through the file in any case once I've flushed out the other branches for this merge window in. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:18 ` Jens Axboe @ 2020-08-03 23:31 ` Jens Axboe 2020-08-03 23:49 ` Linus Torvalds 2020-08-03 23:34 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: Jens Axboe @ 2020-08-03 23:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-kernel On 8/3/20 5:18 PM, Jens Axboe wrote: > On 8/3/20 4:30 PM, Jens Axboe wrote: >>> Adding random kiocb helper functions to a core header file, when they >>> are only used in one place, and when they only make sense in that one >>> place? >>> >>> Not ok. >> >> I'll move that into io_uring instead. > > I see that you handled most of the complaints already, so thanks for > that. I've run some basic testing with master and it works for me, > running some more testing on production too. > > I took a look at the rewrite you queued up, and made a matching change > on the io_uring side: > > https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=419ebeb6f2d0d56f6b2844c0f77034d1048e37e9 Updated to honor exclusive return value as well: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=f6fd3784c9f7d3309507fdb6dcc818f54467bf3e -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:31 ` Jens Axboe @ 2020-08-03 23:49 ` Linus Torvalds 2020-08-03 23:56 ` Jens Axboe 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2020-08-03 23:49 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, linux-kernel On Mon, Aug 3, 2020 at 4:31 PM Jens Axboe <axboe@kernel.dk> wrote: > > Updated to honor exclusive return value as well: See my previous email, You're just adding code that makes no sense, because your wait entry fundamentally isn't an exclusive one. So all that code is a no-op and only makes it more confusing to read. Your wakeup handler has _nothing_ to do with the generic wake_page_function(). There is _zero_ overlap. Your wakeup handler gets called only for the wait entries _you_ created. Trying to use the wakeup logic from wake_page_function() makes no sense, because the rules for wake_page_function() are entirely different. Yes, they are called for the same thing (somebody unlocked a page and is waking up waiters), but it's using a completely different sleeping logic. See? When wake_page_function() does that wait->flags |= WQ_FLAG_WOKEN; and does something different (and returns different values) depending on whether WQ_FLAG_EXCLUSIVE was set, that is all because wait_on_page_bit_common() entry set yo that wait entry (on its stack) with those exact rules in mind. So the wakeup function is 1:1 tied to the code that registers the wait entry. wait_on_page_bit_common() has one set of rules, that are then honored by the wakeup function it uses. But those rules have _zero_ impact on your use. You can have - and you *do* have - different sets of rules. For example, none of your wakeups are ever exclusive. All you do is make a work runnable - that doesn't mean that other people shouldn't do other things when they get a "page was unlocked" wakeup notification. Also, for you "list_del_init()" is fine, because you never do the unlocked "list_empty_careful()" on that wait entry. All the waitqueue operations run under the queue head lock. So what I think you _should_ do is just something like this: diff --git a/fs/io_uring.c b/fs/io_uring.c index 2a3af95be4ca..1e243f99643b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2965,10 +2965,10 @@ static int io_async_buf_func(struct wait_queue_entry *wait, unsigned mode, if (!wake_page_match(wpq, key)) return 0; - /* Stop waking things up if the page is locked again */ - if (test_bit(key->bit_nr, &key->page->flags)) - return -1; - + /* + * Somebody unlocked the page. Unqueue the wait entry + * and run the task_work + */ list_del_init(&wait->entry); init_task_work(&req->task_work, io_req_task_submit); because that matches what you're actually doing. There's no reason to stop waking up others because the page is locked, because you don't know what others want. And there's never any reason for the exclusive thing, b3ecause none of what you do guarantees that you take exclusive ownership of the page lock. Running the work *may* end up doing a "lock_page()", but you don't actually guarantee that. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:49 ` Linus Torvalds @ 2020-08-03 23:56 ` Jens Axboe 2020-08-04 0:11 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Jens Axboe @ 2020-08-03 23:56 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-kernel On 8/3/20 5:49 PM, Linus Torvalds wrote: > On Mon, Aug 3, 2020 at 4:31 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> Updated to honor exclusive return value as well: > > See my previous email, You're just adding code that makes no sense, > because your wait entry fundamentally isn't an exclusive one. Right, I get that now, it's just dead code for my use case. It was sent out before your previous email. > So all that code is a no-op and only makes it more confusing to read. > > Your wakeup handler has _nothing_ to do with the generic > wake_page_function(). There is _zero_ overlap. Your wakeup handler > gets called only for the wait entries _you_ created. > > Trying to use the wakeup logic from wake_page_function() makes no > sense, because the rules for wake_page_function() are entirely > different. Yes, they are called for the same thing (somebody unlocked > a page and is waking up waiters), but it's using a completely > different sleeping logic. > > See? When wake_page_function() does that > > wait->flags |= WQ_FLAG_WOKEN; > > and does something different (and returns different values) depending > on whether WQ_FLAG_EXCLUSIVE was set, that is all because > wait_on_page_bit_common() entry set yo that wait entry (on its stack) > with those exact rules in mind. > > So the wakeup function is 1:1 tied to the code that registers the wait > entry. wait_on_page_bit_common() has one set of rules, that are then > honored by the wakeup function it uses. But those rules have _zero_ > impact on your use. You can have - and you *do* have - different sets > of rules. > > For example, none of your wakeups are ever exclusive. All you do is > make a work runnable - that doesn't mean that other people shouldn't > do other things when they get a "page was unlocked" wakeup > notification. > > Also, for you "list_del_init()" is fine, because you never do the > unlocked "list_empty_careful()" on that wait entry. All the waitqueue > operations run under the queue head lock. > > So what I think you _should_ do is just something like this: > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 2a3af95be4ca..1e243f99643b 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -2965,10 +2965,10 @@ static int io_async_buf_func(struct > wait_queue_entry *wait, unsigned mode, > if (!wake_page_match(wpq, key)) > return 0; > > - /* Stop waking things up if the page is locked again */ > - if (test_bit(key->bit_nr, &key->page->flags)) > - return -1; > - > + /* > + * Somebody unlocked the page. Unqueue the wait entry > + * and run the task_work > + */ > list_del_init(&wait->entry); > > init_task_work(&req->task_work, io_req_task_submit); > > because that matches what you're actually doing. > > There's no reason to stop waking up others because the page is locked, > because you don't know what others want. > > And there's never any reason for the exclusive thing, b3ecause none of > what you do guarantees that you take exclusive ownership of the page > lock. Running the work *may* end up doing a "lock_page()", but you > don't actually guarantee that. What I ended up with after the last email was just removing the test bit: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=cbd287c09351f1d3a4b3cb9167a2616a11390d32 and I clarified the comments on the io_async_buf_func() to add more hints on how everything is triggered instead of just a vague "handler" reference: https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=c1dd91d16246b168b80af9b64c5cc35a66410455 -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:56 ` Jens Axboe @ 2020-08-04 0:11 ` Linus Torvalds 0 siblings, 0 replies; 13+ messages in thread From: Linus Torvalds @ 2020-08-04 0:11 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, linux-kernel On Mon, Aug 3, 2020 at 4:56 PM Jens Axboe <axboe@kernel.dk> wrote: > > What I ended up with after the last email was just removing the test > bit: > > https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=cbd287c09351f1d3a4b3cb9167a2616a11390d32 > > and I clarified the comments on the io_async_buf_func() to add more > hints on how everything is triggered instead of just a vague "handler" > reference: > > https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=c1dd91d16246b168b80af9b64c5cc35a66410455 These both look sensible to me now. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:18 ` Jens Axboe 2020-08-03 23:31 ` Jens Axboe @ 2020-08-03 23:34 ` Linus Torvalds 2020-08-03 23:43 ` Jens Axboe 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2020-08-03 23:34 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, linux-kernel On Mon, Aug 3, 2020 at 4:18 PM Jens Axboe <axboe@kernel.dk> wrote: > > > I took a look at the rewrite you queued up, and made a matching change > on the io_uring side: Oh, no, you made it worse. Now you're tying your odd wakeup routine to entirely irrelevant things that can't even happen to you. That io_async_buf_func() will never be called for any entry that isn't your own, so testing wait->flags & WQ_FLAG_EXCLUSIVE is completely pointless, because you never set that flag. And similarly, for you to then do wait->flags |= WQ_FLAG_WOKEN; is equally pointless, because the only thing that cares and looks at that wait entry is you, and you don't care about the WOKEN flag. So that patch shows a fundamental misunderstanding of how the waitqueues actually work. Which is kind of my _point_. The io_uring code that hooked into the page wait queues really looks like complete cut-and-paste voodoo programming. It needs comments. It's hard to follow. Even somebody like me, who actually knows how the page wait queues really work, have a really hard time following how io_uring initializing a wait-queue entry and pointing to it in the io ctx then interacts with the (later) generic file reading path, and how it then calls back at unlock time to the io_uring callback _if_ the page was locked. And that patch you point to makes me 100% sure you don't quite understand the code either. So when you do /* * Only test the bit if it's an exclusive wait, as we know the * bit is cleared for non-exclusive waits. Also see mm/filemap.c */ if ((wait->flags & WQ_FLAG_EXCLUSIVE) && test_and_set_bit(key->bit_nr, &key->page->flags)) return -1; the first test guarantees that the second test is never done, which is good, because if it *had* been done, you'd have taken the lock and nothing you have actually expects that. So the fix is to just remove those lines entirely. If somebody unlocked the page you care about, and did a wakeup on that page and bit, then you know you should start the async worker. Noi amount of testing bits matters at all. And similarly, the wait->flags |= WQ_FLAG_WOKEN; is a no-op because nothing tests that WQ_FLAG_WOKEN bit. That wait entry is _your_ wait entry. It's not the wait entry of some normal page locker - those use wake_page_function(). Now *if* you had workers that actually expected to be woken up with the page lock already held, and owning it, then that kind of WQ_FLAG_EXCLUSIVE and WQ_FLAG_WOKEN logic would be a good idea. But that's not what you have. > and also queued a documentation patch for the retry logic and the > callback handler: > > https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=9541a9d4791c2d31ba74b92666edd3f1efd936a8 Better. Although I find the first comment a bit misleading. You say /* Invoked from our "page is now unlocked" handler when someone .. but that's not really the case. The function gets called by whoever unlocks the page after you've registered that page wait entry through lock_page_async(). So there's no "our handler" anywhere, which I find misleading and confusing in the comment. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [GIT PULL] io_uring changes for 5.9-rc1 2020-08-03 23:34 ` Linus Torvalds @ 2020-08-03 23:43 ` Jens Axboe 0 siblings, 0 replies; 13+ messages in thread From: Jens Axboe @ 2020-08-03 23:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: io-uring, linux-kernel On 8/3/20 5:34 PM, Linus Torvalds wrote: > On Mon, Aug 3, 2020 at 4:18 PM Jens Axboe <axboe@kernel.dk> wrote: >> >> >> I took a look at the rewrite you queued up, and made a matching change >> on the io_uring side: > > Oh, no, you made it worse. > > Now you're tying your odd wakeup routine to entirely irrelevant things > that can't even happen to you. > > That io_async_buf_func() will never be called for any entry that isn't > your own, so testing > > wait->flags & WQ_FLAG_EXCLUSIVE > > is completely pointless, because you never set that flag. And > similarly, for you to then do > > wait->flags |= WQ_FLAG_WOKEN; > > is equally pointless, because the only thing that cares and looks at > that wait entry is you, and you don't care about the WOKEN flag. > > So that patch shows a fundamental misunderstanding of how the > waitqueues actually work. > > Which is kind of my _point_. The io_uring code that hooked into the > page wait queues really looks like complete cut-and-paste voodoo > programming. > > It needs comments. It's hard to follow. Even somebody like me, who > actually knows how the page wait queues really work, have a really > hard time following how io_uring initializing a wait-queue entry and > pointing to it in the io ctx then interacts with the (later) generic > file reading path, and how it then calls back at unlock time to the > io_uring callback _if_ the page was locked. > > And that patch you point to makes me 100% sure you don't quite > understand the code either. > > So when you do > > /* > * Only test the bit if it's an exclusive wait, as we know the > * bit is cleared for non-exclusive waits. Also see mm/filemap.c > */ > if ((wait->flags & WQ_FLAG_EXCLUSIVE) && > test_and_set_bit(key->bit_nr, &key->page->flags)) > return -1; > > the first test guarantees that the second test is never done, which is > good, because if it *had* been done, you'd have taken the lock and > nothing you have actually expects that. > > So the fix is to just remove those lines entirely. If somebody > unlocked the page you care about, and did a wakeup on that page and > bit, then you know you should start the async worker. Noi amount of > testing bits matters at all. > > And similarly, the > > wait->flags |= WQ_FLAG_WOKEN; > > is a no-op because nothing tests that WQ_FLAG_WOKEN bit. That wait > entry is _your_ wait entry. It's not the wait entry of some normal > page locker - those use wake_page_function(). > > Now *if* you had workers that actually expected to be woken up with > the page lock already held, and owning it, then that kind of > WQ_FLAG_EXCLUSIVE and WQ_FLAG_WOKEN logic would be a good idea. But > that's not what you have. Yes, looks like I was a bit too trigger happy without grokking the whole thing, and got it mixed up with the broader more generic waitqueue cases. Thanks for clueing me in, I've updated the patch so the use case is inline with only what io_uring is doing here. >> and also queued a documentation patch for the retry logic and the >> callback handler: >> >> https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.9&id=9541a9d4791c2d31ba74b92666edd3f1efd936a8 > > Better. Although I find the first comment a bit misleading. > > You say > > /* Invoked from our "page is now unlocked" handler when someone .. > > but that's not really the case. The function gets called by whoever > unlocks the page after you've registered that page wait entry through > lock_page_async(). > > So there's no "our handler" anywhere, which I find misleading and > confusing in the comment. The 'handler' refers to the io_uring waitqueue callback, but I should probably spell that out. I'll adjust it. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-08-04 0:12 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-08-02 21:41 [GIT PULL] io_uring changes for 5.9-rc1 Jens Axboe 2020-08-03 20:48 ` Linus Torvalds 2020-08-03 20:53 ` Linus Torvalds 2020-08-03 21:10 ` Konstantin Ryabitsev 2020-08-03 22:31 ` Jens Axboe 2020-08-03 22:30 ` Jens Axboe 2020-08-03 23:18 ` Jens Axboe 2020-08-03 23:31 ` Jens Axboe 2020-08-03 23:49 ` Linus Torvalds 2020-08-03 23:56 ` Jens Axboe 2020-08-04 0:11 ` Linus Torvalds 2020-08-03 23:34 ` Linus Torvalds 2020-08-03 23:43 ` Jens Axboe
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.