From: "Stefan Bühler" <source@stbuehler.de>
To: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: io_uring memory ordering (and broken queue-is-full checks)
Date: Sun, 14 Apr 2019 18:44:19 +0200 [thread overview]
Message-ID: <608965d0-84e0-4d96-8835-afab2b6f2a1a@stbuehler.de> (raw)
Hi all,
For the record my review was done with commit 4443f8e6ac:
Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block
I'm also rather new to memory ordering problems.
I didn't look into whether the CQ and the SQ need to be synced in any
way to prevent CQ overflows; my apologies if I deemed barriers
unnecessary if they were designed for this (to be honest though: such
barriers should be documented accordingly).
All in all the io_uring memory ordering instructions look rather broken
to me.
This starts with the initial description at the top of io_uring.c:
> [...] When the application reads the CQ ring
> tail, it must use an appropriate smp_rmb() to order with the smp_wmb()
> the kernel uses after writing the tail. Failure to do so could cause a
> delay in when the application notices that completion events available.
> This isn't a fatal condition. [...]
I disagree: a read barrier before reading tail is not useful, and a read
barrier afterwards is NOT optional, otherwise you might read old CQ
entries (this is NOT the same as getting completion events "late" - you
get broken ones). `smp_load_acquire(tail)` is the correct choice imho
(although we don't need the store ordering).
The kernel already uses the matching `smp_store_release` to write the CQ
tail; the `smp_wmb()` following it is imho not needed.
> [...] Likewise, the application must use an
> appropriate smp_wmb() both before writing the SQ tail, and after writing
> the SQ tail. The first one orders the sqe writes with the tail write, and
> the latter is paired with the smp_rmb() the kernel will issue before
> reading the SQ tail on submission.
The write barrier before is required of course, but the one after is not
needed imho; what would it order against? Userspace doesn't have
significant writes after it.
Again imho the correct choice here would be `smp_store_release` to write
the tail (although we don't need the store ordering).
Now to the actual code:
`io_get_cqring` uses a read barrier before reading the CQ head; this
looks unnecessary to me. I'd again use `smp_load_acquire(head)` here,
but I guess as the conditional depends on the load of head subsequent
stores might already be ordered.
`io_get_sqring` too uses a read barrier before reading the SQ tail,
which looks unnecessary to me. But this time the following reads *need*
to be ordered after reading the SQ tail, so either a read barrier after
reading SQ tail or `smp_load_acquire(tail)` is needed.
The `smp_wmb()` at the end of `io_get_sqring` is not needed imho; the
store to `dropped` only needs to be visible once SQ head gets written,
which uses `smp_store_release` and already syncs previous stores.
Setting `IORING_SQ_NEED_WAKEUP` in `io_sq_thread` needs a full barrier
(`smp_mb()`) afterwards though: the store to flags needs to come before
the load of the SQ tail. The existing `smp_wmb()` in `io_sq_thread` and
the `smp_rmb()` in `io_get_sqring` are NOT a full barrier (afaik).
(This needs a full barrier in userspace too to check for.)
Resetting `IORING_SQ_NEED_WAKEUP` doesn't need any barrier though: it is
best effort anyway, and an additional wakeup must not break anything.
`io_cqring_wait`: imho this can return even when `min_events` is not
fulfilled (because userspace might read from the queue just before it
returns). So the only important thing is that it doesn't block once
`min_events` events have been queued in CQ.
It uses read barriers before `io_cqring_events`: the first one isn't
needed because it isn't critical (and there is nothing to sync with),
and `prepare_to_wait` already includes a full barrier.
The `smp_rmb()` in `io_uring_poll` actually looks correct; imho
`poll_wait` should include a read barrier (lots of usages look like they
assume it does), but I couldn't find it. The other end `wq_has_sleeper`
has a full memory barrier to sync with, and the `smp_wmb()` in
`io_commit_cqring` preceding it is not needed imho.
The "userspace" liburing looks even worse. For starters,
`io_uring_get_completion` cannot safely return a pointer to the CQ entry
*AND* increment head: you need to actually read the entry before
incrementing head.
Last but not least the kernel side has two lines it checks whether a
queue is full or not and uses `tail + 1 == head`: this is only true if
there were U32_MAX entries in the queue. You should check `(tail -
head) == SIZE` instead.
cheers,
Stefan
PS: I'm not subscribed to the lists, so please keep me in CC when you
expect me to read it :)
PPS:
I came up with some pseudo-code for the basic queue handling (as I think
it should look like), maybe this helps:
queue writer (writes tail, reads head):
```
cached_head := load_acquire(p_head)
while want_write_more_entries() && (local_tail - cached_head) < SIZE:
entry[MASKED(local_tail++)] = ...
store_release(p_tail, local_tail)
// userspace: if we need to check whether kernel thread is sleeping
// and needs wakeup (IORING_SETUP_SQPOLL)
smp_barrier(); // tail store vs flags load: both read and write barrier
if load(p_flags) & NEED_WAKEUP:
wakeup_kernel_thread()
```
queue reader (writes head, reads tail):
```
cached_tail := load_acquire(tail)
handle_entries:
while want_read_more_entries() && local_head != cached_tail:
yield entry[MASKED(local_head++)]
store_release(p_head, local_head)
// kernel side: sq read thread (IORING_SETUP_SQPOLL) can go sleeping;
// userspace checks flag *after* storing tail whether a wakeup is needed
prepare_to_wait(...);
flags |= NEED_WAKEUP
smp_barrier(); // flags store vs tail load: both read and write barrier
cached_tail := load_acquire(tail)
if local_head != cached_tail:
flags &= ~NEED_WAKEUP
goto handle_entries
else:
sleep()
```
next reply other threads:[~2019-04-14 16:52 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-14 16:44 Stefan Bühler [this message]
2019-04-17 15:02 ` io_uring memory ordering (and broken queue-is-full checks) Jens Axboe
2019-04-19 9:29 ` Stefan Bühler
2019-04-19 9:57 ` [PATCH v1 1/3] io_uring: fix race condition reading SQ entries Stefan Bühler
2019-04-19 9:57 ` [PATCH v1 2/3] io_uring: fix race condition when sq threads goes sleeping Stefan Bühler
2019-04-19 9:57 ` [PATCH v1 3/3] io_uring: fix poll full SQ detection Stefan Bühler
2019-04-22 17:04 ` io_uring memory ordering (and broken queue-is-full checks) Jens Axboe
2019-04-24 21:54 ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
2019-04-24 21:54 ` [PATCH barrier cleanup v1 2/7] io_uring: remove unnecessary barrier before wq_has_sleeper Stefan Bühler
2019-04-24 21:54 ` [PATCH barrier cleanup v1 3/7] io_uring: remove unnecessary barrier before reading cq head Stefan Bühler
2019-04-24 21:54 ` [PATCH barrier cleanup v1 4/7] io_uring: remove unnecessary barrier after updating SQ head Stefan Bühler
2019-04-24 21:54 ` [PATCH barrier cleanup v1 5/7] io_uring: remove unnecessary barrier before reading SQ tail Stefan Bühler
2019-04-24 21:54 ` [PATCH barrier cleanup v1 6/7] io_uring: remove unnecessary barrier after incrementing dropped counter Stefan Bühler
2019-04-24 21:54 ` [PATCH barrier cleanup v1 7/7] io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP Stefan Bühler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=608965d0-84e0-4d96-8835-afab2b6f2a1a@stbuehler.de \
--to=source@stbuehler.de \
--cc=axboe@kernel.dk \
--cc=linux-block@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).