linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: "Stefan Bühler" <source@stbuehler.de>,
	linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: io_uring memory ordering (and broken queue-is-full checks)
Date: Wed, 17 Apr 2019 09:02:52 -0600	[thread overview]
Message-ID: <03cdf146-12a7-95de-eb78-31cd7df4bb44@kernel.dk> (raw)
In-Reply-To: <608965d0-84e0-4d96-8835-afab2b6f2a1a@stbuehler.de>

On 4/14/19 10:44 AM, Stefan Bühler wrote:
> Hi all,
> 
> For the record my review was done with commit 4443f8e6ac:
>     Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block
> 
> I'm also rather new to memory ordering problems.
> 
> I didn't look into whether the CQ and the SQ need to be synced in any
> way to prevent CQ overflows; my apologies if I deemed barriers
> unnecessary if they were designed for this (to be honest though: such
> barriers should be documented accordingly).
> 
> All in all the io_uring memory ordering instructions look rather broken
> to me.
> 
> This starts with the initial description at the top of io_uring.c:
> 
>> [...] When the application reads the CQ ring
>> tail, it must use an appropriate smp_rmb() to order with the smp_wmb()
>> the kernel uses after writing the tail. Failure to do so could cause a
>> delay in when the application notices that completion events available.
>> This isn't a fatal condition. [...]
> 
> I disagree: a read barrier before reading tail is not useful, and a read
> barrier afterwards is NOT optional, otherwise you might read old CQ
> entries (this is NOT the same as getting completion events "late" - you
> get broken ones).  `smp_load_acquire(tail)` is the correct choice imho
> (although we don't need the store ordering).
> 
> The kernel already uses the matching `smp_store_release` to write the CQ
> tail; the `smp_wmb()` following it is imho not needed.
> 
>> [...] Likewise, the application must use an
>> appropriate smp_wmb() both before writing the SQ tail, and after writing
>> the SQ tail. The first one orders the sqe writes with the tail write, and
>> the latter is paired with the smp_rmb() the kernel will issue before
>> reading the SQ tail on submission.
> 
> The write barrier before is required of course, but the one after is not
> needed imho; what would it order against? Userspace doesn't have
> significant writes after it.
> 
> Again imho the correct choice here would be `smp_store_release` to write
> the tail (although we don't need the store ordering).
> 
> Now to the actual code:
> 
> `io_get_cqring` uses a read barrier before reading the CQ head; this
> looks unnecessary to me.  I'd again use `smp_load_acquire(head)` here,
> but I guess as the conditional depends on the load of head subsequent
> stores might already be ordered.
> 
> `io_get_sqring` too uses a read barrier before reading the SQ tail,
> which looks unnecessary to me.  But this time the following reads *need*
> to be ordered after reading the SQ tail, so either a read barrier after
> reading SQ tail or `smp_load_acquire(tail)` is needed.
> 
> The `smp_wmb()` at the end of `io_get_sqring` is not needed imho; the
> store to `dropped` only needs to be visible once SQ head gets written,
> which uses `smp_store_release` and already syncs previous stores.
> 
> Setting `IORING_SQ_NEED_WAKEUP` in `io_sq_thread` needs a full barrier
> (`smp_mb()`) afterwards though: the store to flags needs to come before
> the load of the SQ tail.  The existing `smp_wmb()` in `io_sq_thread` and
> the `smp_rmb()` in `io_get_sqring` are NOT a full barrier (afaik).
> (This needs a full barrier in userspace too to check for.)
> 
> Resetting `IORING_SQ_NEED_WAKEUP` doesn't need any barrier though: it is
> best effort anyway, and an additional wakeup must not break anything.
> 
> `io_cqring_wait`: imho this can return even when `min_events` is not
> fulfilled (because userspace might read from the queue just before it
> returns).  So the only important thing is that it doesn't block once
> `min_events` events have been queued in CQ.
> 
> It uses read barriers before `io_cqring_events`: the first one isn't
> needed because it isn't critical (and there is nothing to sync with),
> and `prepare_to_wait` already includes a full barrier.

Care to turn the pseudo code into a real patch? Would be easier to
digest. But thanks for taking a look at this!

> The "userspace" liburing looks even worse.  For starters,
> `io_uring_get_completion` cannot safely return a pointer to the CQ entry
> *AND* increment head: you need to actually read the entry before
> incrementing head.

That's a good point, we should actually have this split in two to avoid
the case where the kernel will then fill in a new entry for us.

> Last but not least the kernel side has two lines it checks whether a
> queue is full or not and uses `tail + 1 == head`: this is only true if
> there were U32_MAX entries in the queue.  You should check `(tail -
> head) == SIZE` instead.

Good catch, this is a leftover from when the rings were indeed capped by
the mask of the entries. I've fixed this one and pushed it out, and
added a liburing test case for it as well.


-- 
Jens Axboe


  reply	other threads:[~2019-04-17 15:02 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-14 16:44 io_uring memory ordering (and broken queue-is-full checks) Stefan Bühler
2019-04-17 15:02 ` Jens Axboe [this message]
2019-04-19  9:29   ` Stefan Bühler
2019-04-19  9:57     ` [PATCH v1 1/3] io_uring: fix race condition reading SQ entries Stefan Bühler
2019-04-19  9:57       ` [PATCH v1 2/3] io_uring: fix race condition when sq threads goes sleeping Stefan Bühler
2019-04-19  9:57       ` [PATCH v1 3/3] io_uring: fix poll full SQ detection Stefan Bühler
2019-04-22 17:04     ` io_uring memory ordering (and broken queue-is-full checks) Jens Axboe
2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 2/7] io_uring: remove unnecessary barrier before wq_has_sleeper Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 3/7] io_uring: remove unnecessary barrier before reading cq head Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 4/7] io_uring: remove unnecessary barrier after updating SQ head Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 5/7] io_uring: remove unnecessary barrier before reading SQ tail Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 6/7] io_uring: remove unnecessary barrier after incrementing dropped counter Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 7/7] io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP Stefan Bühler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=03cdf146-12a7-95de-eb78-31cd7df4bb44@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=source@stbuehler.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).