linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* io_uring memory ordering (and broken queue-is-full checks)
@ 2019-04-14 16:44 Stefan Bühler
  2019-04-17 15:02 ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Bühler @ 2019-04-14 16:44 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Hi all,

For the record my review was done with commit 4443f8e6ac:
    Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block

I'm also rather new to memory ordering problems.

I didn't look into whether the CQ and the SQ need to be synced in any
way to prevent CQ overflows; my apologies if I deemed barriers
unnecessary if they were designed for this (to be honest though: such
barriers should be documented accordingly).

All in all the io_uring memory ordering instructions look rather broken
to me.

This starts with the initial description at the top of io_uring.c:

> [...] When the application reads the CQ ring
> tail, it must use an appropriate smp_rmb() to order with the smp_wmb()
> the kernel uses after writing the tail. Failure to do so could cause a
> delay in when the application notices that completion events available.
> This isn't a fatal condition. [...]

I disagree: a read barrier before reading tail is not useful, and a read
barrier afterwards is NOT optional, otherwise you might read old CQ
entries (this is NOT the same as getting completion events "late" - you
get broken ones).  `smp_load_acquire(tail)` is the correct choice imho
(although we don't need the store ordering).

The kernel already uses the matching `smp_store_release` to write the CQ
tail; the `smp_wmb()` following it is imho not needed.

> [...] Likewise, the application must use an
> appropriate smp_wmb() both before writing the SQ tail, and after writing
> the SQ tail. The first one orders the sqe writes with the tail write, and
> the latter is paired with the smp_rmb() the kernel will issue before
> reading the SQ tail on submission.

The write barrier before is required of course, but the one after is not
needed imho; what would it order against? Userspace doesn't have
significant writes after it.

Again imho the correct choice here would be `smp_store_release` to write
the tail (although we don't need the store ordering).

Now to the actual code:

`io_get_cqring` uses a read barrier before reading the CQ head; this
looks unnecessary to me.  I'd again use `smp_load_acquire(head)` here,
but I guess as the conditional depends on the load of head subsequent
stores might already be ordered.

`io_get_sqring` too uses a read barrier before reading the SQ tail,
which looks unnecessary to me.  But this time the following reads *need*
to be ordered after reading the SQ tail, so either a read barrier after
reading SQ tail or `smp_load_acquire(tail)` is needed.

The `smp_wmb()` at the end of `io_get_sqring` is not needed imho; the
store to `dropped` only needs to be visible once SQ head gets written,
which uses `smp_store_release` and already syncs previous stores.

Setting `IORING_SQ_NEED_WAKEUP` in `io_sq_thread` needs a full barrier
(`smp_mb()`) afterwards though: the store to flags needs to come before
the load of the SQ tail.  The existing `smp_wmb()` in `io_sq_thread` and
the `smp_rmb()` in `io_get_sqring` are NOT a full barrier (afaik).
(This needs a full barrier in userspace too to check for.)

Resetting `IORING_SQ_NEED_WAKEUP` doesn't need any barrier though: it is
best effort anyway, and an additional wakeup must not break anything.

`io_cqring_wait`: imho this can return even when `min_events` is not
fulfilled (because userspace might read from the queue just before it
returns).  So the only important thing is that it doesn't block once
`min_events` events have been queued in CQ.

It uses read barriers before `io_cqring_events`: the first one isn't
needed because it isn't critical (and there is nothing to sync with),
and `prepare_to_wait` already includes a full barrier.

The `smp_rmb()` in `io_uring_poll` actually looks correct; imho
`poll_wait` should include a read barrier (lots of usages look like they
assume it does), but I couldn't find it.  The other end `wq_has_sleeper`
has a full memory barrier to sync with, and the `smp_wmb()` in
`io_commit_cqring` preceding it is not needed imho.


The "userspace" liburing looks even worse.  For starters,
`io_uring_get_completion` cannot safely return a pointer to the CQ entry
*AND* increment head: you need to actually read the entry before
incrementing head.


Last but not least the kernel side has two lines it checks whether a
queue is full or not and uses `tail + 1 == head`: this is only true if
there were U32_MAX entries in the queue.  You should check `(tail -
head) == SIZE` instead.

cheers,
Stefan

PS: I'm not subscribed to the lists, so please keep me in CC when you
expect me to read it :)

PPS:

I came up with some pseudo-code for the basic queue handling (as I think
it should look like), maybe this helps:

queue writer (writes tail, reads head):

```
    cached_head := load_acquire(p_head)
    while want_write_more_entries() && (local_tail - cached_head) < SIZE:
        entry[MASKED(local_tail++)] = ...
    store_release(p_tail, local_tail)

    // userspace: if we need to check whether kernel thread is sleeping
    // and needs wakeup (IORING_SETUP_SQPOLL)
    smp_barrier(); // tail store vs flags load: both read and write barrier
    if load(p_flags) & NEED_WAKEUP:
        wakeup_kernel_thread()
```

queue reader (writes head, reads tail):

```
    cached_tail := load_acquire(tail)
handle_entries:
    while want_read_more_entries() && local_head != cached_tail:
        yield entry[MASKED(local_head++)]
    store_release(p_head, local_head)

    // kernel side: sq read thread (IORING_SETUP_SQPOLL) can go sleeping;
    // userspace checks flag *after* storing tail whether a wakeup is needed
    prepare_to_wait(...);
    flags |= NEED_WAKEUP
    smp_barrier(); // flags store vs tail load: both read and write barrier
    cached_tail := load_acquire(tail)
    if local_head != cached_tail:
        flags &= ~NEED_WAKEUP
        goto handle_entries
    else:
        sleep()
```

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: io_uring memory ordering (and broken queue-is-full checks)
  2019-04-14 16:44 io_uring memory ordering (and broken queue-is-full checks) Stefan Bühler
@ 2019-04-17 15:02 ` Jens Axboe
  2019-04-19  9:29   ` Stefan Bühler
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2019-04-17 15:02 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 4/14/19 10:44 AM, Stefan Bühler wrote:
> Hi all,
> 
> For the record my review was done with commit 4443f8e6ac:
>     Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block
> 
> I'm also rather new to memory ordering problems.
> 
> I didn't look into whether the CQ and the SQ need to be synced in any
> way to prevent CQ overflows; my apologies if I deemed barriers
> unnecessary if they were designed for this (to be honest though: such
> barriers should be documented accordingly).
> 
> All in all the io_uring memory ordering instructions look rather broken
> to me.
> 
> This starts with the initial description at the top of io_uring.c:
> 
>> [...] When the application reads the CQ ring
>> tail, it must use an appropriate smp_rmb() to order with the smp_wmb()
>> the kernel uses after writing the tail. Failure to do so could cause a
>> delay in when the application notices that completion events available.
>> This isn't a fatal condition. [...]
> 
> I disagree: a read barrier before reading tail is not useful, and a read
> barrier afterwards is NOT optional, otherwise you might read old CQ
> entries (this is NOT the same as getting completion events "late" - you
> get broken ones).  `smp_load_acquire(tail)` is the correct choice imho
> (although we don't need the store ordering).
> 
> The kernel already uses the matching `smp_store_release` to write the CQ
> tail; the `smp_wmb()` following it is imho not needed.
> 
>> [...] Likewise, the application must use an
>> appropriate smp_wmb() both before writing the SQ tail, and after writing
>> the SQ tail. The first one orders the sqe writes with the tail write, and
>> the latter is paired with the smp_rmb() the kernel will issue before
>> reading the SQ tail on submission.
> 
> The write barrier before is required of course, but the one after is not
> needed imho; what would it order against? Userspace doesn't have
> significant writes after it.
> 
> Again imho the correct choice here would be `smp_store_release` to write
> the tail (although we don't need the store ordering).
> 
> Now to the actual code:
> 
> `io_get_cqring` uses a read barrier before reading the CQ head; this
> looks unnecessary to me.  I'd again use `smp_load_acquire(head)` here,
> but I guess as the conditional depends on the load of head subsequent
> stores might already be ordered.
> 
> `io_get_sqring` too uses a read barrier before reading the SQ tail,
> which looks unnecessary to me.  But this time the following reads *need*
> to be ordered after reading the SQ tail, so either a read barrier after
> reading SQ tail or `smp_load_acquire(tail)` is needed.
> 
> The `smp_wmb()` at the end of `io_get_sqring` is not needed imho; the
> store to `dropped` only needs to be visible once SQ head gets written,
> which uses `smp_store_release` and already syncs previous stores.
> 
> Setting `IORING_SQ_NEED_WAKEUP` in `io_sq_thread` needs a full barrier
> (`smp_mb()`) afterwards though: the store to flags needs to come before
> the load of the SQ tail.  The existing `smp_wmb()` in `io_sq_thread` and
> the `smp_rmb()` in `io_get_sqring` are NOT a full barrier (afaik).
> (This needs a full barrier in userspace too to check for.)
> 
> Resetting `IORING_SQ_NEED_WAKEUP` doesn't need any barrier though: it is
> best effort anyway, and an additional wakeup must not break anything.
> 
> `io_cqring_wait`: imho this can return even when `min_events` is not
> fulfilled (because userspace might read from the queue just before it
> returns).  So the only important thing is that it doesn't block once
> `min_events` events have been queued in CQ.
> 
> It uses read barriers before `io_cqring_events`: the first one isn't
> needed because it isn't critical (and there is nothing to sync with),
> and `prepare_to_wait` already includes a full barrier.

Care to turn the pseudo code into a real patch? Would be easier to
digest. But thanks for taking a look at this!

> The "userspace" liburing looks even worse.  For starters,
> `io_uring_get_completion` cannot safely return a pointer to the CQ entry
> *AND* increment head: you need to actually read the entry before
> incrementing head.

That's a good point, we should actually have this split in two to avoid
the case where the kernel will then fill in a new entry for us.

> Last but not least the kernel side has two lines it checks whether a
> queue is full or not and uses `tail + 1 == head`: this is only true if
> there were U32_MAX entries in the queue.  You should check `(tail -
> head) == SIZE` instead.

Good catch, this is a leftover from when the rings were indeed capped by
the mask of the entries. I've fixed this one and pushed it out, and
added a liburing test case for it as well.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: io_uring memory ordering (and broken queue-is-full checks)
  2019-04-17 15:02 ` Jens Axboe
@ 2019-04-19  9:29   ` Stefan Bühler
  2019-04-19  9:57     ` [PATCH v1 1/3] io_uring: fix race condition reading SQ entries Stefan Bühler
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-19  9:29 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Hi,

On 17.04.19 17:02, Jens Axboe wrote:
> [...]
>
> Care to turn the pseudo code into a real patch? Would be easier to
> digest. But thanks for taking a look at this!

Fair enough; I'm working on it.

I'll start with the actual bugs, and will try to cleanup unneeded
barriers/docs later.

>> The "userspace" liburing looks even worse.  For starters,
>> `io_uring_get_completion` cannot safely return a pointer to the CQ entry
>> *AND* increment head: you need to actually read the entry before
>> incrementing head.
> 
> That's a good point, we should actually have this split in two to avoid
> the case where the kernel will then fill in a new entry for us.

I see you already did so in the liburing repo.

When going through liburing code again, I noticed io_uring_submit
assumes there might be pending submission entries in the SQ queue.

But those entries are not "reserved" in the SQE queue; so
io_uring_get_sqe might overwrite SQE data that isn't read by the kernel
through SQ yet.

Is there any motivation behind the indirection?  I see two possible ideas:
- allocate fixed SQE entries for operations, and just add them to SQ
  when needed
- have multiple threads fill SQE in parallel, and only add them to SQ
  when done

Are those actually worth the cost?

liburing doesn't look like it is going to take advantage of it, and the
library I'm writing in Rust right now isn't going to either (actually
even using a fixed sq_ndx == sqe_ndx mapping, i.e. it'll only write
sq->array on startup).

At least consider documenting the motivation :)

>> Last but not least the kernel side has two lines it checks whether a
>> queue is full or not and uses `tail + 1 == head`: this is only true if
>> there were U32_MAX entries in the queue.  You should check `(tail -
>> head) == SIZE` instead.
> 
> Good catch, this is a leftover from when the rings were indeed capped by
> the mask of the entries. I've fixed this one and pushed it out, and
> added a liburing test case for it as well.

I think you missed the second one in io_uring_poll (looking at your
for-linus branch), so I will send a patch for that too.

cheers,
Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 1/3] io_uring: fix race condition reading SQ entries
  2019-04-19  9:29   ` Stefan Bühler
@ 2019-04-19  9:57     ` Stefan Bühler
  2019-04-19  9:57       ` [PATCH v1 2/3] io_uring: fix race condition when sq threads goes sleeping Stefan Bühler
  2019-04-19  9:57       ` [PATCH v1 3/3] io_uring: fix poll full SQ detection Stefan Bühler
  2019-04-22 17:04     ` io_uring memory ordering (and broken queue-is-full checks) Jens Axboe
  2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
  2 siblings, 2 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-19  9:57 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

A read memory barrier is required between reading SQ tail and reading
the actual data belonging to the SQ entry.

Userspace needs a matching write barrier between writing SQ entries and
updating SQ tail (using smp_store_release to update tail will do).

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f65f85d89217..96863e4780b7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1739,7 +1739,8 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	head = ctx->cached_sq_head;
 	/* See comment at the top of this file */
 	smp_rmb();
-	if (head == READ_ONCE(ring->r.tail))
+	/* make sure SQ entry isn't read before tail */
+	if (head == smp_load_acquire(&ring->r.tail))
 		return false;
 
 	head = READ_ONCE(ring->array[head & ctx->sq_mask]);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 2/3] io_uring: fix race condition when sq threads goes sleeping
  2019-04-19  9:57     ` [PATCH v1 1/3] io_uring: fix race condition reading SQ entries Stefan Bühler
@ 2019-04-19  9:57       ` Stefan Bühler
  2019-04-19  9:57       ` [PATCH v1 3/3] io_uring: fix poll full SQ detection Stefan Bühler
  1 sibling, 0 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-19  9:57 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

Reading the SQ tail needs to come after setting IORING_SQ_NEED_WAKEUP in
flags; there is no cheap barrier for ordering a store before a load, a
full memory barrier is required.

Userspace needs a full memory barrier between updating SQ tail and
checking for the IORING_SQ_NEED_WAKEUP too.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 96863e4780b7..912b0b304d90 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1865,7 +1865,8 @@ static int io_sq_thread(void *data)
 
 			/* Tell userspace we may need a wakeup call */
 			ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP;
-			smp_wmb();
+			/* make sure to read SQ tail after writing flags */
+			smp_mb();
 
 			if (!io_get_sqring(ctx, &sqes[0])) {
 				if (kthread_should_stop()) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 3/3] io_uring: fix poll full SQ detection
  2019-04-19  9:57     ` [PATCH v1 1/3] io_uring: fix race condition reading SQ entries Stefan Bühler
  2019-04-19  9:57       ` [PATCH v1 2/3] io_uring: fix race condition when sq threads goes sleeping Stefan Bühler
@ 2019-04-19  9:57       ` Stefan Bühler
  1 sibling, 0 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-19  9:57 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

io_uring_poll shouldn't signal EPOLLOUT | EPOLLWRNORM if the queue is
full; the old check would always signal EPOLLOUT | EPOLLWRNORM (unless
there were U32_MAX - 1 entries in the SQ queue).

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 912b0b304d90..047a0aa7a58c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2576,7 +2576,8 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 	poll_wait(file, &ctx->cq_wait, wait);
 	/* See comment at the top of this file */
 	smp_rmb();
-	if (READ_ONCE(ctx->sq_ring->r.tail) + 1 != ctx->cached_sq_head)
+	if (READ_ONCE(ctx->sq_ring->r.tail) - ctx->cached_sq_head !=
+	    ctx->sq_ring->ring_entries)
 		mask |= EPOLLOUT | EPOLLWRNORM;
 	if (READ_ONCE(ctx->cq_ring->r.head) != ctx->cached_cq_tail)
 		mask |= EPOLLIN | EPOLLRDNORM;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: io_uring memory ordering (and broken queue-is-full checks)
  2019-04-19  9:29   ` Stefan Bühler
  2019-04-19  9:57     ` [PATCH v1 1/3] io_uring: fix race condition reading SQ entries Stefan Bühler
@ 2019-04-22 17:04     ` Jens Axboe
  2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
  2 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2019-04-22 17:04 UTC (permalink / raw)
  To: Stefan Bühler, linux-block, linux-fsdevel

On 4/19/19 3:29 AM, Stefan Bühler wrote:
> Hi,
> 
> On 17.04.19 17:02, Jens Axboe wrote:
>> [...]
>>
>> Care to turn the pseudo code into a real patch? Would be easier to
>> digest. But thanks for taking a look at this!
> 
> Fair enough; I'm working on it.
> 
> I'll start with the actual bugs, and will try to cleanup unneeded
> barriers/docs later.

That sounds great, thanks. The three patches look good to me, I've
queued them up.

>>> The "userspace" liburing looks even worse.  For starters,
>>> `io_uring_get_completion` cannot safely return a pointer to the CQ entry
>>> *AND* increment head: you need to actually read the entry before
>>> incrementing head.
>>
>> That's a good point, we should actually have this split in two to avoid
>> the case where the kernel will then fill in a new entry for us.
> 
> I see you already did so in the liburing repo.

Yep

> When going through liburing code again, I noticed io_uring_submit
> assumes there might be pending submission entries in the SQ queue.
> 
> But those entries are not "reserved" in the SQE queue; so
> io_uring_get_sqe might overwrite SQE data that isn't read by the kernel
> through SQ yet.
> 
> Is there any motivation behind the indirection?  I see two possible ideas:
> - allocate fixed SQE entries for operations, and just add them to SQ
>   when needed
> - have multiple threads fill SQE in parallel, and only add them to SQ
>   when done
> 
> Are those actually worth the cost?
> 
> liburing doesn't look like it is going to take advantage of it, and the
> library I'm writing in Rust right now isn't going to either (actually
> even using a fixed sq_ndx == sqe_ndx mapping, i.e. it'll only write
> sq->array on startup).
> 
> At least consider documenting the motivation :)

I'll take a look at this case and see if the complexity is warranted,
or at least make sure that it's exposed in a safe fashion.

>>> Last but not least the kernel side has two lines it checks whether a
>>> queue is full or not and uses `tail + 1 == head`: this is only true if
>>> there were U32_MAX entries in the queue.  You should check `(tail -
>>> head) == SIZE` instead.
>>
>> Good catch, this is a leftover from when the rings were indeed capped by
>> the mask of the entries. I've fixed this one and pushed it out, and
>> added a liburing test case for it as well.
> 
> I think you missed the second one in io_uring_poll (looking at your
> for-linus branch), so I will send a patch for that too.

I did indeed, my grep failed me. Thanks for catching that!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers
  2019-04-19  9:29   ` Stefan Bühler
  2019-04-19  9:57     ` [PATCH v1 1/3] io_uring: fix race condition reading SQ entries Stefan Bühler
  2019-04-22 17:04     ` io_uring memory ordering (and broken queue-is-full checks) Jens Axboe
@ 2019-04-24 21:54     ` Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 2/7] io_uring: remove unnecessary barrier before wq_has_sleeper Stefan Bühler
                         ` (5 more replies)
  2 siblings, 6 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-24 21:54 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

The application reading the CQ ring needs a barrier to pair with the
smp_store_release in io_commit_cqring, not the barrier after it.

Also a write barrier *after* writing something (but not *before*
writing anything interesting) doesn't order anything, so an smp_wmb()
after writing SQ tail is not needed.

Additionally consider reading SQ head and writing CQ tail in the notes.

Also add some clarifications how the various other fields in the ring
buffers are used.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0e9fb2cb1984..7025e9830abe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4,15 +4,28 @@
  * supporting fast/efficient IO.
  *
  * A note on the read/write ordering memory barriers that are matched between
- * the application and kernel side. When the application reads the CQ ring
- * tail, it must use an appropriate smp_rmb() to order with the smp_wmb()
- * the kernel uses after writing the tail. Failure to do so could cause a
- * delay in when the application notices that completion events available.
- * This isn't a fatal condition. Likewise, the application must use an
- * appropriate smp_wmb() both before writing the SQ tail, and after writing
- * the SQ tail. The first one orders the sqe writes with the tail write, and
- * the latter is paired with the smp_rmb() the kernel will issue before
- * reading the SQ tail on submission.
+ * the application and kernel side.
+ *
+ * After the application reads the CQ ring tail, it must use an
+ * appropriate smp_rmb() to pair with the smp_wmb() the kernel uses
+ * before writing the tail (using smp_load_acquire to read the tail will
+ * do). It also needs a smp_mb() before updating CQ head (ordering the
+ * entry load(s) with the head store), pairing with an implicit barrier
+ * through a control-dependency in io_get_cqring (smp_store_release to
+ * store head will do). Failure to do so could lead to reading invalid
+ * CQ entries.
+ *
+ * Likewise, the application must use an appropriate smp_wmb() before
+ * writing the SQ tail (ordering SQ entry stores with the tail store),
+ * which pairs with smp_load_acquire in io_get_sqring (smp_store_release
+ * to store the tail will do). And it needs a barrier ordering the SQ
+ * head load before writing new SQ entries (smp_load_acquire to read
+ * head will do).
+ *
+ * When using the SQ poll thread (IORING_SETUP_SQPOLL), the application
+ * needs to check the SQ flags for IORING_SQ_NEED_WAKEUP *after*
+ * updating the SQ tail; a full memory barrier smp_mb() is needed
+ * between.
  *
  * Also see the examples in the liburing library:
  *
@@ -70,20 +83,108 @@ struct io_uring {
 	u32 tail ____cacheline_aligned_in_smp;
 };
 
+/*
+ * This data is shared with the application through the mmap at offset
+ * IORING_OFF_SQ_RING.
+ *
+ * The offsets to the member fields are published through struct
+ * io_sqring_offsets when calling io_uring_setup.
+ */
 struct io_sq_ring {
+	/*
+	 * Head and tail offsets into the ring; the offsets need to be
+	 * masked to get valid indices.
+	 *
+	 * The kernel controls head and the application controls tail.
+	 */
 	struct io_uring		r;
+	/*
+	 * Bitmask to apply to head and tail offsets (constant, equals
+	 * ring_entries - 1)
+	 */
 	u32			ring_mask;
+	/* Ring size (constant, power of 2) */
 	u32			ring_entries;
+	/*
+	 * Number of invalid entries dropped by the kernel due to
+	 * invalid index stored in array
+	 *
+	 * Written by the kernel, shouldn't be modified by the
+	 * application (i.e. get number of "new events" by comparing to
+	 * cached value).
+	 *
+	 * After a new SQ head value was read by the application this
+	 * counter includes all submissions that were dropped reaching
+	 * the new SQ head (and possibly more).
+	 */
 	u32			dropped;
+	/*
+	 * Runtime flags
+	 *
+	 * Written by the kernel, shouldn't be modified by the
+	 * application.
+	 *
+	 * The application needs a full memory barrier before checking
+	 * for IORING_SQ_NEED_WAKEUP after updating the sq tail.
+	 */
 	u32			flags;
+	/*
+	 * Ring buffer of indices into array of io_uring_sqe, which is
+	 * mmapped by the application using the IORING_OFF_SQES offset.
+	 *
+	 * This indirection could e.g. be used to assign fixed
+	 * io_uring_sqe entries to operations and only submit them to
+	 * the queue when needed.
+	 *
+	 * The kernel modifies neither the indices array nor the entries
+	 * array.
+	 */
 	u32			array[];
 };
 
+/*
+ * This data is shared with the application through the mmap at offset
+ * IORING_OFF_CQ_RING.
+ *
+ * The offsets to the member fields are published through struct
+ * io_cqring_offsets when calling io_uring_setup.
+ */
 struct io_cq_ring {
+	/*
+	 * Head and tail offsets into the ring; the offsets need to be
+	 * masked to get valid indices.
+	 *
+	 * The application controls head and the kernel tail.
+	 */
 	struct io_uring		r;
+	/*
+	 * Bitmask to apply to head and tail offsets (constant, equals
+	 * ring_entries - 1)
+	 */
 	u32			ring_mask;
+	/* Ring size (constant, power of 2) */
 	u32			ring_entries;
+	/*
+	 * Number of completion events lost because the queue was full;
+	 * this should be avoided by the application by making sure
+	 * there are not more requests pending thatn there is space in
+	 * the completion queue.
+	 *
+	 * Written by the kernel, shouldn't be modified by the
+	 * application (i.e. get number of "new events" by comparing to
+	 * cached value).
+	 *
+	 * As completion events come in out of order this counter is not
+	 * ordered with any other data.
+	 */
 	u32			overflow;
+	/*
+	 * Ring buffer of completion events.
+	 *
+	 * The kernel writes completion events fresh every time they are
+	 * produced, so the application is allowed to modify pending
+	 * entries.
+	 */
 	struct io_uring_cqe	cqes[];
 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH barrier cleanup v1 2/7] io_uring: remove unnecessary barrier before wq_has_sleeper
  2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
@ 2019-04-24 21:54       ` Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 3/7] io_uring: remove unnecessary barrier before reading cq head Stefan Bühler
                         ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-24 21:54 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

wq_has_sleeper has a full barrier internally. The smp_rmb barrier in
io_uring_poll synchronizes with it.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7025e9830abe..1f4419f38ef1 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -418,12 +418,6 @@ static void io_commit_cqring(struct io_ring_ctx *ctx)
 		/* order cqe stores with ring update */
 		smp_store_release(&ring->r.tail, ctx->cached_cq_tail);
 
-		/*
-		 * Write sider barrier of tail update, app has read side. See
-		 * comment at the top of this file.
-		 */
-		smp_wmb();
-
 		if (wq_has_sleeper(&ctx->cq_wait)) {
 			wake_up_interruptible(&ctx->cq_wait);
 			kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
@@ -2674,7 +2668,9 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait)
 	__poll_t mask = 0;
 
 	poll_wait(file, &ctx->cq_wait, wait);
-	/* See comment at the top of this file */
+	/* synchronizes with barrier from wq_has_sleeper call in
+	 * io_commit_cqring
+	 */
 	smp_rmb();
 	if (READ_ONCE(ctx->sq_ring->r.tail) - ctx->cached_sq_head !=
 	    ctx->sq_ring->ring_entries)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH barrier cleanup v1 3/7] io_uring: remove unnecessary barrier before reading cq head
  2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 2/7] io_uring: remove unnecessary barrier before wq_has_sleeper Stefan Bühler
@ 2019-04-24 21:54       ` Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 4/7] io_uring: remove unnecessary barrier after updating SQ head Stefan Bühler
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-24 21:54 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

The memory operations before reading cq head are unrelated and we
don't care about their order.

Document that the control dependency in combination with READ_ONCE and
WRITE_ONCE forms a barrier we need.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1f4419f38ef1..2c101230df71 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -431,8 +431,11 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
 	unsigned tail;
 
 	tail = ctx->cached_cq_tail;
-	/* See comment at the top of the file */
-	smp_rmb();
+	/*
+	 * writes to the cq entry need to come after reading head; the
+	 * control dependency is enough as we're using WRITE_ONCE to
+	 * fill the cq entry
+	 */
 	if (tail - READ_ONCE(ring->r.head) == ring->ring_entries)
 		return NULL;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH barrier cleanup v1 4/7] io_uring: remove unnecessary barrier after updating SQ head
  2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 2/7] io_uring: remove unnecessary barrier before wq_has_sleeper Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 3/7] io_uring: remove unnecessary barrier before reading cq head Stefan Bühler
@ 2019-04-24 21:54       ` Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 5/7] io_uring: remove unnecessary barrier before reading SQ tail Stefan Bühler
                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-24 21:54 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

There is no operation afterwards to order with.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2c101230df71..31357cc0e8e6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1795,12 +1795,6 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
 		 * write new data to them.
 		 */
 		smp_store_release(&ring->r.head, ctx->cached_sq_head);
-
-		/*
-		 * write side barrier of head update, app has read side. See
-		 * comment at the top of this file
-		 */
-		smp_wmb();
 	}
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH barrier cleanup v1 5/7] io_uring: remove unnecessary barrier before reading SQ tail
  2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
                         ` (2 preceding siblings ...)
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 4/7] io_uring: remove unnecessary barrier after updating SQ head Stefan Bühler
@ 2019-04-24 21:54       ` Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 6/7] io_uring: remove unnecessary barrier after incrementing dropped counter Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 7/7] io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP Stefan Bühler
  5 siblings, 0 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-24 21:54 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

There is no operation before to order with.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 31357cc0e8e6..f93a9fca8311 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1828,8 +1828,6 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	 *    though the application is the one updating it.
 	 */
 	head = ctx->cached_sq_head;
-	/* See comment at the top of this file */
-	smp_rmb();
 	/* make sure SQ entry isn't read before tail */
 	if (head == smp_load_acquire(&ring->r.tail))
 		return false;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH barrier cleanup v1 6/7] io_uring: remove unnecessary barrier after incrementing dropped counter
  2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
                         ` (3 preceding siblings ...)
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 5/7] io_uring: remove unnecessary barrier before reading SQ tail Stefan Bühler
@ 2019-04-24 21:54       ` Stefan Bühler
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 7/7] io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP Stefan Bühler
  5 siblings, 0 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-24 21:54 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

smp_store_release in io_commit_sqring already orders the store to
dropped before the update to SQ head.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f93a9fca8311..0287f6694e3b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1843,8 +1843,6 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	/* drop invalid entries */
 	ctx->cached_sq_head++;
 	ring->dropped++;
-	/* See comment at the top of this file */
-	smp_wmb();
 	return false;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH barrier cleanup v1 7/7] io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP
  2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
                         ` (4 preceding siblings ...)
  2019-04-24 21:54       ` [PATCH barrier cleanup v1 6/7] io_uring: remove unnecessary barrier after incrementing dropped counter Stefan Bühler
@ 2019-04-24 21:54       ` Stefan Bühler
  5 siblings, 0 replies; 14+ messages in thread
From: Stefan Bühler @ 2019-04-24 21:54 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-fsdevel

There is no operation to order with afterwards, and removing the flag is
not critical in any way.

There will always be a "race condition" where the application will
trigger IORING_ENTER_SQ_WAKEUP when it isn't actually needed.

Signed-off-by: Stefan Bühler <source@stbuehler.de>
---
 fs/io_uring.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0287f6694e3b..da084bcbd408 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1966,13 +1966,11 @@ static int io_sq_thread(void *data)
 				finish_wait(&ctx->sqo_wait, &wait);
 
 				ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
-				smp_wmb();
 				continue;
 			}
 			finish_wait(&ctx->sqo_wait, &wait);
 
 			ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
-			smp_wmb();
 		}
 
 		i = 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-04-24 21:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-14 16:44 io_uring memory ordering (and broken queue-is-full checks) Stefan Bühler
2019-04-17 15:02 ` Jens Axboe
2019-04-19  9:29   ` Stefan Bühler
2019-04-19  9:57     ` [PATCH v1 1/3] io_uring: fix race condition reading SQ entries Stefan Bühler
2019-04-19  9:57       ` [PATCH v1 2/3] io_uring: fix race condition when sq threads goes sleeping Stefan Bühler
2019-04-19  9:57       ` [PATCH v1 3/3] io_uring: fix poll full SQ detection Stefan Bühler
2019-04-22 17:04     ` io_uring memory ordering (and broken queue-is-full checks) Jens Axboe
2019-04-24 21:54     ` [PATCH barrier cleanup v1 1/7] io_uring: fix notes on barriers Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 2/7] io_uring: remove unnecessary barrier before wq_has_sleeper Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 3/7] io_uring: remove unnecessary barrier before reading cq head Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 4/7] io_uring: remove unnecessary barrier after updating SQ head Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 5/7] io_uring: remove unnecessary barrier before reading SQ tail Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 6/7] io_uring: remove unnecessary barrier after incrementing dropped counter Stefan Bühler
2019-04-24 21:54       ` [PATCH barrier cleanup v1 7/7] io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP Stefan Bühler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).