Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Jann Horn <jannh@google.com>
Cc: linux-aio@kvack.org, linux-block@vger.kernel.org,
	Linux API <linux-api@vger.kernel.org>,
	hch@lst.de, jmoyer@redhat.com, Avi Kivity <avi@scylladb.com>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH 05/19] Add io_uring IO interface
Date: Tue, 12 Feb 2019 16:46:28 -0700
Message-ID: <e70a7451-2fb4-5800-500d-6ae46fbfe5f4@kernel.dk> (raw)
In-Reply-To: <CAG48ez3PESTFsySiZ1T2w-+4xo6mqkHRBv+Hgs2BKy-4w+7yug@mail.gmail.com>

On 2/12/19 4:28 PM, Jann Horn wrote:
> On Wed, Feb 13, 2019 at 12:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/12/19 4:11 PM, Jann Horn wrote:
>>> On Wed, Feb 13, 2019 at 12:00 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 2/12/19 3:57 PM, Jann Horn wrote:
>>>>> On Tue, Feb 12, 2019 at 11:52 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 2/12/19 3:45 PM, Jens Axboe wrote:
>>>>>>> On 2/12/19 3:40 PM, Jann Horn wrote:
>>>>>>>> On Tue, Feb 12, 2019 at 11:06 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>
>>>>>>>>> On 2/12/19 3:03 PM, Jens Axboe wrote:
>>>>>>>>>> On 2/12/19 2:42 PM, Jann Horn wrote:
>>>>>>>>>>> On Sat, Feb 9, 2019 at 5:15 AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>> On 2/8/19 3:12 PM, Jann Horn wrote:
>>>>>>>>>>>>> On Fri, Feb 8, 2019 at 6:34 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>>>>>> The submission queue (SQ) and completion queue (CQ) rings are shared
>>>>>>>>>>>>>> between the application and the kernel. This eliminates the need to
>>>>>>>>>>>>>> copy data back and forth to submit and complete IO.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> IO submissions use the io_uring_sqe data structure, and completions
>>>>>>>>>>>>>> are generated in the form of io_uring_cqe data structures. The SQ
>>>>>>>>>>>>>> ring is an index into the io_uring_sqe array, which makes it possible
>>>>>>>>>>>>>> to submit a batch of IOs without them being contiguous in the ring.
>>>>>>>>>>>>>> The CQ ring is always contiguous, as completion events are inherently
>>>>>>>>>>>>>> unordered, and hence any io_uring_cqe entry can point back to an
>>>>>>>>>>>>>> arbitrary submission.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Two new system calls are added for this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> io_uring_setup(entries, params)
>>>>>>>>>>>>>>         Sets up an io_uring instance for doing async IO. On success,
>>>>>>>>>>>>>>         returns a file descriptor that the application can mmap to
>>>>>>>>>>>>>>         gain access to the SQ ring, CQ ring, and io_uring_sqes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
>>>>>>>>>>>>>>         Initiates IO against the rings mapped to this fd, or waits for
>>>>>>>>>>>>>>         them to complete, or both. The behavior is controlled by the
>>>>>>>>>>>>>>         parameters passed in. If 'to_submit' is non-zero, then we'll
>>>>>>>>>>>>>>         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
>>>>>>>>>>>>>>         kernel will wait for 'min_complete' events, if they aren't
>>>>>>>>>>>>>>         already available. It's valid to set IORING_ENTER_GETEVENTS
>>>>>>>>>>>>>>         and 'min_complete' == 0 at the same time, this allows the
>>>>>>>>>>>>>>         kernel to return already completed events without waiting
>>>>>>>>>>>>>>         for them. This is useful only for polling, as for IRQ
>>>>>>>>>>>>>>         driven IO, the application can just check the CQ ring
>>>>>>>>>>>>>>         without entering the kernel.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With this setup, it's possible to do async IO with a single system
>>>>>>>>>>>>>> call. Future developments will enable polled IO with this interface,
>>>>>>>>>>>>>> and polled submission as well. The latter will enable an application
>>>>>>>>>>>>>> to do IO without doing ANY system calls at all.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For IRQ driven IO, an application only needs to enter the kernel for
>>>>>>>>>>>>>> completions if it wants to wait for them to occur.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Each io_uring is backed by a workqueue, to support buffered async IO
>>>>>>>>>>>>>> as well. We will only punt to an async context if the command would
>>>>>>>>>>>>>> need to wait for IO on the device side. Any data that can be accessed
>>>>>>>>>>>>>> directly in the page cache is done inline. This avoids the slowness
>>>>>>>>>>>>>> issue of usual threadpools, since cached data is accessed as quickly
>>>>>>>>>>>>>> as a sync interface.
>>>>>>>>>>> [...]
>>>>>>>>>>>>>> +static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +       struct io_kiocb *req;
>>>>>>>>>>>>>> +       ssize_t ret;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       /* enforce forwards compatibility on users */
>>>>>>>>>>>>>> +       if (unlikely(s->sqe->flags))
>>>>>>>>>>>>>> +               return -EINVAL;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       req = io_get_req(ctx);
>>>>>>>>>>>>>> +       if (unlikely(!req))
>>>>>>>>>>>>>> +               return -EAGAIN;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       req->rw.ki_filp = NULL;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       ret = __io_submit_sqe(ctx, req, s, true);
>>>>>>>>>>>>>> +       if (ret == -EAGAIN) {
>>>>>>>>>>>>>> +               memcpy(&req->submit, s, sizeof(*s));
>>>>>>>>>>>>>> +               INIT_WORK(&req->work, io_sq_wq_submit_work);
>>>>>>>>>>>>>> +               queue_work(ctx->sqo_wq, &req->work);
>>>>>>>>>>>>>> +               ret = 0;
>>>>>>>>>>>>>> +       }
>>>>>>>>>>>>>> +       if (ret)
>>>>>>>>>>>>>> +               io_free_req(req);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       return ret;
>>>>>>>>>>>>>> +}
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +static void io_commit_sqring(struct io_ring_ctx *ctx)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +       struct io_sq_ring *ring = ctx->sq_ring;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +       if (ctx->cached_sq_head != ring->r.head) {
>>>>>>>>>>>>>> +               WRITE_ONCE(ring->r.head, ctx->cached_sq_head);
>>>>>>>>>>>>>> +               /* write side barrier of head update, app has read side */
>>>>>>>>>>>>>> +               smp_wmb();
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can you elaborate on what this memory barrier is doing? Don't you need
>>>>>>>>>>>>> some sort of memory barrier *before* the WRITE_ONCE(), to ensure that
>>>>>>>>>>>>> nobody sees the updated head before you're done reading the submission
>>>>>>>>>>>>> queue entry? Or is that barrier elsewhere?
>>>>>>>>>>>>
>>>>>>>>>>>> The matching read barrier is in the application, it must do that before
>>>>>>>>>>>> reading ->head for the SQ ring.
>>>>>>>>>>>>
>>>>>>>>>>>> For the other barrier, since the ring->r.head now has a READ_ONCE(),
>>>>>>>>>>>> that should be all we need to ensure that loads are done.
>>>>>>>>>>>
>>>>>>>>>>> READ_ONCE() / WRITE_ONCE are not hardware memory barriers that enforce
>>>>>>>>>>> ordering with regard to concurrent execution on other cores. They are
>>>>>>>>>>> only compiler barriers, influencing the order in which the compiler
>>>>>>>>>>> emits things. (Well, unless you're on alpha, where READ_ONCE() implies
>>>>>>>>>>> a memory barrier that prevents reordering of dependent reads.)
>>>>>>>>>>>
>>>>>>>>>>> As far as I can tell, between the READ_ONCE(ring->array[...]) in
>>>>>>>>>>> io_get_sqring() and the WRITE_ONCE() in io_commit_sqring(), you have
>>>>>>>>>>> no *hardware* memory barrier that prevents reordering against
>>>>>>>>>>> concurrently running userspace code. As far as I can tell, the
>>>>>>>>>>> following could happen:
>>>>>>>>>>>
>>>>>>>>>>>  - The kernel reads from ring->array in io_get_sqring(), then updates
>>>>>>>>>>> the head in io_commit_sqring(). The CPU reorders the memory accesses
>>>>>>>>>>> such that the write to the head becomes visible before the read from
>>>>>>>>>>> ring->array has completed.
>>>>>>>>>>>  - Userspace observes the write to the head and reuses the array slots
>>>>>>>>>>> the kernel has freed with the write, clobbering ring->array before the
>>>>>>>>>>> kernel reads from ring->array.
>>>>>>>>>>
>>>>>>>>>> I'd say this is highly theoretical for the normal use case, as we
>>>>>>>>>> will have submitted IO in between. Hence the load must have been done.
>>>>>>>>
>>>>>>>> Sorry, I'm confused. Who is "we", and which load are you referring to?
>>>>>>>> io_sq_thread() goes directly from io_get_sqring() to
>>>>>>>> io_commit_sqring(), with only a conditional io_sqe_needs_user() in
>>>>>>>> between, if the `i == ARRAY_SIZE(sqes)` check triggers. There is no
>>>>>>>> "submitting IO" in the middle.
>>>>>>>
>>>>>>> You are right, the patch I sent IS needed for the sq thread case! It's
>>>>>>> only true for the "normal" case that we don't need the smp_mb() before
>>>>>>> writing the sq ring head, as sqes are fully consumed at that point.
>>>>>
>>>>> Hmm... does that actually matter? As long as you don't have an
>>>>> explicit barrier for this, the CPU could still reorder things, right?
>>>>> Pull the store in front of everything else?
>>>>
>>>> If the IO has been submitted, by definition the loads have completed.
>>>> At that point it should be fine to commit the ring head that the
>>>> application sees.
>>>
>>> What exactly do you mean by "the IO has been submitted"? Are you
>>> talking about interaction with hardware, or about the end of the
>>> syscall, or something else?
>>
>> I mean that the loads from the sqe, which the IO is made of, have been
>> done. That's what we care about here, right? The sqe has either been
>> turned into an io request and has been submitted, or it has been copied.
> 
> But they might not actually be done. AFAIU the CPU is allowed to do
> the WRITE_ONCE of the head before doing any of the reads from the sqe
> - loads and stores you do, as observed by a concurrently executing
> thread, can happen in an order independent of the order in which you
> write them in your code unless you use memory barriers. So the CPU
> might decide to first write the new head, then do the read for
> io_get_sqring(), and then do the __io_submit_sqe(), potentially
> reading e.g. a IORING_OP_NOP opcode that has been written by
> concurrently executing userspace after userspace has observed the
> bumped head.

For that to be possible, we'd need NO ordering in between the IO
submission and when we write the sq ring head. A single spin lock
should do it, right?

It's not that I'm set against adding an smp_mb() to io_commit_sqring(),
but I think we're going off the deep end a little bit here on
theoretical vs what can practically happen.

For the regular IO cases, we will have done at least one lock/unlock
cycle. This is true for nops as well, and poll. The only case that could
potentially NOT have one is the fsync, for the case where we punt and
don't add it to existing work, we don't have any locking in between.

I'll add the smp_mb() for peace of mind.

-- 
Jens Axboe


  reply index

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-08 17:34 [PATCHSET v13] " Jens Axboe
2019-02-08 17:34 ` [PATCH 01/19] fs: add an iopoll method to struct file_operations Jens Axboe
2019-02-09  9:20   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 02/19] block: wire up block device iopoll method Jens Axboe
2019-02-09  9:22   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 03/19] block: add bio_set_polled() helper Jens Axboe
2019-02-09  9:24   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 04/19] iomap: wire up the iopoll method Jens Axboe
2019-02-09  9:25   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 05/19] Add io_uring IO interface Jens Axboe
2019-02-08 22:12   ` Jann Horn
2019-02-09  4:15     ` Jens Axboe
2019-02-12 21:42       ` Jann Horn
2019-02-12 22:03         ` Jens Axboe
2019-02-12 22:06           ` Jens Axboe
2019-02-12 22:40             ` Jann Horn
2019-02-12 22:45               ` Jens Axboe
2019-02-12 22:52                 ` Jens Axboe
2019-02-12 22:57                   ` Jann Horn
2019-02-12 23:00                     ` Jens Axboe
2019-02-12 23:11                       ` Jann Horn
2019-02-12 23:19                         ` Jens Axboe
2019-02-12 23:28                           ` Jann Horn
2019-02-12 23:46                             ` Jens Axboe [this message]
2019-02-12 23:53                               ` Jens Axboe
2019-02-13  0:07                                 ` Andy Lutomirski
2019-02-13  0:14                                   ` Jann Horn
2019-02-13  0:24                                   ` Jens Axboe
2019-02-09  9:35   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 06/19] io_uring: add fsync support Jens Axboe
2019-02-08 22:36   ` Jann Horn
2019-02-08 23:31     ` Jens Axboe
2019-02-09  9:37   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 07/19] io_uring: support for IO polling Jens Axboe
2019-02-09  9:39   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 08/19] fs: add fget_many() and fput_many() Jens Axboe
2019-02-09  9:41   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 09/19] io_uring: use fget/fput_many() for file references Jens Axboe
2019-02-09  9:42   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 10/19] io_uring: batch io_kiocb allocation Jens Axboe
2019-02-09  9:43   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-02-09  9:45   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-02-08 22:54   ` Jann Horn
2019-02-08 23:38     ` Jens Axboe
2019-02-09 16:50       ` Jens Axboe
2019-02-09  9:48   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 13/19] net: split out functions related to registering inflight socket files Jens Axboe
2019-02-08 19:49   ` David Miller
2019-02-08 19:51     ` Jens Axboe
2019-02-09  9:49   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-08 20:26   ` Jann Horn
2019-02-09  0:16     ` Jens Axboe
2019-02-09  9:50   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 15/19] io_uring: add submission polling Jens Axboe
2019-02-09  9:53   ` Hannes Reinecke
2019-02-08 17:34 ` [PATCH 16/19] io_uring: add io_kiocb ref count Jens Axboe
2019-02-08 17:34 ` [PATCH 17/19] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-02-08 17:34 ` [PATCH 18/19] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
2019-02-08 17:34 ` [PATCH 19/19] io_uring: add io_uring_event cache hit information Jens Axboe
2019-02-09 21:13 [PATCHSET v14] io_uring IO interface Jens Axboe
2019-02-09 21:13 ` [PATCH 05/19] Add " Jens Axboe
2019-02-10 12:03   ` Thomas Gleixner
2019-02-10 14:19     ` Jens Axboe
2019-02-11 19:00 [PATCHSET v15] " Jens Axboe
2019-02-11 19:00 ` [PATCH 05/19] Add " Jens Axboe

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e70a7451-2fb4-5800-500d-6ae46fbfe5f4@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=avi@scylladb.com \
    --cc=hch@lst.de \
    --cc=jannh@google.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox