linux-man.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Jann Horn <jannh@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	linux-aio@kvack.org, linux-block@vger.kernel.org,
	linux-man <linux-man@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>,
	hch@lst.de, jmoyer@redhat.com, Avi Kivity <avi@scylladb.com>
Subject: Re: [PATCH 05/18] Add io_uring IO interface
Date: Mon, 28 Jan 2019 17:03:13 -0700	[thread overview]
Message-ID: <c20a1250-9f3f-a286-9baf-5b71db0c128f@kernel.dk> (raw)
In-Reply-To: <CAG48ez17NW0GJVRC6dFcHZTgQifFz5og1XCUbXkHKhr6f=j74Q@mail.gmail.com>

On 1/28/19 4:59 PM, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 12:47 AM Jens Axboe <axboe@kernel.dk> wrote:
>> On 1/28/19 3:32 PM, Jann Horn wrote:
>>> On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
>>>> The submission queue (SQ) and completion queue (CQ) rings are shared
>>>> between the application and the kernel. This eliminates the need to
>>>> copy data back and forth to submit and complete IO.
>>>>
>>>> IO submissions use the io_uring_sqe data structure, and completions
>>>> are generated in the form of io_uring_sqe data structures. The SQ
>>>> ring is an index into the io_uring_sqe array, which makes it possible
>>>> to submit a batch of IOs without them being contiguous in the ring.
>>>> The CQ ring is always contiguous, as completion events are inherently
>>>> unordered, and hence any io_uring_cqe entry can point back to an
>>>> arbitrary submission.
>>>>
>>>> Two new system calls are added for this:
>>>>
>>>> io_uring_setup(entries, params)
>>>>         Sets up a context for doing async IO. On success, returns a file
>>>>         descriptor that the application can mmap to gain access to the
>>>>         SQ ring, CQ ring, and io_uring_sqes.
>>>>
>>>> io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
>>>>         Initiates IO against the rings mapped to this fd, or waits for
>>>>         them to complete, or both. The behavior is controlled by the
>>>>         parameters passed in. If 'to_submit' is non-zero, then we'll
>>>>         try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
>>>>         kernel will wait for 'min_complete' events, if they aren't
>>>>         already available. It's valid to set IORING_ENTER_GETEVENTS
>>>>         and 'min_complete' == 0 at the same time, this allows the
>>>>         kernel to return already completed events without waiting
>>>>         for them. This is useful only for polling, as for IRQ
>>>>         driven IO, the application can just check the CQ ring
>>>>         without entering the kernel.
>>>>
>>>> With this setup, it's possible to do async IO with a single system
>>>> call. Future developments will enable polled IO with this interface,
>>>> and polled submission as well. The latter will enable an application
>>>> to do IO without doing ANY system calls at all.
>>>>
>>>> For IRQ driven IO, an application only needs to enter the kernel for
>>>> completions if it wants to wait for them to occur.
>>>>
>>>> Each io_uring is backed by a workqueue, to support buffered async IO
>>>> as well. We will only punt to an async context if the command would
>>>> need to wait for IO on the device side. Any data that can be accessed
>>>> directly in the page cache is done inline. This avoids the slowness
>>>> issue of usual threadpools, since cached data is accessed as quickly
>>>> as a sync interface.
>>>>
>>>> Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c
>>> [...]
>>>> +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>>>> +                     bool force_nonblock)
>>>> +{
>>>> +       struct kiocb *kiocb = &req->rw;
>>>> +       int ret;
>>>> +
>>>> +       kiocb->ki_filp = fget(sqe->fd);
>>>> +       if (unlikely(!kiocb->ki_filp))
>>>> +               return -EBADF;
>>>> +       kiocb->ki_pos = sqe->off;
>>>> +       kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
>>>> +       kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
>>>> +       if (sqe->ioprio) {
>>>> +               ret = ioprio_check_cap(sqe->ioprio);
>>>> +               if (ret)
>>>> +                       goto out_fput;
>>>> +
>>>> +               kiocb->ki_ioprio = sqe->ioprio;
>>>> +       } else
>>>> +               kiocb->ki_ioprio = get_current_ioprio();
>>>> +
>>>> +       ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
>>>> +       if (unlikely(ret))
>>>> +               goto out_fput;
>>>> +       if (force_nonblock) {
>>>> +               kiocb->ki_flags |= IOCB_NOWAIT;
>>>> +               req->flags |= REQ_F_FORCE_NONBLOCK;
>>>> +       }
>>>> +       if (kiocb->ki_flags & IOCB_HIPRI) {
>>>> +               ret = -EINVAL;
>>>> +               goto out_fput;
>>>> +       }
>>>> +
>>>> +       kiocb->ki_complete = io_complete_rw;
>>>> +       return 0;
>>>> +out_fput:
>>>> +       fput(kiocb->ki_filp);
>>>> +       return ret;
>>>> +}
>>> [...]
>>>> +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>>>> +                      bool force_nonblock)
>>>> +{
>>>> +       struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>>>> +       struct kiocb *kiocb = &req->rw;
>>>> +       struct iov_iter iter;
>>>> +       struct file *file;
>>>> +       ssize_t ret;
>>>> +
>>>> +       ret = io_prep_rw(req, sqe, force_nonblock);
>>>> +       if (ret)
>>>> +               return ret;
>>>> +       file = kiocb->ki_filp;
>>>> +
>>>> +       ret = -EBADF;
>>>> +       if (unlikely(!(file->f_mode & FMODE_READ)))
>>>> +               goto out_fput;
>>>> +       ret = -EINVAL;
>>>> +       if (unlikely(!file->f_op->read_iter))
>>>> +               goto out_fput;
>>>> +
>>>> +       ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
>>>> +       if (ret)
>>>> +               goto out_fput;
>>>> +
>>>> +       ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
>>>> +       if (!ret) {
>>>> +               ssize_t ret2;
>>>> +
>>>> +               /* Catch -EAGAIN return for forced non-blocking submission */
>>>> +               ret2 = call_read_iter(file, kiocb, &iter);
>>>> +               if (!force_nonblock || ret2 != -EAGAIN)
>>>> +                       io_rw_done(kiocb, ret2);
>>>> +               else
>>>> +                       ret = -EAGAIN;
>>>> +       }
>>>> +       kfree(iovec);
>>>> +out_fput:
>>>> +       if (unlikely(ret))
>>>> +               fput(file);
>>>> +       return ret;
>>>> +}
>>> [...]
>>>> +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>>> +                          struct sqe_submit *s, bool force_nonblock)
>>>> +{
>>>> +       const struct io_uring_sqe *sqe = s->sqe;
>>>> +       ssize_t ret;
>>>> +
>>>> +       if (unlikely(s->index >= ctx->sq_entries))
>>>> +               return -EINVAL;
>>>> +       req->user_data = sqe->user_data;
>>>> +
>>>> +       ret = -EINVAL;
>>>> +       switch (sqe->opcode) {
>>>> +       case IORING_OP_NOP:
>>>> +               ret = io_nop(req, sqe);
>>>> +               break;
>>>> +       case IORING_OP_READV:
>>>> +               ret = io_read(req, sqe, force_nonblock);
>>>> +               break;
>>>> +       case IORING_OP_WRITEV:
>>>> +               ret = io_write(req, sqe, force_nonblock);
>>>> +               break;
>>>> +       default:
>>>> +               ret = -EINVAL;
>>>> +               break;
>>>> +       }
>>>> +
>>>> +       return ret;
>>>> +}
>>>> +
>>>> +static void io_sq_wq_submit_work(struct work_struct *work)
>>>> +{
>>>> +       struct io_kiocb *req = container_of(work, struct io_kiocb, work);
>>>> +       struct sqe_submit *s = &req->submit;
>>>> +       u64 user_data = s->sqe->user_data;
>>>> +       struct io_ring_ctx *ctx = req->ctx;
>>>> +       mm_segment_t old_fs = get_fs();
>>>> +       struct files_struct *old_files;
>>>> +       int ret;
>>>> +
>>>> +        /* Ensure we clear previously set forced non-block flag */
>>>> +       req->flags &= ~REQ_F_FORCE_NONBLOCK;
>>>> +
>>>> +       old_files = current->files;
>>>> +       current->files = ctx->sqo_files;
>>>
>>> I think you're not supposed to twiddle with current->files without
>>> holding task_lock(current).
>>
>> 'current' is the work queue item in this case, do we need to protect
>> against anything else? I can add the locking around the assignments
>> (both places).
> 
> Stuff like proc_fd_link() uses get_files_struct(), which grabs a
> reference to your current files_struct protected only by task_lock();
> and it doesn't use anything like READ_ONCE(), so even if the object
> lifetime is not a problem, get_files_struct() could potentially crash
> due to a double-read (reading task->files twice and assuming that the
> result will be the same). As far as I can tell, this procfs code also
> works on kernel threads.

OK, that does make sense. I've added the locking.

>>>> +       if (!mmget_not_zero(ctx->sqo_mm)) {
>>>> +               ret = -EFAULT;
>>>> +               goto err;
>>>> +       }
>>>> +
>>>> +       use_mm(ctx->sqo_mm);
>>>> +       set_fs(USER_DS);
>>>> +
>>>> +       ret = __io_submit_sqe(ctx, req, s, false);
>>>> +
>>>> +       set_fs(old_fs);
>>>> +       unuse_mm(ctx->sqo_mm);
>>>> +       mmput(ctx->sqo_mm);
>>>> +err:
>>>> +       if (ret) {
>>>> +               io_cqring_add_event(ctx, user_data, ret, 0);
>>>> +               io_free_req(req);
>>>> +       }
>>>> +       current->files = old_files;
>>>> +}
>>> [...]
>>>> +static int io_sq_offload_start(struct io_ring_ctx *ctx)
>>>> +{
>>>> +       int ret;
>>>> +
>>>> +       ctx->sqo_mm = current->mm;
>>>
>>> What keeps this thing alive?
>>
>> I think we're deadling with the same thing as the files below, I'll
>> defer to that.
>>
>>>> +       /*
>>>> +        * This is safe since 'current' has the fd installed, and if that gets
>>>> +        * closed on exit, then fops->release() is invoked which waits for the
>>>> +        * async contexts to flush and exit before exiting.
>>>> +        */
>>>> +       ret = -EBADF;
>>>> +       ctx->sqo_files = current->files;
>>>> +       if (!ctx->sqo_files)
>>>> +               goto err;
>>>
>>> That's gnarly. Adding Al Viro to the thread.
>>>
>>> I think you misunderstand the semantics of f_op->release. The ->flush
>>> handler is invoked whenever a file descriptor is closed through
>>> filp_close() (via deletion of the files_struct, sys_close(),
>>> sys_dup2(), ...), so if you had used that one, _maybe_ this would
>>> work. But the ->release handler only runs when the _last_ reference to
>>> a struct file has been dropped - so you can, for example, fork() a
>>> child, then exit() in the parent, and the ->release handler isn't
>>> invoked. So I don't see how this can work.
>>
>> The anonfd is CLOEXEC. The idea is exactly that it only runs when the
>> last reference to the file has been dropped. Not sure why you think I
>> need ->flush() here?
> 
> Can't I just use fcntl(fd, F_SETFD, fd, 0) to clear the CLOEXEC flag?
> Or send the fd via SCM_RIGHTS?

That would obviously be a problem...

>>> But even if you had abused ->flush for this instead: close_files()
>>> currently has a comment in it that claims that "this is the last
>>> reference to the files structure"; this change would make that claim
>>> untrue.
>>
>> Let me see if I can explain my intent better than that comment... We
>> know the parent who set up the io_uring instance will be around for as
>> long as io_uring instance persists.
> 
> That's the part that I think is wrong: As far as I can tell, the
> parent can go away and you won't notice.

If that's the case, then the mm/files referencing needs to be looked
over for sure. It's currently relying on the fact that the parent stays
alive. If it can go away without ->release() being called, then we have
issues.

> Also, note that "the parent" is different things for ->files and ->mm.
> You can have a multithreaded process whose threads don't have the same
> ->files, or multiple process that share ->files without sharing ->mm,
> ...

Of course, I do realize that.

>> When we are tearing down the
>> io_uring, then we wait for any async contexts (like the one above) to
>> exit. Once they are exited, it's safe to proceed with the exit and
>> teardown ->files[].
> 
> But you only do that teardown on ->release, right? And ->release
> doesn't have much to do with the process lifetime.

Yes, only on ->relase().

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>

  reply	other threads:[~2019-01-29  0:03 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-28 21:35 [PATCHSET v8] io_uring IO interface Jens Axboe
2019-01-28 21:35 ` [PATCH 01/18] fs: add an iopoll method to struct file_operations Jens Axboe
2019-01-28 21:35 ` [PATCH 02/18] block: wire up block device iopoll method Jens Axboe
2019-01-28 21:35 ` [PATCH 03/18] block: add bio_set_polled() helper Jens Axboe
2019-01-28 21:35 ` [PATCH 04/18] iomap: wire up the iopoll method Jens Axboe
2019-01-28 21:35 ` [PATCH 05/18] Add io_uring IO interface Jens Axboe
2019-01-28 21:53   ` Jeff Moyer
2019-01-28 21:56     ` Jens Axboe
2019-01-28 22:32   ` Jann Horn
2019-01-28 23:46     ` Jens Axboe
2019-01-28 23:59       ` Jann Horn
2019-01-29  0:03         ` Jens Axboe [this message]
2019-01-29  0:31           ` Jens Axboe
2019-01-29  0:34             ` Jann Horn
2019-01-29  0:55               ` Jens Axboe
2019-01-29  0:58                 ` Jann Horn
2019-01-29  1:01                   ` Jens Axboe
2019-02-01 16:57         ` Matt Mullins
2019-02-01 17:04           ` Jann Horn
2019-02-01 17:23             ` Jann Horn
2019-02-01 18:05               ` Al Viro
2019-01-29  1:07   ` Jann Horn
2019-01-29  2:21     ` Jann Horn
2019-01-29  2:54       ` Jens Axboe
2019-01-29  3:46       ` Jens Axboe
2019-01-29 15:56         ` Jann Horn
2019-01-29 16:06           ` Jens Axboe
2019-01-29  2:21     ` Jens Axboe
2019-01-29  1:29   ` Jann Horn
2019-01-29  1:31     ` Jens Axboe
2019-01-29  1:32       ` Jann Horn
2019-01-29  2:23         ` Jens Axboe
2019-01-29  7:12   ` Bert Wesarg
2019-01-29 12:12   ` Florian Weimer
2019-01-29 13:35     ` Jens Axboe
2019-01-28 21:35 ` [PATCH 06/18] io_uring: add fsync support Jens Axboe
2019-01-28 21:35 ` [PATCH 07/18] io_uring: support for IO polling Jens Axboe
2019-01-29 17:24   ` Christoph Hellwig
2019-01-29 18:31     ` Jens Axboe
2019-01-29 19:10       ` Jens Axboe
2019-01-29 20:35         ` Jeff Moyer
2019-01-29 20:37           ` Jens Axboe
2019-01-28 21:35 ` [PATCH 08/18] fs: add fget_many() and fput_many() Jens Axboe
2019-01-28 21:35 ` [PATCH 09/18] io_uring: use fget/fput_many() for file references Jens Axboe
2019-01-28 21:56   ` Jann Horn
2019-01-28 22:03     ` Jens Axboe
2019-01-28 21:35 ` [PATCH 10/18] io_uring: batch io_kiocb allocation Jens Axboe
2019-01-29 17:26   ` Christoph Hellwig
2019-01-29 18:14     ` Jens Axboe
2019-01-28 21:35 ` [PATCH 11/18] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-01-28 21:35 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-28 23:35   ` Jann Horn
2019-01-28 23:50     ` Jens Axboe
2019-01-29  0:36       ` Jann Horn
2019-01-29  1:25         ` Jens Axboe
2019-01-28 21:35 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-28 21:35 ` [PATCH 14/18] io_uring: add submission polling Jens Axboe
2019-01-28 21:35 ` [PATCH 15/18] io_uring: add io_kiocb ref count Jens Axboe
2019-01-29 17:26   ` Christoph Hellwig
2019-01-28 21:35 ` [PATCH 16/18] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-01-28 21:35 ` [PATCH 17/18] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
2019-01-28 21:35 ` [PATCH 18/18] io_uring: add io_uring_event cache hit information Jens Axboe
     [not found] <20190123153536.7081-1-axboe@kernel.dk>
     [not found] ` <20190123153536.7081-6-axboe@kernel.dk>
2019-01-28 14:57   ` [PATCH 05/18] Add io_uring IO interface Christoph Hellwig
2019-01-28 16:26     ` Jens Axboe
2019-01-28 16:34       ` Christoph Hellwig
2019-01-28 19:32         ` Jens Axboe
2019-01-28 18:25     ` Jens Axboe
2019-01-29  6:30       ` Christoph Hellwig
2019-01-29 11:58         ` Arnd Bergmann
2019-01-29 15:20           ` Jens Axboe
2019-01-29 16:18             ` Arnd Bergmann
2019-01-29 16:19               ` Jens Axboe
2019-01-29 16:26                 ` Arnd Bergmann
2019-01-29 16:28                   ` Jens Axboe
2019-01-29 16:46                     ` Arnd Bergmann
2019-01-29  0:47     ` Andy Lutomirski
2019-01-29  1:20       ` Jens Axboe
2019-01-29  6:45         ` Christoph Hellwig
2019-01-29 12:05           ` Arnd Bergmann
2019-01-31  5:11         ` Andy Lutomirski
2019-01-31 16:37           ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c20a1250-9f3f-a286-9baf-5b71db0c128f@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=avi@scylladb.com \
    --cc=hch@lst.de \
    --cc=jannh@google.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).