From: Matt Mullins <mmullins@fb.com>
To: "jannh@google.com" <jannh@google.com>,
"axboe@kernel.dk" <axboe@kernel.dk>
Cc: "linux-aio@kvack.org" <linux-aio@kvack.org>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
"jmoyer@redhat.com" <jmoyer@redhat.com>,
"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
"hch@lst.de" <hch@lst.de>,
"viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
"linux-man@vger.kernel.org" <linux-man@vger.kernel.org>,
"avi@scylladb.com" <avi@scylladb.com>
Subject: Re: [PATCH 05/18] Add io_uring IO interface
Date: Fri, 1 Feb 2019 16:57:00 +0000 [thread overview]
Message-ID: <05cb18f7a97a6151c305cdb7240c4abc995aed59.camel@fb.com> (raw)
In-Reply-To: <CAG48ez17NW0GJVRC6dFcHZTgQifFz5og1XCUbXkHKhr6f=j74Q@mail.gmail.com>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 13683 bytes --]
On Tue, 2019-01-29 at 00:59 +0100, Jann Horn wrote:
> On Tue, Jan 29, 2019 at 12:47 AM Jens Axboe <axboe@kernel.dk> wrote:
> > On 1/28/19 3:32 PM, Jann Horn wrote:
> > > On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wrote:
> > > > The submission queue (SQ) and completion queue (CQ) rings are shared
> > > > between the application and the kernel. This eliminates the need to
> > > > copy data back and forth to submit and complete IO.
> > > >
> > > > IO submissions use the io_uring_sqe data structure, and completions
> > > > are generated in the form of io_uring_sqe data structures. The SQ
> > > > ring is an index into the io_uring_sqe array, which makes it possible
> > > > to submit a batch of IOs without them being contiguous in the ring.
> > > > The CQ ring is always contiguous, as completion events are inherently
> > > > unordered, and hence any io_uring_cqe entry can point back to an
> > > > arbitrary submission.
> > > >
> > > > Two new system calls are added for this:
> > > >
> > > > io_uring_setup(entries, params)
> > > > Sets up a context for doing async IO. On success, returns a file
> > > > descriptor that the application can mmap to gain access to the
> > > > SQ ring, CQ ring, and io_uring_sqes.
> > > >
> > > > io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize)
> > > > Initiates IO against the rings mapped to this fd, or waits for
> > > > them to complete, or both. The behavior is controlled by the
> > > > parameters passed in. If 'to_submit' is non-zero, then we'll
> > > > try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
> > > > kernel will wait for 'min_complete' events, if they aren't
> > > > already available. It's valid to set IORING_ENTER_GETEVENTS
> > > > and 'min_complete' == 0 at the same time, this allows the
> > > > kernel to return already completed events without waiting
> > > > for them. This is useful only for polling, as for IRQ
> > > > driven IO, the application can just check the CQ ring
> > > > without entering the kernel.
> > > >
> > > > With this setup, it's possible to do async IO with a single system
> > > > call. Future developments will enable polled IO with this interface,
> > > > and polled submission as well. The latter will enable an application
> > > > to do IO without doing ANY system calls at all.
> > > >
> > > > For IRQ driven IO, an application only needs to enter the kernel for
> > > > completions if it wants to wait for them to occur.
> > > >
> > > > Each io_uring is backed by a workqueue, to support buffered async IO
> > > > as well. We will only punt to an async context if the command would
> > > > need to wait for IO on the device side. Any data that can be accessed
> > > > directly in the page cache is done inline. This avoids the slowness
> > > > issue of usual threadpools, since cached data is accessed as quickly
> > > > as a sync interface.
> > > >
> > > > Sample application: https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_cgit_fio_plain_t_io-5Furing.c&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pqM-eO4A2hNFhIFiX-7eGg&m=MGr14pOzNbC7Z-8_dV4GMiH3AbkkH0RSQoQ894Tu0yc&s=mgbcubzOMiCpFpnwW-HA3ey0YDYPkgMIZ7Bmy4w6Chc&e=
> > >
> > > [...]
> > > > +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> > > > + bool force_nonblock)
> > > > +{
> > > > + struct kiocb *kiocb = &req->rw;
> > > > + int ret;
> > > > +
> > > > + kiocb->ki_filp = fget(sqe->fd);
> > > > + if (unlikely(!kiocb->ki_filp))
> > > > + return -EBADF;
> > > > + kiocb->ki_pos = sqe->off;
> > > > + kiocb->ki_flags = iocb_flags(kiocb->ki_filp);
> > > > + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp));
> > > > + if (sqe->ioprio) {
> > > > + ret = ioprio_check_cap(sqe->ioprio);
> > > > + if (ret)
> > > > + goto out_fput;
> > > > +
> > > > + kiocb->ki_ioprio = sqe->ioprio;
> > > > + } else
> > > > + kiocb->ki_ioprio = get_current_ioprio();
> > > > +
> > > > + ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags);
> > > > + if (unlikely(ret))
> > > > + goto out_fput;
> > > > + if (force_nonblock) {
> > > > + kiocb->ki_flags |= IOCB_NOWAIT;
> > > > + req->flags |= REQ_F_FORCE_NONBLOCK;
> > > > + }
> > > > + if (kiocb->ki_flags & IOCB_HIPRI) {
> > > > + ret = -EINVAL;
> > > > + goto out_fput;
> > > > + }
> > > > +
> > > > + kiocb->ki_complete = io_complete_rw;
> > > > + return 0;
> > > > +out_fput:
> > > > + fput(kiocb->ki_filp);
> > > > + return ret;
> > > > +}
> > >
> > > [...]
> > > > +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> > > > + bool force_nonblock)
> > > > +{
> > > > + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
> > > > + struct kiocb *kiocb = &req->rw;
> > > > + struct iov_iter iter;
> > > > + struct file *file;
> > > > + ssize_t ret;
> > > > +
> > > > + ret = io_prep_rw(req, sqe, force_nonblock);
> > > > + if (ret)
> > > > + return ret;
> > > > + file = kiocb->ki_filp;
> > > > +
> > > > + ret = -EBADF;
> > > > + if (unlikely(!(file->f_mode & FMODE_READ)))
> > > > + goto out_fput;
> > > > + ret = -EINVAL;
> > > > + if (unlikely(!file->f_op->read_iter))
> > > > + goto out_fput;
> > > > +
> > > > + ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter);
> > > > + if (ret)
> > > > + goto out_fput;
> > > > +
> > > > + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter));
> > > > + if (!ret) {
> > > > + ssize_t ret2;
> > > > +
> > > > + /* Catch -EAGAIN return for forced non-blocking submission */
> > > > + ret2 = call_read_iter(file, kiocb, &iter);
> > > > + if (!force_nonblock || ret2 != -EAGAIN)
> > > > + io_rw_done(kiocb, ret2);
> > > > + else
> > > > + ret = -EAGAIN;
> > > > + }
> > > > + kfree(iovec);
> > > > +out_fput:
> > > > + if (unlikely(ret))
> > > > + fput(file);
> > > > + return ret;
> > > > +}
> > >
> > > [...]
> > > > +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> > > > + struct sqe_submit *s, bool force_nonblock)
> > > > +{
> > > > + const struct io_uring_sqe *sqe = s->sqe;
> > > > + ssize_t ret;
> > > > +
> > > > + if (unlikely(s->index >= ctx->sq_entries))
> > > > + return -EINVAL;
> > > > + req->user_data = sqe->user_data;
> > > > +
> > > > + ret = -EINVAL;
> > > > + switch (sqe->opcode) {
> > > > + case IORING_OP_NOP:
> > > > + ret = io_nop(req, sqe);
> > > > + break;
> > > > + case IORING_OP_READV:
> > > > + ret = io_read(req, sqe, force_nonblock);
> > > > + break;
> > > > + case IORING_OP_WRITEV:
> > > > + ret = io_write(req, sqe, force_nonblock);
> > > > + break;
> > > > + default:
> > > > + ret = -EINVAL;
> > > > + break;
> > > > + }
> > > > +
> > > > + return ret;
> > > > +}
> > > > +
> > > > +static void io_sq_wq_submit_work(struct work_struct *work)
> > > > +{
> > > > + struct io_kiocb *req = container_of(work, struct io_kiocb, work);
> > > > + struct sqe_submit *s = &req->submit;
> > > > + u64 user_data = s->sqe->user_data;
> > > > + struct io_ring_ctx *ctx = req->ctx;
> > > > + mm_segment_t old_fs = get_fs();
> > > > + struct files_struct *old_files;
> > > > + int ret;
> > > > +
> > > > + /* Ensure we clear previously set forced non-block flag */
> > > > + req->flags &= ~REQ_F_FORCE_NONBLOCK;
> > > > +
> > > > + old_files = current->files;
> > > > + current->files = ctx->sqo_files;
> > >
> > > I think you're not supposed to twiddle with current->files without
> > > holding task_lock(current).
> >
> > 'current' is the work queue item in this case, do we need to protect
> > against anything else? I can add the locking around the assignments
> > (both places).
>
> Stuff like proc_fd_link() uses get_files_struct(), which grabs a
> reference to your current files_struct protected only by task_lock();
> and it doesn't use anything like READ_ONCE(), so even if the object
> lifetime is not a problem, get_files_struct() could potentially crash
> due to a double-read (reading task->files twice and assuming that the
> result will be the same). As far as I can tell, this procfs code also
> works on kernel threads.
>
> > > > + if (!mmget_not_zero(ctx->sqo_mm)) {
> > > > + ret = -EFAULT;
> > > > + goto err;
> > > > + }
> > > > +
> > > > + use_mm(ctx->sqo_mm);
> > > > + set_fs(USER_DS);
> > > > +
> > > > + ret = __io_submit_sqe(ctx, req, s, false);
> > > > +
> > > > + set_fs(old_fs);
> > > > + unuse_mm(ctx->sqo_mm);
> > > > + mmput(ctx->sqo_mm);
> > > > +err:
> > > > + if (ret) {
> > > > + io_cqring_add_event(ctx, user_data, ret, 0);
> > > > + io_free_req(req);
> > > > + }
> > > > + current->files = old_files;
> > > > +}
> > >
> > > [...]
> > > > +static int io_sq_offload_start(struct io_ring_ctx *ctx)
> > > > +{
> > > > + int ret;
> > > > +
> > > > + ctx->sqo_mm = current->mm;
> > >
> > > What keeps this thing alive?
> >
> > I think we're deadling with the same thing as the files below, I'll
> > defer to that.
> >
> > > > + /*
> > > > + * This is safe since 'current' has the fd installed, and if that gets
> > > > + * closed on exit, then fops->release() is invoked which waits for the
> > > > + * async contexts to flush and exit before exiting.
> > > > + */
> > > > + ret = -EBADF;
> > > > + ctx->sqo_files = current->files;
> > > > + if (!ctx->sqo_files)
> > > > + goto err;
> > >
> > > That's gnarly. Adding Al Viro to the thread.
> > >
> > > I think you misunderstand the semantics of f_op->release. The ->flush
> > > handler is invoked whenever a file descriptor is closed through
> > > filp_close() (via deletion of the files_struct, sys_close(),
> > > sys_dup2(), ...), so if you had used that one, _maybe_ this would
> > > work. But the ->release handler only runs when the _last_ reference to
> > > a struct file has been dropped - so you can, for example, fork() a
> > > child, then exit() in the parent, and the ->release handler isn't
> > > invoked. So I don't see how this can work.
> >
> > The anonfd is CLOEXEC. The idea is exactly that it only runs when the
> > last reference to the file has been dropped. Not sure why you think I
> > need ->flush() here?
>
> Can't I just use fcntl(fd, F_SETFD, fd, 0) to clear the CLOEXEC flag?
> Or send the fd via SCM_RIGHTS?
>
> > > But even if you had abused ->flush for this instead: close_files()
> > > currently has a comment in it that claims that "this is the last
> > > reference to the files structure"; this change would make that claim
> > > untrue.
> >
> > Let me see if I can explain my intent better than that comment... We
> > know the parent who set up the io_uring instance will be around for as
> > long as io_uring instance persists.
>
> That's the part that I think is wrong: As far as I can tell, the
> parent can go away and you won't notice.
>
> Also, note that "the parent" is different things for ->files and ->mm.
> You can have a multithreaded process whose threads don't have the same
> ->files, or multiple process that share ->files without sharing ->mm,
> ...
This had actually been get_files_struct() in early versions, and I had
reported to Jens that it allows something like
int main() {
struct io_uring_params uring_params = {
.flags = IORING_SETUP_SQPOLL,
};
int uring_fd = syscall(425 /* io_uring_setup */, 16, &uring_params);
}
to leak both the files_struct and the kthread, as the files_struct and
the uring context form a circular reference. I haven't really come up
with a good way to reconcile the requirements here; perhaps we need an
exit_uring() akin to exit_aio()?
> > When we are tearing down the
> > io_uring, then we wait for any async contexts (like the one above) to
> > exit. Once they are exited, it's safe to proceed with the exit and
> > teardown ->files[].
>
> But you only do that teardown on ->release, right? And ->release
> doesn't have much to do with the process lifetime.
>
> > That's the idea... Not trying to be clever, some of this dates back to
> > the aio weirdness where it was impossible to have cross references like
> > this, since it would lead to teardown deadlocks with how exit_aio() is
> > used. I can probably grab a struct files reference above, but currently
> > I don't see why it's needed.
N§²æìr¸zǧu©²Æ {\béì¹»\x1c®&Þ)îŨ¨{ayº\x1dÊÚ&j:+v¨öà\x16梷¢ú(¸§»\x10\b:ÇÛiÿü0ÂKÚrJ+ö¢£ðèצj)Z·
next prev parent reply other threads:[~2019-02-01 16:57 UTC|newest]
Thread overview: 81+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-28 21:35 [PATCHSET v8] io_uring IO interface Jens Axboe
2019-01-28 21:35 ` [PATCH 01/18] fs: add an iopoll method to struct file_operations Jens Axboe
2019-01-28 21:35 ` [PATCH 02/18] block: wire up block device iopoll method Jens Axboe
2019-01-28 21:35 ` [PATCH 03/18] block: add bio_set_polled() helper Jens Axboe
2019-01-28 21:35 ` [PATCH 04/18] iomap: wire up the iopoll method Jens Axboe
2019-01-28 21:35 ` [PATCH 05/18] Add io_uring IO interface Jens Axboe
2019-01-28 21:53 ` Jeff Moyer
2019-01-28 21:56 ` Jens Axboe
2019-01-28 22:32 ` Jann Horn
2019-01-28 23:46 ` Jens Axboe
2019-01-28 23:59 ` Jann Horn
2019-01-29 0:03 ` Jens Axboe
2019-01-29 0:31 ` Jens Axboe
2019-01-29 0:34 ` Jann Horn
2019-01-29 0:55 ` Jens Axboe
2019-01-29 0:58 ` Jann Horn
2019-01-29 1:01 ` Jens Axboe
2019-02-01 16:57 ` Matt Mullins [this message]
2019-02-01 17:04 ` Jann Horn
2019-02-01 17:23 ` Jann Horn
2019-02-01 18:05 ` Al Viro
2019-01-29 1:07 ` Jann Horn
2019-01-29 2:21 ` Jann Horn
2019-01-29 2:54 ` Jens Axboe
2019-01-29 3:46 ` Jens Axboe
2019-01-29 15:56 ` Jann Horn
2019-01-29 16:06 ` Jens Axboe
2019-01-29 2:21 ` Jens Axboe
2019-01-29 1:29 ` Jann Horn
2019-01-29 1:31 ` Jens Axboe
2019-01-29 1:32 ` Jann Horn
2019-01-29 2:23 ` Jens Axboe
2019-01-29 7:12 ` Bert Wesarg
2019-01-29 12:12 ` Florian Weimer
2019-01-29 13:35 ` Jens Axboe
2019-01-28 21:35 ` [PATCH 06/18] io_uring: add fsync support Jens Axboe
2019-01-28 21:35 ` [PATCH 07/18] io_uring: support for IO polling Jens Axboe
2019-01-29 17:24 ` Christoph Hellwig
2019-01-29 18:31 ` Jens Axboe
2019-01-29 19:10 ` Jens Axboe
2019-01-29 20:35 ` Jeff Moyer
2019-01-29 20:37 ` Jens Axboe
2019-01-28 21:35 ` [PATCH 08/18] fs: add fget_many() and fput_many() Jens Axboe
2019-01-28 21:35 ` [PATCH 09/18] io_uring: use fget/fput_many() for file references Jens Axboe
2019-01-28 21:56 ` Jann Horn
2019-01-28 22:03 ` Jens Axboe
2019-01-28 21:35 ` [PATCH 10/18] io_uring: batch io_kiocb allocation Jens Axboe
2019-01-29 17:26 ` Christoph Hellwig
2019-01-29 18:14 ` Jens Axboe
2019-01-28 21:35 ` [PATCH 11/18] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-01-28 21:35 ` [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-01-28 23:35 ` Jann Horn
2019-01-28 23:50 ` Jens Axboe
2019-01-29 0:36 ` Jann Horn
2019-01-29 1:25 ` Jens Axboe
2019-01-28 21:35 ` [PATCH 13/18] io_uring: add file set registration Jens Axboe
2019-01-28 21:35 ` [PATCH 14/18] io_uring: add submission polling Jens Axboe
2019-01-28 21:35 ` [PATCH 15/18] io_uring: add io_kiocb ref count Jens Axboe
2019-01-29 17:26 ` Christoph Hellwig
2019-01-28 21:35 ` [PATCH 16/18] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-01-28 21:35 ` [PATCH 17/18] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
2019-01-28 21:35 ` [PATCH 18/18] io_uring: add io_uring_event cache hit information Jens Axboe
[not found] <20190123153536.7081-1-axboe@kernel.dk>
[not found] ` <20190123153536.7081-6-axboe@kernel.dk>
2019-01-28 14:57 ` [PATCH 05/18] Add io_uring IO interface Christoph Hellwig
2019-01-28 16:26 ` Jens Axboe
2019-01-28 16:34 ` Christoph Hellwig
2019-01-28 19:32 ` Jens Axboe
2019-01-28 18:25 ` Jens Axboe
2019-01-29 6:30 ` Christoph Hellwig
2019-01-29 11:58 ` Arnd Bergmann
2019-01-29 15:20 ` Jens Axboe
2019-01-29 16:18 ` Arnd Bergmann
2019-01-29 16:19 ` Jens Axboe
2019-01-29 16:26 ` Arnd Bergmann
2019-01-29 16:28 ` Jens Axboe
2019-01-29 16:46 ` Arnd Bergmann
2019-01-29 0:47 ` Andy Lutomirski
2019-01-29 1:20 ` Jens Axboe
2019-01-29 6:45 ` Christoph Hellwig
2019-01-29 12:05 ` Arnd Bergmann
2019-01-31 5:11 ` Andy Lutomirski
2019-01-31 16:37 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=05cb18f7a97a6151c305cdb7240c4abc995aed59.camel@fb.com \
--to=mmullins@fb.com \
--cc=avi@scylladb.com \
--cc=axboe@kernel.dk \
--cc=hch@lst.de \
--cc=jannh@google.com \
--cc=jmoyer@redhat.com \
--cc=linux-aio@kvack.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-man@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).