From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jann Horn <jannh@google.com>
Subject: Re: [PATCH 05/18] Add io_uring IO interface
Date: Fri, 1 Feb 2019 18:23:27 +0100
Message-ID: <CAG48ez0sJys=M1MR3DBd1DJ21-qHd2W7sB8+EvjhScWbtQnizA@mail.gmail.com>
References: <20190128213538.13486-1-axboe@kernel.dk> <20190128213538.13486-6-axboe@kernel.dk>
 <CAG48ez0vDqDH4ks7q4L3F+xt-4kVQrN1yw34QwFAmwQyy27FTw@mail.gmail.com>
 <e9326a77-54c5-e2b8-d9e5-663261462597@kernel.dk> <CAG48ez17NW0GJVRC6dFcHZTgQifFz5og1XCUbXkHKhr6f=j74Q@mail.gmail.com>
 <05cb18f7a97a6151c305cdb7240c4abc995aed59.camel@fb.com> <CAG48ez09iOOnPz83b6HxktYHTfouS2GD6i3PfQEjp8WCp+3-VA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Return-path: <owner-linux-aio@kvack.org>
In-Reply-To: <CAG48ez09iOOnPz83b6HxktYHTfouS2GD6i3PfQEjp8WCp+3-VA@mail.gmail.com>
Sender: owner-linux-aio@kvack.org
To: Matt Mullins <mmullins@fb.com>, "viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>, "axboe@kernel.dk" <axboe@kernel.dk>, linux-fsdevel@vger.kernel.org
Cc: "linux-aio@kvack.org" <linux-aio@kvack.org>, "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>, "jmoyer@redhat.com" <jmoyer@redhat.com>, "linux-api@vger.kernel.org" <linux-api@vger.kernel.org>, "hch@lst.de" <hch@lst.de>, "linux-man@vger.kernel.org" <linux-man@vger.kernel.org>, "avi@scylladb.com" <avi@scylladb.com>
List-Id: linux-man@vger.kernel.org

On Fri, Feb 1, 2019 at 6:04 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, Feb 1, 2019 at 5:57 PM Matt Mullins <mmullins@fb.com> wrote:
> > On Tue, 2019-01-29 at 00:59 +0100, Jann Horn wrote:
> > > On Tue, Jan 29, 2019 at 12:47 AM Jens Axboe <axboe@kernel.dk> wrote:
> > > > On 1/28/19 3:32 PM, Jann Horn wrote:
> > > > > On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@kernel.dk> wro=
te:
> > > > > > The submission queue (SQ) and completion queue (CQ) rings are s=
hared
> > > > > > between the application and the kernel. This eliminates the nee=
d to
> > > > > > copy data back and forth to submit and complete IO.
> > > > > >
> > > > > > IO submissions use the io_uring_sqe data structure, and complet=
ions
> > > > > > are generated in the form of io_uring_sqe data structures. The =
SQ
> > > > > > ring is an index into the io_uring_sqe array, which makes it po=
ssible
> > > > > > to submit a batch of IOs without them being contiguous in the r=
ing.
> > > > > > The CQ ring is always contiguous, as completion events are inhe=
rently
> > > > > > unordered, and hence any io_uring_cqe entry can point back to a=
n
> > > > > > arbitrary submission.
> > > > > >
> > > > > > Two new system calls are added for this:
> > > > > >
> > > > > > io_uring_setup(entries, params)
> > > > > >         Sets up a context for doing async IO. On success, retur=
ns a file
> > > > > >         descriptor that the application can mmap to gain access=
 to the
> > > > > >         SQ ring, CQ ring, and io_uring_sqes.
> > > > > >
> > > > > > io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigs=
etsize)
> > > > > >         Initiates IO against the rings mapped to this fd, or wa=
its for
> > > > > >         them to complete, or both. The behavior is controlled b=
y the
> > > > > >         parameters passed in. If 'to_submit' is non-zero, then =
we'll
> > > > > >         try and submit new IO. If IORING_ENTER_GETEVENTS is set=
, the
> > > > > >         kernel will wait for 'min_complete' events, if they are=
n't
> > > > > >         already available. It's valid to set IORING_ENTER_GETEV=
ENTS
> > > > > >         and 'min_complete' =3D=3D 0 at the same time, this allo=
ws the
> > > > > >         kernel to return already completed events without waiti=
ng
> > > > > >         for them. This is useful only for polling, as for IRQ
> > > > > >         driven IO, the application can just check the CQ ring
> > > > > >         without entering the kernel.
> > > > > >
> > > > > > With this setup, it's possible to do async IO with a single sys=
tem
> > > > > > call. Future developments will enable polled IO with this inter=
face,
> > > > > > and polled submission as well. The latter will enable an applic=
ation
> > > > > > to do IO without doing ANY system calls at all.
> > > > > >
> > > > > > For IRQ driven IO, an application only needs to enter the kerne=
l for
> > > > > > completions if it wants to wait for them to occur.
> > > > > >
> > > > > > Each io_uring is backed by a workqueue, to support buffered asy=
nc IO
> > > > > > as well. We will only punt to an async context if the command w=
ould
> > > > > > need to wait for IO on the device side. Any data that can be ac=
cessed
> > > > > > directly in the page cache is done inline. This avoids the slow=
ness
> > > > > > issue of usual threadpools, since cached data is accessed as qu=
ickly
> > > > > > as a sync interface.
> > > > > >
> > > > > > Sample application: https://urldefense.proofpoint.com/v2/url?u=
=3Dhttp-3A__git.kernel.dk_cgit_fio_plain_t_io-5Furing.c&d=3DDwIBaQ&c=3D5VD0=
RTtNlTh3ycd41b3MUw&r=3DpqM-eO4A2hNFhIFiX-7eGg&m=3DMGr14pOzNbC7Z-8_dV4GMiH3A=
bkkH0RSQoQ894Tu0yc&s=3DmgbcubzOMiCpFpnwW-HA3ey0YDYPkgMIZ7Bmy4w6Chc&e=3D
> > > > >
> > > > > [...]
> > > > > > +static int io_prep_rw(struct io_kiocb *req, const struct io_ur=
ing_sqe *sqe,
> > > > > > +                     bool force_nonblock)
> > > > > > +{
> > > > > > +       struct kiocb *kiocb =3D &req->rw;
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       kiocb->ki_filp =3D fget(sqe->fd);
> > > > > > +       if (unlikely(!kiocb->ki_filp))
> > > > > > +               return -EBADF;
> > > > > > +       kiocb->ki_pos =3D sqe->off;
> > > > > > +       kiocb->ki_flags =3D iocb_flags(kiocb->ki_filp);
> > > > > > +       kiocb->ki_hint =3D ki_hint_validate(file_write_hint(kio=
cb->ki_filp));
> > > > > > +       if (sqe->ioprio) {
> > > > > > +               ret =3D ioprio_check_cap(sqe->ioprio);
> > > > > > +               if (ret)
> > > > > > +                       goto out_fput;
> > > > > > +
> > > > > > +               kiocb->ki_ioprio =3D sqe->ioprio;
> > > > > > +       } else
> > > > > > +               kiocb->ki_ioprio =3D get_current_ioprio();
> > > > > > +
> > > > > > +       ret =3D kiocb_set_rw_flags(kiocb, sqe->rw_flags);
> > > > > > +       if (unlikely(ret))
> > > > > > +               goto out_fput;
> > > > > > +       if (force_nonblock) {
> > > > > > +               kiocb->ki_flags |=3D IOCB_NOWAIT;
> > > > > > +               req->flags |=3D REQ_F_FORCE_NONBLOCK;
> > > > > > +       }
> > > > > > +       if (kiocb->ki_flags & IOCB_HIPRI) {
> > > > > > +               ret =3D -EINVAL;
> > > > > > +               goto out_fput;
> > > > > > +       }
> > > > > > +
> > > > > > +       kiocb->ki_complete =3D io_complete_rw;
> > > > > > +       return 0;
> > > > > > +out_fput:
> > > > > > +       fput(kiocb->ki_filp);
> > > > > > +       return ret;
> > > > > > +}
> > > > >
> > > > > [...]
> > > > > > +static ssize_t io_read(struct io_kiocb *req, const struct io_u=
ring_sqe *sqe,
> > > > > > +                      bool force_nonblock)
> > > > > > +{
> > > > > > +       struct iovec inline_vecs[UIO_FASTIOV], *iovec =3D inlin=
e_vecs;
> > > > > > +       struct kiocb *kiocb =3D &req->rw;
> > > > > > +       struct iov_iter iter;
> > > > > > +       struct file *file;
> > > > > > +       ssize_t ret;
> > > > > > +
> > > > > > +       ret =3D io_prep_rw(req, sqe, force_nonblock);
> > > > > > +       if (ret)
> > > > > > +               return ret;
> > > > > > +       file =3D kiocb->ki_filp;
> > > > > > +
> > > > > > +       ret =3D -EBADF;
> > > > > > +       if (unlikely(!(file->f_mode & FMODE_READ)))
> > > > > > +               goto out_fput;
> > > > > > +       ret =3D -EINVAL;
> > > > > > +       if (unlikely(!file->f_op->read_iter))
> > > > > > +               goto out_fput;
> > > > > > +
> > > > > > +       ret =3D io_import_iovec(req->ctx, READ, sqe, &iovec, &i=
ter);
> > > > > > +       if (ret)
> > > > > > +               goto out_fput;
> > > > > > +
> > > > > > +       ret =3D rw_verify_area(READ, file, &kiocb->ki_pos, iov_=
iter_count(&iter));
> > > > > > +       if (!ret) {
> > > > > > +               ssize_t ret2;
> > > > > > +
> > > > > > +               /* Catch -EAGAIN return for forced non-blocking=
 submission */
> > > > > > +               ret2 =3D call_read_iter(file, kiocb, &iter);
> > > > > > +               if (!force_nonblock || ret2 !=3D -EAGAIN)
> > > > > > +                       io_rw_done(kiocb, ret2);
> > > > > > +               else
> > > > > > +                       ret =3D -EAGAIN;
> > > > > > +       }
> > > > > > +       kfree(iovec);
> > > > > > +out_fput:
> > > > > > +       if (unlikely(ret))
> > > > > > +               fput(file);
> > > > > > +       return ret;
> > > > > > +}
> > > > >
> > > > > [...]
> > > > > > +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_=
kiocb *req,
> > > > > > +                          struct sqe_submit *s, bool force_non=
block)
> > > > > > +{
> > > > > > +       const struct io_uring_sqe *sqe =3D s->sqe;
> > > > > > +       ssize_t ret;
> > > > > > +
> > > > > > +       if (unlikely(s->index >=3D ctx->sq_entries))
> > > > > > +               return -EINVAL;
> > > > > > +       req->user_data =3D sqe->user_data;
> > > > > > +
> > > > > > +       ret =3D -EINVAL;
> > > > > > +       switch (sqe->opcode) {
> > > > > > +       case IORING_OP_NOP:
> > > > > > +               ret =3D io_nop(req, sqe);
> > > > > > +               break;
> > > > > > +       case IORING_OP_READV:
> > > > > > +               ret =3D io_read(req, sqe, force_nonblock);
> > > > > > +               break;
> > > > > > +       case IORING_OP_WRITEV:
> > > > > > +               ret =3D io_write(req, sqe, force_nonblock);
> > > > > > +               break;
> > > > > > +       default:
> > > > > > +               ret =3D -EINVAL;
> > > > > > +               break;
> > > > > > +       }
> > > > > > +
> > > > > > +       return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static void io_sq_wq_submit_work(struct work_struct *work)
> > > > > > +{
> > > > > > +       struct io_kiocb *req =3D container_of(work, struct io_k=
iocb, work);
> > > > > > +       struct sqe_submit *s =3D &req->submit;
> > > > > > +       u64 user_data =3D s->sqe->user_data;
> > > > > > +       struct io_ring_ctx *ctx =3D req->ctx;
> > > > > > +       mm_segment_t old_fs =3D get_fs();
> > > > > > +       struct files_struct *old_files;
> > > > > > +       int ret;
> > > > > > +
> > > > > > +        /* Ensure we clear previously set forced non-block fla=
g */
> > > > > > +       req->flags &=3D ~REQ_F_FORCE_NONBLOCK;
> > > > > > +
> > > > > > +       old_files =3D current->files;
> > > > > > +       current->files =3D ctx->sqo_files;
> > > > >
> > > > > I think you're not supposed to twiddle with current->files withou=
t
> > > > > holding task_lock(current).
> > > >
> > > > 'current' is the work queue item in this case, do we need to protec=
t
> > > > against anything else? I can add the locking around the assignments
> > > > (both places).
> > >
> > > Stuff like proc_fd_link() uses get_files_struct(), which grabs a
> > > reference to your current files_struct protected only by task_lock();
> > > and it doesn't use anything like READ_ONCE(), so even if the object
> > > lifetime is not a problem, get_files_struct() could potentially crash
> > > due to a double-read (reading task->files twice and assuming that the
> > > result will be the same). As far as I can tell, this procfs code also
> > > works on kernel threads.
> > >
> > > > > > +       if (!mmget_not_zero(ctx->sqo_mm)) {
> > > > > > +               ret =3D -EFAULT;
> > > > > > +               goto err;
> > > > > > +       }
> > > > > > +
> > > > > > +       use_mm(ctx->sqo_mm);
> > > > > > +       set_fs(USER_DS);
> > > > > > +
> > > > > > +       ret =3D __io_submit_sqe(ctx, req, s, false);
> > > > > > +
> > > > > > +       set_fs(old_fs);
> > > > > > +       unuse_mm(ctx->sqo_mm);
> > > > > > +       mmput(ctx->sqo_mm);
> > > > > > +err:
> > > > > > +       if (ret) {
> > > > > > +               io_cqring_add_event(ctx, user_data, ret, 0);
> > > > > > +               io_free_req(req);
> > > > > > +       }
> > > > > > +       current->files =3D old_files;
> > > > > > +}
> > > > >
> > > > > [...]
> > > > > > +static int io_sq_offload_start(struct io_ring_ctx *ctx)
> > > > > > +{
> > > > > > +       int ret;
> > > > > > +
> > > > > > +       ctx->sqo_mm =3D current->mm;
> > > > >
> > > > > What keeps this thing alive?
> > > >
> > > > I think we're deadling with the same thing as the files below, I'll
> > > > defer to that.
> > > >
> > > > > > +       /*
> > > > > > +        * This is safe since 'current' has the fd installed, a=
nd if that gets
> > > > > > +        * closed on exit, then fops->release() is invoked whic=
h waits for the
> > > > > > +        * async contexts to flush and exit before exiting.
> > > > > > +        */
> > > > > > +       ret =3D -EBADF;
> > > > > > +       ctx->sqo_files =3D current->files;
> > > > > > +       if (!ctx->sqo_files)
> > > > > > +               goto err;
> > > > >
> > > > > That's gnarly. Adding Al Viro to the thread.
> > > > >
> > > > > I think you misunderstand the semantics of f_op->release. The ->f=
lush
> > > > > handler is invoked whenever a file descriptor is closed through
> > > > > filp_close() (via deletion of the files_struct, sys_close(),
> > > > > sys_dup2(), ...), so if you had used that one, _maybe_ this would
> > > > > work. But the ->release handler only runs when the _last_ referen=
ce to
> > > > > a struct file has been dropped - so you can, for example, fork() =
a
> > > > > child, then exit() in the parent, and the ->release handler isn't
> > > > > invoked. So I don't see how this can work.
> > > >
> > > > The anonfd is CLOEXEC. The idea is exactly that it only runs when t=
he
> > > > last reference to the file has been dropped. Not sure why you think=
 I
> > > > need ->flush() here?
> > >
> > > Can't I just use fcntl(fd, F_SETFD, fd, 0) to clear the CLOEXEC flag?
> > > Or send the fd via SCM_RIGHTS?
> > >
> > > > > But even if you had abused ->flush for this instead: close_files(=
)
> > > > > currently has a comment in it that claims that "this is the last
> > > > > reference to the files structure"; this change would make that cl=
aim
> > > > > untrue.
> > > >
> > > > Let me see if I can explain my intent better than that comment... W=
e
> > > > know the parent who set up the io_uring instance will be around for=
 as
> > > > long as io_uring instance persists.
> > >
> > > That's the part that I think is wrong: As far as I can tell, the
> > > parent can go away and you won't notice.
> > >
> > > Also, note that "the parent" is different things for ->files and ->mm=
.
> > > You can have a multithreaded process whose threads don't have the sam=
e
> > > ->files, or multiple process that share ->files without sharing ->mm,
> > > ...
> >
> > This had actually been get_files_struct() in early versions, and I had
> > reported to Jens that it allows something like
> >
> > int main() {
> >   struct io_uring_params uring_params =3D {
> >         .flags =3D IORING_SETUP_SQPOLL,
> >   };
> >   int uring_fd =3D syscall(425 /* io_uring_setup */, 16, &uring_params)=
;
> > }
> >
> > to leak both the files_struct and the kthread, as the files_struct and
> > the uring context form a circular reference.  I haven't really come up
> > with a good way to reconcile the requirements here; perhaps we need an
> > exit_uring() akin to exit_aio()?
>
> Oh, yuck. Uuuh... can we make "struct files_struct" doubly-refcounted,
> like "struct mm_struct"? One reference type to keep the contents
> intact (the reference type you normally use, and the type used by
> uring when the thread is running), and one reference type to just keep
> the struct itself existing, but without preserving its contents
> (reference held consistently by the uring thread)?

Something like this (completely untested); and then instead of the
current get_files_struct(), you'd do get_files_struct_weak(), and
while the thread is running, it protects the files_struct from dying
with tryget_weak_files_struct() / put_files_struct().

Al, do you have opinions on this?

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
diff --git a/fs/file.c b/fs/file.c
index 3209ee271c41..fbf02ef2753d 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -281,6 +281,7 @@ struct files_struct *dup_fd(struct files_struct
*oldf, int *errorp)
        if (!newf)
                goto out;

+       kref_init(&newf->weak_refs);
        atomic_set(&newf->count, 1);

        spin_lock_init(&newf->file_lock);
@@ -410,6 +411,26 @@ struct files_struct *get_files_struct(struct
task_struct *task)
        return files;
 }

+static void free_files_struct(struct kref *ref) {
+       struct files_struct *files =3D
+               container_of(ref, struct files_struct, weak_refs);
+       kmem_cache_free(files_cachep, files);
+}
+
+void put_files_struct_weak(struct files_struct *files) {
+       kref_put(&files->weak_refs, free_files_struct);
+}
+
+struct files_struct *get_files_struct_weak(struct task_struct *task)
+{
+       struct files_struct *files =3D get_files_struct(task);
+       if (files) {
+               kref_get(&files->weak_refs);
+               put_files_struct(files);
+       }
+       return files;
+}
+
 void put_files_struct(struct files_struct *files)
 {
        if (atomic_dec_and_test(&files->count)) {
@@ -418,10 +439,17 @@ void put_files_struct(struct files_struct *files)
                /* free the arrays if they are not embedded */
                if (fdt !=3D &files->fdtab)
                        __free_fdtable(fdt);
-               kmem_cache_free(files_cachep, files);
+               put_files_struct_weak(files);
        }
 }

+struct files_struct *tryget_weak_files_struct(struct files_struct *fs) {
+       if (atomic_inc_not_zero(&fs->count)) {
+               return fs;
+       }
+       return NULL;
+}
+
 void reset_files_struct(struct files_struct *files)
 {
        struct task_struct *tsk =3D current;
@@ -448,6 +476,7 @@ void exit_files(struct task_struct *tsk)

 struct files_struct init_files =3D {
        .count          =3D ATOMIC_INIT(1),
+       .weak_refs      =3D KREF_INIT(1),
        .fdt            =3D &init_files.fdtab,
        .fdtab          =3D {
                .max_fds        =3D NR_OPEN_DEFAULT,
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..6ad95a95cc0b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -14,6 +14,7 @@
 #include <linux/types.h>
 #include <linux/init.h>
 #include <linux/fs.h>
+#include <linux/kref.h>

 #include <linux/atomic.h>

@@ -50,6 +51,7 @@ struct files_struct {
    * read mostly part
    */
        atomic_t count;
+       struct kref weak_refs;
        bool resize_in_progress;
        wait_queue_head_t resize_wait;

@@ -107,6 +109,9 @@ struct task_struct;

 struct files_struct *get_files_struct(struct task_struct *);
 void put_files_struct(struct files_struct *fs);
+void put_files_struct_weak(struct files_struct *files);
+struct files_struct *get_files_struct_weak(struct task_struct *);
+struct files_struct *tryget_weak_files_struct(struct files_struct *);
 void reset_files_struct(struct files_struct *);
 int unshare_files(struct files_struct **);
 struct files_struct *dup_fd(struct files_struct *, int *) __latent_entropy=
;
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majordomo@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: <a href=mailto:"aart@kvack.org">aart@kvack.org</a>