Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
From: Jann Horn <jannh@google.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: linux-aio@kvack.org, linux-block@vger.kernel.org,
	Linux API <linux-api@vger.kernel.org>,
	hch@lst.de, jmoyer@redhat.com, Avi Kivity <avi@scylladb.com>,
	Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH 14/19] io_uring: add file set registration
Date: Tue, 19 Feb 2019 17:12:31 +0100
Message-ID: <CAG48ez25XYh1-SRw7HG-RXOz9xVZ0EXCumT_Csw0ER62K4Q1qw@mail.gmail.com> (raw)
In-Reply-To: <20190211190049.7888-16-axboe@kernel.dk>

On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
> We normally have to fget/fput for each IO we do on a file. Even with
> the batching we do, the cost of the atomic inc/dec of the file usage
> count adds up.
>
> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
> for the io_uring_register(2) system call. The arguments passed in must
> be an array of __s32 holding file descriptors, and nr_args should hold
> the number of file descriptors the application wishes to pin for the
> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
> called).
>
> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
> to the index in the array passed in to IORING_REGISTER_FILES.
>
> Files are automatically unregistered when the io_uring instance is torn
> down. An application need only unregister if it wishes to register a new
> set of fds.
>
> Reviewed-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
[...]
> @@ -1335,6 +1379,161 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>         return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>  }
>
> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +#if defined(CONFIG_UNIX)
> +       if (ctx->ring_sock) {
> +               struct sock *sock = ctx->ring_sock->sk;
> +               struct sk_buff *skb;
> +
> +               while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
> +                       kfree_skb(skb);
> +       }
> +#else
> +       int i;
> +
> +       for (i = 0; i < ctx->nr_user_files; i++)
> +               fput(ctx->user_files[i]);
> +#endif
> +}
> +
> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
> +{
> +       if (!ctx->user_files)
> +               return -ENXIO;
> +
> +       __io_sqe_files_unregister(ctx);
> +       kfree(ctx->user_files);
> +       ctx->user_files = NULL;
> +       return 0;
> +}
> +
> +#if defined(CONFIG_UNIX)
> +/*
> + * Ensure the UNIX gc is aware of our file set, so we are certain that
> + * the io_uring can be safely unregistered on process exit, even if we have
> + * loops in the file referencing.
> + */

I still don't get how this is supposed to work. Quoting from an
earlier version of the patch:

|> I think the overall concept here is still broken: You're giving the
|> user_files to the GC, and I think the GC can drop their refcounts, but
|> I don't see you actually getting feedback from the GC anywhere that
|> would let the GC break your references? E.g. in io_prep_rw() you grab
|> file pointers from ctx->user_files after simply checking
|> ctx->nr_user_files, and there is no path from the GC that touches
|> those fields. As far as I can tell, the GC is just going to go through
|> unix_destruct_scm() and drop references on your files, causing
|> use-after-free.
|>
|> But the unix GC is complicated, and maybe I'm just missing something...
|
| Only when the skb is released, which is either done when the io_uring
| is torn down (and then definitely safe), or if the socket is released,
| which is again also at a safe time.

I'll try to add inline comments on my understanding of the code, maybe
you can point out where exactly we're understanding it differently...

> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
> +{
> +       struct sock *sk = ctx->ring_sock->sk;
> +       struct scm_fp_list *fpl;
> +       struct sk_buff *skb;
> +       int i;
> +
> +       fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
> +       if (!fpl)
> +               return -ENOMEM;
> +
            // here we allocate a new `skb` with ->users==1
> +       skb = alloc_skb(0, GFP_KERNEL);
> +       if (!skb) {
> +               kfree(fpl);
> +               return -ENOMEM;
> +       }
> +
> +       skb->sk = sk;
            // set the skb's destructor, invoked when ->users drops to 0;
            // destructor drops file refcounts
> +       skb->destructor = unix_destruct_scm;
> +
> +       fpl->user = get_uid(ctx->user);
> +       for (i = 0; i < nr; i++) {
                    // grab a reference to each file for the skb
> +               fpl->fp[i] = get_file(ctx->user_files[i + offset]);
> +               unix_inflight(fpl->user, fpl->fp[i]);
> +       }
> +
> +       fpl->max = fpl->count = nr;
> +       UNIXCB(skb).fp = fpl;
> +       refcount_add(skb->truesize, &sk->sk_wmem_alloc);
            // put the skb in the sk_receive_queue, still with a refcount of 1.
> +       skb_queue_head(&sk->sk_receive_queue, skb);
> +
            // drop a reference from each file; after this, only the
skb owns references to files;
            // the ctx->user_files entries borrow their lifetime from the skb
> +       for (i = 0; i < nr; i++)
> +               fput(fpl->fp[i]);
> +
> +       return 0;
> +}

So let's say you have a cyclic dependency where an io_uring points to
a unix domain socket, and the unix domain socket points back at the
uring. The last reference from outside the loop goes away when the
user closes the uring's fd, but the uring's busypolling kernel thread
is still running and busypolling for new submission queue entries.

The GC can then come along and run scan_inflight(), detect that
ctx->ring_sock->sk->sk_receive_queue contains a reference to a unix
domain socket, and steal the skb (unlinking it from the ring_sock and
linking it into the hitlist):

__skb_unlink(skb, &x->sk_receive_queue);
__skb_queue_tail(hitlist, skb);

And then the hitlist will be processed by __skb_queue_purge(),
dropping the refcount of the skb from 1 to 0. At that point, the unix
domain socket can be freed, and you still have a pointer to it in
ctx->user_files.

> +
> +/*
> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which
> + * causes regular reference counting to break down. We rely on the UNIX
> + * garbage collection to take care of this problem for us.
> + */
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       unsigned left, total;
> +       int ret = 0;
> +
> +       total = 0;
> +       left = ctx->nr_user_files;
> +       while (left) {
> +               unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
> +               int ret;
> +
> +               ret = __io_sqe_files_scm(ctx, this_files, total);
> +               if (ret)
> +                       break;

If we bail out in the middle of translating the ->user_files here, we
have to make sure that we both destroy the already-created SKBs and
drop our references on the files we haven't dealt with yet.

> +               left -= this_files;
> +               total += this_files;
> +       }
> +
> +       return ret;
> +}
> +#else
> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
> +{
> +       return 0;
> +}
> +#endif
> +
> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
> +                                unsigned nr_args)
> +{
> +       __s32 __user *fds = (__s32 __user *) arg;
> +       int fd, ret = 0;
> +       unsigned i;
> +
> +       if (ctx->user_files)
> +               return -EBUSY;
> +       if (!nr_args)
> +               return -EINVAL;
> +       if (nr_args > IORING_MAX_FIXED_FILES)
> +               return -EMFILE;
> +
> +       ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
> +       if (!ctx->user_files)
> +               return -ENOMEM;
> +
> +       for (i = 0; i < nr_args; i++) {
> +               ret = -EFAULT;
> +               if (copy_from_user(&fd, &fds[i], sizeof(fd)))
> +                       break;
> +
> +               ctx->user_files[i] = fget(fd);
> +
> +               ret = -EBADF;
> +               if (!ctx->user_files[i])
> +                       break;

Let's say we hit this error condition after N successful loop
iterations, on a kernel with CONFIG_UNIX. At that point, we've filled
N file pointers into ctx->user_files[], and we've incremented
ctx->nr_user_files up to N. Now we jump to the `if (ret)` branch,
which goes into io_sqe_files_unregister(); but that's going to attempt
to dequeue inflight files from ctx->ring_sock, so that's not going to
work.

> +               /*
> +                * Don't allow io_uring instances to be registered. If UNIX
> +                * isn't enabled, then this causes a reference cycle and this
> +                * instance can never get freed. If UNIX is enabled we'll
> +                * handle it just fine, but there's still no point in allowing
> +                * a ring fd as it doesn't support regular read/write anyway.
> +                */
> +               if (ctx->user_files[i]->f_op == &io_uring_fops) {
> +                       fput(ctx->user_files[i]);
> +                       break;
> +               }
> +               ctx->nr_user_files++;

I don't see anything that can set ctx->nr_user_files back down to
zero; as far as I can tell, if you repeatedly register and unregister
a set of files, ctx->nr_user_files will just grow, and since it's used
as an upper bound for array accesses, that's bad.

> +               ret = 0;
> +       }
> +
> +       if (!ret)
> +               ret = io_sqe_files_scm(ctx);
> +       if (ret)
> +               io_sqe_files_unregister(ctx);
> +
> +       return ret;
> +}

  reply index

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-11 19:00 [PATCHSET v15] io_uring IO interface Jens Axboe
2019-02-11 19:00 ` [PATCH 01/19] fs: add an iopoll method to struct file_operations Jens Axboe
2019-02-11 19:00 ` [PATCH] io_uring: add io_uring_event cache hit information Jens Axboe
2019-02-11 19:00 ` [PATCH 02/19] block: wire up block device iopoll method Jens Axboe
2019-02-11 19:00 ` [PATCH 03/19] block: add bio_set_polled() helper Jens Axboe
2019-02-11 19:00 ` [PATCH 04/19] iomap: wire up the iopoll method Jens Axboe
2019-02-11 19:00 ` [PATCH 05/19] Add io_uring IO interface Jens Axboe
2019-02-11 19:00 ` [PATCH 06/19] io_uring: add fsync support Jens Axboe
2019-02-11 19:00 ` [PATCH 07/19] io_uring: support for IO polling Jens Axboe
2019-02-11 19:00 ` [PATCH 08/19] fs: add fget_many() and fput_many() Jens Axboe
2019-02-11 19:00 ` [PATCH 09/19] io_uring: use fget/fput_many() for file references Jens Axboe
2019-02-11 19:00 ` [PATCH 10/19] io_uring: batch io_kiocb allocation Jens Axboe
2019-02-11 19:00 ` [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-02-20 22:58   ` Ming Lei
2019-02-21 17:45     ` Jens Axboe
2019-02-26  3:46       ` Eric Biggers
2019-02-26  4:34         ` Jens Axboe
2019-02-26 15:54           ` Jens Axboe
2019-02-27  1:21             ` Ming Lei
2019-02-27  1:47               ` Jens Axboe
2019-02-27  1:53                 ` Ming Lei
2019-02-27  1:57                   ` Jens Axboe
2019-02-27  2:21                     ` Ming Lei
2019-02-27  2:28                       ` Jens Axboe
2019-02-27  2:37                         ` Ming Lei
2019-02-27  2:43                           ` Jens Axboe
2019-02-27  3:09                             ` Ming Lei
2019-02-27  3:37                               ` Jens Axboe
2019-02-27  3:43                                 ` Jens Axboe
2019-02-27  3:44                                 ` Ming Lei
2019-02-27  4:05                                   ` Jens Axboe
2019-02-27  4:06                                     ` Jens Axboe
2019-02-27 19:42                                       ` Christoph Hellwig
2019-02-28  8:37                                         ` Ming Lei
2019-02-27 23:35                         ` Ming Lei
2019-03-08  7:55                         ` Christoph Hellwig
2019-03-08  9:12                           ` Ming Lei
2019-03-08  8:18                     ` Christoph Hellwig
2019-02-11 19:00 ` [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-02-19 19:08   ` Jann Horn
2019-02-22 22:29     ` Jens Axboe
2019-02-11 19:00 ` [PATCH 13/19] net: split out functions related to registering inflight socket files Jens Axboe
2019-02-11 19:00 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-19 16:12   ` Jann Horn [this message]
2019-02-22 22:29     ` Jens Axboe
2019-02-11 19:00 ` [PATCH 15/19] io_uring: add submission polling Jens Axboe
2019-02-11 19:00 ` [PATCH 16/19] io_uring: add io_kiocb ref count Jens Axboe
2019-02-11 19:00 ` [PATCH 17/19] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-02-11 19:00 ` [PATCH 18/19] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
2019-02-11 19:00 ` [PATCH 19/19] io_uring: add io_uring_event cache hit information Jens Axboe
2019-02-21 12:10 ` [PATCHSET v15] io_uring IO interface Marek Majkowski
2019-02-21 17:48   ` Jens Axboe
2019-02-22 15:01     ` Marek Majkowski
2019-02-22 22:32       ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2019-02-09 21:13 [PATCHSET v14] " Jens Axboe
2019-02-09 21:13 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-09 23:52   ` Matt Mullins
2019-02-10  0:47     ` Jens Axboe
     [not found]       ` <60e4c6a489549daad1fb2c5e8eee5496c668d79a.camel@fb.com>
2019-02-10  2:34         ` Jens Axboe
2019-02-10  2:57           ` Jens Axboe
2019-02-10 19:55             ` Matt Mullins
2019-02-08 17:34 [PATCHSET v13] io_uring IO interface Jens Axboe
2019-02-08 17:34 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-08 20:26   ` Jann Horn
2019-02-09  0:16     ` Jens Axboe
2019-02-09  9:50   ` Hannes Reinecke

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAG48ez25XYh1-SRw7HG-RXOz9xVZ0EXCumT_Csw0ER62K4Q1qw@mail.gmail.com \
    --to=jannh@google.com \
    --cc=avi@scylladb.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=jmoyer@redhat.com \
    --cc=linux-aio@kvack.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox