From: Jens Axboe <axboe@kernel.dk>
To: Jann Horn <jannh@google.com>
Cc: linux-aio@kvack.org, linux-block@vger.kernel.org,
Linux API <linux-api@vger.kernel.org>,
hch@lst.de, jmoyer@redhat.com, Avi Kivity <avi@scylladb.com>,
Al Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH 14/19] io_uring: add file set registration
Date: Fri, 22 Feb 2019 15:29:17 -0700 [thread overview]
Message-ID: <ebdd373d-41df-a8c4-9943-37f036d4504f@kernel.dk> (raw)
In-Reply-To: <CAG48ez25XYh1-SRw7HG-RXOz9xVZ0EXCumT_Csw0ER62K4Q1qw@mail.gmail.com>
On 2/19/19 9:12 AM, Jann Horn wrote:
> On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@kernel.dk> wrote:
>> We normally have to fget/fput for each IO we do on a file. Even with
>> the batching we do, the cost of the atomic inc/dec of the file usage
>> count adds up.
>>
>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes
>> for the io_uring_register(2) system call. The arguments passed in must
>> be an array of __s32 holding file descriptors, and nr_args should hold
>> the number of file descriptors the application wishes to pin for the
>> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is
>> called).
>>
>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags
>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd
>> to the index in the array passed in to IORING_REGISTER_FILES.
>>
>> Files are automatically unregistered when the io_uring instance is torn
>> down. An application need only unregister if it wishes to register a new
>> set of fds.
>>
>> Reviewed-by: Hannes Reinecke <hare@suse.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
> [...]
>> @@ -1335,6 +1379,161 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>> return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0;
>> }
>>
>> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> +#if defined(CONFIG_UNIX)
>> + if (ctx->ring_sock) {
>> + struct sock *sock = ctx->ring_sock->sk;
>> + struct sk_buff *skb;
>> +
>> + while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL)
>> + kfree_skb(skb);
>> + }
>> +#else
>> + int i;
>> +
>> + for (i = 0; i < ctx->nr_user_files; i++)
>> + fput(ctx->user_files[i]);
>> +#endif
>> +}
>> +
>> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx)
>> +{
>> + if (!ctx->user_files)
>> + return -ENXIO;
>> +
>> + __io_sqe_files_unregister(ctx);
>> + kfree(ctx->user_files);
>> + ctx->user_files = NULL;
>> + return 0;
>> +}
>> +
>> +#if defined(CONFIG_UNIX)
>> +/*
>> + * Ensure the UNIX gc is aware of our file set, so we are certain that
>> + * the io_uring can be safely unregistered on process exit, even if we have
>> + * loops in the file referencing.
>> + */
>
> I still don't get how this is supposed to work. Quoting from an
> earlier version of the patch:
>
> |> I think the overall concept here is still broken: You're giving the
> |> user_files to the GC, and I think the GC can drop their refcounts, but
> |> I don't see you actually getting feedback from the GC anywhere that
> |> would let the GC break your references? E.g. in io_prep_rw() you grab
> |> file pointers from ctx->user_files after simply checking
> |> ctx->nr_user_files, and there is no path from the GC that touches
> |> those fields. As far as I can tell, the GC is just going to go through
> |> unix_destruct_scm() and drop references on your files, causing
> |> use-after-free.
> |>
> |> But the unix GC is complicated, and maybe I'm just missing something...
> |
> | Only when the skb is released, which is either done when the io_uring
> | is torn down (and then definitely safe), or if the socket is released,
> | which is again also at a safe time.
>
> I'll try to add inline comments on my understanding of the code, maybe
> you can point out where exactly we're understanding it differently...
>
>> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset)
>> +{
>> + struct sock *sk = ctx->ring_sock->sk;
>> + struct scm_fp_list *fpl;
>> + struct sk_buff *skb;
>> + int i;
>> +
>> + fpl = kzalloc(sizeof(*fpl), GFP_KERNEL);
>> + if (!fpl)
>> + return -ENOMEM;
>> +
> // here we allocate a new `skb` with ->users==1
>> + skb = alloc_skb(0, GFP_KERNEL);
>> + if (!skb) {
>> + kfree(fpl);
>> + return -ENOMEM;
>> + }
>> +
>> + skb->sk = sk;
> // set the skb's destructor, invoked when ->users drops to 0;
> // destructor drops file refcounts
>> + skb->destructor = unix_destruct_scm;
>> +
>> + fpl->user = get_uid(ctx->user);
>> + for (i = 0; i < nr; i++) {
> // grab a reference to each file for the skb
>> + fpl->fp[i] = get_file(ctx->user_files[i + offset]);
>> + unix_inflight(fpl->user, fpl->fp[i]);
>> + }
>> +
>> + fpl->max = fpl->count = nr;
>> + UNIXCB(skb).fp = fpl;
>> + refcount_add(skb->truesize, &sk->sk_wmem_alloc);
> // put the skb in the sk_receive_queue, still with a refcount of 1.
>> + skb_queue_head(&sk->sk_receive_queue, skb);
>> +
> // drop a reference from each file; after this, only the
> skb owns references to files;
> // the ctx->user_files entries borrow their lifetime from the skb
>> + for (i = 0; i < nr; i++)
>> + fput(fpl->fp[i]);
>> +
>> + return 0;
>> +}
>
> So let's say you have a cyclic dependency where an io_uring points to
> a unix domain socket, and the unix domain socket points back at the
> uring. The last reference from outside the loop goes away when the
> user closes the uring's fd, but the uring's busypolling kernel thread
> is still running and busypolling for new submission queue entries.
>
> The GC can then come along and run scan_inflight(), detect that
> ctx->ring_sock->sk->sk_receive_queue contains a reference to a unix
> domain socket, and steal the skb (unlinking it from the ring_sock and
> linking it into the hitlist):
>
> __skb_unlink(skb, &x->sk_receive_queue);
> __skb_queue_tail(hitlist, skb);
>
> And then the hitlist will be processed by __skb_queue_purge(),
> dropping the refcount of the skb from 1 to 0. At that point, the unix
> domain socket can be freed, and you still have a pointer to it in
> ctx->user_files.
I've fixed this for the sq offload case.
>> +
>> +/*
>> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which
>> + * causes regular reference counting to break down. We rely on the UNIX
>> + * garbage collection to take care of this problem for us.
>> + */
>> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
>> +{
>> + unsigned left, total;
>> + int ret = 0;
>> +
>> + total = 0;
>> + left = ctx->nr_user_files;
>> + while (left) {
>> + unsigned this_files = min_t(unsigned, left, SCM_MAX_FD);
>> + int ret;
>> +
>> + ret = __io_sqe_files_scm(ctx, this_files, total);
>> + if (ret)
>> + break;
>
> If we bail out in the middle of translating the ->user_files here, we
> have to make sure that we both destroy the already-created SKBs and
> drop our references on the files we haven't dealt with yet.
Good catch, fixed.
>> + left -= this_files;
>> + total += this_files;
>> + }
>> +
>> + return ret;
>> +}
>> +#else
>> +static int io_sqe_files_scm(struct io_ring_ctx *ctx)
>> +{
>> + return 0;
>> +}
>> +#endif
>> +
>> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
>> + unsigned nr_args)
>> +{
>> + __s32 __user *fds = (__s32 __user *) arg;
>> + int fd, ret = 0;
>> + unsigned i;
>> +
>> + if (ctx->user_files)
>> + return -EBUSY;
>> + if (!nr_args)
>> + return -EINVAL;
>> + if (nr_args > IORING_MAX_FIXED_FILES)
>> + return -EMFILE;
>> +
>> + ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL);
>> + if (!ctx->user_files)
>> + return -ENOMEM;
>> +
>> + for (i = 0; i < nr_args; i++) {
>> + ret = -EFAULT;
>> + if (copy_from_user(&fd, &fds[i], sizeof(fd)))
>> + break;
>> +
>> + ctx->user_files[i] = fget(fd);
>> +
>> + ret = -EBADF;
>> + if (!ctx->user_files[i])
>> + break;
>
> Let's say we hit this error condition after N successful loop
> iterations, on a kernel with CONFIG_UNIX. At that point, we've filled
> N file pointers into ctx->user_files[], and we've incremented
> ctx->nr_user_files up to N. Now we jump to the `if (ret)` branch,
> which goes into io_sqe_files_unregister(); but that's going to attempt
> to dequeue inflight files from ctx->ring_sock, so that's not going to
> work.
Fixed, thanks.
>> + /*
>> + * Don't allow io_uring instances to be registered. If UNIX
>> + * isn't enabled, then this causes a reference cycle and this
>> + * instance can never get freed. If UNIX is enabled we'll
>> + * handle it just fine, but there's still no point in allowing
>> + * a ring fd as it doesn't support regular read/write anyway.
>> + */
>> + if (ctx->user_files[i]->f_op == &io_uring_fops) {
>> + fput(ctx->user_files[i]);
>> + break;
>> + }
>> + ctx->nr_user_files++;
>
> I don't see anything that can set ctx->nr_user_files back down to
> zero; as far as I can tell, if you repeatedly register and unregister
> a set of files, ctx->nr_user_files will just grow, and since it's used
> as an upper bound for array accesses, that's bad.
Fixed that one earlier.
--
Jens Axboe
next prev parent reply other threads:[~2019-02-22 22:29 UTC|newest]
Thread overview: 64+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-11 19:00 [PATCHSET v15] io_uring IO interface Jens Axboe
2019-02-11 19:00 ` [PATCH 01/19] fs: add an iopoll method to struct file_operations Jens Axboe
2019-02-11 19:00 ` [PATCH] io_uring: add io_uring_event cache hit information Jens Axboe
2019-02-11 19:00 ` [PATCH 02/19] block: wire up block device iopoll method Jens Axboe
2019-02-11 19:00 ` [PATCH 03/19] block: add bio_set_polled() helper Jens Axboe
2019-02-11 19:00 ` [PATCH 04/19] iomap: wire up the iopoll method Jens Axboe
2019-02-11 19:00 ` [PATCH 05/19] Add io_uring IO interface Jens Axboe
2019-02-11 19:00 ` [PATCH 06/19] io_uring: add fsync support Jens Axboe
2019-02-11 19:00 ` [PATCH 07/19] io_uring: support for IO polling Jens Axboe
2019-02-11 19:00 ` [PATCH 08/19] fs: add fget_many() and fput_many() Jens Axboe
2019-02-11 19:00 ` [PATCH 09/19] io_uring: use fget/fput_many() for file references Jens Axboe
2019-02-11 19:00 ` [PATCH 10/19] io_uring: batch io_kiocb allocation Jens Axboe
2019-02-11 19:00 ` [PATCH 11/19] block: implement bio helper to add iter bvec pages to bio Jens Axboe
2019-02-20 22:58 ` Ming Lei
2019-02-21 17:45 ` Jens Axboe
2019-02-26 3:46 ` Eric Biggers
2019-02-26 4:34 ` Jens Axboe
2019-02-26 15:54 ` Jens Axboe
2019-02-27 1:21 ` Ming Lei
2019-02-27 1:47 ` Jens Axboe
2019-02-27 1:53 ` Ming Lei
2019-02-27 1:57 ` Jens Axboe
2019-02-27 2:21 ` Ming Lei
2019-02-27 2:28 ` Jens Axboe
2019-02-27 2:37 ` Ming Lei
2019-02-27 2:43 ` Jens Axboe
2019-02-27 3:09 ` Ming Lei
2019-02-27 3:37 ` Jens Axboe
2019-02-27 3:43 ` Jens Axboe
2019-02-27 3:44 ` Ming Lei
2019-02-27 4:05 ` Jens Axboe
2019-02-27 4:06 ` Jens Axboe
2019-02-27 19:42 ` Christoph Hellwig
2019-02-28 8:37 ` Ming Lei
2019-02-27 23:35 ` Ming Lei
2019-03-08 7:55 ` Christoph Hellwig
2019-03-08 9:12 ` Ming Lei
2019-03-08 8:18 ` Christoph Hellwig
2019-02-11 19:00 ` [PATCH 12/19] io_uring: add support for pre-mapped user IO buffers Jens Axboe
2019-02-19 19:08 ` Jann Horn
2019-02-22 22:29 ` Jens Axboe
2019-02-11 19:00 ` [PATCH 13/19] net: split out functions related to registering inflight socket files Jens Axboe
2019-02-11 19:00 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-19 16:12 ` Jann Horn
2019-02-22 22:29 ` Jens Axboe [this message]
2019-02-11 19:00 ` [PATCH 15/19] io_uring: add submission polling Jens Axboe
2019-02-11 19:00 ` [PATCH 16/19] io_uring: add io_kiocb ref count Jens Axboe
2019-02-11 19:00 ` [PATCH 17/19] io_uring: add support for IORING_OP_POLL Jens Axboe
2019-02-11 19:00 ` [PATCH 18/19] io_uring: allow workqueue item to handle multiple buffered requests Jens Axboe
2019-02-11 19:00 ` [PATCH 19/19] io_uring: add io_uring_event cache hit information Jens Axboe
2019-02-21 12:10 ` [PATCHSET v15] io_uring IO interface Marek Majkowski
2019-02-21 17:48 ` Jens Axboe
2019-02-22 15:01 ` Marek Majkowski
2019-02-22 22:32 ` Jens Axboe
-- strict thread matches above, loose matches on Subject: below --
2019-02-09 21:13 [PATCHSET v14] " Jens Axboe
2019-02-09 21:13 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-09 23:52 ` Matt Mullins
2019-02-10 0:47 ` Jens Axboe
[not found] ` <60e4c6a489549daad1fb2c5e8eee5496c668d79a.camel@fb.com>
2019-02-10 2:34 ` Jens Axboe
2019-02-10 2:57 ` Jens Axboe
2019-02-10 19:55 ` Matt Mullins
2019-02-08 17:34 [PATCHSET v13] io_uring IO interface Jens Axboe
2019-02-08 17:34 ` [PATCH 14/19] io_uring: add file set registration Jens Axboe
2019-02-08 20:26 ` Jann Horn
2019-02-09 0:16 ` Jens Axboe
2019-02-09 9:50 ` Hannes Reinecke
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ebdd373d-41df-a8c4-9943-37f036d4504f@kernel.dk \
--to=axboe@kernel.dk \
--cc=avi@scylladb.com \
--cc=hch@lst.de \
--cc=jannh@google.com \
--cc=jmoyer@redhat.com \
--cc=linux-aio@kvack.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).