From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9295C169C4 for ; Fri, 8 Feb 2019 20:26:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 5781920844 for ; Fri, 8 Feb 2019 20:26:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ApK/nDdo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727106AbfBHU02 (ORCPT ); Fri, 8 Feb 2019 15:26:28 -0500 Received: from mail-ot1-f66.google.com ([209.85.210.66]:41176 "EHLO mail-ot1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726211AbfBHU02 (ORCPT ); Fri, 8 Feb 2019 15:26:28 -0500 Received: by mail-ot1-f66.google.com with SMTP id u16so7989024otk.8 for ; Fri, 08 Feb 2019 12:26:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=WQ1Uiw7yG7bX1y/wPZuih6fVwfW+HixcxpxeyUyTJak=; b=ApK/nDdo4fJvBZQeRAJy3NYJ+9gMCprJgp3pGfTai1K2WeVBFHpePKCZoLAjm/pUeq n0s++vpQPM7OK9/zlLAeh70yKGlI/wFqBb0Gj7PD4nchKpdO3gYXwX0vsKm+QO4Ei5Re kMW/m0LleSCfdEpVKdKEdJMa3SeqeT9dK0EJZmadakq337nbmoHaB3wV5Jg1il4kl5BS Saa7AkVhAqYKmdKHlHC9BRSlE7mGuLc6oRCLVsnc/7Oa9w+FYICgvQbC2irNx4tFuVvI v/hURkq147/SSAkL4fHw6NFWMzJZ3S0E41Z+RWaGxCMlEvTe+WbgbIt5uiMtW7PR6/01 5myA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=WQ1Uiw7yG7bX1y/wPZuih6fVwfW+HixcxpxeyUyTJak=; b=JpLCxtO9urqX83qQCz8YqPbHPhZhqa947Gix/JNC6bbfsSMinzeX5pCZqomxjhBCgU PBLHRPsbBQQx6VYKWxBMx8XJFsFzjya3rpnAI5AyqsNDGJE+j2ep0nowis5lRJN8U3EK 5Ks3encmw1IZ6cCV91GAuvSc2uRkL7jaINA45++EJdAFeEcTwDpty9CpRP81apv0Op5M 4rDSFldIalEeGdfXY4wL6uxo05Jccwx53l3RGH6xwa0LgdRqRKnWkxZwgvPhD6v0TDFb VaH0+OAbSvDV2l7Mi86921g3///X8C5r+dvM/7VraaFI4vEoK2/kQL8o8C17TlMEzlNB zgvQ== X-Gm-Message-State: AHQUAuZ3pP9i4/+HIW+OFmvybm6B/Z2xFA+PcBY6Iwzdqam34HHBsGHJ deYKrUoAhbNmur1ffnOlklJeSPfQX2ZZ94HZz+4oYADu4ZIKyg== X-Google-Smtp-Source: AHgI3IZA2d0EMgmZ1T2+ZyDk+0AJt6Je5t5aHSTt50Wbkus0erVY6Y5yd7wscP/LAnoixp1vZeU/9s+8fD72hnhTchc= X-Received: by 2002:a9d:638f:: with SMTP id w15mr14250777otk.230.1549657587054; Fri, 08 Feb 2019 12:26:27 -0800 (PST) MIME-Version: 1.0 References: <20190208173423.27014-1-axboe@kernel.dk> <20190208173423.27014-15-axboe@kernel.dk> In-Reply-To: <20190208173423.27014-15-axboe@kernel.dk> From: Jann Horn Date: Fri, 8 Feb 2019 21:26:01 +0100 Message-ID: Subject: Re: [PATCH 14/19] io_uring: add file set registration To: Jens Axboe Cc: linux-aio@kvack.org, linux-block@vger.kernel.org, Linux API , hch@lst.de, jmoyer@redhat.com, Avi Kivity , Al Viro Content-Type: text/plain; charset="UTF-8" Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Fri, Feb 8, 2019 at 6:35 PM Jens Axboe wrote: > We normally have to fget/fput for each IO we do on a file. Even with > the batching we do, the cost of the atomic inc/dec of the file usage > count adds up. > > This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes > for the io_uring_register(2) system call. The arguments passed in must > be an array of __s32 holding file descriptors, and nr_args should hold > the number of file descriptors the application wishes to pin for the > duration of the io_uring instance (or until IORING_UNREGISTER_FILES is > called). > > When used, the application must set IOSQE_FIXED_FILE in the sqe->flags > member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd > to the index in the array passed in to IORING_REGISTER_FILES. > > Files are automatically unregistered when the io_uring instance is torn > down. An application need only unregister if it wishes to register a new > set of fds. I think the overall concept here is still broken: You're giving the user_files to the GC, and I think the GC can drop their refcounts, but I don't see you actually getting feedback from the GC anywhere that would let the GC break your references? E.g. in io_prep_rw() you grab file pointers from ctx->user_files after simply checking ctx->nr_user_files, and there is no path from the GC that touches those fields. As far as I can tell, the GC is just going to go through unix_destruct_scm() and drop references on your files, causing use-after-free. But the unix GC is complicated, and maybe I'm just missing something... > +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx) > +{ > +#if defined(CONFIG_UNIX) > + if (ctx->ring_sock) { > + struct sock *sock = ctx->ring_sock->sk; > + struct sk_buff *skb; > + > + while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL) > + kfree_skb(skb); > + } > +#else > + int i; > + > + for (i = 0; i < ctx->nr_user_files; i++) > + fput(ctx->user_files[i]); > +#endif > +} > + > +static int io_sqe_files_unregister(struct io_ring_ctx *ctx) > +{ > + if (!ctx->user_files) > + return -ENXIO; > + > + __io_sqe_files_unregister(ctx); > + kfree(ctx->user_files); > + ctx->user_files = NULL; > + return 0; > +} > + > +#if defined(CONFIG_UNIX) > +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset) > +{ > + struct scm_fp_list *fpl; > + struct sk_buff *skb; > + int i; > + > + fpl = kzalloc(sizeof(*fpl), GFP_KERNEL); > + if (!fpl) > + return -ENOMEM; > + > + skb = alloc_skb(0, GFP_KERNEL); > + if (!skb) { > + kfree(fpl); > + return -ENOMEM; > + } > + > + skb->sk = ctx->ring_sock->sk; > + skb->destructor = unix_destruct_scm; > + > + fpl->user = get_uid(ctx->user); > + for (i = 0; i < nr; i++) { > + fpl->fp[i] = get_file(ctx->user_files[i + offset]); > + unix_inflight(fpl->user, fpl->fp[i]); > + fput(fpl->fp[i]); This pattern is almost always superfluous. You increment the file's refcount, maybe insert the file into a list (essentially), and drop the file's refcount back down. You're already holding a stable reference, and you're not temporarily lending that to anyone else. > + } > + > + fpl->max = fpl->count = nr; > + UNIXCB(skb).fp = fpl; > + skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb); > + return 0; > +} > + > +/* > + * If UNIX sockets are enabled, fd passing can cause a reference cycle which > + * causes regular reference counting to break down. We rely on the UNIX > + * garbage collection to take care of this problem for us. > + */ > +static int io_sqe_files_scm(struct io_ring_ctx *ctx) > +{ > + unsigned left, total; > + int ret = 0; > + > + total = 0; > + left = ctx->nr_user_files; > + while (left) { > + unsigned this_files = min_t(unsigned, left, SCM_MAX_FD); > + int ret; > + > + ret = __io_sqe_files_scm(ctx, this_files, total); > + if (ret) > + break; > + left -= this_files; > + total += this_files; > + } > + > + return ret; > +} > +#else > +static int io_sqe_files_scm(struct io_ring_ctx *ctx) > +{ > + return 0; > +} > +#endif > + > +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, > + unsigned nr_args) > +{ > + __s32 __user *fds = (__s32 __user *) arg; > + int fd, ret = 0; > + unsigned i; > + > + if (ctx->user_files) > + return -EBUSY; > + if (!nr_args) > + return -EINVAL; > + if (nr_args > IORING_MAX_FIXED_FILES) > + return -EMFILE; > + > + ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL); > + if (!ctx->user_files) > + return -ENOMEM; > + > + for (i = 0; i < nr_args; i++) { > + ret = -EFAULT; > + if (copy_from_user(&fd, &fds[i], sizeof(fd))) > + break; > + > + ctx->user_files[i] = fget(fd); > + > + ret = -EBADF; > + if (!ctx->user_files[i]) > + break; > + /* > + * Don't allow io_uring instances to be registered. If UNIX > + * isn't enabled, then this causes a reference cycle and this > + * instance can never get freed. If UNIX is enabled we'll > + * handle it just fine, but there's still no point in allowing > + * a ring fd as it doesn't suppor regular read/write anyway. nit: support > + */ > + if (ctx->user_files[i]->f_op == &io_uring_fops) { > + fput(ctx->user_files[i]); > + break; > + } > + ctx->nr_user_files++; > + ret = 0; > + } > + > + if (!ret) > + ret = io_sqe_files_scm(ctx); > + if (ret) > + io_sqe_files_unregister(ctx); > + > + return ret; > +} > + > static int io_sq_offload_start(struct io_ring_ctx *ctx) > { > int ret; > @@ -1521,14 +1708,16 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) > destroy_workqueue(ctx->sqo_wq); > if (ctx->sqo_mm) > mmdrop(ctx->sqo_mm); > + > + io_iopoll_reap_events(ctx); > + io_sqe_buffer_unregister(ctx); > + io_sqe_files_unregister(ctx); > + > #if defined(CONFIG_UNIX) > if (ctx->ring_sock) > sock_release(ctx->ring_sock); > #endif > > - io_iopoll_reap_events(ctx); > - io_sqe_buffer_unregister(ctx);