From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52045CA9EAB for ; Fri, 18 Oct 2019 18:06:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 30894222C5 for ; Fri, 18 Oct 2019 18:06:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="l6O+OilV" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727054AbfJRSGx (ORCPT ); Fri, 18 Oct 2019 14:06:53 -0400 Received: from mail-oi1-f193.google.com ([209.85.167.193]:43724 "EHLO mail-oi1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2505658AbfJRSGx (ORCPT ); Fri, 18 Oct 2019 14:06:53 -0400 Received: by mail-oi1-f193.google.com with SMTP id t84so5932874oih.10 for ; Fri, 18 Oct 2019 11:06:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=N/e/Moa0ihvefz9yqg1t4ezlgVZNAXSin7nNZXCiNtE=; b=l6O+OilV7pYfr0d3OLsLunxuHvoVfIh4g28Q7bMQ3CIfn71OG2kjCu/B6xUe2eXJIm cWq/iVwwNDETgBwt81Z3JIt56pXupCy3t4HWAUFG9M40/ukJ7km001AeeDq0q9SDQeem jsadBSr48T6h8+5x9DZ00BrEyjDlPWHmUxdFm4dqjcTzxnpRSYpI99rgLZ+ciiKr+zFk BMwPtRz9ErwDygSDxV/+AXFqDTBCUfB7PBRWgQ1jcG4IzXb+i7qG7uCHXdc0sdjcVB78 /PDWuGJzMdVCurTF9PuvqwWybOZb2ZPK7fVJci8FEpyBaaOmsrx6EB5xF7c9LroXEoGt 38yg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=N/e/Moa0ihvefz9yqg1t4ezlgVZNAXSin7nNZXCiNtE=; b=FpxP/TbmIMpzqcuCirYdGUi2ulz9cbXbnnsc7SzKa8/eSBhcnzjnZyscN/FtMaT0V6 dSVs0iFvhU5U4swbM+715x9tvijVJrGwlqrO5boW2WkGRn+RBNOjr1C+OCULiMBjffoY RXiAgaqXfP+6rVCx+dSQV/Ye+C6ez9hHqpPaTImw1OzW2sr0obXPYqESelp/WdT+bDyY 83iQKEnNHkiEUYAYgXy+pvCLPh0qZvuaWiS5PX/PcQzmKBhnuj3lTvdPL+Nhlg23ODCv gPAODbdVHGrFFU/cJkeX/L7D1lZCoKMcJdSqoF8iNrtDVL8Zk/NA+H8O8WUm7GBnOUsj mQdw== X-Gm-Message-State: APjAAAVAMLWzJE6SzAuth9kv7hxYccin3x9auyf/sEeyf1aZv8hg00aH S8evcCLeXdFWTCASG2pSXIm+Dmc2RQL3+G6aJBAoiA== X-Google-Smtp-Source: APXvYqz/NarsZ1RFuKrAUxdKpaKRPsMJeKFch6Pfq9no/XPUXq2I29eblMPlKWnPxn4jB/4zZiOk3ZFUxjvkNJ//drE= X-Received: by 2002:aca:da41:: with SMTP id r62mr8722272oig.47.1571422011755; Fri, 18 Oct 2019 11:06:51 -0700 (PDT) MIME-Version: 1.0 References: <20191017212858.13230-1-axboe@kernel.dk> <20191017212858.13230-2-axboe@kernel.dk> <0fb9d9a0-6251-c4bd-71b0-6e34c6a1aab8@kernel.dk> <572f40fb-201c-99ce-b3f5-05ff9369b895@kernel.dk> <20b44cc0-87b1-7bf8-d20e-f6131da9d130@kernel.dk> <2d208fc8-7c24-bca5-3d4a-796a5a8267eb@kernel.dk> In-Reply-To: <2d208fc8-7c24-bca5-3d4a-796a5a8267eb@kernel.dk> From: Jann Horn Date: Fri, 18 Oct 2019 20:06:25 +0200 Message-ID: Subject: Re: [PATCH 1/3] io_uring: add support for async work inheriting files table To: Jens Axboe Cc: linux-block@vger.kernel.org, "David S. Miller" , Network Development Content-Type: text/plain; charset="UTF-8" Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On Fri, Oct 18, 2019 at 7:05 PM Jens Axboe wrote: > On 10/18/19 10:36 AM, Jens Axboe wrote: > >> Ignoring the locking elision, basically the logic is now this: > >> > >> static void io_sq_wq_submit_work(struct work_struct *work) > >> { > >> struct io_kiocb *req = container_of(work, struct io_kiocb, work); > >> struct files_struct *cur_files = NULL, *old_files; > >> [...] > >> old_files = current->files; > >> [...] > >> do { > >> struct sqe_submit *s = &req->submit; > >> [...] > >> if (cur_files) > >> /* drop cur_files reference; borrow lifetime must > >> * end before here */ > >> put_files_struct(cur_files); > >> /* move reference ownership to cur_files */ > >> cur_files = s->files; > >> if (cur_files) { > >> task_lock(current); > >> /* current->files borrows reference from cur_files; > >> * existing borrow from previous loop ends here */ > >> current->files = cur_files; > >> task_unlock(current); > >> } > >> > >> [call __io_submit_sqe()] > >> [...] > >> } while (req); > >> [...] > >> /* existing borrow ends here */ > >> task_lock(current); > >> current->files = old_files; > >> task_unlock(current); > >> if (cur_files) > >> /* drop cur_files reference; borrow lifetime must > >> * end before here */ > >> put_files_struct(cur_files); > >> } > >> > >> If you run two iterations of this loop, with a first element that has > >> a ->files pointer and a second element that doesn't, then in the > >> second run through the loop, the reference to the files_struct will be > >> dropped while current->files still points to it; current->files is > >> only reset after the loop has ended. If someone accesses > >> current->files through procfs directly after that, AFAICS you'd get a > >> use-after-free. > > > > Amazing how this is still broken. You are right, and it's especially > > annoying since that's exactly the case I originally talked about (not > > flipping current->files if we don't have to). I just did it wrong, so > > we'll leave a dangling pointer in ->files. > > > > The by far most common case is if one sqe has a files it needs to > > attach, then others that also have files will be the same set. So I want > > to optimize for the case where we only flip current->files once when we > > see the files, and once when we're done with the loop. > > > > Let me see if I can get this right... > > I _think_ the simplest way to do it is simply to have both cur_files and > current->files hold a reference to the file table. That won't really add > any extra cost as the double increments / decrements are following each > other. Something like this incremental, totally untested. > > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 2fed0badad38..b3cf3f3d7911 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -2293,9 +2293,14 @@ static void io_sq_wq_submit_work(struct work_struct *work) > put_files_struct(cur_files); > cur_files = s->files; > if (cur_files && cur_files != current->files) { > + struct files_struct *old; > + > + atomic_inc(&cur_files->count); > task_lock(current); > + old = current->files; > current->files = cur_files; > task_unlock(current); > + put_files_struct(old); > } > > if (!ret) { > @@ -2390,9 +2395,13 @@ static void io_sq_wq_submit_work(struct work_struct *work) > mmput(cur_mm); > } > if (old_files != current->files) { > + struct files_struct *old; > + > task_lock(current); > + old = current->files; > current->files = old_files; > task_unlock(current); > + put_files_struct(old); > } > if (cur_files) > put_files_struct(cur_files); The only part I still feel a bit twitchy about is this part at the end: if (old_files != current->files) { struct files_struct *old; task_lock(current); old = current->files; current->files = old_files; task_unlock(current); put_files_struct(old); } If it was possible for the initial ->files to be the same as the ->files of a submission, and we got two submissions with first a different files_struct and then our old one, then this branch would not be executed even though it should, which would leave the refcount of the files_struct one too high. But that probably can't happen? Since kernel workers should be running with &init_files (I think?) and that thing is never used for userspace tasks. But still, I'd feel better if you could change it like this: diff --git a/fs/io_uring.c b/fs/io_uring.c index f9f5c70564f0..7673035d6bfe 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -2265,6 +2265,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct files_struct *cur_files = NULL, *old_files; + bool restore_current_files = false; struct io_ring_ctx *ctx = req->ctx; struct mm_struct *cur_mm = NULL; struct async_list *async_list; @@ -2313,6 +2314,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) current->files = cur_files; task_unlock(current); put_files_struct(old); + restore_current_files = true; } if (!ret) { @@ -2406,7 +2408,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) unuse_mm(cur_mm); mmput(cur_mm); } - if (old_files != current->files) { + if (restore_current_files) { struct files_struct *old; task_lock(current); But actually, by the way: Is this whole files_struct thing creating a reference loop? The files_struct has a reference to the uring file, and the uring file has ACCEPT work that has a reference to the files_struct. If the task gets killed and the accept work blocks, the entire files_struct will stay alive, right?