From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 838D6C2B9F4 for ; Tue, 15 Jun 2021 02:00:36 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F2ED061412 for ; Tue, 15 Jun 2021 02:00:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F2ED061412 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7225C6B0036; Mon, 14 Jun 2021 22:00:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D1756B006E; Mon, 14 Jun 2021 22:00:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 571DF6B0070; Mon, 14 Jun 2021 22:00:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0232.hostedemail.com [216.40.44.232]) by kanga.kvack.org (Postfix) with ESMTP id 239DC6B0036 for ; Mon, 14 Jun 2021 22:00:35 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id B27FD180AD804 for ; Tue, 15 Jun 2021 02:00:34 +0000 (UTC) X-FDA: 78254303988.15.5FC5EB3 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf21.hostedemail.com (Postfix) with ESMTP id 5FC0DE00027F for ; Tue, 15 Jun 2021 02:00:22 +0000 (UTC) Received: by mail.kernel.org (Postfix) with ESMTPSA id 02336613FA; Tue, 15 Jun 2021 02:00:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1623722433; bh=7xF3tLg4trnpNvzypuC15OrE9KArJ5qA7ubwj4+GG1c=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=jGz/ZdF0fpxmdpjjD4gaadDILCbK94CQ9OqZ+PqHP43Puv/u/fsfprJ/mkBZPqYVC eJOSkK2ibVP6OPtjSPwv2eq9BjJcjQVSRMNRYctH8dKb0OuNrpIPfpIfvNR8VBwoMY bjEtsK+msfuz1jR3wvMJecK4mWL6HHPbTtZpAfHA= Date: Mon, 14 Jun 2021 19:00:32 -0700 From: Andrew Morton To: Jann Horn Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Matthew Wilcox , "Kirill A . Shutemov" , John Hubbard , Jan Kara , stable@vger.kernel.org, Michal Hocko Subject: Re: [PATCH v2] mm/gup: fix try_grab_compound_head() race with split_huge_page() Message-Id: <20210614190032.09d8b7ac530c8b14ace44b82@linux-foundation.org> In-Reply-To: <20210615012014.1100672-1-jannh@google.com> References: <20210615012014.1100672-1-jannh@google.com> X-Mailer: Sylpheed 3.5.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b="jGz/ZdF0"; dmarc=none; spf=pass (imf21.hostedemail.com: domain of akpm@linux-foundation.org designates 198.145.29.99 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org X-Rspamd-Server: rspam02 X-Stat-Signature: p39s5m3j7nzdwosgskzt15b5yrx8dn1s X-Rspamd-Queue-Id: 5FC0DE00027F X-HE-Tag: 1623722422-301447 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, 15 Jun 2021 03:20:14 +0200 Jann Horn wrote: > try_grab_compound_head() is used to grab a reference to a page from > get_user_pages_fast(), which is only protected against concurrent > freeing of page tables (via local_irq_save()), but not against > concurrent TLB flushes, freeing of data pages, or splitting of compound > pages. > > Because no reference is held to the page when try_grab_compound_head() > is called, the page may have been freed and reallocated by the time its > refcount has been elevated; therefore, once we're holding a stable > reference to the page, the caller re-checks whether the PTE still points > to the same page (with the same access rights). > > The problem is that try_grab_compound_head() has to grab a reference on > the head page; but between the time we look up what the head page is and > the time we actually grab a reference on the head page, the compound > page may have been split up (either explicitly through split_huge_page() > or by freeing the compound page to the buddy allocator and then > allocating its individual order-0 pages). > If that happens, get_user_pages_fast() may end up returning the right > page but lifting the refcount on a now-unrelated page, leading to > use-after-free of pages. > > To fix it: > Re-check whether the pages still belong together after lifting the > refcount on the head page. > Move anything else that checks compound_head(page) below the refcount > increment. > > This can't actually happen on bare-metal x86 (because there, disabling > IRQs locks out remote TLB flushes), but it can happen on virtualized x86 > (e.g. under KVM) and probably also on arm64. The race window is pretty > narrow, and constantly allocating and shattering hugepages isn't exactly > fast; for now I've only managed to reproduce this in an x86 KVM guest with > an artificially widened timing window (by adding a loop that repeatedly > calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, > so that PV TLB flushes are used instead of IPIs). > > As requested on the list, also replace the existing VM_BUG_ON_PAGE() > with a warning and bailout. Since the existing code only performed the > BUG_ON check on DEBUG_VM kernels, ensure that the new code also only > performs the check under that configuration - I don't want to mix two > logically separate changes together too much. > The macro VM_WARN_ON_ONCE_PAGE() doesn't return a value on !DEBUG_VM, > so wrap the whole check in an #ifdef block. > An alternative would be to change the VM_WARN_ON_ONCE_PAGE() definition > for !DEBUG_VM such that it always returns false, but since that would > differ from the behavior of the normal WARN macros, it might be too > confusing for readers. > > ... > > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -43,8 +43,25 @@ static void hpage_pincount_sub(struct page *page, int refs) > > atomic_sub(refs, compound_pincount_ptr(page)); > } > > +/* Equivalent to calling put_page() @refs times. */ > +static void put_page_refs(struct page *page, int refs) > +{ > +#ifdef CONFIG_DEBUG_VM > + if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page)) > + return; > +#endif Well dang those ifdefs. With CONFIG_DEBUG_VM=n, this expands to if (((void)(sizeof((__force long)(page_ref_count(page) < refs)))) return; which will fail with "void value not ignored as it ought to be". Because VM_WARN_ON_ONCE_PAGE() is an rval with CONFIG_DEBUG_VM=y and is not an rval with CONFIG_DEBUG_VM=n. So the ifdefs are needed. I know we've been around this loop before, but it still sucks! Someone please remind me of the reasoning? Can we do #define VM_WARN_ON_ONCE_PAGE(cond, page) { BUILD_BUG_ON_INVALID(cond); cond; } ?