From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=qfur=LS=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 618BEC48BC2
	for <linux-mm@archiver.kernel.org>; Thu, 24 Jun 2021 03:39:00 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id DC2BE613A9
	for <linux-mm@archiver.kernel.org>; Thu, 24 Jun 2021 03:38:55 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DC2BE613A9
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 41ECE8D0006; Wed, 23 Jun 2021 23:38:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3F4D86B0083; Wed, 23 Jun 2021 23:38:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 26EA78D0006; Wed, 23 Jun 2021 23:38:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0254.hostedemail.com [216.40.44.254])
	by kanga.kvack.org (Postfix) with ESMTP id E0DC96B0082
	for <linux-mm@kvack.org>; Wed, 23 Jun 2021 23:38:48 -0400 (EDT)
Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 143CB19B20
	for <linux-mm@kvack.org>; Thu, 24 Jun 2021 03:38:49 +0000 (UTC)
X-FDA: 78287210778.05.AAD1256
Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182])
	by imf14.hostedemail.com (Postfix) with ESMTP id D5A2AC0042E1
	for <linux-mm@kvack.org>; Thu, 24 Jun 2021 03:38:47 +0000 (UTC)
Received: by mail-pg1-f182.google.com with SMTP id t9so3576761pgn.4
        for <linux-mm@kvack.org>; Wed, 23 Jun 2021 20:38:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=CVWElybcMrVlFZEvuz6VJdeudyns0hLjfyTboPZWo+0=;
        b=XRQ1UFVyRFNZNGRsxtPwmHKKKv5wVxnp/XI3NVo6+vchT/UyHu1vFPPdzzRvokNxLG
         pCjAQiKw6T46dmvvTHBd8cxsHPCuqgqwJfobh3x2Pn3V5fFXaoGL3FQ6Q7gAuFqw4g/D
         tHyT7ekx+HlC45ZkTmEdTxQ2S3Edq+hEQKa9UcLtjGj3L+KCxpw65+LgJxdbeweD1qm6
         ifnbOVkD2rV7+2X6cAgdcI+I6Bkj7IlJq9EVMEO+4FG5Mdoed5DQgbQrs6bCi4ESqf8H
         vUmwxrhS9EZq8WNUv5DD/bdo05sOqgG1WYeVN3N0M8BcKyCwVFqU3NFHda/WpBRNWiet
         o+Rg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=CVWElybcMrVlFZEvuz6VJdeudyns0hLjfyTboPZWo+0=;
        b=QA/U+t5gZRvpSMcxGXhyjHOHYcA3yc+TMpWQ7abmMvWl2RZaI6qAaQFum47KNWYiWc
         v3w3Pg1457Ej/CAfej/j83zrtDyeXh8X0XTYOCD3Dfm8g/F0wPgXaPjO/jljgw+/0qDn
         d62f2N2zVZbidKYnC/R28b+aLBrGX44nz2s04QRhxe0U202nEHHtRSEFS0mMkZovYPUJ
         gy+AP3/NMY2KtIbrRreV6JF7QCRRMIn/Q3X2GjbXaV5v1j2YOPWSQ6w6/iSYsWG3uvGg
         e0x48bIfuU+ivgXsUdqxrEAQv4YW2M0CCzUMYh4TuO9lNItdgcCVIfZqY+vgMw9OptbX
         Qglg==
X-Gm-Message-State: AOAM531NLIlkhKoA39IVitxN/+pjHM8cZ7DUFj1bvYd9GbP+8u07oiCK
	l8IS3LaMc9psZaHV5f4P+ybq+R9P8IpKpvMwgy/aTQ==
X-Google-Smtp-Source: ABdhPJzJxUPsffBJjpirJgEDkvON8ayGRvsw6/E7EbW0LiXCMufsP62kGOVQtTg+y/PExUadM49df9OguzL68DfgwiE=
X-Received: by 2002:a65:63ce:: with SMTP id n14mr2725499pgv.273.1624505925137;
 Wed, 23 Jun 2021 20:38:45 -0700 (PDT)
MIME-Version: 1.0
References: <20210622021423.154662-1-mike.kravetz@oracle.com>
 <20210622021423.154662-3-mike.kravetz@oracle.com> <CAMZfGtVMn3daKrJwZMaVOGOaJU+B4dS--x_oPmGQMD=c=QNGEg@mail.gmail.com>
 <39674952-44fc-8386-39b7-9e0862aaa991@oracle.com>
In-Reply-To: <39674952-44fc-8386-39b7-9e0862aaa991@oracle.com>
From: Muchun Song <songmuchun@bytedance.com>
Date: Thu, 24 Jun 2021 11:38:08 +0800
Message-ID: <CAMZfGtXQg__eDh39g0aXhvtD-91PkXJGeNFY_cJgC8X8q-11tA@mail.gmail.com>
Subject: Re: [External] [PATCH 2/2] hugetlb: address ref count racing in prep_compound_gigantic_page
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>, 
	Linux Memory Management List <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, 
	Jann Horn <jannh@google.com>, Youquan Song <youquan.song@intel.com>, 
	Andrea Arcangeli <aarcange@redhat.com>, Jan Kara <jack@suse.cz>, John Hubbard <jhubbard@nvidia.com>, 
	"Kirill A . Shutemov" <kirill@shutemov.name>, Matthew Wilcox <willy@infradead.org>, 
	Michal Hocko <mhocko@kernel.org>, Andrew Morton <akpm@linux-foundation.org>
Content-Type: text/plain; charset="UTF-8"
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=XRQ1UFVy;
	dmarc=pass (policy=none) header.from=bytedance.com;
	spf=pass (imf14.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com
X-Stat-Signature: 6ngeheigk6witwsmnc7fr1f48dd7497q
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: D5A2AC0042E1
X-HE-Tag: 1624505927-842711
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Jun 24, 2021 at 8:26 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Cc: Naoya
>
> On 6/23/21 1:00 AM, Muchun Song wrote:
> > On Tue, Jun 22, 2021 at 10:15 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> In [1], Jann Horn points out a possible race between
> >> prep_compound_gigantic_page and __page_cache_add_speculative.  The
> >> root cause of the possible race is prep_compound_gigantic_page
> >> uncondittionally setting the ref count of pages to zero.  It does this
> >> because prep_compound_gigantic_page is handed a 'group' of pages from an
> >> allocator and needs to convert that group of pages to a compound page.
> >> The ref count of each page in this 'group' is one as set by the
> >> allocator.  However, the ref count of compound page tail pages must be
> >> zero.
> >>
> >> The potential race comes about when ref counted pages are returned from
> >> the allocator.  When this happens, other mm code could also take a
> >> reference on the page.  __page_cache_add_speculative is one such
> >> example.  Therefore, prep_compound_gigantic_page can not just set the
> >> ref count of pages to zero as it does today.  Doing so would lose the
> >> reference taken by any other code.  This would lead to BUGs in code
> >> checking ref counts and could possibly even lead to memory corruption.
> >
> > Hi Mike,
> >
> > Well. It takes me some time to get the race. It also makes me think more
> > about this. See the below code snippet in gather_surplus_pages().
> >
> >         zeroed = put_page_testzero(page);
> >        VM_BUG_ON_PAGE(!zeroed, page);
> >         enqueue_huge_page(h, page);
> >
> > The VM_BUG_ON_PAGE() can be triggered because of the similar
> > race, right? IIUC, we also should fix this.
>
> Thanks for taking a look at this Muchun.
>
> I believe you are correct.  Page allocators (even buddy) will hand back
> a ref counted head page.  Any other code 'could' take a reference on the
> head page before the pages are made into a hugetlb page.  Once the pages
> becomes a hugetlb page (PageHuge() true), then only hugetlb specific
> code should be modifying the ref count.  So, it seems the 'race window'
> is from the time the pages are returned from a low level allocator until
> the time the pages become a hugetlb page.  Does that sound correct?

I have a question about this, why should the ref count of the hugetlb page
be managed by the hugetlb specific code? What if we broke this rule? If so,
we can fix this issue easily.

CPU0:                                   CPU1:
page = xas_load()
// the page is freed to the buddy
                                        page = alloc_surplus_huge_page()
                                        if (put_page_testzero(page))
                                                enqueue_huge_page(h, page)
page_cache_get_speculative(page)
if (unlikely(page != xas_reload(&xas)))
        // this can help us free the hugetlb page to pool
        put_page(page)

If someone gets the page successfully before we put the page in the
gather_surplus_pages, then it will help us add the hugetlb page
to the pool when it calls put_page().

>
> If we want to check for and handle such a race, we would need to do so
> in prep_new_huge_page.  After setting the descructor we would need to
> check for an increased ref count (> 1).  Not sure if we would need a
> memory barrier or some other type synchronization for this?  This of
> course means that prep_new_huge_page could return an error, and we would
> need to deal with that in all callers.

As described above, IIUC, we do not need to change the behavior
of prep_new_huge_page. We just need to change gather_surplus_pages
like below. Just my thoughts about this, maybe I am wrong.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c3b2a8a494d6..8a1a75534543 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2160,17 +2160,11 @@ static int gather_surplus_pages(struct hstate
*h, long delta)

        /* Free the needed pages to the hugetlb pool */
        list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
-               int zeroed;
-
                if ((--needed) < 0)
                        break;
-               /*
-                * This page is now managed by the hugetlb allocator and has
-                * no users -- drop the buddy allocator's reference.
-                */
-               zeroed = put_page_testzero(page);
-               VM_BUG_ON_PAGE(!zeroed, page);
-               enqueue_huge_page(h, page);
+
+               if (put_page_testzero(page))
+                       enqueue_huge_page(h, page);
        }
 free:
        spin_unlock_irq(&hugetlb_lock);


>
> I went back and looked at those lines in gather_surplus_pages
>
>                 zeroed = put_page_testzero(page);
>                 VM_BUG_ON_PAGE(!zeroed, page);
>                 enqueue_huge_page(h, page);
>
> They were first added as part of alloc_buddy_huge_page with commit
> 2668db9111bb - hugetlb: correct page count for surplus huge pages.
> It appears the reason for the VM_BUG_ON is because prior hugetlb code
> forgot to account for the ref count provided by the buddy allocator.
> The VM_BUG_ON may have been added mostly as a sanity check for hugetlb
> ref count management.
>
> I wonder if we have ever hit that VM_BUG_ON in the 13 years it has been
> in the code?  I know you recently spotted the potential race with memory
> error handling and Naoya fixed up the memory error code.
>
> I'm OK with modifying prep_new_huge_page, but it is going to be a bit
> messy (like this patch).  I wonder if there are other less intrusive
> ways to address this potential issue?
> --
> Mike Kravetz