From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.5 required=3.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED,DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4D9AC433ED for ; Wed, 28 Apr 2021 07:47:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 40223606A5 for ; Wed, 28 Apr 2021 07:47:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 40223606A5 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B479C6B0036; Wed, 28 Apr 2021 03:47:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AF71C6B006E; Wed, 28 Apr 2021 03:47:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 997D46B0070; Wed, 28 Apr 2021 03:47:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0008.hostedemail.com [216.40.44.8]) by kanga.kvack.org (Postfix) with ESMTP id 786966B0036 for ; Wed, 28 Apr 2021 03:47:00 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 34E088249980 for ; Wed, 28 Apr 2021 07:47:00 +0000 (UTC) X-FDA: 78080994600.07.0ABBBA7 Received: from mail-pg1-f172.google.com (mail-pg1-f172.google.com [209.85.215.172]) by imf12.hostedemail.com (Postfix) with ESMTP id 0AE0512E for ; Wed, 28 Apr 2021 07:46:48 +0000 (UTC) Received: by mail-pg1-f172.google.com with SMTP id p12so44135408pgj.10 for ; Wed, 28 Apr 2021 00:46:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:mime-version:content-disposition :content-transfer-encoding:in-reply-to; bh=j5ZgSWzuhsKjqSUXE1GUgLCTEhguAnWn3LEnraKymSM=; b=OPW3C3yq6ygedhhFwi/oqANu9XBXK6su9rI9OM7RaKZRXDRHe6lT1jfg5SJh8n/q3z Famcekju/1YDZNbkOydklqg2sKettB78qUBsnqmbwG827bsruwPtUFkuw3bhKQVCJF9M MigxN+8I766+LDZIGKlfQnUNyNH4hoUQvNqNDVRdYPnTme/bfuAEXaWHvcgj4ZbPxD/2 hGXr3kCjz1+ZgtoYAXyVSaDpLy7HNi3Ys81OjtQkgcntUTLzcxiDy0EI6gzTMQH6SY+U QDqVFFAhfzaA/CjBXYolDFlo1eVnsRiTIA+v/3vM3NcWs9wWk64oGoFFicUVqCEbyUsY mITQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=j5ZgSWzuhsKjqSUXE1GUgLCTEhguAnWn3LEnraKymSM=; b=BlAHRgql5sw1Bclo6Mzs74JSaEaQdmsYl3ac231uzA98b0UDIthj4JhnV/P/3/ZXqs LZT3JJ4Fjek8Qsj6NaxW9plr6vQf7Ljk00gXGiz0LDkhGZMnT8TcPWW05v2o41WmX2oS LXQkXy1tqi6jGKkD7xBgxvBa3Dfgn1xGiML6coJypcbfmy/nMt0WS8oCPIStzLGpPzh+ HIPGUKnqCVZe6w4VYLjwOcYuzFoAkb9BZl+I50+p4uG0aiBNWaUdLOOyr6r2ZOK+ttf0 vnT+NzN/b3mvi6IGkdjyTxGUUzr55PLXsamN6HqQIMAmjulUkG3DjNC91sTxy7JCS1Dn c33Q== X-Gm-Message-State: AOAM533YgJbXZNvCN4x/Yw2r9OSD4ET4DQWcgRao7Z7sOe7J2p/EmKgP EIWKn0Ho4w16v/NikkyclA== X-Google-Smtp-Source: ABdhPJysCGbZBEIsLxIYtE1LTTryO9KWLqxsLKyiYbXbYM+D78Z7L7zP4aP5qePzvo1iveiyq0wMHw== X-Received: by 2002:a63:a47:: with SMTP id z7mr25308472pgk.350.1619596018649; Wed, 28 Apr 2021 00:46:58 -0700 (PDT) Received: from u2004 (h175-177-040-153.catv02.itscom.jp. [175.177.40.153]) by smtp.gmail.com with ESMTPSA id q8sm4201339pfk.137.2021.04.28.00.46.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 28 Apr 2021 00:46:58 -0700 (PDT) Date: Wed, 28 Apr 2021 16:46:54 +0900 From: Naoya Horiguchi To: Mike Kravetz , Michal Hocko , Muchun Song , "akpm@linux-foundation.org" , "osalvador@suse.de" , "linux-mm@kvack.org" Cc: "linux-kernel@vger.kernel.org" , Naoya Horiguchi Subject: [PATCH] mm,hwpoison: fix race with compound page allocation Message-ID: <20210428074654.GA2093897@u2004> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210423080153.GA78658@hori.linux.bs1.fc.nec.co.jp> X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 0AE0512E X-Stat-Signature: 7jgmwitc4qmwh4toddc78rs13toujdt5 Received-SPF: none (gmail.com>: No applicable sender policy available) receiver=imf12; identity=mailfrom; envelope-from=""; helo=mail-pg1-f172.google.com; client-ip=209.85.215.172 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1619596008-299112 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Apr 23, 2021 at 08:01:54AM +0000, HORIGUCHI NAOYA(=E5=A0=80=E5=8F= =A3 =E7=9B=B4=E4=B9=9F) wrote: > On Thu, Apr 22, 2021 at 08:27:46AM +0000, HORIGUCHI NAOYA(=E5=A0=80=E5=8F= =A3 =E7=9B=B4=E4=B9=9F) wrote: > > On Wed, Apr 21, 2021 at 11:03:24AM -0700, Mike Kravetz wrote: > > > On 4/21/21 1:33 AM, HORIGUCHI NAOYA(=E5=A0=80=E5=8F=A3 =E7=9B=B4=E4= =B9=9F) wrote: > > > > On Wed, Apr 21, 2021 at 10:03:34AM +0200, Michal Hocko wrote: > > > >> [Cc Naoya] > > > >> > > > >> On Wed 21-04-21 14:02:59, Muchun Song wrote: > > > >>> The possible bad scenario: > > > >>> > > > >>> CPU0: CPU1: > > > >>> > > > >>> gather_surplus_pages() > > > >>> page =3D alloc_surplus_huge_p= age() > > > >>> memory_failure_hugetlb() > > > >>> get_hwpoison_page(page) > > > >>> __get_hwpoison_page(page) > > > >>> get_page_unless_zero(page) > > > >>> zero =3D put_page_testzero(pa= ge) > > > >>> VM_BUG_ON_PAGE(!zero, page) > > > >>> enqueue_huge_page(h, page) > > > >>> put_page(page) > > > >>> > > > >>> The refcount can possibly be increased by memory-failure or sof= t_offline > > > >>> handlers, we can trigger VM_BUG_ON_PAGE and wrongly add the pag= e to the > > > >>> hugetlb pool list. > > > >> > > > >> The hwpoison side of this looks really suspicious to me. It shou= ldn't > > > >> really touch the reference count of hugetlb pages without being = very > > > >> careful (and having hugetlb_lock held). > > > >=20 > > > > I have the same feeling, there is a window where a hugepage is re= fcounted > > > > during converting from buddy free pages into free hugepage, so re= fcount > > > > alone is not enough to prevent the race. hugetlb_lock is retaken= after > > > > alloc_surplus_huge_page returns, so simply holding hugetlb_lock i= n > > > > get_hwpoison_page() seems not work. Is there any status bit to s= how that a > > > > hugepage is just being initialized (not in free hugepage pool or = in use)? > > > >=20 > > >=20 > > > It seems we can also race with the code that makes a compound page = a > > > hugetlb page. The memory failure code could be called after alloca= ting > > > pages from buddy and before setting compound page DTOR. So, the me= mory > > > handling code will process it as a compound page. > >=20 > > Yes, so get_hwpoison_page() has to call get_page_unless_zero() > > only when memory_failure() can surely handle the error. > >=20 > > >=20 > > > Just thinking that this may not be limited to the hugetlb specific = memory > > > failure handling? > >=20 > > Currently hugetlb page is the only type of compound page supported by= memory > > failure. But I agree with you that other types of compound pages hav= e the > > same race window, and judging only with get_page_unless_zero() is dan= gerous. > > So I think that __get_hwpoison_page() should have the following struc= ture: > >=20 > > if (PageCompound) { > > if (PageHuge) { > > if (PageHugeFreed || PageHugeActive) { > > if (get_page_unless_zero) > > return 0; // path for in-use hugetlb page > > else > > return 1; // path for free hugetlb page > > } else { > > return -EBUSY; // any transient hugetlb page > > } > > } else { > > ... // any other compound page (like thp, slab, ...) > > } > > } else { > > ... // any non-compound page > > } >=20 > The above pseudo code was wrong, so let me update my thought. > I'm now trying to solve the reported issue by changing __get_hwpoison_p= age() > like below: >=20 > static int __get_hwpoison_page(struct page *page) > { > struct page *head =3D compound_head(page); > =20 > if (PageCompound(page)) { > if (PageSlab(page)) { > return get_page_unless_zero(page); > } else if (PageHuge(head)) { > if (HPageFreed(head) || HPageMigratable(head)= ) > return get_page_unless_zero(head); > } else if (PageTransHuge(head)) { > /* > * Non anonymous thp exists only in allocatio= n/free time. We > * can't handle such a case correctly, so let= 's give it up. > * This should be better than triggering BUG_= ON when kernel > * tries to touch the "partially handled" pag= e. > */ > if (!PageAnon(head)) { > pr_err("Memory failure: %#lx: non ano= nymous thp\n", > page_to_pfn(page)); > return 0; > } > if (get_page_unless_zero(head)) { > if (head =3D=3D compound_head(page)) > return 1; > pr_info("Memory failure: %#lx cannot = catch tail\n", > page_to_pfn(page)); > put_page(head); > } > } > return 0; > } > =20 > return get_page_unless_zero(page); > } >=20 > Some notes:=20 >=20 > - in hugetlb path, new HPage* checks should avoid the reported race, > but I still need more testing to confirm it, > - PageSlab check is added because otherwise I found that "non anonymo= us thp" > path is chosen, that's obviously wrong, > - thp's branch has a known issue unrelated to the current issue, whic= h > will/should be improved later. >=20 > I'll send a patch next week. I confirmed that the patch fixes the reported problem (in the testcase triggering VM_BUG_ON_PAGE() without this patch). So let me suggest this as a fix on hwpoison side. Thanks, Naoya Horiguchi --- From: Naoya Horiguchi Date: Wed, 28 Apr 2021 15:55:47 +0900 Subject: [PATCH] mm,hwpoison: fix race with compound page allocation When hugetlb page fault (under overcommiting situation) and memory_failur= e() race, VM_BUG_ON_PAGE() is triggered by the following race: CPU0: CPU1: gather_surplus_pages() page =3D alloc_surplus_huge_page() memory_failure_hugetlb() get_hwpoison_page(page) __get_hwpoison_page(page) get_page_unless_zero(page) zero =3D put_page_testzero(page) VM_BUG_ON_PAGE(!zero, page) enqueue_huge_page(h, page) put_page(page) __get_hwpoison_page() only checks page refcount before taking additional one for memory error handling, which is wrong because there's time windows where compound pages have non-zero refcount during initialization= . So makes __get_hwpoison_page() check more page status for a few types of compound pages. PageSlab() check is added because otherwise "non anonymous thp" path is wrongly chosen for slab pages. Signed-off-by: Naoya Horiguchi Reported-by: Muchun Song --- mm/memory-failure.c | 48 +++++++++++++++++++++++++-------------------- 1 file changed, 27 insertions(+), 21 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index a3659619d293..61988e332712 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1095,30 +1095,36 @@ static int __get_hwpoison_page(struct page *page) { struct page *head =3D compound_head(page); =20 - if (!PageHuge(head) && PageTransHuge(head)) { - /* - * Non anonymous thp exists only in allocation/free time. We - * can't handle such a case correctly, so let's give it up. - * This should be better than triggering BUG_ON when kernel - * tries to touch the "partially handled" page. - */ - if (!PageAnon(head)) { - pr_err("Memory failure: %#lx: non anonymous thp\n", - page_to_pfn(page)); - return 0; + if (PageCompound(page)) { + if (PageSlab(page)) { + return get_page_unless_zero(page); + } else if (PageHuge(head)) { + if (HPageFreed(head) || HPageMigratable(head)) + return get_page_unless_zero(head); + } else if (PageTransHuge(head)) { + /* + * Non anonymous thp exists only in allocation/free time. We + * can't handle such a case correctly, so let's give it up. + * This should be better than triggering BUG_ON when kernel + * tries to touch the "partially handled" page. + */ + if (!PageAnon(head)) { + pr_err("Memory failure: %#lx: non anonymous thp\n", + page_to_pfn(page)); + return 0; + } + if (get_page_unless_zero(head)) { + if (head =3D=3D compound_head(page)) + return 1; + pr_info("Memory failure: %#lx cannot catch tail\n", + page_to_pfn(page)); + put_page(head); + } } + return 0; } =20 - if (get_page_unless_zero(head)) { - if (head =3D=3D compound_head(page)) - return 1; - - pr_info("Memory failure: %#lx cannot catch tail\n", - page_to_pfn(page)); - put_page(head); - } - - return 0; + return get_page_unless_zero(page); } =20 /* --=20 2.25.1