From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85392C34049 for ; Tue, 18 Feb 2020 20:37:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 5575D2173E for ; Tue, 18 Feb 2020 20:37:22 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5575D2173E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DF01F6B0005; Tue, 18 Feb 2020 15:37:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DA0936B0006; Tue, 18 Feb 2020 15:37:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C69F06B0007; Tue, 18 Feb 2020 15:37:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0162.hostedemail.com [216.40.44.162]) by kanga.kvack.org (Postfix) with ESMTP id AC3B56B0005 for ; Tue, 18 Feb 2020 15:37:21 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 740DD2C9D for ; Tue, 18 Feb 2020 20:37:21 +0000 (UTC) X-FDA: 76504407882.10.roof16_2305610ffc83e X-HE-Tag: roof16_2305610ffc83e X-Filterd-Recvd-Size: 7867 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf25.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Feb 2020 20:37:20 +0000 (UTC) X-Amp-Result: UNKNOWN X-Amp-Original-Verdict: FILE UNKNOWN X-Amp-File-Uploaded: False Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 18 Feb 2020 12:37:18 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,457,1574150400"; d="scan'208";a="228865923" Received: from sjchrist-coffee.jf.intel.com (HELO linux.intel.com) ([10.54.74.202]) by orsmga008.jf.intel.com with ESMTP; 18 Feb 2020 12:37:17 -0800 Date: Tue, 18 Feb 2020 12:37:17 -0800 From: Sean Christopherson To: "Longpeng(Mike)" Cc: mike.kravetz@oracle.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, arei.gonglei@huawei.com, weidong.huang@huawei.com, weifuqiang@huawei.com, kvm@vger.kernel.org Subject: Re: [PATCH] mm/hugetlb: avoid get wrong ptep caused by race Message-ID: <20200218203717.GE28156@linux.intel.com> References: <1582027825-112728-1-git-send-email-longpeng2@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1582027825-112728-1-git-send-email-longpeng2@huawei.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 18, 2020 at 08:10:25PM +0800, Longpeng(Mike) wrote: > Our machine encountered a panic after run for a long time and > the calltrace is: What's the actual panic? Is it a BUG() in hugetlb_fault(), a bad pointer dereference, etc...? > RIP: 0010:[] [] hugetlb_fault+0x307/0xbe0 > RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286 > RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48 > RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48 > RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080 > R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8 > R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074 > FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > [] ? unlock_page+0x2b/0x30 > [] ? hugetlb_fault+0x222/0xbe0 > [] follow_hugetlb_page+0x175/0x540 > [] ? cpumask_next_and+0x35/0x50 > [] __get_user_pages+0x2a0/0x7e0 > [] __get_user_pages_unlocked+0x15d/0x210 > [] __gfn_to_pfn_memslot+0x3c5/0x460 [kvm] > [] try_async_pf+0x6e/0x2a0 [kvm] > [] tdp_page_fault+0x151/0x2d0 [kvm] > [] ? vmx_vcpu_run+0x2ec/0xc80 [kvm_intel] > [] ? vmx_vcpu_run+0x2f8/0xc80 [kvm_intel] > [] kvm_mmu_page_fault+0x31/0x140 [kvm] > [] handle_ept_violation+0x9e/0x170 [kvm_intel] > [] vmx_handle_exit+0x2bc/0xc70 [kvm_intel] > [] ? __vmx_complete_interrupts.part.73+0x80/0xd0 [kvm_intel] > [] ? vmx_vcpu_run+0x490/0xc80 [kvm_intel] > [] vcpu_enter_guest+0x7be/0x13a0 [kvm] > [] ? kvm_check_async_pf_completion+0x8e/0xb0 [kvm] > [] kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm] > [] kvm_vcpu_ioctl+0x309/0x6d0 [kvm] > [] ? dequeue_signal+0x32/0x180 > [] ? do_sigtimedwait+0xcd/0x230 > [] do_vfs_ioctl+0x3f0/0x540 > [] SyS_ioctl+0xa1/0xc0 > [] system_call_fastpath+0x22/0x27 > > ( The kernel we used is older, but we think the latest kernel also has this > bug after dig into this problem. ) > > For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it > may return a wrong 'pmdp' if there is a race. Please look at the following > code snippet: > ... > pud = pud_offset(p4d, addr); > if (sz != PUD_SIZE && pud_none(*pud)) > return NULL; > /* hugepage or swap? */ > if (pud_huge(*pud) || !pud_present(*pud)) > return (pte_t *)pud; > > pmd = pmd_offset(pud, addr); > if (sz != PMD_SIZE && pmd_none(*pmd)) > return NULL; > /* hugepage or swap? */ > if (pmd_huge(*pmd) || !pmd_present(*pmd)) > return (pte_t *)pmd; > ... > > The following sequence would trigger this bug: > 1. CPU0: sz = PUD_SIZE and *pud = 0 , continue > 1. CPU0: "pud_huge(*pud)" is false > 2. CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT) > 3. CPU0: "!pud_present(*pud)" is false, continue > 4. CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp > However, we want CPU0 to return NULL or pudp. > > We can avoid this race by read the pud only once. Are there any other options for avoiding the panic you hit? I ask because there are a variety of flows that use a very similar code pattern, e.g. lookup_address_in_pgd(), and using READ_ONCE() in huge_pte_offset() but not other flows could be confusing (or in my case, anxiety inducing[*]). At the least, adding a comment in huge_pte_offset() to explain the need for READ_ONCE() would be helpful. [*] In kernel 5.6, KVM is moving to using lookup_address_in_pgd() (via lookup_address_in_mm()) to identify large page mappings. The function itself is susceptible to such a race, but KVM only does the lookup after it has done gup() and also ensures any zapping of ptes will cause KVM to restart the faulting (guest) instruction or that the zap will be blocked until after KVM does the lookup, i.e. racing with a transition from !PRESENT -> PRESENT should be impossible (in theory). > Signed-off-by: Longpeng(Mike) > --- > mm/hugetlb.c | 34 ++++++++++++++++++---------------- > 1 file changed, 18 insertions(+), 16 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index dd8737a..3bde229 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4908,31 +4908,33 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, > pte_t *huge_pte_offset(struct mm_struct *mm, > unsigned long addr, unsigned long sz) > { > - pgd_t *pgd; > - p4d_t *p4d; > - pud_t *pud; > - pmd_t *pmd; > + pgd_t *pgdp; > + p4d_t *p4dp; > + pud_t *pudp, pud; > + pmd_t *pmdp, pmd; > > - pgd = pgd_offset(mm, addr); > - if (!pgd_present(*pgd)) > + pgdp = pgd_offset(mm, addr); > + if (!pgd_present(*pgdp)) > return NULL; > - p4d = p4d_offset(pgd, addr); > - if (!p4d_present(*p4d)) > + p4dp = p4d_offset(pgdp, addr); > + if (!p4d_present(*p4dp)) > return NULL; > > - pud = pud_offset(p4d, addr); > - if (sz != PUD_SIZE && pud_none(*pud)) > + pudp = pud_offset(p4dp, addr); > + pud = READ_ONCE(*pudp); > + if (sz != PUD_SIZE && pud_none(pud)) > return NULL; > /* hugepage or swap? */ > - if (pud_huge(*pud) || !pud_present(*pud)) > - return (pte_t *)pud; > + if (pud_huge(pud) || !pud_present(pud)) > + return (pte_t *)pudp; > > - pmd = pmd_offset(pud, addr); > - if (sz != PMD_SIZE && pmd_none(*pmd)) > + pmdp = pmd_offset(pudp, addr); > + pmd = READ_ONCE(*pmdp); > + if (sz != PMD_SIZE && pmd_none(pmd)) > return NULL; > /* hugepage or swap? */ > - if (pmd_huge(*pmd) || !pmd_present(*pmd)) > - return (pte_t *)pmd; > + if (pmd_huge(pmd) || !pmd_present(pmd)) > + return (pte_t *)pmdp; > > return NULL; > } > -- > 1.8.3.1 > >