From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39181C35669 for ; Sat, 22 Feb 2020 02:16:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D1A6B206E2 for ; Sat, 22 Feb 2020 02:16:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D1A6B206E2 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 810C96B0006; Fri, 21 Feb 2020 21:16:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7C1886B0007; Fri, 21 Feb 2020 21:16:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6B16E6B0008; Fri, 21 Feb 2020 21:16:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0111.hostedemail.com [216.40.44.111]) by kanga.kvack.org (Postfix) with ESMTP id 514EE6B0006 for ; Fri, 21 Feb 2020 21:16:04 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 06BAC21EA for ; Sat, 22 Feb 2020 02:16:04 +0000 (UTC) X-FDA: 76516147806.06.river44_6f0549cf3b94b X-HE-Tag: river44_6f0549cf3b94b X-Filterd-Recvd-Size: 5588 Received: from huawei.com (szxga07-in.huawei.com [45.249.212.35]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Sat, 22 Feb 2020 02:16:02 +0000 (UTC) Received: from DGGEMS413-HUB.china.huawei.com (unknown [172.30.72.60]) by Forcepoint Email with ESMTP id 029D1CC4DAC9F3D0C1A2; Sat, 22 Feb 2020 10:15:56 +0800 (CST) Received: from [127.0.0.1] (10.177.246.209) by DGGEMS413-HUB.china.huawei.com (10.3.19.213) with Microsoft SMTP Server id 14.3.439.0; Sat, 22 Feb 2020 10:15:49 +0800 Subject: Re: [PATCH] mm/hugetlb: avoid get wrong ptep caused by race To: Mike Kravetz , Sean Christopherson CC: , , , , , , , "Kirill A. Shutemov" , Matthew Wilcox References: <1582027825-112728-1-git-send-email-longpeng2@huawei.com> <20200218203717.GE28156@linux.intel.com> <20200219015836.GM28156@linux.intel.com> <098a5dd6-e1da-f161-97d7-cfe735d14fd8@oracle.com> <502b5e52-060b-6864-d1b7-eab2dc951aed@huawei.com> From: "Longpeng (Mike)" Message-ID: Date: Sat, 22 Feb 2020 10:15:47 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" X-Originating-IP: [10.177.246.209] X-CFilter-Loop: Reflected Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =E5=9C=A8 2020/2/21 8:22, Mike Kravetz =E5=86=99=E9=81=93: > On 2/19/20 6:30 PM, Longpeng (Mike) wrote: >> =E5=9C=A8 2020/2/20 3:33, Mike Kravetz =E5=86=99=E9=81=93: >>> + Kirill >>> On 2/18/20 5:58 PM, Sean Christopherson wrote: >>>> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: > >>>> The race and the fix make sense. I assumed dereferencing garbage fr= om the >>>> huge page was the issue, but I wasn't 100% that was the case, which = is why >>>> I asked about alternative fixes. >>>> >>>>> We change the code from >>>>> if (pud_huge(*pud) || !pud_present(*pud)) >>>>> to >>>>> if (pud_huge(*pud) >>>>> return (pte_t *)pud; >>>>> busy loop for 500ms >>>>> if (!pud_present(*pud)) >>>>> return (pte_t *)pud; >>>>> and the panic will be hit quickly. >>>>> >>>>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look= at this >>>>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing= page tables). >>>>> >>>>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read= entry from >>>>> pud twice and the *pud maybe change in a race, so if we only read t= he pud once. >>>>> I use READ_ONCE here is just for safe, to prevents the complier mis= chief if >>>>> possible. >>>> >>>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, = e.g. >>>> convert everything as a follow-up patch (or patches). I'm fairly co= nfident >>>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't e= xactly >>>> bet my life on it. I'd much rather the failing scenario be that KVM= uses >>>> a sub-optimal page size as opposed to exploding on a bad pointer. >>> >>> Longpeng(Mike) asked in another e-mail specifically about making simi= lar >>> changes to lookup_address_in_mm(). Replying here as there is more co= ntext. >>> >>> I 'think' lookup_address_in_mm is safe from this issue. Why? IIUC, = the >>> problem with the huge_pte_offset routine is that the pud changes from >>> pud_none() to pud_huge() in the middle of >>> 'if (pud_huge(*pud) || !pud_present(*pud))'. In the case of >>> lookup_address_in_mm, we know pud was not pud_none() as it was previo= usly >>> checked. I am not aware of any other state transitions which could c= ause >>> us trouble. However, I am no expert in this area. >=20 > Bad copy/paste by me. Longpeng(Mike) was asking about lookup_address_i= n_pgd. >=20 >> So... I need just fix huge_pte_offset in mm/hugetlb.c, right? >=20 > Let's start with just a fix for huge_pte_offset() as you can easily rep= roduce > that issue by adding a delay. >=20 >> Is it possible the pud changes from pud_huge() to pud_none() while ano= ther CPU >> is walking the pagetable ? >=20 All right, I'll send V2 to fix it, thanks :) > I believe it is possible. If we hole punch a hugetlbfs file, we will c= lear > the corresponding pud's. Hence, we can go from pud_huge() to pud_none(= ). > Unless I am missing something, that does imply we could have issues in = places > such as lookup_address_in_pgd: >=20 > pud =3D pud_offset(p4d, address); > if (pud_none(*pud)) > return NULL; >=20 > *level =3D PG_LEVEL_1G; > if (pud_large(*pud) || !pud_present(*pud)) > return (pte_t *)pud; >=20 > I hope I am wrong, but it seems like pud_none(*pud) could become true a= fter > the initial check, and before the (pud_large) check. If so, there coul= d be > a problem (addressing exception) when the code continues and looks up t= he pmd. >=20 > pmd =3D pmd_offset(pud, address); > if (pmd_none(*pmd)) > return NULL; >=20 > It has been mentioned before that there are many page table walks like = this. > What am I missing that prevents races like this? Or, have we just been= lucky? >=20 That's what I worry about. Maybe there is no usecase to hit it. --=20 Regards, Longpeng(Mike)