From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.1 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A562C11D00 for ; Fri, 21 Feb 2020 00:22:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E511424654 for ; Fri, 21 Feb 2020 00:22:48 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="cVgkuu1q" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E511424654 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 379056B0005; Thu, 20 Feb 2020 19:22:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 327BB6B0006; Thu, 20 Feb 2020 19:22:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C8F36B0007; Thu, 20 Feb 2020 19:22:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0096.hostedemail.com [216.40.44.96]) by kanga.kvack.org (Postfix) with ESMTP id 015876B0005 for ; Thu, 20 Feb 2020 19:22:47 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B2A80247A for ; Fri, 21 Feb 2020 00:22:47 +0000 (UTC) X-FDA: 76512233574.12.noise46_528bebf7d2d15 X-HE-Tag: noise46_528bebf7d2d15 X-Filterd-Recvd-Size: 7475 Received: from userp2130.oracle.com (userp2130.oracle.com [156.151.31.86]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Fri, 21 Feb 2020 00:22:47 +0000 (UTC) Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 01L0IuBc097560; Fri, 21 Feb 2020 00:22:29 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2020-01-29; bh=j3orcV8N7f6tsO1hzt4d6bYa8i40eAtvmiZrDeUsowc=; b=cVgkuu1q0ENV4pHWiXR021YoDnJt/upxXvl+V7TptYMU6reyQ9uuYXRwAz5ius5oIyTO YmwNJWi92N/4cDdoxdkAni9VqOlV+FR9E4F7GFK/r7pXVJ3E0XEpuXnjvvEmOqXngjow CtPXx7EndrPPiiNOevB6oh5vwEoM5H6upA9kEQ3giDn9xX1UKjeigv4UkdsV8UvBMTHk Pty3uFwFC1DoD8CcbRLZf8hAE/WGbNVKx6LGYYj+TOz6ppUHBu/5y+TzbjpsBWBpOe4J 5X4q0LuRs/zDbbtQIJQ3skmEPVXUs+DF0+Qy1pOZKvAdgDr/l71Kaf5Szqh5LbKj/bis dg== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by userp2130.oracle.com with ESMTP id 2y8udddayy-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 21 Feb 2020 00:22:29 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 01L0HssC082442; Fri, 21 Feb 2020 00:22:28 GMT Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserp3020.oracle.com with ESMTP id 2y8udgkksv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 21 Feb 2020 00:22:28 +0000 Received: from abhmp0005.oracle.com (abhmp0005.oracle.com [141.146.116.11]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 01L0MP4K016579; Fri, 21 Feb 2020 00:22:26 GMT Received: from [192.168.1.206] (/71.63.128.209) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 20 Feb 2020 16:22:25 -0800 Subject: Re: [PATCH] mm/hugetlb: avoid get wrong ptep caused by race To: "Longpeng (Mike)" , Sean Christopherson Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, arei.gonglei@huawei.com, weidong.huang@huawei.com, weifuqiang@huawei.com, kvm@vger.kernel.org, "Kirill A. Shutemov" , Matthew Wilcox References: <1582027825-112728-1-git-send-email-longpeng2@huawei.com> <20200218203717.GE28156@linux.intel.com> <20200219015836.GM28156@linux.intel.com> <098a5dd6-e1da-f161-97d7-cfe735d14fd8@oracle.com> <502b5e52-060b-6864-d1b7-eab2dc951aed@huawei.com> From: Mike Kravetz Message-ID: Date: Thu, 20 Feb 2020 16:22:24 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: <502b5e52-060b-6864-d1b7-eab2dc951aed@huawei.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9537 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 mlxlogscore=999 phishscore=0 suspectscore=0 mlxscore=0 malwarescore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002210000 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9537 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 phishscore=0 impostorscore=0 mlxlogscore=999 malwarescore=0 mlxscore=0 suspectscore=0 priorityscore=1501 bulkscore=0 adultscore=0 spamscore=0 lowpriorityscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002210000 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2/19/20 6:30 PM, Longpeng (Mike) wrote: > =E5=9C=A8 2020/2/20 3:33, Mike Kravetz =E5=86=99=E9=81=93: >> + Kirill >> On 2/18/20 5:58 PM, Sean Christopherson wrote: >>> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: >>> The race and the fix make sense. I assumed dereferencing garbage fro= m the >>> huge page was the issue, but I wasn't 100% that was the case, which i= s why >>> I asked about alternative fixes. >>> >>>> We change the code from >>>> if (pud_huge(*pud) || !pud_present(*pud)) >>>> to >>>> if (pud_huge(*pud) >>>> return (pte_t *)pud; >>>> busy loop for 500ms >>>> if (!pud_present(*pud)) >>>> return (pte_t *)pud; >>>> and the panic will be hit quickly. >>>> >>>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look = at this >>>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing = page tables). >>>> >>>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read = entry from >>>> pud twice and the *pud maybe change in a race, so if we only read th= e pud once. >>>> I use READ_ONCE here is just for safe, to prevents the complier misc= hief if >>>> possible. >>> >>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e= .g. >>> convert everything as a follow-up patch (or patches). I'm fairly con= fident >>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't ex= actly >>> bet my life on it. I'd much rather the failing scenario be that KVM = uses >>> a sub-optimal page size as opposed to exploding on a bad pointer. >> >> Longpeng(Mike) asked in another e-mail specifically about making simil= ar >> changes to lookup_address_in_mm(). Replying here as there is more con= text. >> >> I 'think' lookup_address_in_mm is safe from this issue. Why? IIUC, t= he >> problem with the huge_pte_offset routine is that the pud changes from >> pud_none() to pud_huge() in the middle of >> 'if (pud_huge(*pud) || !pud_present(*pud))'. In the case of >> lookup_address_in_mm, we know pud was not pud_none() as it was previou= sly >> checked. I am not aware of any other state transitions which could ca= use >> us trouble. However, I am no expert in this area. Bad copy/paste by me. Longpeng(Mike) was asking about lookup_address_in_= pgd. > So... I need just fix huge_pte_offset in mm/hugetlb.c, right? Let's start with just a fix for huge_pte_offset() as you can easily repro= duce that issue by adding a delay. > Is it possible the pud changes from pud_huge() to pud_none() while anot= her CPU > is walking the pagetable ? I believe it is possible. If we hole punch a hugetlbfs file, we will cle= ar the corresponding pud's. Hence, we can go from pud_huge() to pud_none(). Unless I am missing something, that does imply we could have issues in pl= aces such as lookup_address_in_pgd: pud =3D pud_offset(p4d, address); if (pud_none(*pud)) return NULL; *level =3D PG_LEVEL_1G; if (pud_large(*pud) || !pud_present(*pud)) return (pte_t *)pud; I hope I am wrong, but it seems like pud_none(*pud) could become true aft= er the initial check, and before the (pud_large) check. If so, there could = be a problem (addressing exception) when the code continues and looks up the= pmd. pmd =3D pmd_offset(pud, address); if (pmd_none(*pmd)) return NULL; It has been mentioned before that there are many page table walks like th= is. What am I missing that prevents races like this? Or, have we just been l= ucky? --=20 Mike Kravetz