From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 86BB1C433DB for ; Tue, 12 Jan 2021 15:47:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A76FC23132 for ; Tue, 12 Jan 2021 15:47:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A76FC23132 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.vnet.ibm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B4EC18D00B0; Tue, 12 Jan 2021 10:47:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AFED98D00AA; Tue, 12 Jan 2021 10:47:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9ED2D8D00B0; Tue, 12 Jan 2021 10:47:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0044.hostedemail.com [216.40.44.44]) by kanga.kvack.org (Postfix) with ESMTP id 89BF88D00AA for ; Tue, 12 Jan 2021 10:47:43 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 52854181AEF15 for ; Tue, 12 Jan 2021 15:47:43 +0000 (UTC) X-FDA: 77697553206.27.jelly16_1810e7927516 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin27.hostedemail.com (Postfix) with ESMTP id E0C2B3D668 for ; Tue, 12 Jan 2021 15:47:42 +0000 (UTC) X-HE-Tag: jelly16_1810e7927516 X-Filterd-Recvd-Size: 9682 Received: from mx0b-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by imf16.hostedemail.com (Postfix) with ESMTP for ; Tue, 12 Jan 2021 15:47:41 +0000 (UTC) Received: from pps.filterd (m0127361.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 10CFWaSx168016; Tue, 12 Jan 2021 10:47:27 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=pp1; bh=5GuUw2ilKr/3LrpKOVWFzEHAHdojZRzNIOzb13tEkOk=; b=Qw6yCWc7I8RYDbigrNyQOqWQRn+9DSOj9/yVfUMANIPWgXYG9VaCPwvdZFNroHm8foRR 5RVidqpGQFmIQ4cFeCCdkLkZ+MahXLF1SeM5zrySsPkk9Hsd2TfhQFHOSwwe1vcYVMhi pb5L+W00Pvlm918KQZ1G6RSB6E6J/L6eISfEjxLdYmnIVUM8TpT031dfIQfg8j26RlDJ fNpsjS8vV8MmUDh+NidXiW1MIgwxikByZ4kCkLNAqqtwRwFgmRFDzAmHTye22qWxKVx5 gG4iPS4vqZjjmzxxoKJlRMwNvzW4IR+BOxMGj2neuwmTmCXVsIz4XXi7PQP27zNJSLm7 dQ== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 361e2p19qw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 12 Jan 2021 10:47:26 -0500 Received: from m0127361.ppops.net (m0127361.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 10CFWgMn168641; Tue, 12 Jan 2021 10:47:23 -0500 Received: from ppma02fra.de.ibm.com (47.49.7a9f.ip4.static.sl-reverse.com [159.122.73.71]) by mx0a-001b2d01.pphosted.com with ESMTP id 361e2p19pp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 12 Jan 2021 10:47:23 -0500 Received: from pps.filterd (ppma02fra.de.ibm.com [127.0.0.1]) by ppma02fra.de.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 10CFWb9e015679; Tue, 12 Jan 2021 15:47:21 GMT Received: from b06cxnps4075.portsmouth.uk.ibm.com (d06relay12.portsmouth.uk.ibm.com [9.149.109.197]) by ppma02fra.de.ibm.com with ESMTP id 35y448hxxe-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 12 Jan 2021 15:47:21 +0000 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 10CFlJlg25100636 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 12 Jan 2021 15:47:19 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F056852057; Tue, 12 Jan 2021 15:47:18 +0000 (GMT) Received: from pomme.local (unknown [9.145.179.152]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id 5B21252059; Tue, 12 Jan 2021 15:47:18 +0000 (GMT) Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect To: Vinayak Menon , Peter Zijlstra , Linus Torvalds Cc: Andy Lutomirski , Peter Xu , Nadav Amit , Yu Zhao , Andrea Arcangeli , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Will Deacon , surenb@google.com References: <9E301C7C-882A-4E0F-8D6D-1170E792065A@gmail.com> <1FCC8F93-FF29-44D3-A73A-DF943D056680@gmail.com> <20201221223041.GL6640@xz-x1> <20210105153727.GK3040@hirez.programming.kicks-ass.net> From: Laurent Dufour Message-ID: <0201238b-e716-2a3c-e9ea-d5294ff77525@linux.vnet.ibm.com> Date: Tue, 12 Jan 2021 16:47:17 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343,18.0.737 definitions=2021-01-12_10:2021-01-12,2021-01-12 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 impostorscore=0 spamscore=0 mlxlogscore=436 clxscore=1011 lowpriorityscore=0 malwarescore=0 mlxscore=0 bulkscore=0 suspectscore=0 priorityscore=1501 adultscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101120090 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Le 12/01/2021 =C3=A0 12:43, Vinayak Menon a =C3=A9crit=C2=A0: > On 1/5/2021 9:07 PM, Peter Zijlstra wrote: >> On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote: >> >>> So I think the basic rule is that "if you hold mmap_sem for writing, >>> you're always safe". And that really should be considered the >>> "default" locking. >>> >>> ANY time you make a modification to the VM layer, you should basicall= y >>> always treat it as a write operation, and get the mmap_sem for >>> writing. >>> >>> Yeah, yeah, that's a bit simplified, and it ignores various special >>> cases (and the hardware page table walkers that obviously take no >>> locks at all), but if you hold the mmap_sem for writing you won't >>> really race with anything else - not page faults, and not other >>> "modify this VM". >>> To a first approximation, everybody that changes the VM should take >>> the mmap_sem for writing, and the readers should just be just about >>> page fault handling (and I count GUP as "page fault handling" too - >>> it's kind of the same "look up page" rather than "modify vm" kind of >>> operation). >>> >>> And there are just a _lot_ more page faults than there are things tha= t >>> modify the page tables and the vma's. >>> >>> So having that mental model of "lookup of pages in a VM take mmap_sem= n >>> for reading, any modification of the VM uses it for writing" makes >>> sense both from a performance angle and a logical standpoint. It's th= e >>> correct model. >>> And it's worth noting that COW is still "lookup of pages", even thoug= h >>> it might modify the page tables in the process. The same way lookup >>> can modify the page tables to mark things accessed or dirty. >>> >>> So COW is still a lookup operation, in ways that "change the >>> writabiility of this range" very much is not. COW is "lookup for >>> write", and the magic we do to copy to make that write valid is still >>> all about the lookup of the page. >> (your other email clarified this point; the COW needs to copy while >> holding the PTL and we need TLBI under PTL if we're to change this) >> >>> Which brings up another mental mistake I saw earlier in this thread: >>> you should not think "mmap_sem is for vma, and the page table lock is >>> for the page table changes". >>> >>> mmap_sem is the primary lock for any modifications to the VM layout, >>> whether it be in the vma's or in the page tables. >>> >>> Now, the page table lock does exist _in_addition_to_ the mmap_sem, bu= t >>> it is partly because >>> >>> =C2=A0 (a) we have things that historically walked the page tables _w= ithout_ >>> walking the vma's (notably the virtual memory scanning) >>> >>> =C2=A0 (b) we do allow concurrent page faults, so we then need a lowe= r-level >>> lock to serialize the parallelism we _do_ have. >> And I'm thinking the speculative page fault series steps right into al= l >> this, it fundamentally avoids mmap_sem and entirely relies on the PTL. >> >> Which opens it up to exactly these races explored here. >> >> The range lock approach does not suffer this, but I'm still worried >> about the actual performance of that thing. >=20 >=20 > Some thoughts on why there may not be an issue with speculative page fa= ult. > Adding Laurent as well. >=20 > Possibility of race against other PTE modifiers >=20 > 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO a= nd that > is described and fixed here https://lore.kernel.org/patchwork/patch/106= 2672/ > 2) mprotect - change_protection in mprotect which does the deferred flu= sh is > marked under vm_write_begin/vm_write_end, thus SPF bails out on faults = on those=20 > VMAs. > 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) abo= ve. > But SPF does not take UFFD faults. > 4) hugetlb - hugetlb_change_protection - called from mprotect and cover= ed by > (2) above. > 5) Concurrent faults - SPF does not handle all faults. Only anon page f= aults. > Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESE= NT > transitions without tlb flush. And I hope do_wp_page with RO->RW is fin= e as well. > I could not see a case where speculative path cannot see a PTE update d= one via > a fault on another CPU. >=20 Thanks Vinayak, You explained it fine. Indeed SPF is handling deferred TLB invalidation b= y=20 marking the VMA through vm_write_begin/end(), as for the fork case you=20 mentioned. Once the PTL is held, and the VMA's seqcount is checked, the P= TE=20 values read are valid. Cheers, Laurent.