From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BFD0C433DB for ; Thu, 24 Dec 2020 01:21:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EBC4720663 for ; Thu, 24 Dec 2020 01:21:48 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EBC4720663 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=amacapital.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3F6BA8D0060; Wed, 23 Dec 2020 20:21:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3A54F8D005D; Wed, 23 Dec 2020 20:21:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2463C8D0060; Wed, 23 Dec 2020 20:21:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0139.hostedemail.com [216.40.44.139]) by kanga.kvack.org (Postfix) with ESMTP id 0BC678D005D for ; Wed, 23 Dec 2020 20:21:48 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id AD3DF181AEF1E for ; Thu, 24 Dec 2020 01:21:47 +0000 (UTC) X-FDA: 77626423854.06.sleep85_350098b2746d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 8E0DE1004C649 for ; Thu, 24 Dec 2020 01:21:47 +0000 (UTC) X-HE-Tag: sleep85_350098b2746d X-Filterd-Recvd-Size: 6965 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) by imf03.hostedemail.com (Postfix) with ESMTP for ; Thu, 24 Dec 2020 01:21:46 +0000 (UTC) Received: by mail-pf1-f175.google.com with SMTP id t8so417019pfg.8 for ; Wed, 23 Dec 2020 17:21:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amacapital-net.20150623.gappssmtp.com; s=20150623; h=content-transfer-encoding:from:mime-version:subject:date:message-id :references:cc:in-reply-to:to; bh=wZ+Vbb60i9reBOltL3wNTLG+0CilvA1ys5nYLd3KMME=; b=GPuzEKGSkm+C7vd7yDf3Wz3xat4zjRFindzdesJ2x8vxzgswBN32c+fMI7ydV52IYa 0CnpyYWpIT5UIpNLCJLT0LursrPWXwaEoz7R2yydXK43frH4R4A+jntMJlaFD5M0B8Qr A7V90bDlWWT+9qXHEYG+0XKJ6E4c/dPT556xHer1Uy91epEZUdDOw68lb8Z2zN3Ja5o/ pngRJmI3fiBt6rAcOZJbg0bLmasflbMHYniWeWWXrv3Du2/zUubcsyBVW6V19kpJ/Bkl TJINJG1VvuoItplDlaaxoX+s+oNA+XDBuuvYhU0he3caSTiVj/LSoyNVhgOxNb1WoUbG eDhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:content-transfer-encoding:from:mime-version :subject:date:message-id:references:cc:in-reply-to:to; bh=wZ+Vbb60i9reBOltL3wNTLG+0CilvA1ys5nYLd3KMME=; b=ekTQ2GQYJW2oMypSc8ELpCp53Jyii3dAwmwafPA8+1ndIIdU0Hw9mkZALoOQ+ghh9D GimkKYlmrgLr0q0RJaNR80AoTDz8BrXE2k0kINS8/GQiPb9qEtrmZWm0ufmG7WRdTZPR tsxhqKFk9toc4TKpB8Ue2hfmX9d15Iep+82Gex+G4dJEbpF+aFUDmMVU6GtnPDRSFb5U nZFDRWAWNIEEkee/F09jFwFVWJRVzr8V11qqSJPyuMDPdSBdhGpduJoVYhOchV+W2vww ZDYEqHfwgulB1eDTObde9VAwDRckyKhzcTePpb5u7SyUJl7tSXtCMtZ2ko2d2wTfsQdX 2ksg== X-Gm-Message-State: AOAM532Az3JwzObE96wcOKRAnUjP/Wf4yfGROvxwIgyYTVzz6Mdtg2ub 9JUrF9R4Oir3Yk2VMVzku7Eejg== X-Google-Smtp-Source: ABdhPJyl4lyfAfTxxeqYHRT6yUuJzZ3GYfT6m9WSFR4IrXUJ2GEP6j4xehyBLN89bNAraGCM5ueA+g== X-Received: by 2002:a05:6a00:8c4:b029:196:6931:2927 with SMTP id s4-20020a056a0008c4b029019669312927mr2681447pfu.56.1608772905941; Wed, 23 Dec 2020 17:21:45 -0800 (PST) Received: from ?IPv6:2601:646:c200:1ef2:7126:6c30:69f6:6a9? ([2601:646:c200:1ef2:7126:6c30:69f6:6a9]) by smtp.gmail.com with ESMTPSA id x10sm24208577pff.214.2020.12.23.17.21.44 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 23 Dec 2020 17:21:45 -0800 (PST) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Andy Lutomirski Mime-Version: 1.0 (1.0) Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Date: Wed, 23 Dec 2020 17:21:43 -0800 Message-Id: References: Cc: Andrea Arcangeli , Andy Lutomirski , Linus Torvalds , Peter Xu , Nadav Amit , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Will Deacon , Peter Zijlstra In-Reply-To: To: Yu Zhao X-Mailer: iPhone Mail (18B121) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Dec 23, 2020, at 2:29 PM, Yu Zhao wrote: >=20 >=20 > I was hesitant to suggest the following because it isn't that straight > forward. But since you seem to be less concerned with the complexity, > I'll just bring it on the table -- it would take care of both ufd and > clear_refs_write, wouldn't it? >=20 > diff --git a/mm/memory.c b/mm/memory.c > index 5e9ca612d7d7..af38c5ee327e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *= vmf) > goto unlock; > } > if (vmf->flags & FAULT_FLAG_WRITE) { > - if (!pte_write(entry)) > + if (!pte_write(entry)) { > + if (mm_tlb_flush_pending(vmf->vma->vm_mm)) > + flush_tlb_page(vmf->vma, vmf->address); > return do_wp_page(vmf); > + } I don=E2=80=99t love this as a long term fix. AFAICT we can have mm_tlb_flus= h_pending set for quite a while =E2=80=94 mprotect seems like it can wait in= IO while splitting a huge page, for example. That gives us a window in whic= h every write fault turns into a TLB flush. I=E2=80=99m not immediately sure how to do all that much better, though. We c= ould potentially keep a record of pending ranges that need flushing per mm o= r per PTL, protected by the PTL, and arrange to do the flush if we notice th= at flushes are pending when we want to do_wp_page(). At least this would li= mit us to one point extra flush, at least until the concurrent mprotect (or w= hatever) makes further progress. The bookkeeping might be nasty, though. But x86 already sort of does some of this bookkeeping, and arguably x86=E2=80= =99s code could be improved by tracking TLB ranges to flush per mm instead o= f per flush request =E2=80=94 Nadav already got us half way there by making a= little cache of flush_tlb_info structs. IMO it wouldn=E2=80=99t be totally= crazy to integrate this better with tlb_gather_mmu to make the pending flus= h data visible to other CPUs even before actually kicking off the flush. In t= he limit, this starts to look a bit like a fully async flush mechanism. We w= ould have a function to request a flush, and that function would return a ge= neration count but not actually flush anything. The case of flushing a rang= e adjacent to a still-pending range would be explicitly optimized. Then ano= ther function would actually initiate and wait for the flush to complete. A= nd we could, while holding PTL, scan the list of pending flushes, if any, to= see if the PTE we=E2=80=99re looking at has a flush pending. This is sort o= f easy in the one-PTL-per-mm case but potentially rather complicated in the s= plit PTL case. And I=E2=80=99m genuinely unsure where the =E2=80=9Cbatch=E2= =80=9D TLB flush interface fits in, because it can create a batch that spans= more than one mm. x86 can deal with this moderately efficiently since we l= imit the number of live mms per CPU and our flushes are (for now?) per cpu, n= ot per mm. u64 gen =3D 0; for(...) gen =3D queue_flush(mm, start, end, freed_tables); flush_to_gen(mm, gen); and the wp fault path does: wait_for_pending_flushes(mm, address); Other than the possibility of devolving to one flush per operation if one th= read is page faulting at the same speed that another thread is modifying PTE= s, this should be decently performant.