From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87D61C433E0 for ; Mon, 21 Dec 2020 03:33:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EB519212CC for ; Mon, 21 Dec 2020 03:33:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EB519212CC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 01A3E6B005C; Sun, 20 Dec 2020 22:33:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EE3FE6B005D; Sun, 20 Dec 2020 22:33:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB61D6B0068; Sun, 20 Dec 2020 22:33:14 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0155.hostedemail.com [216.40.44.155]) by kanga.kvack.org (Postfix) with ESMTP id BE49F6B005C for ; Sun, 20 Dec 2020 22:33:14 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 785988249980 for ; Mon, 21 Dec 2020 03:33:14 +0000 (UTC) X-FDA: 77615868708.19.sack87_2d0ee1327454 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin19.hostedemail.com (Postfix) with ESMTP id 5CD361AD1B2 for ; Mon, 21 Dec 2020 03:33:14 +0000 (UTC) X-HE-Tag: sack87_2d0ee1327454 X-Filterd-Recvd-Size: 9320 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) by imf02.hostedemail.com (Postfix) with ESMTP for ; Mon, 21 Dec 2020 03:33:13 +0000 (UTC) Received: by mail-pj1-f50.google.com with SMTP id iq13so5374344pjb.3 for ; Sun, 20 Dec 2020 19:33:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=RGHUG3YS9PlJTlCOsXIUaVWRNdpgvr2K2PPmz+gxhDY=; b=XEmACCz/oBW8KpdmFP65y2OmfKersoYJGYoS1JM6sPEmGnRoCz4pfVKQoOIYr5lrk+ iq02YDaAT+f4A6P1moYcyl2LXknuPsaYb480dNRK/YN6ZC5mnU7gkao+G1z00dUqO2+m MNwZXf3+7myWlnIN2tndHM3SlE6pNG5Uued4fvxQ6eZnme4wydgN0tyAxO5GpNi4e39y iCGo1K2dhnw2omUsh+rbiD9UbxRaiUIvuoiZl5xIS/Oqr+r44dGRAAssAS/huehr40ca vjbDKHkuJsUhfHenof7ljlJCkl8vXS6dfhHA4YNvuWlMuK+lBX/MpLf/BlQPHzBuxwRM WIJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=RGHUG3YS9PlJTlCOsXIUaVWRNdpgvr2K2PPmz+gxhDY=; b=ArX/L+fFbE63Zq1f5VAl+zsErm2CWRcQo7EqFqZZtbWbXwbwBISgTrMKJDs7MZ2/v5 1E6wQm8QNYSByOGNgynkXkUZknELg4SoYOWAHOdDqqcoDud+4oPHlipOK1/WzJ/bUtdJ 9vr5fEK6QDJgYcVCppfy5TrA9Ks1daWK89vLtvMRMWm4e8er3yPFdMFwiYQFEXFUdO2K nkhTB7c3hu7vyW6XcHXtXKqbIs5VMLUbR3ZJqG4hBJzwS5DAN2XVaDT6xqyD6r877LBl 6LIVsnf9S4Ztr3Rb6x16QxyvH/ROrFXD6uSJReeY4dEbJz99iadTvbpO9cEZ9G3535u9 0JzQ== X-Gm-Message-State: AOAM532Mhdgm+d8UOm/igtGJOHeAz8tXXS1SlFFiOIJiTIxE3H3b9kZv uxnJqnl/qOW8JcxQtn2b/Ys= X-Google-Smtp-Source: ABdhPJy2vx8dxbMeNtjO8DuPkEpXCO2mR8iu5ujBlJatrx8oO4h8ea5O0c7MzCAJxH0Ztkx1Q3WtKA== X-Received: by 2002:a17:902:b90c:b029:db:f23d:d684 with SMTP id bf12-20020a170902b90cb02900dbf23dd684mr14617589plb.43.1608521592635; Sun, 20 Dec 2020 19:33:12 -0800 (PST) Received: from [10.0.1.14] (c-24-4-128-201.hsd1.ca.comcast.net. [24.4.128.201]) by smtp.gmail.com with ESMTPSA id jx4sm13608056pjb.24.2020.12.20.19.33.10 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 20 Dec 2020 19:33:11 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\)) Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect From: Nadav Amit In-Reply-To: Date: Sun, 20 Dec 2020 19:33:09 -0800 Cc: Andrea Arcangeli , linux-mm , Peter Xu , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable@vger.kernel.org, minchan@kernel.org, Andy Lutomirski , Will Deacon , Peter Zijlstra Content-Transfer-Encoding: quoted-printable Message-Id: <3680387D-65F1-4078-A19D-F77DE8544B96@gmail.com> References: <20201219043006.2206347-1-namit@vmware.com> To: Yu Zhao X-Mailer: Apple Mail (2.3608.120.23.2.4) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Dec 20, 2020, at 1:54 AM, Yu Zhao wrote: >=20 > On Sun, Dec 20, 2020 at 12:06:38AM -0800, Nadav Amit wrote: >>> On Dec 19, 2020, at 10:05 PM, Yu Zhao wrote: >>>=20 >>> On Sat, Dec 19, 2020 at 01:34:29PM -0800, Nadav Amit wrote: >>>> [ cc=E2=80=99ing some more people who have experience with similar = problems ] >>>>=20 >>>>> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli = wrote: >>>>>=20 >>>>> Hello, >>>>>=20 >>>>> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote: >>>>>> Analyzing this problem indicates that there is a real bug since >>>>>> mmap_lock is only taken for read in mwriteprotect_range(). This = might >>>>>=20 >>>>> Never having to take the mmap_sem for writing, and in turn never >>>>> blocking, in order to modify the pagetables is quite an important >>>>> feature in uffd that justifies uffd instead of mprotect. It's not = the >>>>> most important reason to use uffd, but it'd be nice if that = guarantee >>>>> would remain also for the UFFDIO_WRITEPROTECT API, not only for = the >>>>> other pgtable manipulations. >>>>>=20 >>>>>> Consider the following scenario with 3 CPUs (cpu2 is not shown): >>>>>>=20 >>>>>> cpu0 cpu1 >>>>>> ---- ---- >>>>>> userfaultfd_writeprotect() >>>>>> [ write-protecting ] >>>>>> mwriteprotect_range() >>>>>> mmap_read_lock() >>>>>> change_protection() >>>>>> change_protection_range() >>>>>> ... >>>>>> change_pte_range() >>>>>> [ defer TLB flushes] >>>>>> userfaultfd_writeprotect() >>>>>> mmap_read_lock() >>>>>> change_protection() >>>>>> [ write-unprotect ] >>>>>> ... >>>>>> [ unprotect PTE logically ] >>>>>> ... >>>>>> [ page-fault] >>>>>> ... >>>>>> wp_page_copy() >>>>>> [ set new writable page in PTE] >>>=20 >>> I don't see any problem in this example -- wp_page_copy() calls >>> ptep_clear_flush_notify(), which should take care of the stale entry >>> left by cpu0. >>>=20 >>> That being said, I suspect the memory corruption you observed is >>> related this example, with cpu1 running something else that flushes >>> conditionally depending on pte_write(). >>>=20 >>> Do you know which type of pages were corrupted? file, anon, etc. >>=20 >> First, Yu, you are correct. My analysis is incorrect, but let me have >> another try (below). To answer your (and Andrea=E2=80=99s) question - = this happens >> with upstream without any changes, excluding a small fix to the = selftest, >> since it failed (got stuck) due to missing wake events. [1] >>=20 >> We are talking about anon memory. >>=20 >> So to correct myself, I think that what I really encountered was = actually >> during MM_CP_UFFD_WP_RESOLVE (i.e., when the protection is removed). = The >> problem was that in this case the =E2=80=9Cwrite=E2=80=9D-bit was = removed during unprotect. >=20 > Thanks. You are right about when the problem happens: UFD write- > UNprotecting. But it's not UFD write-UNprotecting that removes the > writable bit -- the bit can only be removed during COW or UFD > write-protecting. So your original example was almost correct, except > the last line describing cpu1. The scenario is a bit confusing, so stay with me. The idea behind uffd unprotect is indeed only to mark the PTE logically as uffd-unprotected, = and not to *set* the writable bit, allowing the #PF handler to do COW or whatever correctly upon #PF. However, the problem that we have is that if a page is already writable, write-unprotect *clears* the writable bit, making it write-protected (at least for anonymous pages). This is not good from performance = point-of-view, but also a correctness issue, as I pointed out. In some more detail: mwriteprotect_range() uses vm_get_page_prot() to compute the new protection. For anonymous private memory, at least on = x86, this means the write-bit in the protection is clear. So later, change_pte_range() *clears* the write-bit during *unprotection*. That=E2=80=99s the reason the second part of my patch - the change to = preserve_write - fixes the problem. > The problem is how do_wp_page() handles non-COW pages. (For COW pages, > do_wp_page() works correctly by either reusing an existing page or > make a new copy out of it.) In UFD case, the existing page may not > have been properly write-protected. As you pointed out, the tlb flush > may not be done yet. Making a copy can potentially race with the > writer on cpu2. Just to clarify the difference - You regard a scenario of UFFD write-protect, while I am pretty sure the problem I encountered is = during write-unprotect. I am not sure we are on the same page (but we may be). The problem I = have is with cow_user_page() that is called by do_wp_page() before any TLB flush took place (either by change_protection_range() or by do_wp_page() which does flush, but after the copy). Let me know if you regard a different scenario. > Should we fix the problem by ensuring integrity of the copy? IMO, no, > because do_wp_page() shouldn't copy at all in this case. It seems it > was recently broken by >=20 > be068f29034f mm: fix misplaced unlock_page in do_wp_page() > 09854ba94c6a mm: do_wp_page() simplification >=20 > I haven't study them carefully. But if you could just revert them and > run the test again, we'd know where exactly to look at next. These patches regard the wp_page_reuse() case, which makes me think we are not on the same page. I do not see a problem with wp_page_reuse() since it does not make a copy of the page. If you can explain what I am missing, it would be great.