From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C33C9C433DB for ; Mon, 21 Dec 2020 22:55:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 40003229CA for ; Mon, 21 Dec 2020 22:55:19 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 40003229CA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9ABC96B006E; Mon, 21 Dec 2020 17:55:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 934A76B0070; Mon, 21 Dec 2020 17:55:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7FC7F6B0071; Mon, 21 Dec 2020 17:55:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0041.hostedemail.com [216.40.44.41]) by kanga.kvack.org (Postfix) with ESMTP id 60A096B006E for ; Mon, 21 Dec 2020 17:55:18 -0500 (EST) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 2315B181AF5C4 for ; Mon, 21 Dec 2020 22:55:18 +0000 (UTC) X-FDA: 77618797116.26.chair05_2d0a0c02745b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id F3B941804B647 for ; Mon, 21 Dec 2020 22:55:17 +0000 (UTC) X-HE-Tag: chair05_2d0a0c02745b X-Filterd-Recvd-Size: 7521 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) by imf07.hostedemail.com (Postfix) with ESMTP for ; Mon, 21 Dec 2020 22:55:17 +0000 (UTC) Received: by mail-pf1-f169.google.com with SMTP id 11so7315503pfu.4 for ; Mon, 21 Dec 2020 14:55:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=MWSksfPTMswDsSxUE3HtTa45ybxXErt7UyS5LoWsn/c=; b=W9zTbSghxh6v348guyuA8D2WaWyQuM04ndJ+7kg1cjkAtQGFWQMaaWs0rnQ59b9X59 VbJ9Oq13LwCmBEfBvhe6IvXteCX804g1YEVFPI3kHzBWuSGgH5LJ7ltDlOuDWR9adLFO kGLt7I8oSMMd2Y5FrMNYdJZ2hDrJYaYwUPbRGxQiCdSygTOyVxaWUSOl0nBsUbxEIcK7 Dy7Enqx21GcTgWtVM3LzorT4i6jiRXp6x2nt8ZVcmVpQQclUrE8Y+s6lyagMS6gXOwKJ vFKrtHvALQxAV50D2Ks5cKAYtxrI1YMhqLupQy0PhKDMtH+Y9q1XPx0mUkUb9SDuHffg kDXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=MWSksfPTMswDsSxUE3HtTa45ybxXErt7UyS5LoWsn/c=; b=JtsDlHNHT6AZXmkY3uWCuWgjt8cl3gj+G2fea7Ja/ng9SXUq9xwiz5IGF4zj9vCInl sw8PcCpjeQEgq/XkGBD86Ak5rNP4cAAjgBK29mAwBk8SdNo4/M8VfLPHzgYAeAMSxYce 6837ahyIo+Z0AgFFlcZyEyPGRi372Z5n35W+BLOAAf9Zwrob+t33cI0FFWfLr6n9pyFX Cy0SHZcVXuVitUvj4+5eDdsDD7ZC7HQVcQzjV3qfV/vZeBO6YgSSyt/eQKojIi+fqJ0U ohuVf7xOpz36uXWUkIte+M90Xlrkgf5mzIXxi7CyKwOhHboxxi+zsK87SYErOROr/SUn bSkw== X-Gm-Message-State: AOAM531+dqGSnoPjnAOotAGdeLcWb90vbLv7fwnCMIR8nkqd3Ao7VCXr 4K6euMxRiRVbHCBsQhIMMRg= X-Google-Smtp-Source: ABdhPJz649QHJ7ppWUE4OSOx0WGT1P7riOXSwtwTXLKDsRV5Nb6XAYZ0eENXLGVYxRcaFGtfNAQaeA== X-Received: by 2002:a63:3d8c:: with SMTP id k134mr17310482pga.53.1608591316225; Mon, 21 Dec 2020 14:55:16 -0800 (PST) Received: from ?IPv6:2601:647:4700:9b2:104c:8d35:de28:b8dc? ([2601:647:4700:9b2:104c:8d35:de28:b8dc]) by smtp.gmail.com with ESMTPSA id j14sm16788607pjm.10.2020.12.21.14.55.13 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 21 Dec 2020 14:55:15 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.120.23.2.4\)) Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect From: Nadav Amit In-Reply-To: <20201221223041.GL6640@xz-x1> Date: Mon, 21 Dec 2020 14:55:12 -0800 Cc: Yu Zhao , Linus Torvalds , Andrea Arcangeli , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Andy Lutomirski , Will Deacon , Peter Zijlstra Content-Transfer-Encoding: quoted-printable Message-Id: References: <20201221172711.GE6640@xz-x1> <76B4F49B-ED61-47EA-9BE4-7F17A26B610D@gmail.com> <9E301C7C-882A-4E0F-8D6D-1170E792065A@gmail.com> <1FCC8F93-FF29-44D3-A73A-DF943D056680@gmail.com> <20201221223041.GL6640@xz-x1> To: Peter Xu X-Mailer: Apple Mail (2.3608.120.23.2.4) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Dec 21, 2020, at 2:30 PM, Peter Xu wrote: >=20 > On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote: >> BTW: In general, I think that you are right, and that changing of = PTEs >> should not require taking mmap_lock for write. However, I am not sure >> cow_user_page() is not the only one that poses a problem and whether = a more >> systematic solution is needed. If cow_user_pages() is the only = problem, do >> you think it is possible to do the copying while holding the PTL? It = works >> for normal-pages, but I am not sure whether special-pages pose = special >> problems. >>=20 >> Anyhow, this is an enhancement that we can try later. >=20 > AFAIU mprotect() is the only one who modifies the pte using the mmap = write > lock. NUMA balancing is also using read mmap lock when changing pte > protections, while my understanding is mprotect() used write lock only = because > it manipulates the address space itself (aka. vma layout) rather than = modifying > the ptes, so it needs to. You are correct about NUMA balancing in general. Yet in practice it is = not an issue in our =E2=80=9Cuse-case=E2=80=9D since NUMA balancing = preserves the write-bit. > At the pte level, it seems always to be the pgtable lock that = serializes things. >=20 > So it's perfectly legal to me for e.g. a driver to modify ptes with = the read > lock of mmap_sem, unless I'm severely mistaken.. as long as the = pgtable lock is > taken when doing so. >=20 > If there's a driver that manipulated the ptes, changed the content of = the page, > recover the ptes to origin, and all these happen right after = wp_page_copy() > unlocked the pgtable lock but before wp_page_copy() retakes the same = lock > again, we may face the same issue finding that the page got copied = contains > corrupted data at last. While I don't know what to blame on the = driver either > because it seems to be exactly following the rules. The driver would have to do so without flushing the TLB. Having said = that, the driver could have used inc_tlb_flush_pending() and batch flushes. >=20 > I believe changing into write lock would solve the race here because = tlb > flushing would be guaranteed along the way, but I'm just a bit worried = it's not > the best way to go.. It might be too big of a hammer. But the question that comes to my mind = is, if it is ok to change the PTEs without mmap_lock held for write, why wouldn=E2=80=99t mmap_write_downgrade() be executed before = mprotect_fixup() (so mprotect change of PTE would not be done with the write-lock)? If you = did so, you would have the same problem as the one we encountered = (concurrent protect-unprotect allow concurrent cow-#PF copying the wrong data). So as an alternative solution, I can do copying under the PTL after flushing, which seems to solve the problem. First copying (without a = lock) and then comparing seems to me as suboptimal - I do not think the = benefit (if there is one) of shortening the time in which the lock is taken - = worth the additional compare (and the complexity with special pages). There are 2 problems in doing so: 1. I think that copy_user_highpage() and __copy_from_user_inatomic() can = be called while holding the PTL, but I am not sure. 2. For special pages we would need 2 TLB flushes: one to ensure the write-bit is cleared, and a second one after we clear the PTE. We can limit ourselves to soft-dirty/UFFD VMAs, but if we have your hypothetical driver, this would not be good enough.