From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D997C4361B for ; Sun, 20 Dec 2020 02:21:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8BF3122DBF for ; Sun, 20 Dec 2020 02:21:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8BF3122DBF Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 8F44C6B005C; Sat, 19 Dec 2020 21:21:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8A5EF6B005D; Sat, 19 Dec 2020 21:21:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 76C676B0068; Sat, 19 Dec 2020 21:21:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0092.hostedemail.com [216.40.44.92]) by kanga.kvack.org (Postfix) with ESMTP id 5CF396B005C for ; Sat, 19 Dec 2020 21:21:12 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 10DB53631 for ; Sun, 20 Dec 2020 02:21:12 +0000 (UTC) X-FDA: 77612058384.27.aunt06_330a02e2744b Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin27.hostedemail.com (Postfix) with ESMTP id E4F233D663 for ; Sun, 20 Dec 2020 02:21:11 +0000 (UTC) X-HE-Tag: aunt06_330a02e2744b X-Filterd-Recvd-Size: 9694 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Sun, 20 Dec 2020 02:21:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1608430870; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9zeAQetcjgmSvWdd3NI+EASB0L+QdlqdeWqKzcViEfI=; b=J98vc6QDS+AwILTH0xYDB6Y99yujV7guLfy1mU77z6L0KTxgtIE+wHLb8Vb24rp540AT3s 6nekbJSjr/qf354mpeNDZ/4GBrXeeGx1hqgtoYXVnB+MlP6tsaw9/+IEVkCAid8AnfzFvP jJyQ6605XvW5QLCNqwQ8FZDU9psqeVw= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-207-LM3L3jgIPQifBGX3icIUWg-1; Sat, 19 Dec 2020 21:21:06 -0500 X-MC-Unique: LM3L3jgIPQifBGX3icIUWg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 98C95800D55; Sun, 20 Dec 2020 02:21:04 +0000 (UTC) Received: from mail (ovpn-119-164.rdu2.redhat.com [10.10.119.164]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 63DE960BE2; Sun, 20 Dec 2020 02:21:00 +0000 (UTC) Date: Sat, 19 Dec 2020 21:20:59 -0500 From: Andrea Arcangeli To: Nadav Amit Cc: linux-mm , Peter Xu , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable@vger.kernel.org, minchan@kernel.org, Andy Lutomirski , yuzhao@google.com, Will Deacon , Peter Zijlstra Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: References: <20201219043006.2206347-1-namit@vmware.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.0.3 (2020-12-04) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Dec 19, 2020 at 02:06:02PM -0800, Nadav Amit wrote: > > On Dec 19, 2020, at 1:34 PM, Nadav Amit wrote: > >=20 > > [ cc=E2=80=99ing some more people who have experience with similar pr= oblems ] > >=20 > >> On Dec 19, 2020, at 11:15 AM, Andrea Arcangeli = wrote: > >>=20 > >> Hello, > >>=20 > >> On Fri, Dec 18, 2020 at 08:30:06PM -0800, Nadav Amit wrote: > >>> Analyzing this problem indicates that there is a real bug since > >>> mmap_lock is only taken for read in mwriteprotect_range(). This mig= ht > >>=20 > >> Never having to take the mmap_sem for writing, and in turn never > >> blocking, in order to modify the pagetables is quite an important > >> feature in uffd that justifies uffd instead of mprotect. It's not th= e > >> most important reason to use uffd, but it'd be nice if that guarante= e > >> would remain also for the UFFDIO_WRITEPROTECT API, not only for the > >> other pgtable manipulations. > >>=20 > >>> Consider the following scenario with 3 CPUs (cpu2 is not shown): > >>>=20 > >>> cpu0 cpu1 > >>> ---- ---- > >>> userfaultfd_writeprotect() > >>> [ write-protecting ] > >>> mwriteprotect_range() > >>> mmap_read_lock() > >>> change_protection() > >>> change_protection_range() > >>> ... > >>> change_pte_range() > >>> [ defer TLB flushes] > >>> userfaultfd_writeprotect() > >>> mmap_read_lock() > >>> change_protection() > >>> [ write-unprotect ] > >>> ... > >>> [ unprotect PTE logically ] Is the uffd selftest failing with upstream or after your kernel modification that removes the tlb flush from unprotect? } else if (uffd_wp_resolve) { /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent =3D pte_clear_uffd_wp(ptent); } Upstraem this will still do pages++, there's a tlb flush before change_protection can return here, so I'm confused. > >>> ... > >>> [ page-fault] > >>> ... > >>> wp_page_copy() > >>> [ set new writable page in PTE] > >>=20 > >> Can't we check mm_tlb_flush_pending(vma->vm_mm) if MM_CP_UFFD_WP_ALL > >> is set and do an explicit (potentially spurious) tlb flush before > >> write-unprotect? > >=20 > > There is a concrete scenario that I actually encountered and then the= re is a > > general problem. > >=20 > > In general, the kernel code assumes that PTEs that are read from the > > page-tables are coherent across all the TLBs, excluding permission pr= omotion > > (i.e., the PTE may have higher permissions in the page-tables than th= ose > > that are cached in the TLBs). > >=20 > > We therefore need to both: (a) protect change_protection_range() from= the > > changes of others who might defer TLB flushes without taking mmap_sem= for > > write (e.g., try_to_unmap_one()); and (b) to protect others (e.g., > > page-fault handlers) from concurrent changes of change_protection(). > >=20 > > We have already encountered several similar bugs, and debugging such = issues > > s time consuming and these bugs impact is substantial (memory corrupt= ion, > > security). So I think we should only stick to general solutions. > >=20 > > So perhaps your the approach of your proposed solution is feasible, b= ut it > > would have to be applied all over the place: we will need to add a ch= eck for > > mm_tlb_flush_pending() and conditionally flush the TLB in every case = in > > which PTEs are read and there might be an assumption that the > > access-permission reflect what the TLBs hold. This includes page-faul= t > > handlers, but also NUMA migration code in change_protection(), softdi= rty > > cleanup in clear_refs_write() and maybe others. > >=20 > > [ I have in mind another solution, such as keeping in each page-table= a=20 > > =E2=80=9Ctable-generation=E2=80=9D which is the mm-generation at the = time of the change, > > and only flush if =E2=80=9Ctable-generation=E2=80=9D=3D=3D=E2=80=9Cmm= -generation=E2=80=9D, but it requires > > some thought on how to avoid adding new memory barriers. ] > >=20 > > IOW: I think the change that you suggest is insufficient, and a prope= r > > solution is too intrusive for =E2=80=9Cstable". > >=20 > > As for performance, I can add another patch later to remove the TLB f= lush > > that is unnecessarily performed during change_protection_range() that= does > > permission promotion. I know that your concern is about the =E2=80=9C= protect=E2=80=9D case You may want to check flush_tlb_fix_spurious_fault for other archs before proceeding with your patch later, assuming it wasn't already appli= ed. > > but I cannot think of a good immediate solution that avoids taking mm= ap_lock > > for write. > >=20 > > Thoughts? >=20 > On a second thought (i.e., I don=E2=80=99t know what I was thinking), d= oing so =E2=80=94 > checking mm_tlb_flush_pending() on every PTE read which is potentially Note the part "if MM_CP_UFFD_WP_ALL is set" and probably just MM_CP_UFFD_WP. > dangerous and flushing if needed - can lead to huge amount of TLB flush= es > and shootodowns as the counter might be elevated for considerable amoun= t of > time. >=20 > So this solution seems to me as a no-go. I don't share your concern. What matters is the PT lock, so it wouldn't be one per pte, but a least an order 9 higher, but let's assume one flush per pte. It's either huge mapping and then it's likely running without other tlb flushing in background (postcopy snapshotting), or it's a granular protect with distributed shared memory in which case the number of changd ptes or huge_pmds tends to be always 1 anyway. So it doesn't matter if it's deferred. I agree it may require a larger tlb flush review not just mprotect though, but it didn't sound particularly complex. Note the UFFDIO_WRITEPROTECT is still relatively recent so backports won't risk to reject so heavy as to require a band-aid. My second thought is, I don't see exactly the bug and it's not clear if it's upstream reproducing this, but assuming this happens on upstream, even ignoring everything else happening in the tlb flush code, this sounds like purely introduced by userfaultfd_writeprotect() vs userfaultfd_writeprotect() (since it's the only place changing protection with mmap_sem for reading and note we already unmap and flush tlb with mmap_sem for reading in MADV_DONTNEED/MADV_FREE clears the dirty bit etc..). Flushing tlbs with mmap_sem for reading is nothing new, the only new thing is the flush after wrprotect. So instead of altering any tlb flush code, would it be possible to just stick to mmap_lock for reading and then serialize userfaultfd_writeprotect() against itself with an additional mm->mmap_wprotect_lock mutex? That'd be a very local change to userfaultfd too. Can you look if the rule mmap_sem for reading plus a new mm->mmap_wprotect_lock mutex or the mmap_sem for writing, whenever wrprotecting ptes, is enough to comply with the current tlb flushing code, so not to require any change non local to uffd (modulo the additional mutex). Thanks, Andrea