From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C6ADC433E0 for ; Thu, 7 Jan 2021 20:05:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3CA5A23444 for ; Thu, 7 Jan 2021 20:05:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727396AbhAGUFr (ORCPT ); Thu, 7 Jan 2021 15:05:47 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:42284 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726326AbhAGUFp (ORCPT ); Thu, 7 Jan 2021 15:05:45 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1610049857; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2KFSjyvM+6BgG+NEGG//AxaoKw0CEQS11//I25/s2sk=; b=bI3Yk8A9A5fMRIjd/0Gp77G9wp4395uT8jNUCPV/GAiqQWPMb2wHqIBEfZLmQkoAB+rCJ+ kiyuFaOz6Vq6g4b0yfpCl/2pC4OSvnAYyi/eVE2fnrr2E3449hKElkGOlT3q4uR/VED0wq vr1h0M+P6kBIe+o/NRsKNpSzeu8Fw6U= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-567-2grhLNAiPKWLcZF6CN9ZFA-1; Thu, 07 Jan 2021 15:04:13 -0500 X-MC-Unique: 2grhLNAiPKWLcZF6CN9ZFA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 49ABF612B1; Thu, 7 Jan 2021 20:04:10 +0000 (UTC) Received: from mail (ovpn-112-222.rdu2.redhat.com [10.10.112.222]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 47B146268E; Thu, 7 Jan 2021 20:04:03 +0000 (UTC) From: Andrea Arcangeli To: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Peter Zijlstra , Linus Torvalds , Hugh Dickins , "Kirill A. Shutemov" , Matthew Wilcox , Oleg Nesterov , Jann Horn , Kees Cook , John Hubbard , Leon Romanovsky , Jason Gunthorpe , Jan Kara , Kirill Tkhai Subject: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending Date: Thu, 7 Jan 2021 15:04:02 -0500 Message-Id: <20210107200402.31095-3-aarcange@redhat.com> In-Reply-To: <20210107200402.31095-1-aarcange@redhat.com> References: <20210107200402.31095-1-aarcange@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org NOTE: the "Fixes" tag used here is to optimize the backporting, but 09854ba94c6aad7886996bfbee2530b3d8a7f4f4 is completely correct. Despite being correct it happened to uncover some implicit assumption some other code made on a very specific behavior in do_wp_page that had to be altered by such commit. The page_mapcount is always guaranteed == 1 for an exclusive anon page, so when it was used to decide if an exclusive page under wrprotection could be reused (as in wp_page_reuse), the outcome would always come true. The page_mapcount had to be replaced with the page_count because it couldn't account for GUP pins, so after that change, for the first time, wp_page_copy can now be called also for exclusive anon pages that are underway wrprotection. Even then everything is still fine for all cases that wrprotect with the mmap_write_lock since the COW faults cannot run concurrently in such case. However there are two cases that could wrprotecting exclusive anon pages with only the mmap_read_lock: soft dirty clear_refs_write and UFFDIO_WRITEPROTECT. Both of them would benefit from keeping their wrprotection runtime scalable and to keep using the mmap_read_lock without having to switch to the mmap_write_lock. To stick to the mmap_read_lock, for both UFFDIO_WRITEPROTECT and clear_refs_write we need to handle the new reality that there can be COWs (as in wp_page_copy) happening on exclusive anon pages that are under wrprotection, but with the respective TLB flush still deferred. An example of the problematic UFFDIO_USERFAULTFD runtime is shown below. CPU0 CPU 1 CPU 2 ------ -------- ------- userfaultfd_wrprotect(mode_wp = true) PT lock atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE PT unlock do_page_fault FAULT_FLAG_WRITE userfaultfd_wrprotect(mode_wp = false) PT lock ATOMIC clear _PAGE_UFFD_WP <- problem /* _PAGE_WRITE not set */ PT unlock XXXXXXXXXXXXXX BUG RACE window open here PT lock FAULT_FLAG_WRITE is set by CPU _PAGE_WRITE is still clear in pte PT unlock wp_page_copy cow_user_page runs with stale TLB deferred tlb flush <- too late XXXXXXXXXXXXXX BUG RACE window close here userfaultfd_wrprotect(mode_wp = true) is never a problem because as long as the uffd-wp flag is set in the pte/hugepmd the page fault is guaranteed to reach a dead end in handle_userfault(). The window for uffd-wp not to be set while the pte has been wrprotected but the TLB flush is still pending, is opened when userfaultfd_wrprotect(mode_wp = false) releases the PT-lock as shown above and it closes when the first deferred TLB flush runs. If do_wp_page->wp_copy_page runs within such window, some userland writes can get lost in the copy and they can end up in the original page that gets discarded. The softy dirty runtime is similar and it would be like below: CPU0 CPU 1 CPU 2 ------ -------- ------- instantiate writable TLB clear_refs_write PT lock pte_wrprotect PT unlock do_page_fault FAULT_FLAG_WRITE PT lock FAULT_FLAG_WRITE is set by CPU _PAGE_WRITE is still clear in pte PT unlock wp_page_copy cow_user_page... writes through the TLB ...cow_user_page So to handle this race a wrprotect_tlb_flush_pending atomic counter is added to the vma. This counter needs to be elevated while holding the mmap_read_lock before taking the PT lock to wrprotect the pagetable and it can only be decreased after the deferred TLB flush is complete. This way the page fault can trivially serialize against pending TLB flushes using a new helper: sync_wrprotect_tlb_flush_pending(). Testing with the userfaultfd selftest is showing 100% reproducible mm corruption with writes getting lost, before this commit. $ ./userfaultfd anon 100 100 nr_pages: 25600, nr_pages_per_cpu: 3200 bounces: 99, mode: rnd racing read, userfaults: 773 missing (215+205+58+114+72+85+18+6), 9011 wp (1714+1714+886+1227+1009+1278+646+537) [..] bounces: 72, mode: poll, userfaults: 720 missing (187+148+102+49+92+103+24+15), 9885 wp (1452+1175+1104+1667+1101+1365+913+1108) bounces: 71, mode: rnd racing ver read, page_nr 25241 memory corruption 6 7 After the commit the userland memory corruption is gone as expected. Cc: stable@kernel.org Reported-by: Nadav Amit Suggested-by: Yu Zhao Signed-off-by: Andrea Arcangeli --- fs/proc/task_mmu.c | 17 +++++++++- include/linux/mm.h | 46 +++++++++++++++++++++++++++ include/linux/mm_types.h | 5 +++ kernel/fork.c | 1 + mm/memory.c | 69 ++++++++++++++++++++++++++++++++++++++++ mm/mprotect.c | 4 +++ 6 files changed, 141 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index a127262ba517..e75cb135db02 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1235,8 +1235,20 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, } if (type == CLEAR_REFS_SOFT_DIRTY) { for (vma = mm->mmap; vma; vma = vma->vm_next) { - if (!(vma->vm_flags & VM_SOFTDIRTY)) + struct vm_area_struct *tmp; + if (!(vma->vm_flags & VM_SOFTDIRTY)) { + inc_wrprotect_tlb_flush_pending(vma); continue; + } + + /* + * Rollback wrprotect_tlb_flush_pending before + * releasing the mmap_lock. + */ + for (vma = mm->mmap; vma != tmp; + vma = vma->vm_next) + dec_wrprotect_tlb_flush_pending(vma); + mmap_read_unlock(mm); if (mmap_write_lock_killable(mm)) { count = -EINTR; @@ -1245,6 +1257,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, for (vma = mm->mmap; vma; vma = vma->vm_next) { vma->vm_flags &= ~VM_SOFTDIRTY; vma_set_page_prot(vma); + inc_wrprotect_tlb_flush_pending(vma); } mmap_write_downgrade(mm); break; @@ -1260,6 +1273,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf, if (type == CLEAR_REFS_SOFT_DIRTY) { mmu_notifier_invalidate_range_end(&range); flush_tlb_mm(mm); + for (vma = mm->mmap; vma; vma = vma->vm_next) + dec_wrprotect_tlb_flush_pending(vma); dec_tlb_flush_pending(mm); } mmap_read_unlock(mm); diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8cd6ae..caa1d9a71cb2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3177,5 +3177,51 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping, extern int sysctl_nr_trim_pages; +/* + * NOTE: the mmap_lock must be hold and it cannot be released at any + * time in between inc_wrprotect_tlb_flush_pending() and + * dec_wrprotect_tlb_flush_pending(). + * + * This counter has to be elevated before taking the PT-lock to + * wrprotect pagetables, if the TLB isn't flushed before the + * PT-unlock. + * + * The only reader is the page fault so this has to be elevated (in + * addition of the mm->tlb_flush_pending) only when the mmap_read_lock + * is taken instead of the mmap_write_lock (otherwise the page fault + * couldn't run concurrently in the first place). + * + * This doesn't need to be elevated when clearing pagetables even if + * only holding the mmap_read_lock (as in MADV_DONTNEED). The page + * fault doesn't risk to access the data of the page that is still + * under tlb-gather deferred flushing, if the pagetable is none, + * because the pagetable doesn't point to it anymore. + * + * This counter is read more specifically by the page fault when it + * has to issue a COW that doesn't result in page re-use because of + * the lack of stability of the page_count (vs speculative pagecache + * lookups) or because of a GUP pin exist on an exclusive and writable + * anon page. + * + * If this counter is elevated and the pageteable is wrprotected (as + * in !pte/pmd_write) and present, it means the page may be still + * modified by userland through a stale TLB entry that was + * instantiated before the wrprotection. In such case the COW fault, + * if it decides not to re-use the page, will have to either wait this + * counter to return zero, or it needs to flush the TLB before + * proceeding copying the page. + */ +static inline void inc_wrprotect_tlb_flush_pending(struct vm_area_struct *vma) +{ + atomic_inc(&vma->wrprotect_tlb_flush_pending); + VM_WARN_ON_ONCE(atomic_read(&vma->wrprotect_tlb_flush_pending) <= 0); +} + +static inline void dec_wrprotect_tlb_flush_pending(struct vm_area_struct *vma) +{ + atomic_dec(&vma->wrprotect_tlb_flush_pending); + VM_WARN_ON_ONCE(atomic_read(&vma->wrprotect_tlb_flush_pending) < 0); +} + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 07d9acb5b19c..e3f412c43c30 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -369,6 +369,11 @@ struct vm_area_struct { struct mempolicy *vm_policy; /* NUMA policy for the VMA */ #endif struct vm_userfaultfd_ctx vm_userfaultfd_ctx; + /* + * When elevated, it indicates that a deferred TLB flush may + * be pending after a pagetable write protection on this vma. + */ + atomic_t wrprotect_tlb_flush_pending; } __randomize_layout; struct core_thread { diff --git a/kernel/fork.c b/kernel/fork.c index 37720a6d04ea..7a658c608f3a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -365,6 +365,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) *new = data_race(*orig); INIT_LIST_HEAD(&new->anon_vma_chain); new->vm_next = new->vm_prev = NULL; + VM_WARN_ON_ONCE(atomic_read(&new->wrprotect_tlb_flush_pending)); } return new; } diff --git a/mm/memory.c b/mm/memory.c index feff48e1465a..e8e407443119 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2809,6 +2809,73 @@ static inline void wp_page_reuse(struct vm_fault *vmf) count_vm_event(PGREUSE); } +/* + * This synchronization helper, by the time it returns, has to enforce + * there cannot be stale writable TLB entries left, on any page mapped + * as wrprotected in the pagetables in this vma range. + * + * This is normally used only by the COW fault immediately before + * copying the page to make it proof against stale TLB entries (as the + * one pictured below in CPU 2). + * + * CPU 0 CPU 1 CPU 2 + * ----- ----- ----- + * writable TLB instantiated + * mmap_lock_read + * inc_wrprotect_tlb_flush_pending() + * PT_lock + * wrprotect the pte + * PT unlock + * mmap_lock_read + * PT_lock + * vmf->orig_pte = pte + * do_wp_page() + * PT_unlock + * wp_page_copy() + * sync_wrprotect_tlb_flush_pending() + * found wrprotect_tlb_flush_pending elevated + * flush_tlb_page() + * writable TLB invalidated [1] + * kret of sync_wrprotect_tlb_flush_pending() + * cow_user_page() [2] + * + * The whole objective of the wrprotect_tlb_flush_pending atomic + * counter is to enforce [1] happens before [2] in the above sequence. + * + * Without having to alter the caller of this helper, it'd also be + * possible to replace the flush_tlb_page with a wait for + * wrprotect_tlb_flush_pending counter to return zero using the same + * logic as above. In such case the point [1] would be replaced by + * dec_wrprotect_tlb_flush_pending() happening in CPU 1. + * + * In terms of memory ordering guarantees: all page payload reads of + * page mapped by a wrprotected pagetable, executed after this + * function returns, must not be allowed to be reordered before the + * read of the wrprotect_tlb_flush_pending atomic counter at the start + * of the function. So this function has to provide acquire semantics + * to those page payload reads. + */ +static __always_inline +void sync_wrprotect_tlb_flush_pending(struct vm_area_struct *vma, + unsigned long address) +{ + int val = atomic_read(&vma->wrprotect_tlb_flush_pending); + if (val) { + /* + * flush_tlb_page() needs to deliver acquire semantics + * implicitly. Archs using IPIs to flush remote TLBs + * provide those with csd_lock_wait(). + */ + flush_tlb_page(vma, address); + } else { + /* + * Prevent the read of the wrprotect page payload to be + * reordered before the above atomic_read(). + */ + smp_rmb(); + } +} + /* * Handle the case of a page which we actually need to copy to a new page. * @@ -2849,6 +2916,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom; + sync_wrprotect_tlb_flush_pending(vma, vmf->address); + if (!cow_user_page(new_page, old_page, vmf)) { /* * COW failed, if the fault was solved by other, diff --git a/mm/mprotect.c b/mm/mprotect.c index ab709023e9aa..6b7a52662de8 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -335,6 +335,8 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, pgd = pgd_offset(mm, addr); flush_cache_range(vma, addr, end); inc_tlb_flush_pending(mm); + if (unlikely(cp_flags & MM_CP_UFFD_WP_ALL)) + inc_wrprotect_tlb_flush_pending(vma); do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) @@ -346,6 +348,8 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, /* Only flush the TLB if we actually modified any entries: */ if (pages) flush_tlb_range(vma, start, end); + if (unlikely(cp_flags & MM_CP_UFFD_WP_ALL)) + dec_wrprotect_tlb_flush_pending(vma); dec_tlb_flush_pending(mm); return pages;