From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 089A1C433E0 for ; Thu, 24 Dec 2020 03:33:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CE031224DF for ; Thu, 24 Dec 2020 03:33:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728908AbgLXDdk (ORCPT ); Wed, 23 Dec 2020 22:33:40 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:48166 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728779AbgLXDdj (ORCPT ); Wed, 23 Dec 2020 22:33:39 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1608780732; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ImpOp2pBt8kaX5JJ3UIMDw0UYm4eN1THLKEsdyHiJvY=; b=NcFH7znMoHUklzDSSt4mzEoTi3BJ3oDNxez2G1D1c4LPzoM+ozfcwaNB0VJaq+qpOk8FAE f5p2lA4xw1yKex2gaUcghPydzYy/0JDZmaWX6U00sUjAbrxjR/Qq91w1iX9o2JTF5WVs/N ZZ/pU4AeazEX7eEoiIVqXCVqXXRzeLY= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-506-dpp3bLvUO_GEMxrAAloCNA-1; Wed, 23 Dec 2020 22:32:09 -0500 X-MC-Unique: dpp3bLvUO_GEMxrAAloCNA-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id EB16A180DE0C; Thu, 24 Dec 2020 03:32:06 +0000 (UTC) Received: from mail (ovpn-112-5.rdu2.redhat.com [10.10.112.5]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 9E2845D9CC; Thu, 24 Dec 2020 03:31:59 +0000 (UTC) Date: Wed, 23 Dec 2020 22:31:59 -0500 From: Andrea Arcangeli To: Andy Lutomirski Cc: Yu Zhao , Andy Lutomirski , Linus Torvalds , Peter Xu , Nadav Amit , linux-mm , lkml , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , stable , Minchan Kim , Will Deacon , Peter Zijlstra Subject: Re: [PATCH] mm/userfaultfd: fix memory corruption due to writeprotect Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.0.3 (2020-12-04) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 23, 2020 at 09:00:26PM -0500, Andrea Arcangeli wrote: > One other near zero cost improvement easy to add if this would be "if > (vma->vm_flags & (VM_SOFTDIRTY|VM_UFFD_WP))" and it could be made The next worry then is if UFFDIO_WRITEPROTECT is very large then there would be a flood of wrprotect faults, and they'd end up all issuing a tlb flush during the UFFDIO_WRITEPROTECT itself which again is a performance concern for certain uffd-wp use cases. Those use cases would be for example redis and postcopy live snapshotting, to use it for async snapshots, unprivileged too in the case of redis if it temporarily uses bounce buffers for the syscall I/O for the duration of the snapshot. hypervisors tuned profiles need to manually lift the unprivileged_userfaultfd to 1 unless their jailer leaves one capability in the snapshot thread. Moving the check after userfaultfd_pte_wp would solve userfaultfd_writeprotect(mode_wp=true), but that still wouldn't avoid a flood of tlb flushes during userfaultfd_writeprotect(mode_wp=false) because change_protection doesn't restore the pte_write: } else if (uffd_wp_resolve) { /* * Leave the write bit to be handled * by PF interrupt handler, then * things like COW could be properly * handled. */ ptent = pte_clear_uffd_wp(ptent); } When the snapshot is complete userfaultfd_writeprotect(mode_wp=false) would need to run again on the whole range which can be very big again. Orthogonally I think we should also look to restore the pte_write above orthogonally in uffd-wp, so it'll get yet an extra boost if compared to current redis snapshotting fork(), that cannot restore all pte_write after the snapshot child quit and forces a flood of spurious wrprotect faults (uffd-wp can solve that too). However, even if uffd-wp restored the pte_write, things would remain suboptimal for a terabyte process under clear_refs, since softdirty wrprotect faults that start happening while softdirty is still running on the mm, won't be caught in userfaultfd_pte_wp. Something like below, if cleaned up, abstracted properly and documented well in the two places involved, will have a better chance to perform optimally for softdirty too. And on a side note the CONFIG_MEM_SOFT_DIRTY compile time check is compulsory because VM_SOFTDIRTY is defined to zero if softdirty is not built in. (for VM_UFFD_WP the CONFIG_HAVE_ARCH_USERFAULTFD_WP can be removed and it won't make any measurable difference even when USERFAULTFD=n) RFC untested below, it's supposed to fix the softdirty testcase too, even without the incremental fix, since it already does tlb_gather_mmu before walk_page_range and tlb_finish_mmu after it and that appears enough to define the inc/dec_tlb_flush_pending. diff --git a/mm/memory.c b/mm/memory.c index 7d608765932b..66fd6d070c47 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2844,11 +2844,26 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (!new_page) goto oom; } else { + bool in_uffd_wp, in_softdirty; new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); if (!new_page) goto oom; +#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP + in_uffd_wp = !!(vma->vm_flags & VM_UFFD_WP); +#else + in_uffd_wp = false; +#endif +#ifdef CONFIG_MEM_SOFT_DIRTY + in_softdirty = !(vma->vm_flags & VM_SOFTDIRTY); +#else + in_softdirty = false; +#endif + if ((in_uffd_wp || in_softdirty) && + mm_tlb_flush_pending(mm)) + flush_tlb_page(vma, vmf->address); + if (!cow_user_page(new_page, old_page, vmf)) { /* * COW failed, if the fault was solved by other,