From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-31.2 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C072C47089 for ; Thu, 27 May 2021 19:05:15 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A09F56135C for ; Thu, 27 May 2021 19:05:14 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A09F56135C Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 26AA46B006C; Thu, 27 May 2021 15:05:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2429B8D0001; Thu, 27 May 2021 15:05:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 095626B0070; Thu, 27 May 2021 15:05:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0075.hostedemail.com [216.40.44.75]) by kanga.kvack.org (Postfix) with ESMTP id C6E076B006C for ; Thu, 27 May 2021 15:05:13 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 67D3B1814065F for ; Thu, 27 May 2021 19:05:13 +0000 (UTC) X-FDA: 78187938906.03.EDC1CE9 Received: from mail-qv1-f74.google.com (mail-qv1-f74.google.com [209.85.219.74]) by imf13.hostedemail.com (Postfix) with ESMTP id E7AA7E005F08 for ; Thu, 27 May 2021 19:05:04 +0000 (UTC) Received: by mail-qv1-f74.google.com with SMTP id n12-20020a0c8c0c0000b02901edb8963d4dso839568qvb.18 for ; Thu, 27 May 2021 12:05:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc; bh=s10Qt0g9hDdv+6XSeGwbjGTD/v+/ElL7+mJJcNpqaJU=; b=O0qmirvHR+sXMmH4tujONdn/Q4zPAMXaGNvcMmrRozhF6xYJ8EeJnEaCq2FTNpgr7O GYBZp713yIl//jj8EgkT5+mAZ2khFEw6pxGs/DZVC49xjD/XdMqFeW1FS9A4fFysiCeb wrmDCTQIAtJ4ZiyUHmUZrvyi0AQ2Q4iJwrxUvTVQ9jNP0hJCwLZ+GIVrSxFvtid2wT7V G8VhHBTVpcM+ncaIqJRqYevchB4GpttYzApCcMAHnaDcjV/Dapvqlgs4yuzZE72prUVQ NljVGN/1YRdnt2YnBEdkYBHDkHrA383w+0Ab53vfiWZF70jyRTSgzUzV3sMsF1Xsqtgb 05oA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=s10Qt0g9hDdv+6XSeGwbjGTD/v+/ElL7+mJJcNpqaJU=; b=fOPNST6nVaj7UYWG2mJSE3B+yNN9TJsZOBBCKXdTeUzOyEN2ojGTrX00fdvv5/zSAL Zn3JmbIYJg+gSwGqgbs9V0Ks0mpkv9evyIhCc+yVuObxEUcT5ijnoD4qXniBBwWb90xu nVZG01xxf7rhxlECZHWBYKOQD1/z/vJwBuPpUHiYItsB7YfMdTe76eWGUx3m8ymxCi0N gLaiAboxVW2Ee2UKprfRIR6TL3YwyAy983pdHO5UEkoo6au2cls4TjhInOZqkcYpnYSm qowGnUCw9Ie0Db9rSHp/coxqO7B8waYd+y8X54YtCVp9Vk49pQBsZBRcCB2gtStZ64cW hX0Q== X-Gm-Message-State: AOAM532CM9aKi/A6uQ6mQMTEtpxyugL5AhAQy8NkONWx/niaiISz/3e+ xtKfbr/3rKRv7aqFhXOCtkjeIVA= X-Google-Smtp-Source: ABdhPJxpFpVuJmbVIP43fnuwzZzMMVSH4VFWc1SqdShPQSrfQ/KorAqelZsKomOoUKhAbD57K4Eg1TI= X-Received: from pcc-desktop.svl.corp.google.com ([2620:15c:2ce:200:782a:cf54:64ed:cca6]) (user=pcc job=sendgmr) by 2002:a0c:d784:: with SMTP id z4mr4135074qvi.27.1622142312367; Thu, 27 May 2021 12:05:12 -0700 (PDT) Date: Thu, 27 May 2021 12:04:53 -0700 Message-Id: <20210527190453.1259020-1-pcc@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.32.0.rc0.204.g9fa02ecfa5-goog Subject: [PATCH v4] mm: improve mprotect(R|W) efficiency on pages referenced once From: Peter Collingbourne To: Andrew Morton Cc: Peter Collingbourne , Kostya Kortchinsky , Evgenii Stepanov , Andrea Arcangeli , Peter Xu , linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: E7AA7E005F08 Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=O0qmirvH; spf=pass (imf13.hostedemail.com: domain of 3aO2vYAMKCA84rrv33v0t.r310x29C-11zAprz.36v@flex--pcc.bounces.google.com designates 209.85.219.74 as permitted sender) smtp.mailfrom=3aO2vYAMKCA84rrv33v0t.r310x29C-11zAprz.36v@flex--pcc.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspamd-Server: rspam04 X-Stat-Signature: amhhqnqumqobgzpfp97ryny4q3mxffur X-HE-Tag: 1622142304-605611 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In the Scudo memory allocator [1] we would like to be able to detect use-after-free vulnerabilities involving large allocations by issuing mprotect(PROT_NONE) on the memory region used for the allocation when it is deallocated. Later on, after the memory region has been "quarantined" for a sufficient period of time we would like to be able to use it for another allocation by issuing mprotect(PROT_READ|PROT_WRITE). Before this patch, after removing the write protection, any writes to the memory region would result in page faults and entering the copy-on-write code path, even in the usual case where the pages are only referenced by a single PTE, harming performance unnecessarily. Make it so that any pages in anonymous mappings that are only referenced by a single PTE are immediately made writable during the mprotect so that we can avoid the page faults. This program shows the critical syscall sequence that we intend to use in the allocator: #include #include enum { kSize = 131072 }; int main(int argc, char **argv) { char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); for (int i = 0; i != 100000; ++i) { memset(addr, i, kSize); mprotect((void *)addr, kSize, PROT_NONE); mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE); } } The effect of this patch on the above program was measured on a DragonBoard 845c by taking the median real time execution time of 10 runs. Before: 2.94s After: 0.66s The effect was also measured using one of the microbenchmarks that we normally use to benchmark the allocator [2], after modifying it to make the appropriate mprotect calls [3]. With an allocation size of 131072 bytes to trigger the allocator's "large allocation" code path the per-iteration time was measured as follows: Before: 27450ns After: 6010ns This patch means that we do more work during the mprotect call itself in exchange for less work when the pages are accessed. In the worst case, the pages are not accessed at all. The effect of this patch in such cases was measured using the following program: #include #include enum { kSize = 131072 }; int main(int argc, char **argv) { char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); memset(addr, 1, kSize); for (int i = 0; i != 100000; ++i) { #ifdef PAGE_FAULT memset(addr + (i * 4096) % kSize, i, 4096); #endif mprotect((void *)addr, kSize, PROT_NONE); mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE); } } With PAGE_FAULT undefined (0 pages touched after removing write protection) the median real time execution time of 100 runs was measured as follows: Before: 0.330260s After: 0.338836s With PAGE_FAULT defined (1 page touched) the measurements were as follows: Before: 0.438048s After: 0.355661s So it seems that even with a single page fault the new approach is faster. I saw similar results if I adjusted the programs to use a larger mapping size. With kSize = 1048576 I get these numbers with PAGE_FAULT undefined: Before: 1.428988s After: 1.512016s i.e. around 5.5%. And these with PAGE_FAULT defined: Before: 1.518559s After: 1.524417s i.e. about the same. What I think we may conclude from these results is that for smaller mappings the advantage of the previous approach, although measurable, is wiped out by a single page fault. I think we may expect that there should be at least one access resulting in a page fault (under the previous approach) after making the pages writable, since the program presumably made the pages writable for a reason. For larger mappings we may guesstimate that the new approach wins if the density of future page faults is > 0.4%. But for the mappings that are large enough for density to matter (not just the absolute number of page faults) it doesn't seem like the increase in mprotect latency would be very large relative to the total mprotect execution time. Signed-off-by: Peter Collingbourne Link: https://linux-review.googlesource.com/id/I98d75ef90e20330c578871c87494d64b1df3f1b8 Link: [1] https://source.android.com/devices/tech/debug/scudo Link: [2] https://cs.android.com/android/platform/superproject/+/master:bionic/benchmarks/stdlib_benchmark.cpp;l=53;drc=e8693e78711e8f45ccd2b610e4dbe0b94d551cc9 Link: [3] https://github.com/pcc/llvm-project/commit/scudo-mprotect-secondary2 --- v4: - check pte_uffd_wp() to ensure that we still see UFFD faults - check page_count() instead of page_mapcount() to handle non-map references (e.g. FOLL_LONGTERM) - move the check into a separate function v3: - check for dirty pages - refresh the performance numbers v2: - improve the commit message mm/mprotect.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 94188df1ee55..880c90b5744e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -35,6 +35,29 @@ #include "internal.h" +static bool may_avoid_write_fault(pte_t pte, struct vm_area_struct *vma, + unsigned long cp_flags) +{ + if (!(cp_flags & MM_CP_DIRTY_ACCT)) { + if (!(vma_is_anonymous(vma) && (vma->vm_flags & VM_WRITE))) + return false; + + if (page_count(pte_page(pte)) != 1) + return false; + } + + if (!pte_dirty(pte)) + return false; + + if (!pte_soft_dirty(pte) && (vma->vm_flags & VM_SOFTDIRTY)) + return false; + + if (pte_uffd_wp(pte)) + return false; + + return true; +} + static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -43,7 +66,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, spinlock_t *ptl; unsigned long pages = 0; int target_node = NUMA_NO_NODE; - bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT; bool prot_numa = cp_flags & MM_CP_PROT_NUMA; bool uffd_wp = cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; @@ -132,11 +154,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, } /* Avoid taking write faults for known dirty pages */ - if (dirty_accountable && pte_dirty(ptent) && - (pte_soft_dirty(ptent) || - !(vma->vm_flags & VM_SOFTDIRTY))) { + if (may_avoid_write_fault(ptent, vma, cp_flags)) ptent = pte_mkwrite(ptent); - } ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); pages++; } else if (is_swap_pte(oldpte)) { -- 2.32.0.rc0.204.g9fa02ecfa5-goog