From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 958A8C7618D for ; Mon, 20 Mar 2023 14:41:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C84F36B0074; Mon, 20 Mar 2023 10:41:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C0D786B0078; Mon, 20 Mar 2023 10:41:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AAF846B007B; Mon, 20 Mar 2023 10:41:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 96ACF6B0074 for ; Mon, 20 Mar 2023 10:41:39 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3F74F40DBE for ; Mon, 20 Mar 2023 14:41:39 +0000 (UTC) X-FDA: 80589540318.28.950B018 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf17.hostedemail.com (Postfix) with ESMTP id F291D4001A for ; Mon, 20 Mar 2023 14:41:36 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GtP+zgrY; spf=pass (imf17.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679323297; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lSMB6cdmr6XyOr+cjJxLrfI2j1ikkNVrD7zJD465Tpo=; b=co1LIdUhGTmyozZmNE8qyFQD66vMKxgeahUQnYp1c55qZnlaWpyzvoRNRFVX1U7etaZ53f YfXcMC9Ua6szGcdFhd4FUnRap80hXdSqhr04zsB5C9cICN4SBiOcWWzqzSFFe8ebWPDTOx W/IUo85/8XXwokzHqUrmdEt1zzl4Ygk= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GtP+zgrY; spf=pass (imf17.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679323297; a=rsa-sha256; cv=none; b=50pnTttlqH+NGONcgLIPtRGC1Zn9uMI86PyWtOzQ612b6L7b3oiej+PYHbj2n6kRNa6MX7 Pwj7TAYeZncn0KUbdiW6onQulQbmT/R4URAqvue9cupP4vnrSX+FUoNbLMoLg40J0drVtU wV7Giwu3IorZTg+c8iGLqCuyzLcaeBE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1679323296; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=lSMB6cdmr6XyOr+cjJxLrfI2j1ikkNVrD7zJD465Tpo=; b=GtP+zgrYJSqKdSSCOxAytaYZOSNBly48jT5Dnj1xzZDIk9wQnRa3bB4VVbkK5NpymagFK0 R0xkOZbtOB6ahjbE6PMimZbRIQKEwFu2QH91dRxilAqn+8sJGO8/8KN6xhBOITpBTZSSeU vbIRaKlJQ/TIcZtsNlPGtM+sunxLIQs= Received: from mail-qv1-f71.google.com (mail-qv1-f71.google.com [209.85.219.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-658-p2OAm1edO46rlPSIm1nqsQ-1; Mon, 20 Mar 2023 10:41:35 -0400 X-MC-Unique: p2OAm1edO46rlPSIm1nqsQ-1 Received: by mail-qv1-f71.google.com with SMTP id g14-20020ad457ae000000b005aab630eb8eso6193843qvx.13 for ; Mon, 20 Mar 2023 07:41:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679323294; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=lSMB6cdmr6XyOr+cjJxLrfI2j1ikkNVrD7zJD465Tpo=; b=bwwxzo2F3GPfGFNGCoLp+hHoyTiYoRvOU591Kw1SvEXgSg0sjR9O8PpSmNAb7eeUtw uMax+f8XJCCghskvlTXkM9fQXoqYb3FxFeH+FeO9ab0NTnSSf3shK+0D37w9Uc4KM8Sl 5/oGB7cZSJMgHcDKXBd3z6NpAZoedNi07J2J/5uClkzcRYIq+yftUQu7fpKBr31ufbwP 9PTINzQsgskcjGxPU/W9RelSH6KLWMGSBASTdoK23FtXULv59bDLoISaD7ZRXxTiCaIH 6ZlCPK7Kr+L/eUQiniwneyhSNY8OZooJvnhzNvke52620SQ6NLpFVyKIBNqtrGREsbAD fiWw== X-Gm-Message-State: AO0yUKV9u3Xuo9D51T9Sl/AufONCi4Rf49kZ9t8qD65kAgIZKd2OPXbc EMgYgpaW2ZrZps31ktHH2iCbtnSS7/ZrCjJAer7fFynnJA3Quy/z2UVs/cAeLAVE9chlQUZ1INW JgpKRyHGXVTM= X-Received: by 2002:ad4:5961:0:b0:532:141d:3750 with SMTP id eq1-20020ad45961000000b00532141d3750mr25071520qvb.2.1679323294639; Mon, 20 Mar 2023 07:41:34 -0700 (PDT) X-Google-Smtp-Source: AK7set8zT8F3mCZxJvFU1T+Bzz6l8oCAXcJ/whdF6bkxNT6JJHOQlaqkWO7BSiw6JRofpeTwOd1B7w== X-Received: by 2002:ad4:5961:0:b0:532:141d:3750 with SMTP id eq1-20020ad45961000000b00532141d3750mr25071487qvb.2.1679323294317; Mon, 20 Mar 2023 07:41:34 -0700 (PDT) Received: from x1n (bras-base-aurron9127w-grc-56-70-30-145-63.dsl.bell.ca. [70.30.145.63]) by smtp.gmail.com with ESMTPSA id p16-20020a05620a057000b00745c2b29091sm179229qkp.93.2023.03.20.07.41.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 20 Mar 2023 07:41:33 -0700 (PDT) Date: Mon, 20 Mar 2023 10:41:32 -0400 From: Peter Xu To: David Hildenbrand Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Nadav Amit , Axel Rasmussen , Paul Gofman , Muhammad Usama Anjum , Mike Rapoport , Andrea Arcangeli , Andrew Morton Subject: Re: [PATCH v4 1/2] mm/uffd: UFFD_FEATURE_WP_UNPOPULATED Message-ID: References: <20230309223711.823547-1-peterx@redhat.com> <20230309223711.823547-2-peterx@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: F291D4001A X-Stat-Signature: x6uhrrh5sno71a88ggne5yu6zmepokot X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1679323296-830232 X-HE-Meta: U2FsdGVkX1+0sVnDZarYLba9qRBxq9lskOwH5RWz5nYnJAHTGulChovhDS3nBeCM91L1eB4KG1SseWTd7gmIJ5Wp7D/MFfGmCaI1wzqN5HfGc9K0yEDfkpJ4sF2DeaVGKrPq9PAY5mhFqnRwIisXX797V9xhxfx8I22o6HUaL5WL6VQC5J95v0Xy3UPlOsaA9hbA2pLGWRtqL0ziIlb1so9txKnx6JJNmvG7a3V6rcyhwJeXQPPyTTut5zpq4HFL1fnhjYMIaUXaaYaah7sT588mky8Es8dAaEZ+Jn8OdM6ZVbLXFaOmsUcCPd+bG31/xbRHFkmTLuGBnAU7M+98lzKiZv/P6c0hxllbqCW9DfeFNjOgdSUt/UxMJeHE+k5e44/B6XFE01X3MUkQrcjkgGp1SB1M201ZSNyCaKIpwSt8bXierPyjNB0ZecZWhfGFP48R7t3pVfbnrFHGXwho3sL11EheQtmsdePEoMYQQeozNEM5sw/UzMw7Pw6aRCLfcK4PDFjr4Ki3x2T+lHT2WtAu25n253zvADo2jd1t4nJgWzAbxyrQEBq8wxlfBt2cTvrqnUSe6XWzaWAkihUqKtspN02glAVxpPvhJCjzd9veH73sNKpgfKTAMmo9bW/ZbRIOk/ZBNBPGfs/hazOpcna227Sv/bMOHdWH81gK2OUWuGERCQaigAjeNMf3ehyqreS0PfldnCWvsRJQ72h4LTa78IgIIeNB4NTVKDBs0pulAfKSET924uGk4kAD+zxU4d2Xgx7hYgWzK0g18VThyVkS0qJUvh1A+qWVVXwvqJrygfNNZDp6eltmz4j9YG39NFh3zMfoKxWQPla4YAATC7fRwc/sulWWEj3znu1Sl7pSCHJa2AVlBi5bXRSJ70IXzO4TjaKd8+eTpGzLafxRktQZAL51rCIWmIoyZijBqaABUT18BV2oSgc1gAe3jRNdd0KtoNvqjvjm6kvVxvf Pt7y0bLx E0E00sn3KsUletaPcSSAzlAuyKcG8w02caDIDQdi2LZ7A3cdkKljQv9FjYwYmHKTYDK36PLsOGW866dfG7rhWnpIAlw4zf7niWueb5Pt/GX2bHS11kaKH7bURw90IwgkzEgtSZyhI5paxz6aZ4FfpOD0vajdWjb5sau9EDFsQM6/FZmbIFX8NiuJJ8piN37/7gl5hYNSbHV9I+9oHDwK3N2vjyrG4xIKk+a/xgKa4avm/OHUqOjhHb17FG794de/3DI1U0AruEJFiSaMXtJsTyh7YPp8/0vEmxRN25wUYv+5MuO/2tmDvVLaioA0kzD9RtoWAjEimnKzJ5G5Fi/RFV+C37xzmfgC5LXUw3EEvKEegQpv08JuZLBGGvi6NwfHp1TNFisraha89cMISxH8jGbxI820M/h0n860NHZB71d0qgkrXvYOoB6AX9FO6zcsX+X2Wj9slNcwpRw5Z6n6aSdWwwY4/HX7BArAJ0El+FNkDj13zUh0+fA6PqQJop7Qwejn/pCw9QuoRpg3SQ1vFnhdhE3kXZwZ9LlnfjGOaZIwKDg040OdT5Y3Rj1CpsA8lA3C6bdH81P+qPTFWck25nUyO/1t6z/vFEz8KObV+RxhCJiTO4b2Y7SDEjyVOQYYAQwmcwvr3CQkSN+2NMTONhwHwxAoBII/Qu5RUY2jib6k39tOKM5hqvIWrh2cxdV4XVtXELS4xpiyYvjByykVBluYDIj1kQ1+GpM7kuyS8vKHgKYI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Mar 20, 2023 at 11:21:13AM +0100, David Hildenbrand wrote: > > > (1) With huge page disabled > > echo madvise > /sys/kernel/mm/transparent_hugepage/enabled > > ./uffd_wp_perf > > Test DEFAULT: 4 > > Test PRE-READ: 1111453 (pre-fault 1101011) > > Test MADVISE: 278276 (pre-fault 266378) > > Thinking about it, I guess the biggest slowdown here is the "one fake > pagefault at a time" handling. I think so, though I assume the idea here is to avoid any faulting. > > > Test WP-UNPOPULATE: 11712 > > > > (2) With Huge page enabled > > echo always > /sys/kernel/mm/transparent_hugepage/enabled > > ./uffd_wp_perf > > Test DEFAULT: 4 > > Test PRE-READ: 22521 (pre-fault 22348) > > Test MADVISE: 4909 (pre-fault 4743) > > Test WP-UNPOPULATE: 14448 > > > > There'll be a great perf boost for no-thp case, while for thp enabled with > > extreme case of all-thp-zero WP_UNPOPULATED can be slower than MADVISE, but > > that's low possibility in reality, also the overhead was not reduced but > > postponed until a follow up write on any huge zero thp, so potentially it > > is faster by making the follow up writes slower. > > > > [1] https://lore.kernel.org/all/20210401092226.102804-4-andrey.gruzdev@virtuozzo.com/ > > [2] https://lore.kernel.org/all/Y+v2HJ8+3i%2FKzDBu@x1n/ > > [3] https://lore.kernel.org/all/d0eb0a13-16dc-1ac1-653a-78b7273781e3@collabora.com/ > > [4] https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-wp-perf.c > > > > Signed-off-by: Peter Xu > > --- > > Documentation/admin-guide/mm/userfaultfd.rst | 17 ++++++ > > fs/userfaultfd.c | 16 ++++++ > > include/linux/mm_inline.h | 6 +++ > > include/linux/userfaultfd_k.h | 23 ++++++++ > > include/uapi/linux/userfaultfd.h | 10 +++- > > mm/memory.c | 56 +++++++++++++++----- > > mm/mprotect.c | 51 ++++++++++++++---- > > 7 files changed, 154 insertions(+), 25 deletions(-) > > > > diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst > > index 7dc823b56ca4..c86b56c95ea6 100644 > > --- a/Documentation/admin-guide/mm/userfaultfd.rst > > +++ b/Documentation/admin-guide/mm/userfaultfd.rst > > @@ -219,6 +219,23 @@ former will have ``UFFD_PAGEFAULT_FLAG_WP`` set, the latter > > you still need to supply a page when ``UFFDIO_REGISTER_MODE_MISSING`` was > > used. > > +Userfaultfd write-protect mode currently behave differently on none ptes > > +(when e.g. page is missing) over different types of memories. > > + > > +For anonymous memory, ``ioctl(UFFDIO_WRITEPROTECT)`` will ignore none ptes > > +(e.g. when pages are missing and not populated). For file-backed memories > > +like shmem and hugetlbfs, none ptes will be write protected just like a > > +present pte. In other words, there will be a userfaultfd write fault > > +message generated when writting to a missing page on file typed memories, > > s/writting/writing/ > > > +as long as the page range was write-protected before. Such a message will > > +not be generated on anonymous memories by default. > > + > > +If the application wants to be able to write protect none ptes on anonymous > > +memory, one can pre-populate the memory with e.g. MADV_POPULATE_READ. On > > +newer kernels, one can also detect the feature UFFD_FEATURE_WP_UNPOPULATED > > +and set the feature bit in advance to make sure none ptes will also be > > +write protected even upon anonymous memory. > > + > > [...] > > > /* > > * A number of key systems in x86 including ioremap() rely on the assumption > > @@ -1350,6 +1364,10 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, > > unsigned long addr, pte_t *pte, > > struct zap_details *details, pte_t pteval) > > { > > + /* Zap on anonymous always means dropping everything */ > > + if (vma_is_anonymous(vma)) > > + return; > > + > > if (zap_drop_file_uffd_wp(details)) > > return; > > @@ -1456,8 +1474,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, > > continue; > > rss[mm_counter(page)]--; > > } else if (pte_marker_entry_uffd_wp(entry)) { > > - /* Only drop the uffd-wp marker if explicitly requested */ > > - if (!zap_drop_file_uffd_wp(details)) > > + /* > > + * For anon: always drop the marker; for file: only > > + * drop the marker if explicitly requested. > > + */ > > So MADV_DONTNEED a pte marker in an anonymous VMA will always remove that > marker. Yes. > Is that the same handling as for MADV_DONTNEED on shmem or on > fallocate(PUNCHHOLE) on shmem? Same as PUNCHHOLE for shmem, while DONTNEED for shmem will retain the marker. Here the idea is we drop the marker if the user wants to drop the page, no matter what type of memory is underneath. > > > + if (!vma_is_anonymous(vma) && > > + !zap_drop_file_uffd_wp(details)) > > continue; > > Maybe it would be nicer to have a zap_drop_uffd_wp_marker(vma, details) and > have the comment in there. Especially because of the other hunk above. > > So zap_drop_file_uffd_wp(details) -> zap_drop_uffd_wp_marker(vma, details) > and move the anon handling + comment in there. Yes we can. Actually here I always thought DROP_MARKER is too specific and the caller will be confused on when to pass it in. After introduction of ZAP_FLAG_UNMAP for hugetlb, I think we can also have another more generic flag ZAP_FLAG_TRUNCATE only set during truncations, then here the old DROP_MARKER can be replaced by "TRUNCATE | UNMAP". > > > > } else if (is_hwpoison_entry(entry) || > > is_swapin_error_entry(entry)) { > > @@ -3624,6 +3646,14 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf) > > return 0; > > } > > +static vm_fault_t do_pte_missing(struct vm_fault *vmf) > > +{ > > + if (vma_is_anonymous(vmf->vma)) > > + return do_anonymous_page(vmf); > > + else > > + return do_fault(vmf); > > No need for the "else" statement. I don't see much difference in this specific context, but I'm fine to drop it too. > > > +} > > + > > /* > > * This is actually a page-missing access, but with uffd-wp special pte > > * installed. It means this pte was wr-protected before being unmapped. > > @@ -3634,11 +3664,10 @@ static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf) > > * Just in case there're leftover special ptes even after the region > > * got unregistered - we can simply clear them. > > */ > > - if (unlikely(!userfaultfd_wp(vmf->vma) || vma_is_anonymous(vmf->vma))) > > + if (unlikely(!userfaultfd_wp(vmf->vma))) > > return pte_marker_clear(vmf); > > - /* do_fault() can handle pte markers too like none pte */ > > - return do_fault(vmf); > > + return do_pte_missing(vmf); > > } > > [...] > > > diff --git a/mm/mprotect.c b/mm/mprotect.c > > index 231929f119d9..455f7051098f 100644 > > --- a/mm/mprotect.c > > +++ b/mm/mprotect.c > > @@ -276,7 +276,15 @@ static long change_pte_range(struct mmu_gather *tlb, > > } else { > > /* It must be an none page, or what else?.. */ > > WARN_ON_ONCE(!pte_none(oldpte)); > > - if (unlikely(uffd_wp && !vma_is_anonymous(vma))) { > > + > > + /* > > + * Nobody plays with any none ptes besides > > + * userfaultfd when applying the protections. > > + */ > > + if (likely(!uffd_wp)) > > + continue; > > + > > + if (userfaultfd_wp_use_markers(vma)) { > > /* > > * For file-backed mem, we need to be able to > > * wr-protect a none pte, because even if the > > @@ -320,23 +328,46 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd) > > return 0; > > } > > -/* Return true if we're uffd wr-protecting file-backed memory, or false */ > > +/* > > + * Return true if we want to split huge thps in change protection > > "huge thps" sounds redundant. "if we want to PTE-map a huge PMD" ? Sure. > > > + * procedure, false otherwise. > > > In general, > > Acked-by: David Hildenbrand Thanks, -- Peter Xu