From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27246C432BE for ; Mon, 23 Aug 2021 08:52:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B40766102A for ; Mon, 23 Aug 2021 08:52:58 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B40766102A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 380356B006C; Mon, 23 Aug 2021 04:52:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3093F6B0072; Mon, 23 Aug 2021 04:52:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 183678D0001; Mon, 23 Aug 2021 04:52:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0231.hostedemail.com [216.40.44.231]) by kanga.kvack.org (Postfix) with ESMTP id EB76C6B006C for ; Mon, 23 Aug 2021 04:52:57 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 8DEAD248B3 for ; Mon, 23 Aug 2021 08:52:57 +0000 (UTC) X-FDA: 78505730394.15.D9C0087 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf02.hostedemail.com (Postfix) with ESMTP id 388B37001701 for ; Mon, 23 Aug 2021 08:52:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629708776; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pfOHON14UeDduBGOj0dC9lNj/5YJZNAm23cZkconEzA=; b=XVJpPyiFNYTFbiMLKE6zVGCOksoemZB1pgfPkEYMxx2K8jI1uol4zxtD9PxQ2Kzm30wUAP EafFrr2ue6hdSjk/p4aD69VNgMcSXEnEV1keZa5XkDboHq7/ZiBOE4vxa9atVSFAaIA7cG Exa8rSzs90gp4VrhVoEQAcapTdZpuqU= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-594-0vXU8CH4PMe9qjqvbq56fg-1; Mon, 23 Aug 2021 04:52:54 -0400 X-MC-Unique: 0vXU8CH4PMe9qjqvbq56fg-1 Received: by mail-wr1-f71.google.com with SMTP id i16-20020adfded0000000b001572ebd528eso2630520wrn.19 for ; Mon, 23 Aug 2021 01:52:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=pfOHON14UeDduBGOj0dC9lNj/5YJZNAm23cZkconEzA=; b=kvlpVEkrFmwvaWHZj9DbIrxWwcLulEk7v7Ufjs2QVQaZcnsNYVuTZFo1d8QSNNJ5Rh jhdxz/BxsaQuDlNEPZntwGz9zYHOsvsYo7WBv7eg+3HYFgjx/d61Vkne/g5EBvXCRimS PPjnzaTcr/CbowdYp5agO7yj1fhL8Yqn1Fy/wCmKQqnjVr1cU+nLgr2U8RAA0ZWBYJw0 +gqR4CpTBzqv6tj/AmId91Vrpd+p+Aei+u+n72d108QuIRzKausw9dAXfJvZ93/UyC3R x3WzIp0KwKtRHj4McsK9dzCodfQsAeIyaPwN7otTR+OK3aki3WpZ06lqtaGdyC+C+RxW HkPw== X-Gm-Message-State: AOAM532xeSMKZpvyzXuF5c/+Jq9TMfMTgqrqi2J1HK+egcRVAy+xes3N UxQrFsusYLYXN6HkNAk1+00jnxfiDiPg9fmjLSyKMus1gJUG/RGCnItIZhOUA5xmPDKWdOg9x6D nxUwRQzv4fVU= X-Received: by 2002:a05:600c:3656:: with SMTP id y22mr14945916wmq.58.1629708773253; Mon, 23 Aug 2021 01:52:53 -0700 (PDT) X-Google-Smtp-Source: ABdhPJysTRI6d9227QWZf4EJBfu7NI3bo57nBMrEQLUlt8TYv25W+ux9hU2oGk4Svcydi30oOETcQA== X-Received: by 2002:a05:600c:3656:: with SMTP id y22mr14945896wmq.58.1629708773011; Mon, 23 Aug 2021 01:52:53 -0700 (PDT) Received: from ?IPv6:2003:d8:2f0a:7f00:fad7:3bc9:69d:31f? (p200300d82f0a7f00fad73bc9069d031f.dip0.t-ipconnect.de. [2003:d8:2f0a:7f00:fad7:3bc9:69d:31f]) by smtp.gmail.com with ESMTPSA id z137sm18365575wmc.14.2021.08.23.01.52.52 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 23 Aug 2021 01:52:52 -0700 (PDT) To: Tiberiu Georgescu Cc: Jonathan Corbet , "linux-doc@vger.kernel.org" , "linux-mm@kvack.org" , "peter.xu@redhat.com" , Ivan Teterevkov , Florian Schmidt , "Carl Waldspurger [C]" , Jonathan Davies References: <20210812155843.236919-1-tiberiu.georgescu@nutanix.com> <8f7d6856-7bcd-dedf-663b-cd7ef2d0827f@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] Documentation: update pagemap with SOFT_DIRTY & UFFD_WP shmem issue Message-ID: <4187d379-759e-0dc5-eff8-c8d356828ae2@redhat.com> Date: Mon, 23 Aug 2021 10:52:51 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XVJpPyiF; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf02.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 388B37001701 X-Stat-Signature: mbkew4h1u9ya71b8mbbwyrioe8c1skbj X-HE-Tag: 1629708777-692555 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 20.08.21 19:10, Tiberiu Georgescu wrote: > Hello David, >=20 >> On 18 Aug 2021, at 20:14, David Hildenbrand wrote: >> >> On 12.08.21 17:58, Tiberiu A Georgescu wrote: >>> Mentioning the current missing functionality of the pagemap, in case >>> someone stumbles upon unexpected behaviour. >>> Signed-off-by: Tiberiu A Georgescu >>> Reviewed-by: Ivan Teterevkov >>> Reviewed-by: Florian Schmidt >>> Reviewed-by: Carl Waldspurger >>> Reviewed-by: Jonathan Davies >>> --- >>> Documentation/admin-guide/mm/pagemap.rst | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation= /admin-guide/mm/pagemap.rst >>> index fb578fbbb76c..627f3832b3a2 100644 >>> --- a/Documentation/admin-guide/mm/pagemap.rst >>> +++ b/Documentation/admin-guide/mm/pagemap.rst >>> @@ -207,3 +207,9 @@ Before Linux 3.11 pagemap bits 55-60 were used fo= r "page-shift" (which is >>> always 12 at most architectures). Since Linux 3.11 their meaning ch= anges >>> after first clear of soft-dirty bits. Since Linux 4.2 they are used= for >>> flags unconditionally. >>> + >>> +Note that the page table entries for swappable and non-syncable page= s are >>> +cleared when those pages are zapped or swapped out. This makes infor= mation >>> +about the page disappear from the pagemap. The location of the swap= ped >>> +page can still be retrieved from the page cache, but flags like SOFT= _DIRTY >>> +and UFFD_WP are lost irretrievably. >> >> UFFD_WP is currently only supported for private anonymous memory, wher= e it should just work (a swap entry with a uffd-wp marker). So can we eve= n end up with UFFD_WP bits on shmem and such? (Peter is up-streaming that= right now, but there, I think he's intending to handle it properly witho= ut these bits getting lost using pte_markers and such). >=20 > If that is the case, I guess we should not end up with UFFD_WP bits on = shmem > ptes yet. Sorry for the confusion. >=20 > Great to hear Peter is upstreaming his patch soon. Is it this series[1]= you > mention? >=20 > [1]: https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat= .com/ Yes, and that would take care of making the uffd-wp bit persistent. >> >> So regarding upstream Linux, your note regarding UFFD_WP should not be= applicable, right? >> > Right. >> >> On a related note: if we start thinking about the pagemap expressing w= hich pages are currently mapped into the page tables ("state of the proce= ss page tables") mostly all starts making sense. We document this as "to = examine the page tables" already. >> >> We only get swapped information if there is a swap PTE -- which only m= akes sense for anonymous pages, because there, the page table holds the s= tate ("single source of truth"). For shmem, we don't have it, because the= page cache is the single source of truth. >> >> We only get presence information if there is a page mapped into the pa= ge tables -- which, for anonymous pages, specifies if there is anything p= resent at all. For shmem we only have it if it's currently mapped into th= e page table. >> >> Losing softdirt is a bad side effect of, what you describe, just setti= ng a PTE to none and not syncing back that state back to some central pla= ce where it could be observed even without the PTE at hand. >> > Yeah, that seems to be the case because shared memory behaves internall= y > as file-backed memory, but logically needs to be swapped to a swap devi= ce, not > to the disk. This turns shmem into an odd hybrid, which does not truly = adhere to > the rules the other categories comply. >> >> Maybe we should document more clearly, especially what to expect for a= nonymous pages and what to expect for shared memory etc from the pagemap.= Once we figured out which other interfaces we have to deal with shared m= emory (minore(), lseek() as we learned), we might want to document that a= s well, to safe people some time when exploring this area. >=20 > I agree, as I found out first hand how eluding this information can be. > Thank you for your comments and discoveries mentioned on Peter's RFC th= read[4], particularly the usage of mincore(), lseek() and proc/pid/map_fi= les in > CRIU. I learned a lot from them. We should definitely add them as alter= natives for > parts of the missing information. >=20 > Currently, the missing information for shmem is this: > 1. Difference between is_swap(pte) and is_none(pte). > * is_swap(pte) is always false; > * is_none(pte) is true when is_swap() should have been; You can also have is_none(pte) if it should be is_present(pte). > * is_present(pte) is fine. is_present(pte) is always correct when set, but might be wrong when not s= et. > 2. swp_entry(pte) > Particularly, swp_type() and swp_offset(). > 3. SOFT_DIRTY_BIT > This is not always missing for shmem. > Once 4 is written to clear_refs, if the page is dirtied, the bit i= s fine as long as it > is still in memory. If the page is swapped out, the bit is lost. T= hen, if the page is > brought back into memory, the bit is still lost. There are other cases that don't require swapping I think (THP=20 splitting). I might be wrong. >=20 > For 1, you mentioned how lseek() and madvise() can be used to get this > information [2], and I proposed a different method with a little help f= rom > the current pagemap[3]. They have slightly different output and applica= tions, so > the difference should be taken into consideration. At this point I am pretty sure that the pagemap is the wrong mechanism=20 for that. Pagemap never made that promise; it promised to tell you how=20 the page tables currently look like, not the correct state of the=20 underlying file. > For 2, if anyone knows of any way of retrieve the missing information c= leanly, > please let us know. As raised by Peter as well, there is much likely not a sane use case=20 that should really rely on this. There might be corner cases (use case=20 you mentioned), but that doesn't mean that we want to support them from=20 a Linux ABI POV. > As for 3, AFAIK, we will need to leverage Peter's special PTE marker me= chanism > and implement it in another patch. Or come to the conclusion that softdirty+shmem in the current form is=20 the wrong approach and you actually want to maintain such information in=20 central place, from where you can retrieve reliably if shared memory has=20 been modified by any user. pagemap never worked reliably with softdirty/swap/present on shmem, so=20 it's not a regression. It was always best effort. --=20 Thanks, David / dhildenb