linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>
Cc: Jonathan Corbet <corbet@lwn.net>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"peter.xu@redhat.com" <peter.xu@redhat.com>,
	Ivan Teterevkov <ivan.teterevkov@nutanix.com>,
	Florian Schmidt <flosch@nutanix.com>,
	"Carl Waldspurger [C]" <carl.waldspurger@nutanix.com>,
	Jonathan Davies <jond@nutanix.com>
Subject: Re: [PATCH] Documentation: update pagemap with SOFT_DIRTY & UFFD_WP shmem issue
Date: Mon, 23 Aug 2021 10:52:51 +0200	[thread overview]
Message-ID: <4187d379-759e-0dc5-eff8-c8d356828ae2@redhat.com> (raw)
In-Reply-To: <F04C4283-0D25-4D0E-B3A8-05B36ACFF30D@nutanix.com>

On 20.08.21 19:10, Tiberiu Georgescu wrote:
> Hello David,
> 
>> On 18 Aug 2021, at 20:14, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 12.08.21 17:58, Tiberiu A Georgescu wrote:
>>> Mentioning the current missing functionality of the pagemap, in case
>>> someone stumbles upon unexpected behaviour.
>>> Signed-off-by: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
>>> Reviewed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
>>> Reviewed-by: Florian Schmidt <florian.schmidt@nutanix.com>
>>> Reviewed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
>>> Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
>>> ---
>>>   Documentation/admin-guide/mm/pagemap.rst | 6 ++++++
>>>   1 file changed, 6 insertions(+)
>>> diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
>>> index fb578fbbb76c..627f3832b3a2 100644
>>> --- a/Documentation/admin-guide/mm/pagemap.rst
>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
>>> @@ -207,3 +207,9 @@ Before Linux 3.11 pagemap bits 55-60 were used for "page-shift" (which is
>>>   always 12 at most architectures). Since Linux 3.11 their meaning changes
>>>   after first clear of soft-dirty bits. Since Linux 4.2 they are used for
>>>   flags unconditionally.
>>> +
>>> +Note that the page table entries for swappable and non-syncable pages are
>>> +cleared when those pages are zapped or swapped out. This makes information
>>> +about the page disappear from the pagemap.  The location of the swapped
>>> +page can still be retrieved from the page cache, but flags like SOFT_DIRTY
>>> +and UFFD_WP are lost irretrievably.
>>
>> UFFD_WP is currently only supported for private anonymous memory, where it should just work (a swap entry with a uffd-wp marker). So can we even end up with UFFD_WP bits on shmem and such? (Peter is up-streaming that right now, but there, I think he's intending to handle it properly without these bits getting lost using pte_markers and such).
> 
> If that is the case, I guess we should not end up with UFFD_WP bits on shmem
> ptes yet. Sorry for the confusion.
> 
> Great to hear Peter is upstreaming his patch soon. Is it this series[1] you
> mention?
> 
> [1]: https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/

Yes, and that would take care of making the uffd-wp bit persistent.

>>
>> So regarding upstream Linux, your note regarding UFFD_WP should not be applicable, right?
>>
> Right.
>>
>> On a related note: if we start thinking about the pagemap expressing which pages are currently mapped into the page tables ("state of the process page tables") mostly all starts making sense. We document this as "to examine the page tables" already.
>>
>> We only get swapped information if there is a swap PTE -- which only makes sense for anonymous pages, because there, the page table holds the state ("single source of truth"). For shmem, we don't have it, because the page cache is the single source of truth.
>>
>> We only get presence information if there is a page mapped into the page tables -- which, for anonymous pages, specifies if there is anything present at all. For shmem we only have it if it's currently mapped into the page table.
>>
>> Losing softdirt is a bad side effect of, what you describe, just setting a PTE to none and not syncing back that state back to some central place where it could be observed even without the PTE at hand.
>>
> Yeah, that seems to be the case because shared memory behaves internally
> as file-backed memory, but logically needs to be swapped to a swap device, not
> to the disk. This turns shmem into an odd hybrid, which does not truly adhere to
> the rules the other categories comply.
>>
>> Maybe we should document more clearly, especially what to expect for anonymous pages and what to expect for shared memory etc from the pagemap. Once we figured out which other interfaces we have to deal with shared memory (minore(), lseek() as we learned), we might want to document that as well, to safe people some time when exploring this area.
> 
> I agree, as I found out first hand how eluding this information can be.
> Thank you for your comments and discoveries mentioned on Peter's RFC thread[4], particularly the usage of mincore(), lseek() and proc/pid/map_files in
> CRIU. I learned a lot from them. We should definitely add them as alternatives for
> parts of the missing information.
> 
> Currently, the missing information for shmem is this:
> 1. Difference between is_swap(pte) and is_none(pte).
>      * is_swap(pte) is always false;
>      * is_none(pte) is true when is_swap() should have been;

You can also have is_none(pte) if it should be is_present(pte).

>      * is_present(pte) is fine.

is_present(pte) is always correct when set, but might be wrong when not set.

> 2. swp_entry(pte)
>      Particularly, swp_type() and swp_offset().
> 3. SOFT_DIRTY_BIT
>      This is not always missing for shmem.
>      Once 4 is written to clear_refs, if the page is dirtied, the bit is fine as long as it
>      is still in memory. If the page is swapped out, the bit is lost. Then, if the page is
>      brought back into memory, the bit is still lost.

There are other cases that don't require swapping I think (THP 
splitting). I might be wrong.

> 
> For 1, you mentioned how lseek() and madvise() can be used to get this
> information [2], and I proposed a different method with a little help from
> the current pagemap[3]. They have slightly different output and applications, so
> the difference should be taken into consideration.

At this point I am pretty sure that the pagemap is the wrong mechanism 
for that. Pagemap never made that promise; it promised to tell you how 
the page tables currently look like, not the correct state of the 
underlying file.

> For 2, if anyone knows of any way of retrieve the missing information cleanly,
> please let us know.

As raised by Peter as well, there is much likely not a sane use case 
that should really rely on this. There might be corner cases (use case 
you mentioned), but that doesn't mean that we want to support them from 
a Linux ABI POV.

> As for 3, AFAIK, we will need to leverage Peter's special PTE marker mechanism
> and implement it in another patch.

Or come to the conclusion that softdirty+shmem in the current form is 
the wrong approach and you actually want to maintain such information in 
central place, from where you can retrieve reliably if shared memory has 
been modified by any user.

pagemap never worked reliably with softdirty/swap/present on shmem, so 
it's not a regression. It was always best effort.

-- 
Thanks,

David / dhildenb



  parent reply	other threads:[~2021-08-23  8:52 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-12 15:58 [PATCH] Documentation: update pagemap with SOFT_DIRTY & UFFD_WP shmem issue Tiberiu A Georgescu
2021-08-18 19:14 ` David Hildenbrand
2021-08-20 17:10   ` Tiberiu Georgescu
2021-08-20 20:25     ` Peter Xu
2021-08-23  8:40       ` David Hildenbrand
2021-08-23  8:52     ` David Hildenbrand [this message]
2021-08-25 15:48       ` Tiberiu Georgescu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4187d379-759e-0dc5-eff8-c8d356828ae2@redhat.com \
    --to=david@redhat.com \
    --cc=carl.waldspurger@nutanix.com \
    --cc=corbet@lwn.net \
    --cc=flosch@nutanix.com \
    --cc=ivan.teterevkov@nutanix.com \
    --cc=jond@nutanix.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peter.xu@redhat.com \
    --cc=tiberiu.georgescu@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).