linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>,
	akpm@linux-foundation.org, viro@zeniv.linux.org.uk,
	christian.brauner@ubuntu.com, ebiederm@xmission.com,
	adobriyan@gmail.com, songmuchun@bytedance.com, axboe@kernel.dk,
	vincenzo.frascino@arm.com, catalin.marinas@arm.com,
	peterz@infradead.org, chinwen.chang@mediatek.com,
	linmiaohe@huawei.com, jannh@google.com, apopple@nvidia.com,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, ivan.teterevkov@nutanix.com,
	florian.schmidt@nutanix.com, carl.waldspurger@nutanix.com,
	jonathan.davies@nutanix.com
Subject: Re: [PATCH 0/1] pagemap: swap location for shared pages
Date: Wed, 11 Aug 2021 18:15:37 +0200	[thread overview]
Message-ID: <0beb1386-d670-aab1-6291-5c3cb0d661e0@redhat.com> (raw)
In-Reply-To: <YQrn33pOlpdl662i@t490s>

On 04.08.21 21:17, Peter Xu wrote:
> On Wed, Aug 04, 2021 at 08:49:14PM +0200, David Hildenbrand wrote:
>> TBH, I tend to really dislike the PTE marker idea. IMHO, we shouldn't store
>> any state information regarding shared memory in per-process page tables: it
>> just doesn't make too much sense.
>>
>> And this is similar to SOFTDIRTY or UFFD_WP bits: this information actually
>> belongs to the shared file ("did *someone* write to this page", "is
>> *someone* interested into changes to that page", "is there something"). I
>> know, that screams for a completely different design in respect to these
>> features.
>>
>> I guess we start learning the hard way that shared memory is just different
>> and requires different interfaces than per-process page table interfaces we
>> have (pagemap, userfaultfd).
>>
>> I didn't have time to explore any alternatives yet, but I wonder if tracking
>> such stuff per an actual fd/memfd and not via process page tables is
>> actually the right and clean approach. There are certainly many issues to
>> solve, but conceptually to me it feels more natural to have these shared
>> memory features not mangled into process page tables.
> 
> Yes, we can explore all the possibilities, I'm totally fine with it.
> 
> I just want to say I still don't think when there's page cache then we must put
> all the page-relevant things into the page cache.

[sorry for the late reply]

Right, but for the case of shared, swapped out pages, the information is 
already there, in the page cache :)

> 
> They're shared by processes, but process can still have its own way to describe
> the relationship to that page in the cache, to me it's as simple as "we allow
> process A to write to page cache P", while "we don't allow process B to write
> to the same page" like the write bit.

The issue I'm having uffd-wp as it was proposed for shared memory is 
that there is hardly a sane use case where we would *want* it to work 
that way.

A UFFD-WP flag in a page table for shared memory means "please notify 
once this process modifies the shared memory (via page tables, not via 
any other fd modification)". Do we have an example application where 
these semantics makes sense and don't over-complicate the whole 
approach? I don't know any, thus I'm asking dumb questions :)


For background snapshots in QEMU the flow would currently be like this, 
assuming all processes have the shared guest memory mapped.

1. Background snapshot preparation: QEMU requests all processes
    to uffd-wp the range
a) All processes register a uffd handler on guest RAM
b) All processes fault in all guest memory (essentially populating all
    memory): with a uffd-WP extensions we might be able to get rid of
    that, I remember you were working on that.
c) All processes uffd-WP the range to set the bit in their page table

2. Background snapshot runs:
a) A process either receives a UFFD-WP event and forwards it to QEMU or
    QEMU polls all other processes for UFFD events.
b) QEMU writes the to-be-changed page to the migration stream.
c) QEMU triggers all processes to un-protect the page and wake up any
    waiters. All processes clear the uffd-WP bit in their page tables.

3. Background snapshot completes:
a) All processes unregister the uffd handler


Now imagine something like this:

1. Background snapshot preparation:
a) QEMU registers a UFFD-WP handler on a *memfd file* that corresponds
    to guest memory.
b) QEMU uffd-wp's the whole file

2. Background snapshot runs:
a) QEMU receives a UFFD-WP event.
b) QEMU writes the to-be-changed page to the migration stream.
c) QEMU un-protect the page and wake up any waiters.

3. Background snapshot completes:
a) QEMU unregister the uffd handler


Wouldn't that be much nicer and much easier to handle? Yes, it is much 
harder to implement because such an infrastructure does not exist yet, 
and it most probably wouldn't be called uffd anymore, because we are 
dealing with file access. But this way, it would actually be super easy 
to use the feature across multiple processes and eventually to even 
catch other file modifications.

Again, I am not sure if uffd-wp or softdirty make too much sense in 
general when applied to shmem. But I'm happy to learn more.

-- 
Thanks,

David / dhildenb



  reply	other threads:[~2021-08-11 16:15 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-30 16:08 [PATCH 0/1] pagemap: swap location for shared pages Tiberiu A Georgescu
2021-07-30 16:08 ` [PATCH 1/1] pagemap: report " Tiberiu A Georgescu
2021-07-30 17:28 ` [PATCH 0/1] pagemap: " Eric W. Biederman
2021-08-02 12:20   ` Tiberiu Georgescu
2021-08-04 18:33 ` Peter Xu
2021-08-04 18:49   ` David Hildenbrand
2021-08-04 19:17     ` Peter Xu
2021-08-11 16:15       ` David Hildenbrand [this message]
2021-08-11 16:17         ` David Hildenbrand
2021-08-11 18:25         ` Peter Xu
2021-08-11 18:41           ` David Hildenbrand
2021-08-11 19:54             ` Peter Xu
2021-08-11 20:13               ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0beb1386-d670-aab1-6291-5c3cb0d661e0@redhat.com \
    --to=david@redhat.com \
    --cc=adobriyan@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=axboe@kernel.dk \
    --cc=carl.waldspurger@nutanix.com \
    --cc=catalin.marinas@arm.com \
    --cc=chinwen.chang@mediatek.com \
    --cc=christian.brauner@ubuntu.com \
    --cc=ebiederm@xmission.com \
    --cc=florian.schmidt@nutanix.com \
    --cc=ivan.teterevkov@nutanix.com \
    --cc=jannh@google.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=songmuchun@bytedance.com \
    --cc=tiberiu.georgescu@nutanix.com \
    --cc=vincenzo.frascino@arm.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).