Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI

From: Peter Xu <peterx@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>,
	Suren Baghdasaryan <surenb@google.com>,
	akpm@linux-foundation.org, viro@zeniv.linux.org.uk,
	brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com,
	hughd@google.com, mhocko@suse.com, axelrasmussen@google.com,
	rppt@kernel.org, willy@infradead.org, Liam.Howlett@oracle.com,
	jannh@google.com, zhangpeng362@huawei.com, bgeffon@google.com,
	kaleshsingh@google.com, ngeoffray@google.com, jdduke@google.com,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	kernel-team@android.com
Subject: Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI
Date: Fri, 13 Oct 2023 12:08:10 -0400	[thread overview]
Message-ID: <ZSlragGjFEw9QS1Y@x1n> (raw)
In-Reply-To: <205abf01-9699-ff1c-3e4e-621913ada64e@redhat.com>

On Fri, Oct 13, 2023 at 11:56:31AM +0200, David Hildenbrand wrote:
> Hi Peter,

Hi, David,

> 
> > I used to have the same thought with David on whether we can simplify the
> > design to e.g. limit it to single mm.  Then I found that the trickiest is
> > actually patch 1 together with the anon_vma manipulations, and the problem
> > is that's not avoidable even if we restrict the api to apply on single mm.
> > 
> > What else we can benefit from single mm?  One less mmap read lock, but
> > probably that's all we can get; IIUC we need to keep most of the rest of
> > the code, e.g. pgtable walks, double pgtable lockings, etc.
> 
> No existing mechanisms move anon pages between unrelated processes, that
> naturally makes me nervous if we're doing it "just because we can".

IMHO that's also the potential, when guarded with userfaultfd descriptor
being shared between two processes.

See below with more comment on the raised concerns.

> 
> > 
> > Actually, even though I have no solid clue, but I had a feeling that there
> > can be some interesting way to leverage this across-mm movement, while
> > keeping things all safe (by e.g. elaborately requiring other proc to create
> > uffd and deliver to this proc).
> 
> Okay, but no real use cases yet.

I can provide a "not solid" example.  I didn't mention it because it's
really something that just popped into my mind when thinking cross-mm, so I
never discussed with anyone yet nor shared it anywhere.

Consider VM live upgrade in a generic form (e.g., no VFIO), we can do that
very efficiently with shmem or hugetlbfs, but not yet anonymous.  We can do
extremely efficient postcopy live upgrade now with anonymous if with REMAP.

Basically I see it a potential way of moving memory efficiently especially
with thp.

> 
> > 
> > Considering Andrea's original version already contains those bits and all
> > above, I'd vote that we go ahead with supporting two MMs.
> 
> You can do nasty things with that, as it stands, on the upstream codebase.
> 
> If you pin the page in src_mm and move it to dst_mm, you successfully broke
> an invariant that "exclusive" means "no other references from other
> processes". That page is marked exclusive but it is, in fact, not exclusive.

It is still exclusive to the dst mm?  I see your point, but I think you're
taking exclusiveness altogether with pinning, and IMHO that may not be
always necessary?

> 
> Once you achieved that, you can easily have src_mm not have MMF_HAS_PINNED,

(I suppose you meant dst_mm here)

> so you can just COW-share that page. Now you successfully broke the
> invariant that COW-shared pages must not be pinned. And you can even trigger
> VM_BUG_ONs, like in sanity_check_pinned_pages().

Yeah, that's really unfortunate.  But frankly, I don't think it's the fault
of this new feature, but the rest.

Let's imagine if the MMF_HAS_PINNED wasn't proposed as a per-mm flag, but
per-vma, which I don't see why we can't because it's simply a hint so far.
Then if we apply the same rule here, UFFDIO_REMAP won't even work for
single-mm as long as cross-vma. Then UFFDIO_REMAP as a whole feature will
be NACKed simply because of this..

And I don't think anyone can guarantee a per-vma MMF_HAS_PINNED can never
happen, or any further change to pinning solution that may affect this.  So
far it just looks unsafe to remap a pin page to me.

I don't have a good suggestion here if this is a risk.. I'd think it risky
then to do REMAP over pinned pages no matter cross-mm or single-mm.  It
means probably we just rule them out: folio_maybe_dma_pinned() may not even
be enough to be safe with fast-gup.  We may need page_needs_cow_for_dma()
with proper write_protect_seq no matter cross-mm or single-mm?

> 
> Can it all be fixed? Sure, with more complexity. For something without clear
> motivation, I'll have to pass.

I think what you raised is a valid concern, but IMHO it's better fixed no
matter cross-mm or single-mm.  What do you think?

In general, pinning lose its whole point here to me for an userspace either
if it DONTNEEDs it or REMAP it.  What would be great to do here is we unpin
it upon DONTNEED/REMAP/whatever drops the page, because it loses its
coherency anyway, IMHO.

> 
> Once there is real demand, we can revisit it and explore what else we would
> have to take care of (I don't know how memcg behaves when moving between
> completely unrelated processes, maybe that works as expected, I don't know
> and I have no time to spare on reviewing features with no real use cases)
> and announce it as a new feature.

Good point.  memcg is probably needed..

So you reminded me to do a more thorough review against zap/fault paths, I
think what's missing are (besides page pinning):

  - mem_cgroup_charge()/mem_cgroup_uncharge(): 

    (side note: I think folio_throttle_swaprate() is only for when
     allocating new pages, so not needed here)

  - check_stable_address_space() (under pgtable lock)

  - tlb flush

    Hmm???????????????? I can't see anywhere we did tlb flush, batched or
    not, either single-mm or cross-mm should need it.  Is this missing?

> 
> 
> Note: that (with only reading the documentation) it also kept me wondering
> how the MMs are even implied from
> 
>        struct uffdio_move {
>            __u64 dst;    /* Destination of move */
>            __u64 src;    /* Source of move */
>            __u64 len;    /* Number of bytes to move */
>            __u64 mode;   /* Flags controlling behavior of move */
>            __s64 move;   /* Number of bytes moved, or negated error */
>        };
> 
> That probably has to be documented as well, in which address space dst and
> src reside.

Agreed, some better documentation will never hurt.  Dst should be in the mm
address space that was bound to the userfault descriptor.  Src should be in
the current mm address space.

Thanks,

-- 
Peter Xu