linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>, Oscar Salvador <osalvador@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Minchan Kim <minchan@kernel.org>, Jann Horn <jannh@google.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dave Hansen <dave.hansen@intel.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Richard Henderson <rth@twiddle.net>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	Matt Turner <mattst88@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
	Helge Deller <deller@gmx.de>, Chris Zankel <chris@zankel.net>,
	Max Filippov <jcmvbkbc@gmail.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Peter Xu <peterx@redhat.com>,
	Rolf Eike Beer <eike-kernel@sf-tec.de>,
	linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org,
	linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org,
	linux-arch@vger.kernel.org, Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH resend v2 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables
Date: Thu, 20 May 2021 15:44:18 +0200	[thread overview]
Message-ID: <YKZnsnqgEMx6Xl6X@dhcp22.suse.cz> (raw)
In-Reply-To: <b2fa988d-9625-7c0e-ce83-158af0058e38@redhat.com>

On Tue 18-05-21 14:03:52, David Hildenbrand wrote:
[...]
> > > I expect that use case might vanish over
> > > time (eventually with new kernels and updated user space), but it might
> > > stick for a bit.
> > 
> > Could you elaborate some more please?
> 
> After I raised that the current behavior of userfaultfd-wp is suboptimal,
> Peter started working on a userfaultfd-wp mode that doesn't require to
> prefault all pages just to have it working reliably -- getting notified when
> any page changes, including ones that haven't been populated yet and would
> have been populated with the shared zeropage on first access. Not sure what
> the state of that is and when we might see it.

OK, thanks for the clarification. This suggests that inventing a new
interface to cover this usecase doesn't sound like the strongest
justification to me. But this doesn't mean this disqualifies it either.

> > > Apart from that, populating the shared zeropage might be relevant in some
> > > corner cases: I remember there are sparse matrix algorithms that operate
> > > heavily on the shared zeropage.
> > 
> > I am not sure I see why this would be a useful interface for those? Zero
> > page read fault is really low cost. Or are you worried about cummulative
> > overhead by entering the kernel many times?
> 
> Yes, cumulative overhead when dealing with large, sparse matrices. Just an
> example where I think it could be applied in the future -- but not that I
> consider populating the shared zeropage a really important use case in
> general (besides for userfaultfd-wp right now).

OK.
 
[...]
> Anyhow, please suggest a way to handle it via a single flag in the kernel --
> which would be some kind of heuristic as we know from MAP_POPULATE. Having
> an alternative at hand would make it easier to discuss this topic further. I
> certainly *don't* want MAP_POPULATE semantics when it comes to
> MADV_POPULATE, especially when it comes to shared mappings. Not useful in
> QEMU now and in the future.

OK, this point is still not entirely clear to me. Elsewhere you are
saying that QEMU cannot use MAP_POPULATE because it ignores errors
and also it doesn't support sparse mappings because they apply to the
whole mmap. These are all clear but it is less clear to me why the same
semantic is not applicable for QEMU when used through madvise interface
which can handle both of those.
Do I get it right that you really want to emulate the full fledged write
fault to a) limit another write fault when the content is actually
modified and b) prevent from potential errors during the write fault
(e.g. mkwrite failing on the fs data)?

> We could make MADV_POPULATE act depending on the readability/writability of
> a mapping. Use MADV_POPULATE_WRITE on writable mappings, use
> MADV_POPULATE_READ on readable mappings. Certainly not perfect for use cases
> where you have writable mappings that are mostly read only (as in the
> example with fake-NVDIMMs I gave ...), but if it makes people happy, fine
> with me. I mostly care about MADV_POPULATE_WRITE.

Yes, this is where my thinking was going as well. Essentially define
MADV_POPULATE as "Populate the mapping with the memory based on the
mapping access." This looks like a straightforward semantic to me and it
doesn't really require any deep knowledge of internals.

Now, I was trying to compare which of those would be more tricky to
understand and use and TBH I am not really convinced any of the two is
much better. Separate READ/WRITE modes are explicit which can be good
but it will require quite an advanced knowledge of the #PF behavior.
On the other hand MADV_POPULATE would require some tricks like mmap,
madvise and mprotect(to change to writable) when the data is really
written to. I am not sure how much of a deal this would be for QEMU for
example.

So, all that being said, I am not really sure. I am not really happy
about READ/WRITE split but if a simpler interface is going to be a bad
fit for existing usecases then I believe a proper way to go is the
document the more complex interface thoroughly.
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2021-05-20 13:44 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-11  8:15 [PATCH resend v2 0/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables David Hildenbrand
2021-05-11  8:15 ` [PATCH resend v2 1/5] mm: make variable names for populate_vma_page_range() consistent David Hildenbrand
2021-05-11  9:54   ` Oscar Salvador
2021-05-11  9:56     ` Oscar Salvador
2021-05-11  8:15 ` [PATCH resend v2 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables David Hildenbrand
2021-05-18 10:07   ` Michal Hocko
2021-05-18 10:32     ` David Hildenbrand
2021-05-18 11:17       ` Michal Hocko
2021-05-18 12:03         ` David Hildenbrand
2021-05-20 13:44           ` Michal Hocko [this message]
2021-05-21  8:48             ` David Hildenbrand
2021-05-11  8:15 ` [PATCH resend v2 3/5] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT David Hildenbrand
2021-05-11  9:47   ` Mike Rapoport
2021-05-11  8:15 ` [PATCH resend v2 4/5] selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore David Hildenbrand
2021-05-11  8:15 ` [PATCH resend v2 5/5] selftests/vm: add test for MADV_POPULATE_(READ|WRITE) David Hildenbrand
2021-06-08 16:00 ` [PATCH] madvise.2: Document MADV_POPULATE_READ and MADV_POPULATE_WRITE David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YKZnsnqgEMx6Xl6X@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=chris@zankel.net \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=deller@gmx.de \
    --cc=eike-kernel@sf-tec.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jannh@google.com \
    --cc=jcmvbkbc@gmail.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-alpha@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=linux-xtensa@linux-xtensa.org \
    --cc=mattst88@gmail.com \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=mst@redhat.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rth@twiddle.net \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).