All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>, Oscar Salvador <osalvador@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Minchan Kim <minchan@kernel.org>, Jann Horn <jannh@google.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dave Hansen <dave.hansen@intel.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Richard Henderson <rth@twiddle.net>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	Matt Turner <mattst88@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
	Helge Deller <deller@gmx.de>, Chris Zankel <chris@zankel.net>,
	Max Filippov <jcmvbkbc@gmail.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Peter Xu <peterx@redhat.com>,
	Rolf Eike Beer <eike-kernel@sf-tec.de>,
	linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org,
	linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org,
	linux-arch@vger.kernel.org, Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH resend v2 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables
Date: Tue, 18 May 2021 13:17:43 +0200	[thread overview]
Message-ID: <YKOiV9VkEdYFM9nB@dhcp22.suse.cz> (raw)
In-Reply-To: <bb0e2ebb-e66d-176c-b20a-fbadd95cde98@redhat.com>

On Tue 18-05-21 12:32:12, David Hildenbrand wrote:
> On 18.05.21 12:07, Michal Hocko wrote:
> > [sorry for a long silence on this]
> > 
> > On Tue 11-05-21 10:15:31, David Hildenbrand wrote:
> > [...]
> > 
> > Thanks for the extensive usecase description. That is certainly useful
> > background. I am sorry to bring this up again but I am still not
> > convinced that READ/WRITE variant are the best interface.
> 
> Thanks for having time to look into this.
> 
> > > While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
> > > preallocate memory and prefault page tables for VMs), one issue is that
> > > whenever we prefault pages writable, the pages have to be marked dirty,
> > > because the CPU could dirty them any time. while not a real problem for
> > > hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
> > > page will be marked dirty and has to be written back later when evicting.
> > > 
> > > MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
> > > mapping from backend storage without marking it dirty, such that eviction
> > > won't have to write it back. As discussed above, shared file mappings
> > > might require an explciit fallocate() upfront to achieve
> > > preallcoation+prepopulation.
> > 
> > This means that you want to have two different uses depending on the
> > underlying mapping type. MADV_POPULATE_READ seems rather weak for
> > anonymous/private mappings. Memory backed by zero pages seems rather
> > unhelpful as the PF would need to do all the heavy lifting anyway.
> > Or is there any actual usecase when this is desirable?
> 
> Currently, userfaultfd-wp, which requires "some mapping" to be able to arm
> successfully. In QEMU, we currently have to prefault the shared zeropage for
> userfaultfd-wp to work as expected.

Just for clarification. The aim is to reduce the memory footprint at the
same time, right? If that is really the case then this is worth adding.

> I expect that use case might vanish over
> time (eventually with new kernels and updated user space), but it might
> stick for a bit.

Could you elaborate some more please?

> Apart from that, populating the shared zeropage might be relevant in some
> corner cases: I remember there are sparse matrix algorithms that operate
> heavily on the shared zeropage.

I am not sure I see why this would be a useful interface for those? Zero
page read fault is really low cost. Or are you worried about cummulative
overhead by entering the kernel many times?

> > So the split into these two modes seems more like gup interface
> > shortcomings bubbling up to the interface. I do expect userspace only
> > cares about pre-faulting the address range. No matter what the backing
> > storage is.
> > 
> > Or do I still misunderstand all the usecases?
> 
> Let me give you an example where we really cannot tell what would be best
> from a kernel perspective.
> 
> a) Mapping a file into a VM to be used as RAM. We might expect the guest
> writing all memory immediately (e.g., booting Windows). We would want
> MADV_POPULATE_WRITE as we expect a write access immediately.
> 
> b) Mapping a file into a VM to be used as fake-NVDIMM, for example, ROOTFS
> or just data storage. We expect mostly reading from this memory, thus, we
> would want MADV_POPULATE_READ.

I am afraid I do not follow. Could you be more explicit about advantages
of using those two modes for those example usecases? Is that to share
resources (e.g. by not breaking CoW)?

> Instead of trying to be smart in the kernel, I think for this case it makes
> much more sense to provide user space the options. IMHO it doesn't really
> hurt to let user space decide on what it thinks is best.

I am mostly worried that this will turn out to be more confusing than
helpful. People will need to grasp non trivial concepts and kernel
internal implementation details about how read/write faults are handled.

Thanks!
-- 
Michal Hocko
SUSE Labs

WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@suse.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>, Oscar Salvador <osalvador@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Minchan Kim <minchan@kernel.org>, Jann Horn <jannh@google.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dave Hansen <dave.hansen@intel.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Richard Henderson <rth@twiddle.net>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	Matt Turner <mattst88@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	James E.J. Bottomle
Subject: Re: [PATCH resend v2 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables
Date: Tue, 18 May 2021 13:17:43 +0200	[thread overview]
Message-ID: <YKOiV9VkEdYFM9nB@dhcp22.suse.cz> (raw)
In-Reply-To: <bb0e2ebb-e66d-176c-b20a-fbadd95cde98@redhat.com>

On Tue 18-05-21 12:32:12, David Hildenbrand wrote:
> On 18.05.21 12:07, Michal Hocko wrote:
> > [sorry for a long silence on this]
> > 
> > On Tue 11-05-21 10:15:31, David Hildenbrand wrote:
> > [...]
> > 
> > Thanks for the extensive usecase description. That is certainly useful
> > background. I am sorry to bring this up again but I am still not
> > convinced that READ/WRITE variant are the best interface.
> 
> Thanks for having time to look into this.
> 
> > > While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
> > > preallocate memory and prefault page tables for VMs), one issue is that
> > > whenever we prefault pages writable, the pages have to be marked dirty,
> > > because the CPU could dirty them any time. while not a real problem for
> > > hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
> > > page will be marked dirty and has to be written back later when evicting.
> > > 
> > > MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
> > > mapping from backend storage without marking it dirty, such that eviction
> > > won't have to write it back. As discussed above, shared file mappings
> > > might require an explciit fallocate() upfront to achieve
> > > preallcoation+prepopulation.
> > 
> > This means that you want to have two different uses depending on the
> > underlying mapping type. MADV_POPULATE_READ seems rather weak for
> > anonymous/private mappings. Memory backed by zero pages seems rather
> > unhelpful as the PF would need to do all the heavy lifting anyway.
> > Or is there any actual usecase when this is desirable?
> 
> Currently, userfaultfd-wp, which requires "some mapping" to be able to arm
> successfully. In QEMU, we currently have to prefault the shared zeropage for
> userfaultfd-wp to work as expected.

Just for clarification. The aim is to reduce the memory footprint at the
same time, right? If that is really the case then this is worth adding.

> I expect that use case might vanish over
> time (eventually with new kernels and updated user space), but it might
> stick for a bit.

Could you elaborate some more please?

> Apart from that, populating the shared zeropage might be relevant in some
> corner cases: I remember there are sparse matrix algorithms that operate
> heavily on the shared zeropage.

I am not sure I see why this would be a useful interface for those? Zero
page read fault is really low cost. Or are you worried about cummulative
overhead by entering the kernel many times?

> > So the split into these two modes seems more like gup interface
> > shortcomings bubbling up to the interface. I do expect userspace only
> > cares about pre-faulting the address range. No matter what the backing
> > storage is.
> > 
> > Or do I still misunderstand all the usecases?
> 
> Let me give you an example where we really cannot tell what would be best
> from a kernel perspective.
> 
> a) Mapping a file into a VM to be used as RAM. We might expect the guest
> writing all memory immediately (e.g., booting Windows). We would want
> MADV_POPULATE_WRITE as we expect a write access immediately.
> 
> b) Mapping a file into a VM to be used as fake-NVDIMM, for example, ROOTFS
> or just data storage. We expect mostly reading from this memory, thus, we
> would want MADV_POPULATE_READ.

I am afraid I do not follow. Could you be more explicit about advantages
of using those two modes for those example usecases? Is that to share
resources (e.g. by not breaking CoW)?

> Instead of trying to be smart in the kernel, I think for this case it makes
> much more sense to provide user space the options. IMHO it doesn't really
> hurt to let user space decide on what it thinks is best.

I am mostly worried that this will turn out to be more confusing than
helpful. People will need to grasp non trivial concepts and kernel
internal implementation details about how read/write faults are handled.

Thanks!
-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2021-05-18 11:17 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-11  8:15 [PATCH resend v2 0/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables David Hildenbrand
2021-05-11  8:15 ` [PATCH resend v2 1/5] mm: make variable names for populate_vma_page_range() consistent David Hildenbrand
2021-05-11  9:54   ` Oscar Salvador
2021-05-11  9:56     ` Oscar Salvador
2021-05-11  8:15 ` [PATCH resend v2 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables David Hildenbrand
2021-05-11  8:15   ` David Hildenbrand
2021-05-18 10:07   ` Michal Hocko
2021-05-18 10:07     ` Michal Hocko
2021-05-18 10:32     ` David Hildenbrand
2021-05-18 10:32       ` David Hildenbrand
2021-05-18 11:17       ` Michal Hocko [this message]
2021-05-18 11:17         ` Michal Hocko
2021-05-18 12:03         ` David Hildenbrand
2021-05-18 12:03           ` David Hildenbrand
2021-05-20 13:44           ` Michal Hocko
2021-05-20 13:44             ` Michal Hocko
2021-05-21  8:48             ` David Hildenbrand
2021-05-21  8:48               ` David Hildenbrand
2021-05-11  8:15 ` [PATCH resend v2 3/5] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT David Hildenbrand
2021-05-11  9:47   ` Mike Rapoport
2021-05-11  8:15 ` [PATCH resend v2 4/5] selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore David Hildenbrand
2021-05-11  8:15 ` [PATCH resend v2 5/5] selftests/vm: add test for MADV_POPULATE_(READ|WRITE) David Hildenbrand
2021-05-11  8:15   ` David Hildenbrand
2021-06-08 16:00 ` [PATCH] madvise.2: Document MADV_POPULATE_READ and MADV_POPULATE_WRITE David Hildenbrand
2021-06-08 16:00   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YKOiV9VkEdYFM9nB@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=chris@zankel.net \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=deller@gmx.de \
    --cc=eike-kernel@sf-tec.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jannh@google.com \
    --cc=jcmvbkbc@gmail.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-alpha@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=linux-xtensa@linux-xtensa.org \
    --cc=mattst88@gmail.com \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=mst@redhat.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rth@twiddle.net \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.