All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>, Michal Hocko <mhocko@suse.com>,
	Oscar Salvador <osalvador@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Minchan Kim <minchan@kernel.org>, Jann Horn <jannh@google.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dave Hansen <dave.hansen@intel.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Richard Henderson <rth@twiddle.net>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	Matt Turner <mattst88@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
	Helge Deller <deller@gmx.de>, Chris Zankel <chris@zankel.net>,
	Max Filippov <jcmvbkbc@gmail.com>,
	linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org,
	linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org,
	linux-arch@vger.kernel.org
Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory
Date: Fri, 19 Feb 2021 09:20:16 +0100	[thread overview]
Message-ID: <b24996a6-7652-f88c-301e-28417637fd02@redhat.com> (raw)
In-Reply-To: <20210218225904.GB6669@xz-x1>

On 18.02.21 23:59, Peter Xu wrote:
> Hi, David,
> 
> On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
>> When we manage sparse memory mappings dynamically in user space - also
>> sometimes involving MADV_NORESERVE - we want to dynamically populate/
>> discard memory inside such a sparse memory region. Example users are
>> hypervisors (especially implementing memory ballooning or similar
>> technologies like virtio-mem) and memory allocators. In addition, we want
>> to fail in a nice way if populating does not succeed because we are out of
>> backend memory (which can happen easily with file-based mappings,
>> especially tmpfs and hugetlbfs).
> 
> Could you explain a bit more on how do you plan to use this new interface for
> the virtio-balloon scenario?

Sure, that will bring up an interesting point to discuss 
(MADV_POPULATE_WRITE).

I'm planning on using it in virtio-mem: whenever the guests requests the 
hypervisor (via a virtio-mem device) to make specific blocks available 
("plug"), I want to have a configurable option ("populate=on" / 
"prealloc="on") to perform safety checks ("prealloc") and populate page 
tables.

This becomes especially relevant for private/shared hugetlbfs and shared 
files/shmem where we have a limited pool size (e.g., huge pages, tmpfs 
size, filesystem size). But it will also come in handy when just 
preallocating (esp. zeroing) anonymous memory.

For virito-balloon it is not applicable because it really only supports 
anonymous memory and we cannot fail requests to deflate ...

--- Example ---

Example: Assume the guests requests to make 128 MB available and we're 
using hugetlbfs. Assume we're out of huge pages in the hypervisor - we 
want to fail the request - I want to do some kind of preallocation.

So I could do fallocate() on anything that's MAP_SHARED, but not on 
anything that's MAP_PRIVATE. hugetlbfs via memfd() cannot be 
preallocated without going via SIGBUS handlers.

--- QEMU memory configurations ---

I see the following combinations relevant in QEMU that I want to support 
with virito-mem:

1) MAP_PRIVATE anonymous memory
2) MAP_PRIVATE on hugetlbfs (esp. via memfd)
3) MAP_SHARED on hugetlbfs (esp. via memfd)
4) MAP_SHARED on shmem (file / memfd)
5) MAP_SHARED on some sparse file.

Other MAP_PRIVATE mappings barely make any sense to me - "read the file 
and write to page cache" is not really applicable to VM RAM (not to 
mention doing fallocate(PUNCH_HOLE) that invalidates the private copies 
of all other mappings on that file).

--- Ways to populate/preallocate ---

I see the following ways to populate/preallocate:

a) MADV_POPULATE: write fault on writable MAP_PRIVATE, read fault on
    MAP_SHARED
b) Writing to MAP_PRIVATE | MAP_SHARED from user space.
c) (below) MADV_POPULATE_WRITE: write fault on writable MAP_PRIVATE |
    MAP_SHARED

Especially, 2) is kind of weird as implemented in QEMU 
(util/oslib-posix.c:do_touch_pages):

"Read & write back the same value, so we don't corrupt existing user/app 
data ... TODO: get a better solution from kernel so we don't need to 
write at all so we don't cause wear on the storage backing the region..."

So if we have zero, we write zero. We'll COW pages, triggering a write 
fault - and that's the only good thing about it. For example, similar to 
MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So 
for anonymous memory the actual write is not helpful at all. Similarly 
for hugetlbfs, the actual write is not necessary - but there is no other 
way to really achieve the goal.

--- How MADV_POPULATE is useful ---

With virito-mem, our VM will usually write to memory before it reads it.

With 1) and 2) it does exactly what I want: trigger COW / allocate 
memory and trigger a write fault. The only issue with 1) is that KSM 
might come around and undo our work - but that could only be avoided by 
writing random numbers to all pages from user space. Or we could simply 
rather disable KSM in that setup ...

--- How MADV_POPULATE is not perfect ---

KSM can merge anonymous pages again. Just like the current QEMU 
implementation. The only way around that is writing random numbers to 
the pages or mlocking all memory. No big news.

Nothing stops reclaim/swap code from depopulating when using files. 
Again, no big new - we have to mlock.

--- HOW MADV_POPULATE_WRITE might be useful ---

With 3) 4) 5) MADV_POPULATE does partially what I want: preallocate 
memory and populate page tables. But as it's a read fault, I think we'll 
have another minor fault on access. Not perfect, but better than failing 
with SIGBUS. One way around that would be having an additional 
MADV_POPULATE_WRITE, to use in cases where it makes sense (I think at 
least 3) and 4), most probably not on actual files like 5) ).

Trigger a write fault without actually writing.


Makes sense?

-- 
Thanks,

David / dhildenb


WARNING: multiple messages have this Message-ID (diff)
From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>, Michal Hocko <mhocko@suse.com>,
	Oscar Salvador <osalvador@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Minchan Kim <minchan@kernel.org>, Jann Horn <jannh@google.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dave Hansen <dave.hansen@intel.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Richard Henderson <rth@twiddle.net>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	Matt Turner <mattst88@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.d>
Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory
Date: Fri, 19 Feb 2021 09:20:16 +0100	[thread overview]
Message-ID: <b24996a6-7652-f88c-301e-28417637fd02@redhat.com> (raw)
In-Reply-To: <20210218225904.GB6669@xz-x1>

On 18.02.21 23:59, Peter Xu wrote:
> Hi, David,
> 
> On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
>> When we manage sparse memory mappings dynamically in user space - also
>> sometimes involving MADV_NORESERVE - we want to dynamically populate/
>> discard memory inside such a sparse memory region. Example users are
>> hypervisors (especially implementing memory ballooning or similar
>> technologies like virtio-mem) and memory allocators. In addition, we want
>> to fail in a nice way if populating does not succeed because we are out of
>> backend memory (which can happen easily with file-based mappings,
>> especially tmpfs and hugetlbfs).
> 
> Could you explain a bit more on how do you plan to use this new interface for
> the virtio-balloon scenario?

Sure, that will bring up an interesting point to discuss 
(MADV_POPULATE_WRITE).

I'm planning on using it in virtio-mem: whenever the guests requests the 
hypervisor (via a virtio-mem device) to make specific blocks available 
("plug"), I want to have a configurable option ("populate=on" / 
"prealloc="on") to perform safety checks ("prealloc") and populate page 
tables.

This becomes especially relevant for private/shared hugetlbfs and shared 
files/shmem where we have a limited pool size (e.g., huge pages, tmpfs 
size, filesystem size). But it will also come in handy when just 
preallocating (esp. zeroing) anonymous memory.

For virito-balloon it is not applicable because it really only supports 
anonymous memory and we cannot fail requests to deflate ...

--- Example ---

Example: Assume the guests requests to make 128 MB available and we're 
using hugetlbfs. Assume we're out of huge pages in the hypervisor - we 
want to fail the request - I want to do some kind of preallocation.

So I could do fallocate() on anything that's MAP_SHARED, but not on 
anything that's MAP_PRIVATE. hugetlbfs via memfd() cannot be 
preallocated without going via SIGBUS handlers.

--- QEMU memory configurations ---

I see the following combinations relevant in QEMU that I want to support 
with virito-mem:

1) MAP_PRIVATE anonymous memory
2) MAP_PRIVATE on hugetlbfs (esp. via memfd)
3) MAP_SHARED on hugetlbfs (esp. via memfd)
4) MAP_SHARED on shmem (file / memfd)
5) MAP_SHARED on some sparse file.

Other MAP_PRIVATE mappings barely make any sense to me - "read the file 
and write to page cache" is not really applicable to VM RAM (not to 
mention doing fallocate(PUNCH_HOLE) that invalidates the private copies 
of all other mappings on that file).

--- Ways to populate/preallocate ---

I see the following ways to populate/preallocate:

a) MADV_POPULATE: write fault on writable MAP_PRIVATE, read fault on
    MAP_SHARED
b) Writing to MAP_PRIVATE | MAP_SHARED from user space.
c) (below) MADV_POPULATE_WRITE: write fault on writable MAP_PRIVATE |
    MAP_SHARED

Especially, 2) is kind of weird as implemented in QEMU 
(util/oslib-posix.c:do_touch_pages):

"Read & write back the same value, so we don't corrupt existing user/app 
data ... TODO: get a better solution from kernel so we don't need to 
write at all so we don't cause wear on the storage backing the region..."

So if we have zero, we write zero. We'll COW pages, triggering a write 
fault - and that's the only good thing about it. For example, similar to 
MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So 
for anonymous memory the actual write is not helpful at all. Similarly 
for hugetlbfs, the actual write is not necessary - but there is no other 
way to really achieve the goal.

--- How MADV_POPULATE is useful ---

With virito-mem, our VM will usually write to memory before it reads it.

With 1) and 2) it does exactly what I want: trigger COW / allocate 
memory and trigger a write fault. The only issue with 1) is that KSM 
might come around and undo our work - but that could only be avoided by 
writing random numbers to all pages from user space. Or we could simply 
rather disable KSM in that setup ...

--- How MADV_POPULATE is not perfect ---

KSM can merge anonymous pages again. Just like the current QEMU 
implementation. The only way around that is writing random numbers to 
the pages or mlocking all memory. No big news.

Nothing stops reclaim/swap code from depopulating when using files. 
Again, no big new - we have to mlock.

--- HOW MADV_POPULATE_WRITE might be useful ---

With 3) 4) 5) MADV_POPULATE does partially what I want: preallocate 
memory and populate page tables. But as it's a read fault, I think we'll 
have another minor fault on access. Not perfect, but better than failing 
with SIGBUS. One way around that would be having an additional 
MADV_POPULATE_WRITE, to use in cases where it makes sense (I think at 
least 3) and 4), most probably not on actual files like 5) ).

Trigger a write fault without actually writing.


Makes sense?

-- 
Thanks,

David / dhildenb


  reply	other threads:[~2021-02-19  8:22 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-17 15:48 [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory David Hildenbrand
2021-02-17 15:48 ` David Hildenbrand
2021-02-17 16:46 ` Dave Hansen
2021-02-17 16:46   ` Dave Hansen
2021-02-17 17:06   ` David Hildenbrand
2021-02-17 17:06     ` David Hildenbrand
2021-02-17 17:21 ` Vlastimil Babka
2021-02-17 17:21   ` Vlastimil Babka
2021-02-18 11:07   ` Rolf Eike Beer
2021-02-18 11:07     ` Rolf Eike Beer
2021-02-18 11:27     ` David Hildenbrand
2021-02-18 11:27       ` David Hildenbrand
2021-02-18 10:25 ` Michal Hocko
2021-02-18 10:25   ` Michal Hocko
2021-02-18 10:44   ` David Hildenbrand
2021-02-18 10:44     ` David Hildenbrand
2021-02-18 10:54     ` David Hildenbrand
2021-02-18 10:54       ` David Hildenbrand
2021-02-18 11:28       ` Michal Hocko
2021-02-18 11:28         ` Michal Hocko
2021-02-18 11:27     ` Michal Hocko
2021-02-18 11:27       ` Michal Hocko
2021-02-18 11:38       ` David Hildenbrand
2021-02-18 11:38         ` David Hildenbrand
2021-02-18 12:22 ` [PATCH RFC] madvise.2: Document MADV_POPULATE David Hildenbrand
2021-02-18 12:22   ` David Hildenbrand
2021-02-18 22:59 ` [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory Peter Xu
2021-02-18 22:59   ` Peter Xu
2021-02-19  8:20   ` David Hildenbrand [this message]
2021-02-19  8:20     ` David Hildenbrand
2021-02-19 16:31     ` Peter Xu
2021-02-19 16:31       ` Peter Xu
2021-02-19 17:13       ` David Hildenbrand
2021-02-19 17:13         ` David Hildenbrand
2021-02-19 19:14         ` David Hildenbrand
2021-02-19 19:14           ` David Hildenbrand
2021-02-19 19:25           ` Mike Kravetz
2021-02-19 19:25             ` Mike Kravetz
2021-02-20  9:01             ` David Hildenbrand
2021-02-20  9:01               ` David Hildenbrand
2021-02-19 19:23         ` Peter Xu
2021-02-19 19:23           ` Peter Xu
2021-02-19 20:04           ` David Hildenbrand
2021-02-19 20:04             ` David Hildenbrand
2021-02-22 12:46     ` Michal Hocko
2021-02-22 12:46       ` Michal Hocko
2021-02-22 12:52       ` David Hildenbrand
2021-02-22 12:52         ` David Hildenbrand
2021-02-19 10:35 ` Michal Hocko
2021-02-19 10:35   ` Michal Hocko
2021-02-19 10:43   ` David Hildenbrand
2021-02-19 10:43     ` David Hildenbrand
2021-02-19 11:04     ` Michal Hocko
2021-02-19 11:04       ` Michal Hocko
2021-02-19 11:10       ` David Hildenbrand
2021-02-19 11:10         ` David Hildenbrand
2021-02-20  9:12 ` David Hildenbrand
2021-02-20  9:12   ` David Hildenbrand
2021-02-22 12:56   ` Michal Hocko
2021-02-22 12:56     ` Michal Hocko
2021-02-22 12:59     ` David Hildenbrand
2021-02-22 12:59       ` David Hildenbrand
2021-02-22 13:19       ` Michal Hocko
2021-02-22 13:19         ` Michal Hocko
2021-02-22 13:22         ` David Hildenbrand
2021-02-22 13:22           ` David Hildenbrand
2021-02-22 14:02           ` Michal Hocko
2021-02-22 14:02             ` Michal Hocko
2021-02-22 15:30             ` David Hildenbrand
2021-02-22 15:30               ` David Hildenbrand
2021-02-24 14:25 ` David Hildenbrand
2021-02-24 14:25   ` David Hildenbrand
2021-02-24 14:38   ` David Hildenbrand
2021-02-24 14:38     ` David Hildenbrand
2021-02-25  8:41   ` David Hildenbrand
2021-02-25  8:41     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b24996a6-7652-f88c-301e-28417637fd02@redhat.com \
    --to=david@redhat.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=chris@zankel.net \
    --cc=dave.hansen@intel.com \
    --cc=deller@gmx.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jannh@google.com \
    --cc=jcmvbkbc@gmail.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-alpha@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=linux-xtensa@linux-xtensa.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=mst@redhat.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rth@twiddle.net \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.