All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <dhildenb@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>, Michal Hocko <mhocko@suse.com>,
	Oscar Salvador <osalvador@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Minchan Kim <minchan@kernel.org>, Jann Horn <jannh@google.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dave Hansen <dave.hansen@intel.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Richard Henderson <rth@twiddle.net>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	Matt Turner <mattst88@gmail.com>,
	Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
	"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
	Helge Deller <deller@gmx.de>, Chris Zankel <chris@zankel.net>,
	Max Filippov <jcmvbkbc@gmail.com>,
	linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org,
	linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org,
	linux-arch@vger.kernel.org
Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory
Date: Fri, 19 Feb 2021 15:04:58 -0500 (EST)	[thread overview]
Message-ID: <382EFE9E-86CB-4C0C-B0C8-4EAE8A5281D9@redhat.com> (raw)
In-Reply-To: <20210219192310.GI6669@xz-x1>


> Am 19.02.2021 um 20:23 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Fri, Feb 19, 2021 at 06:13:47PM +0100, David Hildenbrand wrote:
>>> On 19.02.21 17:31, Peter Xu wrote:
>>> On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote:
>>>> On 18.02.21 23:59, Peter Xu wrote:
>>>>> Hi, David,
>>>>> 
>>>>> On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
>>>>>> When we manage sparse memory mappings dynamically in user space - also
>>>>>> sometimes involving MADV_NORESERVE - we want to dynamically populate/
>>>>>> discard memory inside such a sparse memory region. Example users are
>>>>>> hypervisors (especially implementing memory ballooning or similar
>>>>>> technologies like virtio-mem) and memory allocators. In addition, we want
>>>>>> to fail in a nice way if populating does not succeed because we are out of
>>>>>> backend memory (which can happen easily with file-based mappings,
>>>>>> especially tmpfs and hugetlbfs).
> 
> [1]
> 
>>> E.g., can we simply ask the kernel "how much memory this process can still
>>> allocate", then get a number out of it?  I'm not sure whether it can be done
>> 
>> Anything like that is completely racy and unreliable.
> 
> The failure path won't be racy imho - If we can detect current process doesn't
> have enough memory budget, it'll be more efficient to fail even before trying
> to populate any memory and then drop part of them again.
> 
> But I see your point - indeed it's good to guarantee the guest won't crash at
> any point of further guest side memory access.
> 
> Another question: can the user actually specify arbitrary max-length for the
> virtio-mem device (which decides the maximum memory this device could possibly
> consume)?  I thought we should check that first before realizing the device and
> we really shouldn't fail any guest memory access if that check passed. Feel
> free to correct me.

Max-length is currently limited by the mmap() we‘re allowed to create. With MAP_NORESERVE this can be big (not merged yet).

Checking max-lenght at initialization time does not make too much sense. Just imagine shrinking/relocating other VMs so you can grow this VM further. Or migrating the VM to another machine where you might grow it further.

The ultimate goal is to adjust the mapping size dynamically on demand, but that‘s stuff for the future as it turns out complicated. For example, hugetlbfs VMAs cannot be merged yet (although I think it shouldn‘t be too hard to implement).

The short term approach is only exposing a small window of the bigger mmap to the guest.

>> 
>> That would be kind of weird. I'd assume the reservation gets properly done
>> during fork() - just like for VM_ACCOUNT.
> 
> AFAIK VM_ACCOUNT is never applied for hugetlbfs.  Neither do I know any
> accounting done for hugetlbfs during fork(), if not taking the pinned pages
> into account - that is definitely a special case.
> 

Yes, it isn‘t - I meant „like“ as in „similar to swap reservation“.

>> 
>>> However that's definitely not the case for QEMU since QEMU won't work at all as
>>> late as that point.
>>> 
>>> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we
>>> simply want to know "whether we do still have enough space"..  And IIUC 2)
>>> above is the major issue you'd like to solve too.
>> 
>> To avoid page faults at runtime on access I think. Reservation <=
>> Preallocation.
> 
> Yes.  Besides my above question regarding max-length of virtio-mem device: we
> care most about private mappings of hugetlbfs/shmem here, am I right?
> 
> I'm thinking why we'd need MAP_PRIVATE of these at all for VM context.

One reason is that MAP_SHARED does not support mbind() - which should include hugetlbfs. I did not investigate other side effects / performance considerations on allocation.

Similarly, fallocate() does not respect/care about NUMA.

(And yes, NUMA for virtio-mem will be important).

WARNING: multiple messages have this Message-ID (diff)
From: David Hildenbrand <dhildenb@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Arnd Bergmann <arnd@arndb.de>, Michal Hocko <mhocko@suse.com>,
	Oscar Salvador <osalvador@suse.de>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Minchan Kim <minchan@kernel.org>, Jann Horn <jannh@google.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Dave Hansen <dave.hansen@intel.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Richard Henderson <rth@twiddle.net>,
	Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
	Matt Turner <mattst88@gmail.com>,
	Thomas
Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory
Date: Fri, 19 Feb 2021 15:04:58 -0500 (EST)	[thread overview]
Message-ID: <382EFE9E-86CB-4C0C-B0C8-4EAE8A5281D9@redhat.com> (raw)
In-Reply-To: <20210219192310.GI6669@xz-x1>


> Am 19.02.2021 um 20:23 schrieb Peter Xu <peterx@redhat.com>:
> 
> On Fri, Feb 19, 2021 at 06:13:47PM +0100, David Hildenbrand wrote:
>>> On 19.02.21 17:31, Peter Xu wrote:
>>> On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote:
>>>> On 18.02.21 23:59, Peter Xu wrote:
>>>>> Hi, David,
>>>>> 
>>>>> On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote:
>>>>>> When we manage sparse memory mappings dynamically in user space - also
>>>>>> sometimes involving MADV_NORESERVE - we want to dynamically populate/
>>>>>> discard memory inside such a sparse memory region. Example users are
>>>>>> hypervisors (especially implementing memory ballooning or similar
>>>>>> technologies like virtio-mem) and memory allocators. In addition, we want
>>>>>> to fail in a nice way if populating does not succeed because we are out of
>>>>>> backend memory (which can happen easily with file-based mappings,
>>>>>> especially tmpfs and hugetlbfs).
> 
> [1]
> 
>>> E.g., can we simply ask the kernel "how much memory this process can still
>>> allocate", then get a number out of it?  I'm not sure whether it can be done
>> 
>> Anything like that is completely racy and unreliable.
> 
> The failure path won't be racy imho - If we can detect current process doesn't
> have enough memory budget, it'll be more efficient to fail even before trying
> to populate any memory and then drop part of them again.
> 
> But I see your point - indeed it's good to guarantee the guest won't crash at
> any point of further guest side memory access.
> 
> Another question: can the user actually specify arbitrary max-length for the
> virtio-mem device (which decides the maximum memory this device could possibly
> consume)?  I thought we should check that first before realizing the device and
> we really shouldn't fail any guest memory access if that check passed. Feel
> free to correct me.

Max-length is currently limited by the mmap() we‘re allowed to create. With MAP_NORESERVE this can be big (not merged yet).

Checking max-lenght at initialization time does not make too much sense. Just imagine shrinking/relocating other VMs so you can grow this VM further. Or migrating the VM to another machine where you might grow it further.

The ultimate goal is to adjust the mapping size dynamically on demand, but that‘s stuff for the future as it turns out complicated. For example, hugetlbfs VMAs cannot be merged yet (although I think it shouldn‘t be too hard to implement).

The short term approach is only exposing a small window of the bigger mmap to the guest.

>> 
>> That would be kind of weird. I'd assume the reservation gets properly done
>> during fork() - just like for VM_ACCOUNT.
> 
> AFAIK VM_ACCOUNT is never applied for hugetlbfs.  Neither do I know any
> accounting done for hugetlbfs during fork(), if not taking the pinned pages
> into account - that is definitely a special case.
> 

Yes, it isn‘t - I meant „like“ as in „similar to swap reservation“.

>> 
>>> However that's definitely not the case for QEMU since QEMU won't work at all as
>>> late as that point.
>>> 
>>> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we
>>> simply want to know "whether we do still have enough space"..  And IIUC 2)
>>> above is the major issue you'd like to solve too.
>> 
>> To avoid page faults at runtime on access I think. Reservation <=
>> Preallocation.
> 
> Yes.  Besides my above question regarding max-length of virtio-mem device: we
> care most about private mappings of hugetlbfs/shmem here, am I right?
> 
> I'm thinking why we'd need MAP_PRIVATE of these at all for VM context.

One reason is that MAP_SHARED does not support mbind() - which should include hugetlbfs. I did not investigate other side effects / performance considerations on allocation.

Similarly, fallocate() does not respect/care about NUMA.

(And yes, NUMA for virtio-mem will be important).

  reply	other threads:[~2021-02-19 20:06 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-17 15:48 [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory David Hildenbrand
2021-02-17 15:48 ` David Hildenbrand
2021-02-17 16:46 ` Dave Hansen
2021-02-17 16:46   ` Dave Hansen
2021-02-17 17:06   ` David Hildenbrand
2021-02-17 17:06     ` David Hildenbrand
2021-02-17 17:21 ` Vlastimil Babka
2021-02-17 17:21   ` Vlastimil Babka
2021-02-18 11:07   ` Rolf Eike Beer
2021-02-18 11:07     ` Rolf Eike Beer
2021-02-18 11:27     ` David Hildenbrand
2021-02-18 11:27       ` David Hildenbrand
2021-02-18 10:25 ` Michal Hocko
2021-02-18 10:25   ` Michal Hocko
2021-02-18 10:44   ` David Hildenbrand
2021-02-18 10:44     ` David Hildenbrand
2021-02-18 10:54     ` David Hildenbrand
2021-02-18 10:54       ` David Hildenbrand
2021-02-18 11:28       ` Michal Hocko
2021-02-18 11:28         ` Michal Hocko
2021-02-18 11:27     ` Michal Hocko
2021-02-18 11:27       ` Michal Hocko
2021-02-18 11:38       ` David Hildenbrand
2021-02-18 11:38         ` David Hildenbrand
2021-02-18 12:22 ` [PATCH RFC] madvise.2: Document MADV_POPULATE David Hildenbrand
2021-02-18 12:22   ` David Hildenbrand
2021-02-18 22:59 ` [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory Peter Xu
2021-02-18 22:59   ` Peter Xu
2021-02-19  8:20   ` David Hildenbrand
2021-02-19  8:20     ` David Hildenbrand
2021-02-19 16:31     ` Peter Xu
2021-02-19 16:31       ` Peter Xu
2021-02-19 17:13       ` David Hildenbrand
2021-02-19 17:13         ` David Hildenbrand
2021-02-19 19:14         ` David Hildenbrand
2021-02-19 19:14           ` David Hildenbrand
2021-02-19 19:25           ` Mike Kravetz
2021-02-19 19:25             ` Mike Kravetz
2021-02-20  9:01             ` David Hildenbrand
2021-02-20  9:01               ` David Hildenbrand
2021-02-19 19:23         ` Peter Xu
2021-02-19 19:23           ` Peter Xu
2021-02-19 20:04           ` David Hildenbrand [this message]
2021-02-19 20:04             ` David Hildenbrand
2021-02-22 12:46     ` Michal Hocko
2021-02-22 12:46       ` Michal Hocko
2021-02-22 12:52       ` David Hildenbrand
2021-02-22 12:52         ` David Hildenbrand
2021-02-19 10:35 ` Michal Hocko
2021-02-19 10:35   ` Michal Hocko
2021-02-19 10:43   ` David Hildenbrand
2021-02-19 10:43     ` David Hildenbrand
2021-02-19 11:04     ` Michal Hocko
2021-02-19 11:04       ` Michal Hocko
2021-02-19 11:10       ` David Hildenbrand
2021-02-19 11:10         ` David Hildenbrand
2021-02-20  9:12 ` David Hildenbrand
2021-02-20  9:12   ` David Hildenbrand
2021-02-22 12:56   ` Michal Hocko
2021-02-22 12:56     ` Michal Hocko
2021-02-22 12:59     ` David Hildenbrand
2021-02-22 12:59       ` David Hildenbrand
2021-02-22 13:19       ` Michal Hocko
2021-02-22 13:19         ` Michal Hocko
2021-02-22 13:22         ` David Hildenbrand
2021-02-22 13:22           ` David Hildenbrand
2021-02-22 14:02           ` Michal Hocko
2021-02-22 14:02             ` Michal Hocko
2021-02-22 15:30             ` David Hildenbrand
2021-02-22 15:30               ` David Hildenbrand
2021-02-24 14:25 ` David Hildenbrand
2021-02-24 14:25   ` David Hildenbrand
2021-02-24 14:38   ` David Hildenbrand
2021-02-24 14:38     ` David Hildenbrand
2021-02-25  8:41   ` David Hildenbrand
2021-02-25  8:41     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=382EFE9E-86CB-4C0C-B0C8-4EAE8A5281D9@redhat.com \
    --to=dhildenb@redhat.com \
    --cc=James.Bottomley@hansenpartnership.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=chris@zankel.net \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=deller@gmx.de \
    --cc=hughd@google.com \
    --cc=ink@jurassic.park.msu.ru \
    --cc=jannh@google.com \
    --cc=jcmvbkbc@gmail.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-alpha@vger.kernel.org \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-parisc@vger.kernel.org \
    --cc=linux-xtensa@linux-xtensa.org \
    --cc=mattst88@gmail.com \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=mst@redhat.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rth@twiddle.net \
    --cc=tsbogend@alpha.franken.de \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.