From: David Hildenbrand <david@redhat.com>
To: Jann Horn <jannh@google.com>
Cc: kernel list <linux-kernel@vger.kernel.org>,
Linux-MM <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Arnd Bergmann <arnd@arndb.de>, Michal Hocko <mhocko@suse.com>,
Oscar Salvador <osalvador@suse.de>,
Matthew Wilcox <willy@infradead.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Minchan Kim <minchan@kernel.org>, Jason Gunthorpe <jgg@ziepe.ca>,
Dave Hansen <dave.hansen@intel.com>,
Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
"Michael S . Tsirkin" <mst@redhat.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Vlastimil Babka <vbabka@suse.cz>,
Richard Henderson <rth@twiddle.net>,
Ivan Kokshaysky <ink@jurassic.park.msu.ru>,
Matt Turner <mattst88@gmail.com>,
Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
"James E.J. Bottomley" <James.Bottomley@hansenpartnership.com>,
Helge Deller <deller@gmx.de>, Chris Zankel <chris@zankel.net>,
Max Filippov <jcmvbkbc@gmail.com>,
Mike Kravetz <mike.kravetz@oracle.com>,
Peter Xu <peterx@redhat.com>,
Rolf Eike Beer <eike-kernel@sf-tec.de>,
linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org,
linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org,
linux-arch <linux-arch@vger.kernel.org>,
Linux API <linux-api@vger.kernel.org>
Subject: Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory
Date: Tue, 30 Mar 2021 17:01:08 +0200 [thread overview]
Message-ID: <2bab28c7-08c0-7ff0-c70e-9bf94da05ce1@redhat.com> (raw)
In-Reply-To: <CAG48ez0BQ3Vd3nDLEvyiSU0XALgUQ=c-fAwcFVScUkgo_9qVuQ@mail.gmail.com>
[...]
>>
>> Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE with the
>> following semantics:
>> 1. MADV_POPULATE_READ can be used to preallocate backend memory and
>> prefault page tables just like manually reading each individual page.
>> This will not break any COW mappings -- e.g., it will populate the
>> shared zeropage when applicable.
>
> Please clarify what is meant by "backend memory". As far as I can tell
> from looking at the code, MADV_POPULATE_READ on file mappings will
> allocate zeroed memory in the page cache, and map it as readonly pages
> into userspace, but any attempt to actually write to that memory will
> trigger the filesystem's ->page_mkwrite handler; and e.g. ext4 will
> only try to allocate disk blocks at that point, which may fail. So as
> far as I can tell, for files on filesystems like ext4, the current
> implementation of MADV_POPULATE_READ does not replace fallocate(). Am
> I missing something?
Thanks for pointing that out, I guess I was blinded by tmpfs/hugetlbfs
behavior. There might be cases (!tmpfs, !hugetlbfs) where we indeed need
fallocate()+MADV_POPULATE_READ on file mappings.
The logic is essentially what mlock()/MAP_POPULATE does via
populate_vma_page_range() on shared mappings, so I assumed it would
always properly allocate backend memory.
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
* and we would not want to dirty them for nothing.
*/
My tests with MADV_POPULATE_READ:
1. MAP_SHARED on tmpfs: memory in the file is allocated
2. MAP_PRIVATE on tmpfs: memory in the file is allocated
3. MAP_SHARED on hugetlbfs: memory in the file is allocated
4. MAP_PRIVATE on hugetlbfs: memory in the file is *not* allocated
5. MAP_SHARED on ext4: memory in the file is *not* allocated
6. MAP_PRIVATE on ext4: memory in the file is *not* allocated
1..4 are also the reason why it works with memfd as expected.
For 4 and 6 it's not bad: writing to the private mapping will not result
in backend storage/blocks having to get allocated. So the backend
storage is actually RAM (although we don't allocate backend storage here
but use the shared zero page, but that's a different story).
For 5. we indeed need fallocate() before MADV_POPULATE_READ in case we
could have holes.
Thanks for pointing that out.
>
> If the desired semantics are that disk blocks should be preallocated,
> I think you may have to look up the ->vm_file and then internally call
> vfs_fallocate() to address this, kinda like in madvise_remove()?
Does not sound too complicated, but devil might be in the details. At
least for MAP_SHARED this might be the right thing to do. As discussed
above, for MAP_PRIVATE we usually don't want to do that (and SHMEM is
just weird).
I honestly do wonder if breaking with MAP_POPULATE semantics is
beneficial. For my use cases, doing fallocate() plus MADV_POPULATE_READ
on shared, file-backed mappings would certainly be sufficient. But
having a simple, consistent behavior would be much nicer.
I'll give it a thought!
>> 2. If MADV_POPULATE_READ succeeds, all page tables have been populated
>> (prefaulted) readable once.
>> 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
>> prefault page tables just like manually writing (or
>> reading+writing) each individual page. This will break any COW
>> mappings -- e.g., the shared zeropage is never populated.
>> 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
>> (prefaulted) writable once.
>> 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
>> mappings marked with VM_PFNMAP and VM_IO. Also, proper access
>> permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
>> mapping is encountered, madvise() fails with -EINVAL.
>> 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
>> might have been populated. In that case, madvise() fails with
>> -ENOMEM.
>
> AFAICS that's not true (or misphrased). If MADV_POPULATE_*
> successfully populates a bunch of pages, then fails because of an
> error (e.g. EHWPOISON), it will return EHWPOISON, not ENOMEM, right?
Indeed, leftover from previous version. It's clearer in the man page I
prepared, will fix it up.
>
>> 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
>> when encountering a HW poisoned page in the range.
>> 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
>> cannot protect from the OOM (Out Of Memory) handler killing the
>> process.
>>
>> While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
>> preallocate memory and prefault page tables for VMs), there are valid use
>> cases for MADV_POPULATE_READ:
>> 1. Efficiently populate page tables with zero pages (i.e., shared
>> zeropage). This is necessary when using userfaultfd() WP (Write-Protect
>> to properly catch all modifications within a mapping: for
>> write-protection to be effective for a virtual address, there has to be
>> a page already mapped -- even if it's the shared zeropage.
>
> This sounds like a hack to work around issues that would be better
> addressed by improving userfaultfd?
There are plans to do that, indeed.
>
>> 2. Pre-read a whole mapping from backend storage without marking it
>> dirty, such that eviction won't have to write it back. If no backend
>> memory has been allocated yet, allocate the backend memory. Helpful
>> when preallocating/prefaulting a file stored on disk without having
>> to writeback each and every page on eviction.
>
> This sounds reasonable to me.
Yes, the case with holes / backend memory has to be clarified.
>
>> Although sparse memory mappings are the primary use case, this will
>> also be useful for ordinary preallocations where MAP_POPULATE is not
>> desired especially in QEMU, where users can trigger preallocation of
>> guest RAM after the mapping was created.
>>
>> Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
>> however, the main motivation back than was performance improvements
>> (which should also still be the case, but it is a secondary concern).
>>
>> V. Single-threaded performance comparison
>>
>> There is a performance benefit when using POPULATE_READ / POPULATE_WRITE
>> already when only using a single thread to do prefaulting/preallocation. As
>> we have less pagefaults for huge pages, the performance benefit is
>> negligible with small mappings.
> [...]
>> diff --git a/mm/gup.c b/mm/gup.c
> [...]
>> +long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
>> + unsigned long end, bool write, int *locked)
>> +{
>> + struct mm_struct *mm = vma->vm_mm;
>> + unsigned long nr_pages = (end - start) / PAGE_SIZE;
>> + int gup_flags;
>> +
>> + VM_BUG_ON(!PAGE_ALIGNED(start));
>> + VM_BUG_ON(!PAGE_ALIGNED(end));
>> + VM_BUG_ON_VMA(start < vma->vm_start, vma);
>> + VM_BUG_ON_VMA(end > vma->vm_end, vma);
>> + mmap_assert_locked(mm);
>> +
>> + /*
>> + * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
>> + * a poisoned page.
>> + * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
>> + * !FOLL_FORCE: Require proper access permissions.
>> + */
>> + gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
>> + if (write)
>> + gup_flags |= FOLL_WRITE;
>> +
>> + /*
>> + * See check_vma_flags(): Will return -EFAULT on incompatible mappings
>> + * or with insufficient permissions.
>> + */
>> + return __get_user_pages(mm, start, nr_pages, gup_flags,
>> + NULL, NULL, locked);
>
> You mentioned in the commit message that you don't want to actually
> dirty all the file pages and force writeback; but doesn't
> POPULATE_WRITE still do exactly that? In follow_page_pte(), if
> FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty:
Well, I mention that POPULATE_READ explicitly doesn't do that. I
primarily set it because populate_vma_page_range() also sets it.
Is it safe to *not* set it? IOW, fault something writable into a page
table (where the CPU could dirty it without additional page faults)
without marking it accessed? For me, this made logically sense. Thus I
also understood why populate_vma_page_range() set it.
>
> if (flags & FOLL_TOUCH) {
> if ((flags & FOLL_WRITE) &&
> !pte_dirty(pte) && !PageDirty(page))
> set_page_dirty(page);
> /*
> * pte_mkyoung() would be more correct here, but atomic care
> * is needed to avoid losing the dirty bit: it is easier to use
> * mark_page_accessed().
> */
> mark_page_accessed(page);
> }
>
>
>> +}
>> +
>> /*
>> * __mm_populate - populate and/or mlock pages within a range of address space.
>> *
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 3f22c4ceb7b5..ee398696380f 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -335,6 +335,9 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma);
>> #ifdef CONFIG_MMU
>> extern long populate_vma_page_range(struct vm_area_struct *vma,
>> unsigned long start, unsigned long end, int *locked);
>> +extern long faultin_vma_page_range(struct vm_area_struct *vma,
>> + unsigned long start, unsigned long end,
>> + bool write, int *locked);
>> extern void munlock_vma_pages_range(struct vm_area_struct *vma,
>> unsigned long start, unsigned long end);
>> static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 01fef79ac761..857460873f7a 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -53,6 +53,8 @@ static int madvise_need_mmap_write(int behavior)
>> case MADV_COLD:
>> case MADV_PAGEOUT:
>> case MADV_FREE:
>> + case MADV_POPULATE_READ:
>> + case MADV_POPULATE_WRITE:
>> return 0;
>> default:
>> /* be safe, default to 1. list exceptions explicitly */
>> @@ -822,6 +824,64 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>> return -EINVAL;
>> }
>>
>> +static long madvise_populate(struct vm_area_struct *vma,
>> + struct vm_area_struct **prev,
>> + unsigned long start, unsigned long end,
>> + int behavior)
>> +{
>> + const bool write = behavior == MADV_POPULATE_WRITE;
>> + struct mm_struct *mm = vma->vm_mm;
>> + unsigned long tmp_end;
>> + int locked = 1;
>> + long pages;
>> +
>> + *prev = vma;
>> +
>> + while (start < end) {
>> + /*
>> + * We might have temporarily dropped the lock. For example,
>> + * our VMA might have been split.
>> + */
>> + if (!vma || start >= vma->vm_end) {
>> + vma = find_vma(mm, start);
>> + if (!vma || start < vma->vm_start)
>> + return -ENOMEM;
>> + }
>> +
>> + tmp_end = min_t(unsigned long, end, vma->vm_end);
>> + /* Populate (prefault) page tables readable/writable. */
>> + pages = faultin_vma_page_range(vma, start, tmp_end, write,
>> + &locked);
>> + if (!locked) {
>> + mmap_read_lock(mm);
>> + locked = 1;
>> + *prev = NULL;
>> + vma = NULL;
>> + }
>> + if (pages < 0) {
>> + switch (pages) {
>> + case -EINTR:
>> + return -EINTR;
>> + case -EFAULT: /* Incompatible mappings / permissions. */
>> + return -EINVAL;
>> + case -EHWPOISON:
>> + return -EHWPOISON;
>> + case -EBUSY:
>
> What is -EBUSY doing here? __get_user_pages() fixes up -EBUSY from
> faultin_page() to 0, right?
>
>> + case -EAGAIN:
>
> Where can -EAGAIN come from?
On both points: the lack of documentation on return values made me add
these. The faultin_page() path is indeed fine. If the other paths don't
yield any such return values, we can drop both.
Thanks for the review!
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2021-03-30 15:02 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-03-17 11:06 [PATCH v1 0/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory David Hildenbrand
2021-03-17 11:06 ` [PATCH v1 1/5] mm: make variable names for populate_vma_page_range() consistent David Hildenbrand
2021-03-17 11:06 ` [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory David Hildenbrand
2021-03-30 13:37 ` Jann Horn
2021-03-30 15:01 ` David Hildenbrand [this message]
2021-03-30 16:21 ` Jann Horn
2021-03-30 16:30 ` David Hildenbrand
2021-03-30 16:31 ` David Hildenbrand
2021-04-07 10:31 ` David Hildenbrand
2021-04-15 10:26 ` David Hildenbrand
2021-03-17 11:06 ` [PATCH v1 3/5] MAINTAINERS: add tools/testing/selftests/vm/ to MEMORY MANAGEMENT David Hildenbrand
2021-03-17 11:06 ` [PATCH v1 4/5] selftests/vm: add protection_keys_32 / protection_keys_64 to gitignore David Hildenbrand
2021-03-17 11:06 ` [PATCH v1 5/5] selftests/vm: add test for MADV_POPULATE_(READ|WRITE) David Hildenbrand
2021-03-30 8:58 ` [PATCH v1 0/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault/prealloc memory David Hildenbrand
2021-03-31 4:58 ` Andrew Morton
2021-03-31 6:46 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2bab28c7-08c0-7ff0-c70e-9bf94da05ce1@redhat.com \
--to=david@redhat.com \
--cc=James.Bottomley@hansenpartnership.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=chris@zankel.net \
--cc=dave.hansen@intel.com \
--cc=deller@gmx.de \
--cc=eike-kernel@sf-tec.de \
--cc=hughd@google.com \
--cc=ink@jurassic.park.msu.ru \
--cc=jannh@google.com \
--cc=jcmvbkbc@gmail.com \
--cc=jgg@ziepe.ca \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-alpha@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-arch@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-parisc@vger.kernel.org \
--cc=linux-xtensa@linux-xtensa.org \
--cc=mattst88@gmail.com \
--cc=mhocko@suse.com \
--cc=mike.kravetz@oracle.com \
--cc=minchan@kernel.org \
--cc=mst@redhat.com \
--cc=osalvador@suse.de \
--cc=peterx@redhat.com \
--cc=riel@surriel.com \
--cc=rth@twiddle.net \
--cc=tsbogend@alpha.franken.de \
--cc=vbabka@suse.cz \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).