All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Alistair Popple <apopple@nvidia.com>,
	Tiberiu Georgescu <tiberiu.georgescu@nutanix.com>,
	ivan.teterevkov@nutanix.com,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Hugh Dickins <hughd@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER
Date: Tue, 17 Aug 2021 20:46:45 +0200	[thread overview]
Message-ID: <c4adbba1-2299-f87c-1893-e83af9beadbc@redhat.com> (raw)
In-Reply-To: <YRvtPrPmAorX+KY5@t490s>

>>
>> For uffd-wp in its current form, it would certainly be the way to go I
>> think. AFAIKT, the idea of special swap entries isn't new, just that it's
>> limited to anonymous memory for now, which makes things like fork and new
>> mappings a lot cheaper.
> 
> Thanks for reviewing this series separately; yes I definitely wanted to get
> comments on both sides: one on the pte marker idea, the other is whether it's
> applicable to this swap+shmem use case.

Exactly.

> 
> Here I really want to make the pte marker be flexible - it can be strict when
> necessary (it will be 100% strict with uffd-wp), then it can also be a hint
> just like what we have with available ptes on soft-dirty, idle, accessed bits.
> Here the swap bit I wanted to make it that kind, so we add zero overhead to
> fork() and we still solve problems.
> 
> Same thing to "newly mapped shmem".  Do we have a use case for that?  If that's
> a hint bit, can we ignore it?

I am really not a fan of taking an already broken feature (broken 
presence information for shmem in pagemap) and instead of fixing it 
properly, turning it less broken, crossing fingers that nobody will 
notice the remaining (undocumented) cases.

[...]

>> As already expressed, we should try storing as little information in page
>> tables as possible if we're dealing with shared memory. The features we
>> design around this all seem to over-complicate the actual users,
>> over-complicate fork, over-complicate handling on new mappings.
> 
> I'll skip the last two "over-complicated" items, because as we've discussed I
> don't think we need to take care of them so far.  We can revisit when they
> become some kind of requirement.
> 
> To me having PM_SWAP 99% right on shmem is still a progress comparing to
> completely missing it, even if it's not 100% right.  It's used for performance
> reasons on PAGEOUT and doing finer-grained memory control from userspace, it's
> not a strict requirement.
> 
> So I still cannot strictly follow why storing information in pte is so bad for
> file-backed, which I can see you really don't like it.  Could you share some
> detailed example?

I am not a fan of your approach of "hints", because while it might work 
for some use cases, it might not work for others (see below for CRIU); I 
would rather like to avoid such "inconsistent caches" where it's not 
really required. But again, this is just my opinion and there might be 
plenty other people that will most probably disagree.

Storing hints in page tables isn't actually that bad, because we can 
drop hints whenever we like (well, there are side effects, and once we 
drop hints too often people might complain again) -- for example, when 
reclaiming "empty" but actually "filled with hints" page tables. When we 
rely on consistent values, fork() and mmap() are a problem. Well, and 
page tables will have to stick around. At least for uffd-wp, mmap() is 
not an issue, and we don't expect too many uffd-wp users such that page 
table consumption would matter -- my guess.

So I repeat, for uffd-wp in its current form, it sounds like the right 
thing to do.

>> But I guess I'm biased at this point because the main users of these
>> features actually want to query/set such properties for all sharers, not
>> individual processes; so the opinion of others would be appreciated.
>>
>>>
>>> Known Issues/Concerns
>>> =====================
>>>
>>> About THP
>>> ---------
>>>
>>> Currently we don't need to worry about THP because paged out shmem pages will
>>> be split when shrinking, IOW we only need to consider PTE, and the markers will
>>> only be applied to a shmem pte not pmd or bigger.
>>>
>>> About PM_SWAP Accuracy
>>> ----------------------
>>>
>>> This is not an "accurate" solution to provide PM_SWAP bit.  Two exmaples:
>>>
>>>     - When process A & B both map shmem page P somewhere, it can happen that only
>>>       one of these ptes got marked with the pte marker.  Imagine below sequence:
>>>
>>>       0. Process A & B both map shmem page P somewhere
>>>       1. Process A zap pte of page P for some reason (e.g. thp split)
>>>       2. System decides to recycle page P
>>>       3. System replace process B's pte (pointed to P) by PTE marker
>>>       4. System _didn't_ replace process A's pte because it was none pte, and
>>>          it'll continue to be none pte
>>>       5. Only process B's relevant pte has the PTE marker after P swapped out
>>>
>>>     - When fork, we don't copy shmem vma ptes, including the pte markers.  So
>>>       even if page P was swapped out, only the parent process has the pte marker
>>>       installed, in child it'll be none pte if fork() happened after pageout.
>>>
>>> Conclusion: just like it used to be, the PM_SWAP is best-effort.  But it should
>>> work in 99.99% cases and it should already start to solve problems.
>>
>> At least I don't like these semantics at all. PM_SWAP is a cached value
>> which might be under-represented and consequently wrong.
> 
> Please have a look at current pagemap impl in pte_to_pagemap_entry().  It's not
> accurate from the 1st day, imho.  E.g., when a page is being migrated from numa
> node 1 to node 2, we'll mark it PM_SWAP but I think it's not the case.  We can
> make it more accurate, but I think it's fine, because it's a hint.

That inconsistency doesn't really matter as you can determine if 
something is present and worth dumping if it's either swapped or 
present. As long as it's one of both but not simply nothing.

I will shamelessly reference 
tools/testing/selftests/vm/madv_populate.c:pagemap_is_populated() that 
checks exactly for that (the test case uses only private anonymous memory).

> 
>> Take CRIU as an example, it has to be correct even if a process would remap a
>> memory region, fork() and unmap in the parent as far as I understand, ...
> 
> Are you talking about dirty bit or swap bit?  I'm a bit confused on why swap
> bit needs to be accurate.  Maybe you mean the dirty bit?

https://criu.org/Shared_memory

"Dumping present pages"

"... CRIU does not dump all of the data. Instead, it determines which 
pages contain it, and only dumps those pages. This is done similarly to 
how regular memory dumping and restoring works, i.e. by looking for 
PRESENT or SWAPPED bits in owners' pagemap entries."

-> Neither PRESENT nor SWAPPED results in memory not getting dumped, 
which makes perfect sense.

1) Process A sets up shared memory and writes data to it.
2) System swaps out memory, hints are setup.
3) Process A forks Process B, hints are not copied.
4) Process A unmaps shared memory, hints are dropped.
5) CRIU migrates process A and B and migrates only PRESENT or SWAPPED in 
pagemap.
6) Process B uses memory in shared memory region. Pages were not migrated.

Just one example; feel free to correct me.


There is notion of the mincore() systemcall:

"There is one particular feature of shared memory dumps worth 
mentioning. Sometimes, a shared memory page can exist in the kernel, but 
it is not mapped to any process. CRIU detects such pages by calling 
mincore() on the shmem segment, which reports back the page in-memory 
status. The mincore bitmap is when ANDed with the per-process ones. "

Not sure if they actually mean ORed, because otherwise they'd be losing 
pages that have been swapped out. "mincore() returns a vector that 
indicates whether pages of the calling process's virtual memory are 
resident in core (RAM)"

-- 
Thanks,

David / dhildenb



  reply	other threads:[~2021-08-17 18:47 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-07  3:25 [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Peter Xu
2021-08-07  3:25 ` [PATCH RFC 1/4] mm: Introduce PTE_MARKER swap entry Peter Xu
2021-08-07  3:25 ` [PATCH RFC 2/4] mm: Check against orig_pte for finish_fault() Peter Xu
2021-08-07  3:25 ` [PATCH RFC 3/4] mm: Handle PTE_MARKER page faults Peter Xu
2021-08-07  3:25 ` [PATCH RFC 4/4] mm: Install marker pte when page out for shmem pages Peter Xu
2021-08-13 15:18   ` Tiberiu Georgescu
2021-08-13 16:01     ` Peter Xu
2021-08-18 18:02       ` Tiberiu Georgescu
2021-08-17  9:04 ` [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER David Hildenbrand
2021-08-17 17:09   ` Peter Xu
2021-08-17 18:46     ` David Hildenbrand [this message]
2021-08-17 20:24       ` Peter Xu
2021-08-18  8:24         ` David Hildenbrand
2021-08-18 17:52           ` Tiberiu Georgescu
2021-08-18 18:13             ` David Hildenbrand
2021-08-19 14:54               ` Tiberiu Georgescu
2021-08-19 17:26                 ` David Hildenbrand
2021-08-20 16:49                   ` Tiberiu Georgescu
2021-08-20 19:12                     ` Peter Xu
2021-08-25 13:40                       ` Tiberiu Georgescu
2021-08-25 14:59                         ` Peter Xu
2021-08-26 15:01                           ` Tiberiu Georgescu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c4adbba1-2299-f87c-1893-e83af9beadbc@redhat.com \
    --to=david@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=hughd@google.com \
    --cc=ivan.teterevkov@nutanix.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=tiberiu.georgescu@nutanix.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.