All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michel Lespinasse <michel@lespinasse.org>
To: Michel Lespinasse <michel@lespinasse.org>
Cc: Linux-MM <linux-mm@kvack.org>,
	Linux-Kernel <linux-kernel@vger.kernel.org>,
	Laurent Dufour <ldufour@linux.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	Rik van Riel <riel@surriel.com>,
	Paul McKenney <paulmck@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Joel Fernandes <joelaf@google.com>,
	Andy Lutomirski <luto@kernel.org>
Subject: Re: [PATCH 00/29] Speculative page faults (anon vmas only)
Date: Fri, 30 Apr 2021 15:46:49 -0700	[thread overview]
Message-ID: <20210430224649.GA29203@lespinasse.org> (raw)
In-Reply-To: <20210430195232.30491-1-michel@lespinasse.org>

On Fri, Apr 30, 2021 at 12:52:01PM -0700, Michel Lespinasse wrote:
> This patchset is my take on speculative page faults (spf).
> It builds on ideas that have been previously proposed by Laurent Dufour,
> Peter Zijlstra and others before. While Laurent's previous proposal
> was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> this now based on what I think is a simpler and more bisectable approach,
> much improved scaling numbers in the anonymous vma case, and the Android
> use case that has since emerged. I will expand on these points towards
> the end of this message.

I want to address a few questions that I think are likely to come up,
about how this patchset relates to others currently being worked on,
and about design points for this patchset (mainly about the
per-mm sequence count).


I- Maple tree

I do not think there is any fundamental conflict between the maple
tree patches currently being considered, and this patchset.
I actually have a (very lightly tested) tree merging the two together,
which was a fairly easy merge. For those interested, I made this
available at my github, as the v5.12-maple-spf branch.

At the same time, Matthew & Liam have made it known that they would
like to build some lockless fault facilities on top of maple tree,
and even though these ideas have not been implemented yet (AFAIK),
my proposal probably falls short of what they have in mind.
From my point of view, I do not see that as a fundamental conflict
either; my take is that I like to use a more incremental approach and
that the speculative page fault ideas are worth exploring on their own;
they could be further extended in the future with some of the additional
ideas I have heard discussed in association to maple tree.

I am aware of two main areas where my proposal is more limited than
the plans I have heard from Matthew & Liam. Maybe there are more, and
I hope they will correct me of that is the case. But, the ones I know
about would be:

1- VMA lookups. This patchset has mmap writers update a sequence
counter around updates; the speculative fault path uses that counter
to detect concurrent updates when looking up and copying the VMA.
This means lookups might fail if they overlap with a concurrent mmap
writer; the alternative discussed by maple tree proponents would be to
make VMAs immutable and have the writers actually make a new copy when
they want to update. While this might impose some costs on the writers,
it would benefit the fault path in two ways: first, lookups would always
succeed, and second, the fault path wouldn't need to make a VMA copy.
I think this is worth exploring, but can be done as a separate step.

2- Validation at the end of the page fault. After taking the page
table lock but before inserting the new PTE, this patchset verifies
the per-mm sequence counter to validate that no mmap writers ran
concurrently with the fault. As people noted, this is quite
restrictive; page faults may unnecessarily abort due to writers
operating on a separate memory range. This topic is worthy discussion
independently of the maple tree stuff, so I'll get back to it later down.

Matthew & Liam, do you have other extensions in mind which I have not
covered here ?


II- Range locking

Prior to this patchset I had been working on mmap range locking
approaches, in order to allow non-overlapping memory operations to
proceed concurrently. I think this is still an interesting idea,
but the speculative page fault proposal is independent of it
and is more mature so I think it should be submitted first.


III- Thoughts about concurrency checks at the end of the page fault

As noted, the check using the per-mm counter can lead to unnecessary
speculative page fault aborts. Why do it that way then ?

The first reason I want to give is practical. The types of faults this
patchset implements speculatively tend to be fairly quick - in
particular, no I/O is involved (for the swap case, we only implement
the case of hitting into the swap cache). As a result, there is not
very much time for concurrent mmap writers to interfere. I did try
implementing a more precise check, but it did not significantly
improve the success rate in workloads I looked at, so it seemed best
to go with the simplest possible check first.

But still, could we implement a precise check that never leads to
unnecessary page fault aborts ?

The simplest way to go about this would seem to be to look up the VMA
again at the end of the page fault (after taking the page table lock
but before inserting the new PTE into the page table). If the VMA
attributes have not changed, we might be tempted to conclude it is
safe to insert the new PTE and complete the page fault. However, I am
not sure if that would always be correct.

The case I am worried about is when breaking COW:
- Page P is COW mapped into processes A and B
- Thread A1 (within process A) takes a write fault on P
- A1 allocates a new page P2
- A1 starts copying P into P2
- B unmaps P
- Thread A2 (within process A) takes a write fault on P
  P now has only one mapping, so A2 just changes P to be writable
  A2's page fault completes
- A2 writes into P
- A2 calls mprotect() to make P's mapping readonly.
  P's PTE gets its W permission bit cleared.
- A2 calls mprotect() to make P's mapping writable again.
- A1 is done copying P into P2.
  A1 takes the page table lock
  A1 verifies that P's VMA has not changed - it's still a writable mapping
  A1 verifies that P's PTE has not changed -
    it still points to P with the W permission bit cleared.
  A1 updates the pte to point to the P2 page (with the W permission bit set)

The above would be incorrect because A2's write into P may get lost.

This seems like a convoluted scenario but I am not sure how to cleanly
protect against it. Surely one could extend the validation mechanism
(Laurent's proposal used per-VMA sequence counts), but there is still
a possibility of unnecessary aborts there, so I don't think that is
fully satisfactory.

I think doing re-fetching the VMA at the end of the page fault would
be safe in at least some of the cases though, most notably if the
original PTE was pte_none. So maybe that would cover enough cases ?

To sum it up, I agree that using the per-mm sequence count to validate
page faults is imperfect, but I think it gives a decent first stab at
the issue, and that further improvements are not trivial enough to
design in a vacuum - they would be better handled by incrementally
addressing problem workloads IMO.


--
Michel "walken" Lespinasse

  parent reply	other threads:[~2021-04-30 22:46 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-30 19:52 [PATCH 00/29] Speculative page faults (anon vmas only) Michel Lespinasse
2021-04-30 19:52 ` [PATCH 01/29] mm: export dump_mm Michel Lespinasse
2021-04-30 19:52 ` [PATCH 02/29] mmap locking API: mmap_lock_is_contended returns a bool Michel Lespinasse
2021-04-30 19:52 ` [PATCH 03/29] mmap locking API: name the return values Michel Lespinasse
2021-04-30 19:52 ` [PATCH 04/29] do_anonymous_page: use update_mmu_tlb() Michel Lespinasse
2021-06-10  0:38   ` Suren Baghdasaryan
2021-06-10  0:38     ` Suren Baghdasaryan
2021-04-30 19:52 ` [PATCH 05/29] do_anonymous_page: reduce code duplication Michel Lespinasse
2021-04-30 19:52 ` [PATCH 06/29] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT Michel Lespinasse
2021-04-30 19:52 ` [PATCH 07/29] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Michel Lespinasse
2021-04-30 19:52 ` [PATCH 08/29] mm: add FAULT_FLAG_SPECULATIVE flag Michel Lespinasse
2021-06-10  0:58   ` Suren Baghdasaryan
2021-06-10  0:58     ` Suren Baghdasaryan
2021-04-30 19:52 ` [PATCH 09/29] mm: add do_handle_mm_fault() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 10/29] mm: add per-mm mmap sequence counter for speculative page fault handling Michel Lespinasse
2021-04-30 19:52 ` [PATCH 11/29] mm: rcu safe vma freeing Michel Lespinasse
2021-04-30 19:52 ` [PATCH 12/29] x86/mm: attempt speculative mm faults first Michel Lespinasse
2021-04-30 19:52 ` [PATCH 13/29] mm: add speculative_page_walk_begin() and speculative_page_walk_end() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 14/29] mm: refactor __handle_mm_fault() / handle_pte_fault() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 15/29] mm: implement speculative handling in __handle_mm_fault() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 16/29] mm: add pte_map_lock() and pte_spinlock() Michel Lespinasse
2021-04-30 23:33   ` kernel test robot
2021-04-30 23:33     ` kernel test robot
2021-04-30 23:45   ` kernel test robot
2021-04-30 23:45     ` kernel test robot
2021-04-30 19:52 ` [PATCH 17/29] mm: implement speculative handling in do_anonymous_page() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 18/29] mm: enable speculative fault handling through do_anonymous_page() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 19/29] mm: implement speculative handling in do_numa_page() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 20/29] mm: enable speculative fault " Michel Lespinasse
2021-04-30 19:52 ` [PATCH 21/29] mm: implement speculative handling in wp_page_copy() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 22/29] mm: implement and enable speculative fault handling in handle_pte_fault() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 23/29] mm: implement speculative handling in do_swap_page() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 24/29] mm: enable speculative fault handling through do_swap_page() Michel Lespinasse
2021-04-30 19:52 ` [PATCH 25/29] mm: disable speculative faults for single threaded user space Michel Lespinasse
2021-04-30 19:52 ` [PATCH 26/29] mm: disable rcu safe vma freeing " Michel Lespinasse
2021-04-30 19:52 ` [PATCH 27/29] mm: anon spf statistics Michel Lespinasse
2021-04-30 22:52   ` kernel test robot
2021-04-30 22:52     ` kernel test robot
2021-04-30 19:52 ` [PATCH 28/29] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Michel Lespinasse
2021-04-30 19:52 ` [PATCH 29/29] arm64/mm: attempt speculative mm faults first Michel Lespinasse
2021-04-30 19:52 ` [PATCH 30/31] powerpc/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Michel Lespinasse
2021-04-30 19:52 ` [PATCH 31/31] powerpc/mm: attempt speculative mm faults first Michel Lespinasse
2021-04-30 22:46 ` Michel Lespinasse [this message]
2021-05-03 18:11   ` [PATCH 00/29] Speculative page faults (anon vmas only) Michel Lespinasse
2021-05-17 17:57     ` Paul E. McKenney
2021-05-20 22:10       ` Suren Baghdasaryan
2021-05-20 22:10         ` Suren Baghdasaryan
2021-05-20 23:08         ` Paul E. McKenney
2021-06-01  7:41         ` Michel Lespinasse
2021-06-01 20:18           ` Paul E. McKenney
2021-06-01 20:23         ` Paul E. McKenney
2021-06-14  7:04         ` Michel Lespinasse
2021-05-01 19:56 ` Theodore Ts'o
2021-05-01 21:19   ` Michel Lespinasse
2021-06-17 13:46 ` David Hildenbrand
2021-07-09 10:41   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210430224649.GA29203@lespinasse.org \
    --to=michel@lespinasse.org \
    --cc=akpm@linux-foundation.org \
    --cc=joelaf@google.com \
    --cc=ldufour@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mhocko@suse.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.