linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Suren Baghdasaryan <surenb@google.com>
To: David Hildenbrand <david@redhat.com>
Cc: Michel Lespinasse <michel@lespinasse.org>,
	Linux-MM <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	kernel-team@fb.com, Laurent Dufour <ldufour@linux.ibm.com>,
	Jerome Glisse <jglisse@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Matthew Wilcox <willy@infradead.org>,
	Liam Howlett <liam.howlett@oracle.com>,
	Rik van Riel <riel@surriel.com>,
	Paul McKenney <paulmck@kernel.org>,
	Song Liu <songliubraving@fb.com>,
	Minchan Kim <minchan@google.com>,
	Joel Fernandes <joelaf@google.com>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Andy Lutomirski <luto@kernel.org>,
	Tim Murray <timmurray@google.com>
Subject: Re: [PATCH v2 00/35] Speculative page faults
Date: Mon, 31 Jan 2022 09:00:18 -0800	[thread overview]
Message-ID: <CAJuCfpH+emfOg55Fh6hO90+MeGmg2r5FKR7BeuzPJtsu1ArtZA@mail.gmail.com> (raw)
In-Reply-To: <1bfec16f-76c6-9beb-26b2-ca508baa76a3@redhat.com>

On Mon, Jan 31, 2022 at 1:56 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 28.01.22 14:09, Michel Lespinasse wrote:
>
> Hi Michel,
>
> > This patchset is my take on speculative page faults (spf).
> > It builds on ideas that have been previously proposed by Laurent Dufour,
> > Peter Zijlstra and others before. While Laurent's previous proposal
> > was rejected around the time of LSF/MM 2019, I am hoping we can revisit
> > this now based on what I think is a simpler and more bisectable approach,
> > much improved scaling numbers in the anonymous vma case, and the Android
> > use case that has since emerged. I will expand on these points towards
> > the end of this message.
> >
> > The patch series applies on top of linux v5.17-rc1;
> > a git tree is also available:
> > git fetch https://github.com/lespinasse/linux.git v5.17-rc1-spf-anon
> >
> > I would like these patches to be considered for inclusion into v5.18.
>
> Just a general note: we certainly need (much more) review. And I think
> we'll have to make a decision if the maintenance effort +  complexity
> will be worth the benefit.
>
> > Several android vendors are using Laurent Dufour's previous SPF work into
> > their kernel tree in order to improve application startup performance,
> > want to converge to an upstream accepted solution, and have reported good
> > numbers with previous versions of this patchset. Also, there is a broader
> > interest into reducing mmap lock dependencies in critical MM paths,
> > and I think this patchset would be a good first step in that direction.
> >
> >
> > This patchset follows the same overall structure as the v1 proposal,
> > with the following differences:
> > - Commit 12 (mm: separate mmap locked assertion from find_vma) is new.
> > - The mmu notifier lock is new; this fixes a race in v1 patchset
> >   between speculative COW faults and registering new MMU notifiers.
> > - Speculative handling of swap-cache pages has been removed.
> > - Commit 30 is new; this fixes build issues that showed in some configs.
> >
> >
> > In principle it would also be possible to extend this work for handling
> > file mapped vmas; I have pending work on such patches too but they are
> > not mature enough to be submitted for inclusion at this point.
> >
>
> I'd have expected a performance evaluation at this point, to highlight
> the possible benefit and eventually also downsides, if any.

Hi David,
In Android we and several Android vendors reported application start
time improvements (a critical metric in Android world) on the previous
SPF posting.
My test results were included in the cover letter:
  https://lore.kernel.org/lkml/eee7431c-3dc8-ca3c-02fb-9e059d30e951@kernel.org/T/#m23c5cb33b1a04979c792db6ddd7e3245e5f86bcb
Android vendors reported their results on the same thread:
  https://lore.kernel.org/lkml/eee7431c-3dc8-ca3c-02fb-9e059d30e951@kernel.org/T/#m8eb304b67c9a33388e2fe4448a04a74879120b34
  https://lore.kernel.org/lkml/eee7431c-3dc8-ca3c-02fb-9e059d30e951@kernel.org/T/#maaa58f7072732e5a2a77fe9f65dd3e444c2aed04
And Axel ran pft (pagefault test) benchmarks on server class machines
with results reported here:
  https://lore.kernel.org/lkml/eee7431c-3dc8-ca3c-02fb-9e059d30e951@kernel.org/T/#mc3965e87a702c67909a078a67f8f7964d707b2e0
The Android performance team had recently reported a case when a
low-end device was having visible performance issues and after
applying SPF the device became usable. I'm CC'ing Tim Murray from that
team to provide more information if possible.
As a side-note, an older version of SPF has been used for several
years on Android and many vendors specifically requested us to include
it in our kernels. It is currently maintained in Android Common Kernel
as an out-of-tree patchset and getting it upstream would be huge for
us in terms of getting more testing in a wider ecosystem and
maintenance efforts.
Thanks,
Suren.




>
> >
> > Patchset summary:
> >
> > Classical page fault processing takes the mmap read lock in order to
> > prevent races with mmap writers. In contrast, speculative fault
> > processing does not take the mmap read lock, and instead verifies,
> > when the results of the page fault are about to get committed and
> > become visible to other threads, that no mmap writers have been
> > running concurrently with the page fault. If the check fails,
> > speculative updates do not get committed and the fault is retried
> > in the usual, non-speculative way (with the mmap read lock held).
> >
> > The concurrency check is implemented using a per-mm mmap sequence count.
> > The counter is incremented at the beginning and end of each mmap write
> > operation. If the counter is initially observed to have an even value,
> > and has the same value later on, the observer can deduce that no mmap
> > writers have been running concurrently with it between those two times.
> > This is similar to a seqlock, except that readers never spin on the
> > counter value (they would instead revert to taking the mmap read lock),
> > and writers are allowed to sleep. One benefit of this approach is that
> > it requires no writer side changes, just some hooks in the mmap write
> > lock APIs that writers already use.
> >
> > The first step of a speculative page fault is to look up the vma and
> > read its contents (currently by making a copy of the vma, though in
> > principle it would be sufficient to only read the vma attributes that
> > are used in page faults). The mmap sequence count is used to verify
> > that there were no mmap writers concurrent to the lookup and copy steps.
> > Note that walking rbtrees while there may potentially be concurrent
> > writers is not an entirely new idea in linux, as latched rbtrees
> > are already doing this. This is safe as long as the lookup is
> > followed by a sequence check to verify that concurrency did not
> > actually occur (and abort the speculative fault if it did).
> >
> > The next step is to walk down the existing page table tree to find the
> > current pte entry. This is done with interrupts disabled to avoid
> > races with munmap(). Again, not an entirely new idea, as this repeats
> > a pattern already present in fast GUP. Similar precautions are also
> > taken when taking the page table lock.
> >
> > Breaking COW on an existing mapping may require firing MMU notifiers.
> > Some care is required to avoid racing with registering new notifiers.
> > This patchset adds a new per-cpu rwsem to handle this situation.
>
> I have to admit that this sounds complicated and possibly dangerous to me.
>
>
> Here is one of my concerns, I hope you can clarify:
>
> GUP-fast only ever walks page tables and doesn't actually modify any
> page table state, including, not taking page table locks which might not
> reside in the memmap directly but in auxiliary data. It works because we
> only ever drop the last reference to a page table (to free it) after we
> synchronized against GUP-fast either via an IPI or synchronize_rcu(), as
> GUP=fast disables interrupts.
>
>
> I'd assume that taking page table locks on page tables that might no
> longer be spanned by a VMA because of concurrent page table
> deconstruction  is dangerous:
>
>
> On munmap(), we do the VMA update under mmap_lock in write mode, to the
> remove the page tables under mmap_lock in read mode.
>
> Let's take a look at free_pte_range() on x86:
>
> free_pte_range()
> -> pte_free_tlb()
>  -> tlb_flush_pmd_range()
>   -> __tlb_adjust_range()
>    /* Doesn't actually flush but only updates the tlb range */
>  -> __pte_free_tlb()
>   -> ___pte_free_tlb()
>    -> pgtable_pte_page_dtor()
>     -> ptlock_free()
>     /* page table lock was freed */
>    -> paravirt_tlb_remove_table()
>     -> tlb_remove_page()
>      -> tlb_remove_page_size()
>       -> __tlb_remove_page_size()
>        /* Page added to TLB batch flushing+freeing */
>
> The later tlb_flush_mmu() via tlb_flush_mmu_free()->tlb_table_flush()
> will the free the page tables, after synchronizing against GUP-fast. But
> at that point we already deconstructed the page tables.
>
> So just reading your summary here, what prevents in your approach taking
> a page table lock with racing against page table lock freeing? I cannot
> see how a seqcount would help.
>
>
> IIUC, with what you propose we cannot easily have auxiliary data for a
> page table, at least not via current pgtable_pte_page_dtor(), including
> page locks, which is a drawback (and currently eventually a BUG in your
> code?) at least for me. But I only read the cover letter, so I might be
> missing something important :)
>
> --
> Thanks,
>
> David / dhildenb
>

  reply	other threads:[~2022-01-31 17:00 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-28 13:09 [PATCH v2 00/35] Speculative page faults Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 01/35] mm: export dump_mm Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 02/35] mmap locking API: mmap_lock_is_contended returns a bool Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 03/35] mmap locking API: name the return values Michel Lespinasse
2022-01-31 16:17   ` Liam Howlett
2022-02-07 17:39     ` Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 04/35] do_anonymous_page: use update_mmu_tlb() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 05/35] do_anonymous_page: reduce code duplication Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 06/35] mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 07/35] x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 08/35] mm: add FAULT_FLAG_SPECULATIVE flag Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 09/35] mm: add do_handle_mm_fault() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 10/35] mm: add per-mm mmap sequence counter for speculative page fault handling Michel Lespinasse
2022-08-25 11:23   ` Pavan Kondeti
2022-01-28 13:09 ` [PATCH v2 11/35] mm: rcu safe vma freeing Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 12/35] mm: separate mmap locked assertion from find_vma Michel Lespinasse
2022-01-29  0:08   ` kernel test robot
2022-01-29  0:33     ` Michel Lespinasse
2022-01-31 14:44   ` Matthew Wilcox
2022-02-04 22:41     ` Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 13/35] x86/mm: attempt speculative mm faults first Michel Lespinasse
2022-02-01 17:16   ` Liam Howlett
2022-02-07 17:39     ` Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 14/35] mm: add speculative_page_walk_begin() and speculative_page_walk_end() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 15/35] mm: refactor __handle_mm_fault() / handle_pte_fault() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 16/35] mm: implement speculative handling in __handle_mm_fault() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 17/35] mm: add pte_map_lock() and pte_spinlock() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 18/35] mm: implement speculative handling in do_anonymous_page() Michel Lespinasse
2022-01-28 21:03   ` kernel test robot
2022-01-28 22:08     ` Michel Lespinasse
2022-01-30  2:54   ` [mm] fa5331bae2: canonical_address#:#[##] kernel test robot
2022-01-30  5:08     ` Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 19/35] mm: enable speculative fault handling through do_anonymous_page() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 20/35] mm: implement speculative handling in do_numa_page() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 21/35] mm: enable speculative fault " Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 22/35] percpu-rwsem: enable percpu_sem destruction in atomic context Michel Lespinasse
     [not found]   ` <20220129121319.3593-1-hdanton@sina.com>
2022-01-31 18:04     ` Suren Baghdasaryan
     [not found]       ` <20220201020958.3720-1-hdanton@sina.com>
2022-02-07 19:31         ` Suren Baghdasaryan
     [not found]           ` <20220208002059.2670-1-hdanton@sina.com>
2022-02-08  1:31             ` Suren Baghdasaryan
2022-01-28 13:09 ` [PATCH v2 23/35] mm: add mmu_notifier_lock Michel Lespinasse
2022-07-27  7:34   ` Pavan Kondeti
2022-07-27 20:30     ` Suren Baghdasaryan
2022-01-28 13:09 ` [PATCH v2 24/35] mm: write lock mmu_notifier_lock when registering mmu notifiers Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 25/35] mm: add mmu_notifier_trylock() and mmu_notifier_unlock() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 26/35] mm: implement speculative handling in wp_page_copy() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 27/35] mm: implement and enable speculative fault handling in handle_pte_fault() Michel Lespinasse
2022-01-28 13:09 ` [PATCH v2 28/35] mm: disable speculative faults for single threaded user space Michel Lespinasse
2022-01-28 13:10 ` [PATCH v2 29/35] mm: disable rcu safe vma freeing " Michel Lespinasse
2022-01-28 13:10 ` [PATCH v2 30/35] mm: create new include/linux/vm_event.h header file Michel Lespinasse
2022-01-28 13:10 ` [PATCH v2 31/35] mm: anon spf statistics Michel Lespinasse
2022-01-28 13:10 ` [PATCH v2 32/35] arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Michel Lespinasse
2022-01-28 13:10 ` [PATCH v2 33/35] arm64/mm: attempt speculative mm faults first Michel Lespinasse
2022-01-30  9:13   ` Mike Rapoport
2022-01-31  8:07     ` Michel Lespinasse
2022-02-01  8:58       ` Mike Rapoport
2022-02-07 17:39         ` Michel Lespinasse
2022-02-08  9:07           ` Mike Rapoport
2022-01-28 13:10 ` [PATCH v2 34/35] powerpc/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT Michel Lespinasse
2022-01-28 13:10 ` [PATCH v2 35/35] powerpc/mm: attempt speculative mm faults first Michel Lespinasse
2022-01-31  9:56 ` [PATCH v2 00/35] Speculative page faults David Hildenbrand
2022-01-31 17:00   ` Suren Baghdasaryan [this message]
2022-02-01  1:14 ` Andrew Morton
2022-02-01  2:20   ` Matthew Wilcox
2022-02-07 17:39     ` Michel Lespinasse
2022-02-01 17:17   ` Sebastian Andrzej Siewior
2022-02-23 16:11 ` Mel Gorman
2022-03-08  5:37   ` Suren Baghdasaryan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJuCfpH+emfOg55Fh6hO90+MeGmg2r5FKR7BeuzPJtsu1ArtZA@mail.gmail.com \
    --to=surenb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=jglisse@google.com \
    --cc=joelaf@google.com \
    --cc=kernel-team@fb.com \
    --cc=ldufour@linux.ibm.com \
    --cc=liam.howlett@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mhocko@suse.com \
    --cc=michel@lespinasse.org \
    --cc=minchan@google.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=songliubraving@fb.com \
    --cc=timmurray@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).