[RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi.

* [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi.
@ 2022-07-18 12:01 Nadav Amit
  2022-07-18 12:01 ` [RFC PATCH 01/14] userfaultfd: set dirty and young on writeprotect Nadav Amit
                   ` (13 more replies)
  0 siblings, 14 replies; 37+ messages in thread
From: Nadav Amit @ 2022-07-18 12:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Mike Rapoport, Axel Rasmussen,
	Nadav Amit, Andrea Arcangeli, Andrew Cooper, Andy Lutomirski,
	Dave Hansen, David Hildenbrand, Peter Xu, Peter Zijlstra,
	Thomas Gleixner, Will Deacon, Yu Zhao, Nick Piggin

From: Nadav Amit <namit@vmware.com>

Following the optimizations to avoid unnecessary TLB flushes [1],
mprotect() and userfaultfd() did not cause unnecessary TLB flushes when
protection was unchanged. This enabled userfaultfd to write-unprotect a
page without triggering a TLB flush (and potentially shootdown).

After these changes, David added another feature to mprotect [2],
allowing pages that can safely be mapped as writable, to be mapped as
such directly from mprotect(), instead of going through the page fault
handler. This saves the overhead of a page-fault when write-unprotecting
private exclusive pages as writable, for instance.

This change introduced, however, some undesired behaviors, especially if
we adopt this new feature for userfaultfd. First, the newly mapped PTE
is not set as dirty, which might induce on x86 over 500 cycles of
overhead (if the page was not dirty before).  Second, once again we can
have an expensive TLB shootdown when we write-unprotect a page: when we
relax the protection (i.e., give more permission), we would do a TLB
flush. If the application is multithreaded, or a userfaultfd monitor
uses write-unprotect (which is a common case), a TLB shootdown would be
needed.

This patch-set allows userfaultfd to map pages as writeable directly
upon write-(un)protect ioctl, while addressing the undesired behaviors
that occur when one uses userfaultfd write-unprotect or mprotect to add
permissions. It also does some cleanup and micro-optimizations along the
way.

The main change that is done in the patch-set - x86 specific, at the
moment - is the introduction of "relaxed" TLB flushes when permissions
are added. Upon a "relaxed" TLB flush, the mm's TLB generation is
advanced and the local TLB is flushed, but no TLB shootdown takes place.
If a spurious page-fault occurs and the local generation of the TLB is
found to be out-of-sync with the mm generation, a full TLB flush is
performed on the faulting core to prevent further spurious page-faults.

To a certain extent "relaxed flushes" are similar to the changes that
were proposed some time ago for kernel mappings [3]. However, it does
not have any complicated interactions with with NMI handlers.

Experiments on Haswell show the performance improvement.  Running, for a
single page, a loop of (1) mprotect(READ); (2) mprotect(READ|WRITE) and
then (3) access provides the following result (on bare metal this time):

mprotect(PROT_READ) time in cycles:

			1 Thread	2 Threads
Before (5.19rc4+)	2499		4655
+patch			2495		4363 (-6%)

mprotect(PROT_READ|PROT_WRITE) in cycles:

			1 Thread	2 Threads
Before (5.19rc4+)	2529		4675
+patch			2496		2615 (-44%)

If we ran MADV_FREE or the page was not dirty, we can also shorten the
PROT_READ time by skipping the TLB shootdown with this patch-set.

[1] https://lore.kernel.org/all/20220401180821.1986781-1-namit@vmware.com/
[2] https://lore.kernel.org/all/20220614093629.76309-1-david@redhat.com/
[3] https://lore.kernel.org/all/4797D64D.1060105@goop.org/

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>

Nadav Amit (14):
  userfaultfd: set dirty and young on writeprotect
  userfaultfd: try to map write-unprotected pages
  mm/mprotect: allow exclusive anon pages to be writable
  mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE
  x86/mm: check exec permissions on fault
  mm/rmap: avoid flushing on page_vma_mkclean_one() when possible
  mm: do fix spurious page-faults for instruction faults
  x86/mm: introduce flush_tlb_fix_spurious_fault
  mm: introduce relaxed TLB flushes
  x86/mm: introduce relaxed TLB flushes
  x86/mm: use relaxed TLB flushes when protection is removed
  x86/tlb: no flush on PTE change from RW->RO when PTE is clean
  mm/mprotect: do not check flush type if a strict is needed
  mm: conditional check of pfn in pte_flush_type

 arch/x86/include/asm/pgtable.h  |   4 +-
 arch/x86/include/asm/tlb.h      |   3 +-
 arch/x86/include/asm/tlbflush.h |  90 +++++++++++++++++--------
 arch/x86/kernel/alternative.c   |   2 +-
 arch/x86/kernel/ldt.c           |   3 +-
 arch/x86/mm/fault.c             |  22 +++++-
 arch/x86/mm/tlb.c               |  21 +++++-
 include/asm-generic/tlb.h       | 116 +++++++++++++++++++-------------
 include/linux/mm.h              |   2 +
 include/linux/mm_types.h        |   6 ++
 mm/huge_memory.c                |   9 ++-
 mm/hugetlb.c                    |   2 +-
 mm/memory.c                     |   2 +-
 mm/mmu_gather.c                 |   1 +
 mm/mprotect.c                   |  31 ++++++---
 mm/rmap.c                       |  16 +++--
 mm/userfaultfd.c                |  10 ++-
 17 files changed, 237 insertions(+), 103 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 37+ messages in thread