[RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async

* [RFC PATCH v2 00/12] x86: Flush remote TLBs concurrently and async
@ 2019-05-31  6:36 Nadav Amit
  2019-05-31  6:36 ` [RFC PATCH v2 01/12] smp: Remove smp_call_function() and on_each_cpu() return values Nadav Amit
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Nadav Amit @ 2019-05-31  6:36 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Borislav Petkov, Dave Hansen, Ingo Molnar, Thomas Gleixner, x86,
	linux-kernel, Nadav Amit

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, this patch-set performs remote TLB shootdowns of userspace
mappings asynchronously. It inlines the info that describes what to
flush with the SMP information (which also reduces cache coherency
traffic), and the initiator continues to run after the IPI is received
before the actual flush. This is only possible if page tables were not
removed as otherwise speculative page-walks might lead to machine
checks. Access to userspace is prevented from NMI handlers and kprobes
until the flush is actually carried. Arguably, this should be safe.

Finally, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on the
behavior of on_each_cpu() that functions were first executed remotely
and only then locally.

On my Haswell machine (bare-metal), running a TLB flush microbenchmark
(MADV_DONTNEED/touch for a single page on one thread in a loop, others
perform busy-wait loop), takes the following time (ns per iteration) on
the thread the invokes the TLB shootdown:

	n_threads	before		after
	---------	------		-----
	1		661		663
	2		1436		923 (-36%)
	4		1571		1070 (-32%)

Note that since the benchmark also causes page-faults, the actual
speedup of TLB shootdowns is actually greater. Also note the higher
improvement in performance with 2 thread (a single remote TLB flush
target). This seems to be a side-effect of holding synchronization
data-structures (csd) off the stack, unlike the way it is currently done
(in smp_call_function_single()).

This microbenchmark only measures the time of the thread that invokes
the TLB shootdown. Running sysbench on dax w/emulated-pmem:

  sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
   --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

Shows a performance improvement of 5% (607k vs 575k events).

Patch 1 does small cleanup. Patches 3-5 implement the concurrent
execution of TLB flushes. Patch 6-8 deals with false-sharing and
unnecassary atomic operations. Patches 9-12 perform async TLB flushes
when possible and inline the data on IPIs.

RFC v1 -> RFC v2:
- Removed a patch which did not improve performance
- Patches 6-8: false-sharing and atomic operation optimizations
- Patches 9-12: asynchronous TLB flushes

Nadav Amit (12):
  smp: Remove smp_call_function() and on_each_cpu() return values
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Refactor common code into flush_tlb_on_cpus()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Optimize local TLB flushes
  KVM: x86: Provide paravirtualized flush_tlb_multi()
  smp: Do not mark call_function_data as shared
  x86/tlb: Privatize cpu_tlbstate
  x86/apic: Use non-atomic operations when possible
  smp: Enable data inlining for inter-processor function call
  x86/mm/tlb: Use async and inline messages for flushing
  x86/mm/tlb: Reverting the removal of flush_tlb_info from stack

 arch/alpha/kernel/smp.c               |  19 +--
 arch/alpha/oprofile/common.c          |   6 +-
 arch/ia64/kernel/perfmon.c            |  12 +-
 arch/ia64/kernel/uncached.c           |   8 +-
 arch/x86/hyperv/mmu.c                 |   2 +
 arch/x86/include/asm/paravirt.h       |   8 +
 arch/x86/include/asm/paravirt_types.h |   6 +
 arch/x86/include/asm/tlbflush.h       |  58 ++++---
 arch/x86/kernel/apic/apic_flat_64.c   |   4 +-
 arch/x86/kernel/apic/x2apic_cluster.c |   2 +-
 arch/x86/kernel/kvm.c                 |  11 +-
 arch/x86/kernel/paravirt.c            |   3 +
 arch/x86/kernel/smp.c                 |   2 +-
 arch/x86/lib/cache-smp.c              |   3 +-
 arch/x86/mm/init.c                    |   2 +-
 arch/x86/mm/tlb.c                     | 204 +++++++++++++++++--------
 arch/x86/xen/mmu_pv.c                 |   2 +
 drivers/char/agp/generic.c            |   3 +-
 include/linux/smp.h                   |  45 ++++--
 kernel/smp.c                          | 210 ++++++++++++++++----------
 kernel/up.c                           |   3 +-
 21 files changed, 397 insertions(+), 216 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 36+ messages in thread