linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] TLB flush multiple pages per IPI v5
@ 2015-06-08 12:50 Mel Gorman
  2015-06-08 12:50 ` [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Mel Gorman @ 2015-06-08 12:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linux-MM, LKML, Mel Gorman

Changelog since V4
o Rebase to 4.1-rc6

Changelog since V3
o Drop batching of TLB flush from migration
o Redo how larger batching is managed
o Batch TLB flushes when writable entries exist

When unmapping pages it is necessary to flush the TLB. If that page was
accessed by another CPU then an IPI is used to flush the remote CPU. That
is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.

There already is a window between when a page is unmapped and when it is
TLB flushed. This series simply increases the window so multiple pages
can be flushed using a single IPI. This *should* be safe or the kernel is
hosed already but I've cc'd the x86 maintainers and some of the Intel folk
for comment.

Patch 1 simply made the rest of the series easier to write as ftrace
	could identify all the senders of TLB flush IPIS.

Patch 2 collects a list of PFNs and sends one IPI to flush them all

Patch 3 tracks when there potentially are writable TLB entries that
	need to be batched differently

The performance impact is documented in the changelogs but in the optimistic
case on a 4-socket machine the full series reduces interrupts from 900K
interrupts/second to 60K interrupts/second.

 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 arch/x86/mm/tlb.c               |   1 +
 include/linux/init_task.h       |   8 +++
 include/linux/mm_types.h        |   1 +
 include/linux/rmap.h            |   3 +
 include/linux/sched.h           |  15 +++++
 include/trace/events/tlb.h      |   3 +-
 init/Kconfig                    |   8 +++
 kernel/fork.c                   |   5 ++
 kernel/sched/core.c             |   3 +
 mm/internal.h                   |  15 +++++
 mm/rmap.c                       | 119 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  30 +++++++++-
 14 files changed, 210 insertions(+), 4 deletions(-)

-- 
2.3.5


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent
  2015-06-08 12:50 [PATCH 0/3] TLB flush multiple pages per IPI v5 Mel Gorman
@ 2015-06-08 12:50 ` Mel Gorman
  2015-06-08 12:50 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2015-06-08 12:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linux-MM, LKML, Mel Gorman

It is easy to trace when an IPI is received to flush a TLB but harder to
detect what event sent it. This patch makes it easy to identify the source
of IPIs being transmitted for TLB flushes on x86.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
---
 arch/x86/mm/tlb.c          | 1 +
 include/linux/mm_types.h   | 1 +
 include/trace/events/tlb.h | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3250f2371aea..2da824c1c140 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -140,6 +140,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 	info.flush_end = end;
 
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
+	trace_tlb_flush(TLB_REMOTE_SEND_IPI, end - start);
 	if (is_uv_system()) {
 		unsigned int cpu;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8d37e26a1007..86ad9f902042 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -534,6 +534,7 @@ enum tlb_flush_reason {
 	TLB_REMOTE_SHOOTDOWN,
 	TLB_LOCAL_SHOOTDOWN,
 	TLB_LOCAL_MM_SHOOTDOWN,
+	TLB_REMOTE_SEND_IPI,
 	NR_TLB_FLUSH_REASONS,
 };
 
diff --git a/include/trace/events/tlb.h b/include/trace/events/tlb.h
index 4250f364a6ca..bc8815f45f3b 100644
--- a/include/trace/events/tlb.h
+++ b/include/trace/events/tlb.h
@@ -11,7 +11,8 @@
 	EM(  TLB_FLUSH_ON_TASK_SWITCH,	"flush on task switch" )	\
 	EM(  TLB_REMOTE_SHOOTDOWN,	"remote shootdown" )		\
 	EM(  TLB_LOCAL_SHOOTDOWN,	"local shootdown" )		\
-	EMe( TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )
+	EM(  TLB_LOCAL_MM_SHOOTDOWN,	"local mm shootdown" )		\
+	EMe( TLB_REMOTE_SEND_IPI,	"remote ipi send" )
 
 /*
  * First define the enums in TLB_FLUSH_REASON to be exported to userspace
-- 
2.3.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped
  2015-06-08 12:50 [PATCH 0/3] TLB flush multiple pages per IPI v5 Mel Gorman
  2015-06-08 12:50 ` [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Mel Gorman
@ 2015-06-08 12:50 ` Mel Gorman
  2015-06-08 22:38   ` Andrew Morton
  2015-06-08 12:50 ` [PATCH 3/3] mm: Defer flush of writable TLB entries Mel Gorman
  2015-06-08 17:45 ` [PATCH 0/3] TLB flush multiple pages per IPI v5 Ingo Molnar
  3 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2015-06-08 12:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linux-MM, LKML, Mel Gorman

An IPI is sent to flush remote TLBs when a page is unmapped that was
recently accessed by other CPUs. There are many circumstances where this
happens but the obvious one is kswapd reclaiming pages belonging to a
running process as kswapd and the task are likely running on separate CPUs.

On small machines, this is not a significant problem but as machine
gets larger with more cores and more memory, the cost of these IPIs can
be high. This patch uses a structure similar in principle to a pagevec
to collect a list of PFNs and CPUs that require flushing. It then sends
one IPI per CPU that was mapping any of those pages to flush the list of
PFNs. A new TLB flush helper is required for this and one is added for
x86. Other architectures will need to decide if batching like this is both
safe and worth the memory overhead. Specifically the requirement is;

	If a clean page is unmapped and not immediately flushed, the
	architecture must guarantee that a write to that page from a CPU
	with a cached TLB entry will trap a page fault.

This is essentially what the kernel already depends on but the window is
much larger with this patch applied and is worth highlighting.

The impact of this patch depends on the workload as measuring any benefit
requires both mapped pages co-located on the LRU and memory pressure. The
case with the biggest impact is multiple processes reading mapped pages
taken from the vm-scalability test suite. The test case uses NR_CPU readers
of mapped files that consume 10*RAM.

vmscale on a 4-node machine with 64G RAM and 48 CPUs
           4.1.0-rc6     4.1.0-rc6
             vanilla batchunmap-v5
User          577.35        618.60
System       5927.06       4195.03
Elapsed       162.21        121.31

The workload completed 25% faster with 29% less CPU time.

This is showing that the readers completed 25% with 30% less CPU time. From
vmstats, it is known that the vanilla kernel was interrupted roughly 900K
times per second during the steady phase of the test and the patched kernel
was interrupts 180K times per second.

The impact is much lower on a small machine

vmscale on a 1-node machine with 8G RAM and 1 CPU
           4.1.0-rc6     4.1.0-rc6
             vanilla batchunmap-v5
User           59.14         58.86
System        109.15         83.78
Elapsed        27.32         23.14

It's still a noticeable improvement with vmstat showing interrupts went
from roughly 500K per second to 45K per second.

The patch will have no impact on workloads with no memory pressure or
have relatively few mapped pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/x86/Kconfig                |   1 +
 arch/x86/include/asm/tlbflush.h |   2 +
 include/linux/init_task.h       |   8 +++
 include/linux/rmap.h            |   3 ++
 include/linux/sched.h           |  14 ++++++
 init/Kconfig                    |   8 +++
 kernel/fork.c                   |   5 ++
 kernel/sched/core.c             |   3 ++
 mm/internal.h                   |  11 +++++
 mm/rmap.c                       | 105 +++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c                     |  23 ++++++++-
 11 files changed, 181 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 226d5696e1d1..4e8bd86735af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -31,6 +31,7 @@ config X86
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select HAVE_AOUT if X86_32
 	select HAVE_UNSTABLE_SCHED_CLOCK
+	select ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
 	select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
 	select ARCH_SUPPORTS_INT128 if X86_64
 	select HAVE_IDE
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cd791948b286..10c197a649f5 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -152,6 +152,8 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 
+#define flush_local_tlb_addr(addr) __flush_tlb_single(addr)
+
 #ifndef CONFIG_SMP
 
 /* "_up" is for UniProcessor.
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 696d22312b31..0771937b47e1 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -175,6 +175,13 @@ extern struct task_group root_task_group;
 # define INIT_NUMA_BALANCING(tsk)
 #endif
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
+	.tlb_ubc = NULL,
+#else
+# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
+#endif
+
 #ifdef CONFIG_KASAN
 # define INIT_KASAN(tsk)						\
 	.kasan_depth = 1,
@@ -257,6 +264,7 @@ extern struct task_group root_task_group;
 	INIT_RT_MUTEXES(tsk)						\
 	INIT_VTIME(tsk)							\
 	INIT_NUMA_BALANCING(tsk)					\
+	INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
 	INIT_KASAN(tsk)							\
 }
 
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c89c53a113a8..29446aeef36e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -89,6 +89,9 @@ enum ttu_flags {
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
 	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
+	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
+					 * and caller guarantees they will
+					 * do a final flush if necessary */
 };
 
 #ifdef CONFIG_MMU
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26a2e6122734..57ff61b16565 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1289,6 +1289,16 @@ enum perf_event_task_context {
 	perf_nr_task_contexts,
 };
 
+/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
+#define BATCH_TLBFLUSH_SIZE 32UL
+
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+	struct cpumask cpumask;
+	unsigned long nr_pages;
+	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	void *stack;
@@ -1648,6 +1658,10 @@ struct task_struct {
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	struct tlbflush_unmap_batch *tlb_ubc;
+#endif
+
 	struct rcu_head rcu;
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec60232..1ff5d17e518a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -904,6 +904,14 @@ config ARCH_SUPPORTS_NUMA_BALANCING
 	bool
 
 #
+# For architectures that have a local TLB flush for a PFN without knowledge
+# of the VMA. The architecture must provide guarantees on what happens if
+# a clean TLB cache entry is written after the unmap. Details are in mm/rmap.c
+# near the check for should_defer_flush.
+config ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	bool
+
+#
 # For architectures that know their GCC __int128 support is sound
 #
 config ARCH_SUPPORTS_INT128
diff --git a/kernel/fork.c b/kernel/fork.c
index 03c1eaaa6ef5..8ea9e8730ada 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -257,6 +257,11 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	kfree(tsk->tlb_ubc);
+	tsk->tlb_ubc = NULL;
+#endif
+
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 123673291ffb..936ce13aeb97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1843,6 +1843,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+	p->tlb_ubc = NULL;
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 }
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/internal.h b/mm/internal.h
index a25e359a4039..8cbb68ccc731 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -433,4 +433,15 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_CMA		0x80 /* allow allocations from CMA areas */
 #define ALLOC_FAIR		0x100 /* fair zone allocation */
 
+enum ttu_flags;
+struct tlbflush_unmap_batch;
+
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+void try_to_unmap_flush(void);
+#else
+static inline void try_to_unmap_flush(void)
+{
+}
+
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index 24dd3f9fee27..a8dbba62398a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -60,6 +60,8 @@
 
 #include <asm/tlbflush.h>
 
+#include <trace/events/tlb.h>
+
 #include "internal.h"
 
 static struct kmem_cache *anon_vma_cachep;
@@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+static void percpu_flush_tlb_batch_pages(void *data)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = data;
+	int i;
+
+	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+	for (i = 0; i < tlb_ubc->nr_pages; i++)
+		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
+}
+
+/*
+ * Flush TLB entries for recently unmapped pages from remote CPUs. It is
+ * important that if a PTE was dirty when it was unmapped that it's flushed
+ * before any IO is initiated on the page to prevent lost writes. Similarly,
+ * it must be flushed before freeing to prevent data leakage.
+ */
+void try_to_unmap_flush(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+	int cpu;
+
+	if (!tlb_ubc || !tlb_ubc->nr_pages)
+		return;
+
+	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
+
+	preempt_disable();
+	cpu = smp_processor_id();
+	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
+		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
+
+	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
+		smp_call_function_many(&tlb_ubc->cpumask,
+			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
+	}
+	cpumask_clear(&tlb_ubc->cpumask);
+	tlb_ubc->nr_pages = 0;
+	preempt_enable();
+}
+
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	cpumask_or(&tlb_ubc->cpumask, &tlb_ubc->cpumask, mm_cpumask(mm));
+	tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
+	tlb_ubc->nr_pages++;
+
+	if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
+		try_to_unmap_flush();
+}
+
+/*
+ * Returns true if the TLB flush should be deferred to the end of a batch of
+ * unmap operations to reduce IPIs.
+ */
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	bool should_defer = false;
+
+	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
+		return false;
+
+	/* If remote CPUs need to be flushed then defer batch the flush */
+	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+		should_defer = true;
+	put_cpu();
+
+	return should_defer;
+}
+#else
+static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
+		struct page *page)
+{
+}
+
+static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 /*
  * At what user virtual address is page expected in vma?
  * Caller should check the page is actually part of the vma.
@@ -1213,7 +1299,24 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush(vma, address, pte);
+	if (should_defer_flush(mm, flags)) {
+		/*
+		 * We clear the PTE but do not flush so potentially a remote
+		 * CPU could still be writing to the page. If the entry was
+		 * previously clean then the architecture must guarantee that
+		 * a clear->dirty transition on a cached TLB entry is written
+		 * through and traps if the PTE is unmapped.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pte);
+
+		/* Potentially writable TLBs must be flushed before IO */
+		if (pte_dirty(pteval))
+			flush_tlb_page(vma, address);
+		else
+			set_tlb_ubc_flush_pending(mm, page);
+	} else {
+		pteval = ptep_clear_flush(vma, address, pte);
+	}
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd71bac..5121742ccb87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1024,7 +1024,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+			switch (try_to_unmap(page,
+					ttu_flags|TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1175,6 +1176,7 @@ keep:
 	}
 
 	mem_cgroup_uncharge_list(&free_pages);
+	try_to_unmap_flush();
 	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
@@ -2118,6 +2120,23 @@ out:
 	}
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
+/*
+ * Allocate the control structure for batch TLB flushing. An allocation
+ * failure is harmless as the reclaimer will send IPIs where necessary.
+ */
+static void alloc_tlb_ubc(void)
+{
+	if (!current->tlb_ubc)
+		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
+						GFP_KERNEL | __GFP_NOWARN);
+}
+#else
+static inline void alloc_tlb_ubc(void)
+{
+}
+#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
+
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -2152,6 +2171,8 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
 	scan_adjusted = (global_reclaim(sc) && !current_is_kswapd() &&
 			 sc->priority == DEF_PRIORITY);
 
+	alloc_tlb_ubc();
+
 	blk_start_plug(&plug);
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
-- 
2.3.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/3] mm: Defer flush of writable TLB entries
  2015-06-08 12:50 [PATCH 0/3] TLB flush multiple pages per IPI v5 Mel Gorman
  2015-06-08 12:50 ` [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Mel Gorman
  2015-06-08 12:50 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
@ 2015-06-08 12:50 ` Mel Gorman
  2015-06-08 17:45 ` [PATCH 0/3] TLB flush multiple pages per IPI v5 Ingo Molnar
  3 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2015-06-08 12:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linux-MM, LKML, Mel Gorman

If a PTE is unmapped and it's dirty then it was writable recently. Due
to deferred TLB flushing, it's best to assume a writable TLB cache entry
exists. With that assumption, the TLB must be flushed before any IO can
start or the page is freed to avoid lost writes or data corruption. This
patch defers flushing of potentially writable TLBs as long as possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/sched.h |  1 +
 mm/internal.h         |  4 ++++
 mm/rmap.c             | 28 +++++++++++++++++++++-------
 mm/vmscan.c           |  7 ++++++-
 4 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 57ff61b16565..827d9b123bd5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1296,6 +1296,7 @@ enum perf_event_task_context {
 struct tlbflush_unmap_batch {
 	struct cpumask cpumask;
 	unsigned long nr_pages;
+	bool writable;
 	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
 };
 
diff --git a/mm/internal.h b/mm/internal.h
index 8cbb68ccc731..ecf47a01420d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -438,10 +438,14 @@ struct tlbflush_unmap_batch;
 
 #ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
 void try_to_unmap_flush(void);
+void try_to_unmap_flush_dirty(void);
 #else
 static inline void try_to_unmap_flush(void)
 {
 }
+static inline void try_to_unmap_flush_dirty(void)
+{
+}
 
 #endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/rmap.c b/mm/rmap.c
index a8dbba62398a..3c8ebacfe6ef 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -621,11 +621,21 @@ void try_to_unmap_flush(void)
 	}
 	cpumask_clear(&tlb_ubc->cpumask);
 	tlb_ubc->nr_pages = 0;
+	tlb_ubc->writable = false;
 	preempt_enable();
 }
 
+/* Flush iff there are potentially writable TLB entries that can race with IO */
+void try_to_unmap_flush_dirty(void)
+{
+	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
+
+	if (tlb_ubc && tlb_ubc->writable)
+		try_to_unmap_flush();
+}
+
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page)
+		struct page *page, bool writable)
 {
 	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
 
@@ -633,6 +643,14 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
 	tlb_ubc->pfns[tlb_ubc->nr_pages] = page_to_pfn(page);
 	tlb_ubc->nr_pages++;
 
+	/*
+	 * If the PTE was dirty then it's best to assume it's writable. The
+	 * caller must use try_to_unmap_flush_dirty() or try_to_unmap_flush()
+	 * before the page any IO is initiated.
+	 */
+	if (writable)
+		tlb_ubc->writable = true;
+
 	if (tlb_ubc->nr_pages == BATCH_TLBFLUSH_SIZE)
 		try_to_unmap_flush();
 }
@@ -657,7 +675,7 @@ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 }
 #else
 static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
-		struct page *page)
+		struct page *page, bool writable)
 {
 }
 
@@ -1309,11 +1327,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		pteval = ptep_get_and_clear(mm, address, pte);
 
-		/* Potentially writable TLBs must be flushed before IO */
-		if (pte_dirty(pteval))
-			flush_tlb_page(vma, address);
-		else
-			set_tlb_ubc_flush_pending(mm, page);
+		set_tlb_ubc_flush_pending(mm, page, pte_dirty(pteval));
 	} else {
 		pteval = ptep_clear_flush(vma, address, pte);
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5121742ccb87..0055224c52d4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1065,7 +1065,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			if (!sc->may_writepage)
 				goto keep_locked;
 
-			/* Page is dirty, try to write it out here */
+			/*
+			 * Page is dirty. Flush the TLB if a writable entry
+			 * potentially exists to avoid CPU writes after IO
+			 * starts and then write it out here
+			 */
+			try_to_unmap_flush_dirty();
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
-- 
2.3.5


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-08 12:50 [PATCH 0/3] TLB flush multiple pages per IPI v5 Mel Gorman
                   ` (2 preceding siblings ...)
  2015-06-08 12:50 ` [PATCH 3/3] mm: Defer flush of writable TLB entries Mel Gorman
@ 2015-06-08 17:45 ` Ingo Molnar
  2015-06-08 18:21   ` Dave Hansen
  2015-06-09  8:47   ` Mel Gorman
  3 siblings, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2015-06-08 17:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner


* Mel Gorman <mgorman@suse.de> wrote:

> Changelog since V4
> o Rebase to 4.1-rc6
> 
> Changelog since V3
> o Drop batching of TLB flush from migration
> o Redo how larger batching is managed
> o Batch TLB flushes when writable entries exist
> 
> When unmapping pages it is necessary to flush the TLB. If that page was
> accessed by another CPU then an IPI is used to flush the remote CPU. That
> is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.
> 
> There already is a window between when a page is unmapped and when it is
> TLB flushed. This series simply increases the window so multiple pages
> can be flushed using a single IPI. This *should* be safe or the kernel is
> hosed already but I've cc'd the x86 maintainers and some of the Intel folk
> for comment.
> 
> Patch 1 simply made the rest of the series easier to write as ftrace
> 	could identify all the senders of TLB flush IPIS.
> 
> Patch 2 collects a list of PFNs and sends one IPI to flush them all
> 
> Patch 3 tracks when there potentially are writable TLB entries that
> 	need to be batched differently
> 
> The performance impact is documented in the changelogs but in the optimistic
> case on a 4-socket machine the full series reduces interrupts from 900K
> interrupts/second to 60K interrupts/second.

Yeah, so I think batching writable flushes is useful I think, but I disagree with 
one aspect of it: with the need to gather _pfns_ and batch them over to the remote 
CPU.

As per my measurements the __flush_tlb_single() primitive (which you use in patch
#2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2
pages and gets exponentially worse. It's probably done in microcode and its 
performance is horrible.

So have you explored the possibility to significantly simplify your patch-set by 
only deferring the flushing, and doing a simple TLB flush on the remote CPU? As 
per your measurements there must be tons and tons of flushes of lots of pages, the 
pfn tracking simply does not make sense.

That way there's no memory overhead and no complex tracking of pfns - we'd 
basically track a simple deferred-flush bit instead. We'd still have the benefits 
of batching the IPIs, which is the main win.

I strongly suspect that your numbers will become even better with such a variant.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-08 17:45 ` [PATCH 0/3] TLB flush multiple pages per IPI v5 Ingo Molnar
@ 2015-06-08 18:21   ` Dave Hansen
  2015-06-08 19:52     ` Ingo Molnar
  2015-06-09  8:47   ` Mel Gorman
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2015-06-08 18:21 UTC (permalink / raw)
  To: Ingo Molnar, Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Andi Kleen, H Peter Anvin, Linux-MM, LKML, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner

On 06/08/2015 10:45 AM, Ingo Molnar wrote:
> As per my measurements the __flush_tlb_single() primitive (which you use in patch
> #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2
> pages and gets exponentially worse. It's probably done in microcode and its 
> performance is horrible.

I discussed this a bit in commit a5102476a2.  I'd be curious what
numbers you came up with.

But, don't we have to take in to account the cost of refilling the TLB
in addition to the cost of emptying it?  The TLB size is historically
increasing on a per-core basis, so isn't this refill cost only going to
get worse?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-08 18:21   ` Dave Hansen
@ 2015-06-08 19:52     ` Ingo Molnar
  2015-06-08 20:03       ` Ingo Molnar
  2015-06-08 21:07       ` Dave Hansen
  0 siblings, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2015-06-08 19:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Hugh Dickins,
	Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave.hansen@intel.com> wrote:

> On 06/08/2015 10:45 AM, Ingo Molnar wrote:
> > As per my measurements the __flush_tlb_single() primitive (which you use in patch
> > #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2
> > pages and gets exponentially worse. It's probably done in microcode and its 
> > performance is horrible.
> 
> I discussed this a bit in commit a5102476a2.  I'd be curious what
> numbers you came up with.

... which for those of us who don't have sha1's cached in their brain is:

  a5102476a24b ("x86/mm: Set TLB flush tunable to sane value (33)")

;-)

So what I measured agrees generally with the comment you added in the commit:

 + * Each single flush is about 100 ns, so this caps the maximum overhead at
 + * _about_ 3,000 ns.

Let that sink through: 3,000 nsecs = 3 usecs, that's like eternity!

A CR3 driven TLB flush takes less time than a single INVLPG (!):

   [    0.389028] x86/fpu: Cost of: __flush_tlb()               fn            :    96 cycles
   [    0.405885] x86/fpu: Cost of: __flush_tlb_one()           fn            :   260 cycles
   [    0.414302] x86/fpu: Cost of: __flush_tlb_range()         fn            :   404 cycles

it's true that a full flush has hidden costs not measured above, because it has 
knock-on effects (because it drops non-global TLB entries), but it's not _that_ 
bad due to:

  - there almost always being a L1 or L2 cache miss when a TLB miss occurs,
    which latency can be overlaid

  - global bit being held for kernel entries

  - user-space with high memory pressure trashing through TLBs typically

... and especially with caches and Intel's historically phenomenally low TLB 
refill latency it's difficult to measure the effects of local TLB refills, let 
alone measure it in any macro benchmark.

Cross-CPU flushes are expensive, absolutely no argument about that - my suggestion 
here is to keep the batching but simplify it: because I strongly suspect that the 
biggest win is the batching, not the pfn queueing.

We might even win a bit more performance due to the simplification.

> But, don't we have to take in to account the cost of refilling the TLB in 
> addition to the cost of emptying it?  The TLB size is historically increasing on 
> a per-core basis, so isn't this refill cost only going to get worse?

Only if TLB refill latency sucks - but Intel's is very good and AMD's is pretty 
good as well.

Also, usually if you miss the TLB you miss the cache line as well (you definitely 
miss the L1 cache, and TLB caches are sized to hold a fair chunk of your L2 
cache), and the CPU can overlap the two latencies.

So while it might sound counter-intuitive, a full TLB flush might be faster than 
trying to do software based TLB cache management ...

INVLPG really sucks. I can be convinced by numbers, but this isn't nearly as 
clear-cut as it might look.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-08 19:52     ` Ingo Molnar
@ 2015-06-08 20:03       ` Ingo Molnar
  2015-06-08 21:07       ` Dave Hansen
  1 sibling, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2015-06-08 20:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Hugh Dickins,
	Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner


* Ingo Molnar <mingo@kernel.org> wrote:

> So what I measured agrees generally with the comment you added in the commit:
> 
>  + * Each single flush is about 100 ns, so this caps the maximum overhead at
>  + * _about_ 3,000 ns.
> 
> Let that sink through: 3,000 nsecs = 3 usecs, that's like eternity!
> 
> A CR3 driven TLB flush takes less time than a single INVLPG (!):
> 
>    [    0.389028] x86/fpu: Cost of: __flush_tlb()               fn            :    96 cycles
>    [    0.405885] x86/fpu: Cost of: __flush_tlb_one()           fn            :   260 cycles
>    [    0.414302] x86/fpu: Cost of: __flush_tlb_range()         fn            :   404 cycles
> 
> it's true that a full flush has hidden costs not measured above, because it has 
> knock-on effects (because it drops non-global TLB entries), but it's not _that_ 
> bad due to:
> 
>   - there almost always being a L1 or L2 cache miss when a TLB miss occurs,
>     which latency can be overlaid
> 
>   - global bit being held for kernel entries
> 
>   - user-space with high memory pressure trashing through TLBs typically

I also have cache-cold numbers from another (Intel) system:

[    0.176473] x86/bench:##########################################################################
[    0.185656] x86/bench: Running x86 benchmarks:                     cache-    hot /   cold cycles
[    1.234448] x86/bench: Cost of: null                                    :     35 /     73 cycles
[    ........]
[   27.930451] x86/bench:########  MM instructions:          ######################################
[   28.979251] x86/bench: Cost of: __flush_tlb()             fn            :    251 /    366 cycles
[   30.028795] x86/bench: Cost of: __flush_tlb_global()      fn            :    746 /   1795 cycles
[   31.077862] x86/bench: Cost of: __flush_tlb_one()         fn            :    237 /    883 cycles
[   32.127371] x86/bench: Cost of: __flush_tlb_range()       fn            :    312 /   1603 cycles
[   35.254202] x86/bench: Cost of: wbinvd()                  insn          : 2491761 / 2491922 cycles

Note how the numbers are even worse in the cache-cold case: the algorithmic 
complexity of __flush_tlb_range() versus __flush_tlb() makes it run slower 
(because we miss the I$), while the TLB cache-preservation argument is probably 
weaker, because when we are cache cold then TLB refill latency probably matters 
less (as it can be overlapped).

So __flush_tlb_range() is software trying to beat hardware, and that's almost 
always a bad idea on x86.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-08 19:52     ` Ingo Molnar
  2015-06-08 20:03       ` Ingo Molnar
@ 2015-06-08 21:07       ` Dave Hansen
  2015-06-08 21:50         ` Ingo Molnar
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2015-06-08 21:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Hugh Dickins,
	Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner

On 06/08/2015 12:52 PM, Ingo Molnar wrote:
> A CR3 driven TLB flush takes less time than a single INVLPG (!):
> 
>    [    0.389028] x86/fpu: Cost of: __flush_tlb()               fn            :    96 cycles
>    [    0.405885] x86/fpu: Cost of: __flush_tlb_one()           fn            :   260 cycles
>    [    0.414302] x86/fpu: Cost of: __flush_tlb_range()         fn            :   404 cycles

How was that measured, btw?  Are these instructions running in a loop?
Does __flush_tlb_one() include the tracepoint?

(From the commit I referenced) This was (probably) using a different
method than you did, but "FULL" below is __flush_tlb() while "1" is
__flush_tlb_one().  The "cycles" includes some overhead from the tracing:

>       FULL:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
>          1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895

So it looks like we've got some discrepancy, either from the test
methodology or the CPU.  All of the code and my methodology are in the
commit.  Could you share yours?

> it's true that a full flush has hidden costs not measured above, because it has 
> knock-on effects (because it drops non-global TLB entries), but it's not _that_ 
> bad due to:
> 
>   - there almost always being a L1 or L2 cache miss when a TLB miss occurs,
>     which latency can be overlaid
> 
>   - global bit being held for kernel entries
> 
>   - user-space with high memory pressure trashing through TLBs typically
> 
> ... and especially with caches and Intel's historically phenomenally low TLB 
> refill latency it's difficult to measure the effects of local TLB refills, let 
> alone measure it in any macro benchmark.

All that you're saying there is that you need to consider how TLB misses
act in _practice_ and not just measure worst-case or theoretical TLB
miss cost.  I completely agree with that.

...
> INVLPG really sucks. I can be convinced by numbers, but this isn't nearly as 
> clear-cut as it might look.

It's clear as mud!

I'd be very interested to see any numbers for how this affects real
workloads.  I've been unable to find anything that was measurably
affected by invlpg vs a full flush.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-08 21:07       ` Dave Hansen
@ 2015-06-08 21:50         ` Ingo Molnar
  0 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2015-06-08 21:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Hugh Dickins,
	Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner


* Dave Hansen <dave.hansen@intel.com> wrote:

> On 06/08/2015 12:52 PM, Ingo Molnar wrote:
> > A CR3 driven TLB flush takes less time than a single INVLPG (!):
> > 
> >    [    0.389028] x86/fpu: Cost of: __flush_tlb()               fn            :    96 cycles
> >    [    0.405885] x86/fpu: Cost of: __flush_tlb_one()           fn            :   260 cycles
> >    [    0.414302] x86/fpu: Cost of: __flush_tlb_range()         fn            :   404 cycles
> 
> How was that measured, btw?  Are these instructions running in a loop?

Yes - see the x86 benchmarking patch in the big FPU submission for an earlier 
version.

> Does __flush_tlb_one() include the tracepoint?

No tracing overhead.

> (From the commit I referenced) This was (probably) using a different method than 
> you did, but "FULL" below is __flush_tlb() while "1" is __flush_tlb_one().  The 
> "cycles" includes some overhead from the tracing:
> 
> >       FULL:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
> >          1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895
> 
> So it looks like we've got some discrepancy, either from the test methodology or 
> the CPU.  All of the code and my methodology are in the commit.  Could you share 
> yours?

Yes, you can reproduce it by applying this patch from the FPU series:

  Subject: [PATCH 207/208] x86/fpu: Add FPU performance measurement subsystem

(you were Cc:-ed to it, so it should be in your inbox.)

I've got a more advanced version meanwhile, will post it in the next couple of 
days or so.

> > it's true that a full flush has hidden costs not measured above, because it has 
> > knock-on effects (because it drops non-global TLB entries), but it's not _that_ 
> > bad due to:
> > 
> >   - there almost always being a L1 or L2 cache miss when a TLB miss occurs,
> >     which latency can be overlaid
> > 
> >   - global bit being held for kernel entries
> > 
> >   - user-space with high memory pressure trashing through TLBs typically
> > 
> > ... and especially with caches and Intel's historically phenomenally low TLB 
> > refill latency it's difficult to measure the effects of local TLB refills, let 
> > alone measure it in any macro benchmark.
> 
> All that you're saying there is that you need to consider how TLB misses act in 
> _practice_ and not just measure worst-case or theoretical TLB miss cost.  I 
> completely agree with that.

So I'm saying considerably more than that: I consider it likely that a full TLB 
flush is not nearly as costly as assumed, for the three reasons outlined above.

It might even be a performance win in Mel's benchmark - although possibly not 
measurable within measurement noise levels.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped
  2015-06-08 12:50 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
@ 2015-06-08 22:38   ` Andrew Morton
  2015-06-09 11:07     ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2015-06-08 22:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linux-MM, LKML

On Mon,  8 Jun 2015 13:50:53 +0100 Mel Gorman <mgorman@suse.de> wrote:

> An IPI is sent to flush remote TLBs when a page is unmapped that was
> recently accessed by other CPUs. There are many circumstances where this
> happens but the obvious one is kswapd reclaiming pages belonging to a
> running process as kswapd and the task are likely running on separate CPUs.
> 
> On small machines, this is not a significant problem but as machine
> gets larger with more cores and more memory, the cost of these IPIs can
> be high. This patch uses a structure similar in principle to a pagevec
> to collect a list of PFNs and CPUs that require flushing. It then sends
> one IPI per CPU that was mapping any of those pages to flush the list of
> PFNs. A new TLB flush helper is required for this and one is added for
> x86. Other architectures will need to decide if batching like this is both
> safe and worth the memory overhead. Specifically the requirement is;
> 
> 	If a clean page is unmapped and not immediately flushed, the
> 	architecture must guarantee that a write to that page from a CPU
> 	with a cached TLB entry will trap a page fault.
> 
> This is essentially what the kernel already depends on but the window is
> much larger with this patch applied and is worth highlighting.
> 
> The impact of this patch depends on the workload as measuring any benefit
> requires both mapped pages co-located on the LRU and memory pressure. The
> case with the biggest impact is multiple processes reading mapped pages
> taken from the vm-scalability test suite. The test case uses NR_CPU readers
> of mapped files that consume 10*RAM.
> 
> vmscale on a 4-node machine with 64G RAM and 48 CPUs
>            4.1.0-rc6     4.1.0-rc6
>              vanilla batchunmap-v5
> User          577.35        618.60
> System       5927.06       4195.03
> Elapsed       162.21        121.31
> 
> The workload completed 25% faster with 29% less CPU time.
> 
> This is showing that the readers completed 25% with 30% less CPU time. From
> vmstats, it is known that the vanilla kernel was interrupted roughly 900K
> times per second during the steady phase of the test and the patched kernel
> was interrupts 180K times per second.
> 
> The impact is much lower on a small machine
> 
> vmscale on a 1-node machine with 8G RAM and 1 CPU
>            4.1.0-rc6     4.1.0-rc6
>              vanilla batchunmap-v5
> User           59.14         58.86
> System        109.15         83.78
> Elapsed        27.32         23.14
> 
> It's still a noticeable improvement with vmstat showing interrupts went
> from roughly 500K per second to 45K per second.

Looks nice.

> The patch will have no impact on workloads with no memory pressure or
> have relatively few mapped pages.

What benefit can we expect to see to any real-world userspace?

> --- a/include/linux/init_task.h
> +++ b/include/linux/init_task.h
> @@ -175,6 +175,13 @@ extern struct task_group root_task_group;
>  # define INIT_NUMA_BALANCING(tsk)
>  #endif
>  
> +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
> +	.tlb_ubc = NULL,
> +#else
> +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
> +#endif
> +
>  #ifdef CONFIG_KASAN
>  # define INIT_KASAN(tsk)						\
>  	.kasan_depth = 1,
> @@ -257,6 +264,7 @@ extern struct task_group root_task_group;
>  	INIT_RT_MUTEXES(tsk)						\
>  	INIT_VTIME(tsk)							\
>  	INIT_NUMA_BALANCING(tsk)					\
> +	INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
>  	INIT_KASAN(tsk)							\
>  }

We don't really need any of this - init_task starts up all-zero anyway.
Maybe it's useful for documentation reasons (dubious), but I bet we've
already missed fields.

>
> ...
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1289,6 +1289,16 @@ enum perf_event_task_context {
>  	perf_nr_task_contexts,
>  };
>  
> +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
> +#define BATCH_TLBFLUSH_SIZE 32UL
> +
> +/* Track pages that require TLB flushes */
> +struct tlbflush_unmap_batch {
> +	struct cpumask cpumask;
> +	unsigned long nr_pages;
> +	unsigned long pfns[BATCH_TLBFLUSH_SIZE];

Why are we storing pfn's rather than page*'s?

I'm trying to get my head around what's actually in this structure.

Each thread has one of these, lazily allocated <under circumstances>. 
The cpumask field contains a mask of all the CPUs which have done
<something>.  The careful reader will find mm_struct.cpu_vm_mask_var
and will wonder why it didn't need documenting, sigh.

Wanna fill in the blanks?  As usual, understanding the data structure
is the key to understanding the design, so it's worth a couple of
paragraphs.  With this knowledge, the reader may understand why
try_to_unmap_flush() has preempt_disable() protection but
set_tlb_ubc_flush_pending() doesn't need it!

> +};
> +
>  struct task_struct {
>  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
>  	void *stack;
>
> ...
>
> @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma)
>  	return address;
>  }
>  
> +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> +static void percpu_flush_tlb_batch_pages(void *data)
> +{
> +	struct tlbflush_unmap_batch *tlb_ubc = data;
> +	int i;

Anally speaking, this should be unsigned long (in which case it
shouldn't be called `i'!).  Or make tlbflush_unmap_batch.nr_pages int. 
But int is signed, which is silly.  Whatever ;)


> +	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> +	for (i = 0; i < tlb_ubc->nr_pages; i++)
> +		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
> +}
> +
> +/*
> + * Flush TLB entries for recently unmapped pages from remote CPUs. It is
> + * important that if a PTE was dirty when it was unmapped that it's flushed

s/important that/important /

> + * before any IO is initiated on the page to prevent lost writes. Similarly,
> + * it must be flushed before freeing to prevent data leakage.
> + */
> +void try_to_unmap_flush(void)
> +{
> +	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
> +	int cpu;
> +
> +	if (!tlb_ubc || !tlb_ubc->nr_pages)
> +		return;
> +
> +	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
> +
> +	preempt_disable();
> +	cpu = smp_processor_id();

get_cpu()

> +	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
> +		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
> +
> +	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
> +		smp_call_function_many(&tlb_ubc->cpumask,
> +			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
> +	}
> +	cpumask_clear(&tlb_ubc->cpumask);
> +	tlb_ubc->nr_pages = 0;
> +	preempt_enable();

put_cpu()

> +}
>
> ...
>
> +/*
> + * Returns true if the TLB flush should be deferred to the end of a batch of
> + * unmap operations to reduce IPIs.
> + */
> +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> +{
> +	bool should_defer = false;
> +
> +	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
> +		return false;
> +
> +	/* If remote CPUs need to be flushed then defer batch the flush */
> +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> +		should_defer = true;
> +	put_cpu();

What did the get_cpu() protect?

> +	return should_defer;
> +}
> +#else
> +static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
> +		struct page *page)
> +{
> +}
> +
>
> ...
>
> +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> +/*
> + * Allocate the control structure for batch TLB flushing. An allocation
> + * failure is harmless as the reclaimer will send IPIs where necessary.
> + */
> +static void alloc_tlb_ubc(void)
> +{
> +	if (!current->tlb_ubc)
> +		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
> +						GFP_KERNEL | __GFP_NOWARN);

A GFP_KERNEL allocation from deep within page reclaim?  Seems imprudent
if only from a stack-usage POV.

> +}
> +#else
> +static inline void alloc_tlb_ubc(void)
> +{
> +}
> +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
>
> ...
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-08 17:45 ` [PATCH 0/3] TLB flush multiple pages per IPI v5 Ingo Molnar
  2015-06-08 18:21   ` Dave Hansen
@ 2015-06-09  8:47   ` Mel Gorman
  2015-06-09 10:32     ` Ingo Molnar
  1 sibling, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2015-06-09  8:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner

On Mon, Jun 08, 2015 at 07:45:51PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > Changelog since V4
> > o Rebase to 4.1-rc6
> > 
> > Changelog since V3
> > o Drop batching of TLB flush from migration
> > o Redo how larger batching is managed
> > o Batch TLB flushes when writable entries exist
> > 
> > When unmapping pages it is necessary to flush the TLB. If that page was
> > accessed by another CPU then an IPI is used to flush the remote CPU. That
> > is a lot of IPIs if kswapd is scanning and unmapping >100K pages per second.
> > 
> > There already is a window between when a page is unmapped and when it is
> > TLB flushed. This series simply increases the window so multiple pages
> > can be flushed using a single IPI. This *should* be safe or the kernel is
> > hosed already but I've cc'd the x86 maintainers and some of the Intel folk
> > for comment.
> > 
> > Patch 1 simply made the rest of the series easier to write as ftrace
> > 	could identify all the senders of TLB flush IPIS.
> > 
> > Patch 2 collects a list of PFNs and sends one IPI to flush them all
> > 
> > Patch 3 tracks when there potentially are writable TLB entries that
> > 	need to be batched differently
> > 
> > The performance impact is documented in the changelogs but in the optimistic
> > case on a 4-socket machine the full series reduces interrupts from 900K
> > interrupts/second to 60K interrupts/second.
> 
> Yeah, so I think batching writable flushes is useful I think, but I disagree with 
> one aspect of it: with the need to gather _pfns_ and batch them over to the remote 
> CPU.
> 

It's PFN-based for three reasons. The first is because the old code
is flushing on a per-page basis and the intent of the series was to
reduce the number of IPIs that requires. Moving away from that has an
unpredictable impact that depends on the workload and the exact CPU
used.  The second is that a struct page-based interface would require
percpu_flush_tlb_batch_pages to do a page->pfn lookup in the IPI call which
is more expensive than necessary. The final reason is that the TLB flush
API given to architectures at the moment includes single page primitives
and while it's not necessarily the best decision for x86, the same may
not be true for other architectures if they decide to activate the batching.

> As per my measurements the __flush_tlb_single() primitive (which you use in patch
> #2) is very expensive on most Intel and AMD CPUs. It barely makes sense for a 2
> pages and gets exponentially worse. It's probably done in microcode and its 
> performance is horrible.
> 
> So have you explored the possibility to significantly simplify your patch-set by 
> only deferring the flushing, and doing a simple TLB flush on the remote CPU?

Yes. At one point I looked at range flushing but it is not a good idea.
The ranges that reach the end of the LRU are too large to be useful except
in the ideal case of a workload that sequentially accesses memory. Flushing
the full TLB has an unpredictable cost. Currently, the unmapping and flush
is of inactive pages, some of which may not be in the TLB at all and the
impact on the workload is limited to the IPI and flush cost.

With a full flush we clear entries we know were recently accessed and
may have to be looked up again and we do this every 32 mapped pages that
are reclaimed. In the ideal case of a sequential mapped reader it would
not matter as the entries are not needed so we would not see the cost at
all. Other workloads will have to do a refill that was not necessary before
this series. The cost of the refill will depend on the CPU and whether the
lookup information is still in the CPU cache or not. That means measuring
the full impact of your proposal is impossible as it depends heavily on
the workload, the timing of its interaction with kswapd in particular,
the state of the CPU cache and the cost of refills for the CPU.

I agree with you in that it would be a simplier series and the actual
flush would probably be faster but the downsides are too unpredictable
for a series that primarily is about reducing the number of IPIs.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09  8:47   ` Mel Gorman
@ 2015-06-09 10:32     ` Ingo Molnar
  2015-06-09 11:20       ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2015-06-09 10:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner


* Mel Gorman <mgorman@suse.de> wrote:

> > So have you explored the possibility to significantly simplify your patch-set 
> > by only deferring the flushing, and doing a simple TLB flush on the remote 
> > CPU?
> 
> Yes. At one point I looked at range flushing but it is not a good idea.

My suggestion wasn't range-flushing, but a simple all-or-nothing batched flush of 
user-space TLBs.

> The ranges that reach the end of the LRU are too large to be useful except in 
> the ideal case of a workload that sequentially accesses memory. Flushing the 
> full TLB has an unpredictable cost. [...]

Why would it have unpredictable cost? We flush the TLB on every process context 
switch. Yes, it's somewhat workload dependent, but the performance profile is so 
different anyway with batching that it has to be re-measured anyway.

> With a full flush we clear entries we know were recently accessed and may have 
> to be looked up again and we do this every 32 mapped pages that are reclaimed. 
> In the ideal case of a sequential mapped reader it would not matter as the 
> entries are not needed so we would not see the cost at all. Other workloads will 
> have to do a refill that was not necessary before this series. The cost of the 
> refill will depend on the CPU and whether the lookup information is still in the 
> CPU cache or not. That means measuring the full impact of your proposal is 
> impossible as it depends heavily on the workload, the timing of its interaction 
> with kswapd in particular, the state of the CPU cache and the cost of refills 
> for the CPU.
>
> I agree with you in that it would be a simplier series and the actual flush 
> would probably be faster but the downsides are too unpredictable for a series 
> that primarily is about reducing the number of IPIs.

Sorry, I don't buy this, at all.

Please measure this, the code would become a lot simpler, as I'm not convinced 
that we need pfn (or struct page) or even range based flushing.

I.e. please first implement the simplest remote batching variant, then complicate 
it if the numbers warrant it. Not the other way around. It's not like the VM code 
needs the extra complexity!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped
  2015-06-08 22:38   ` Andrew Morton
@ 2015-06-09 11:07     ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2015-06-09 11:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Hugh Dickins, Minchan Kim, Dave Hansen, Andi Kleen,
	H Peter Anvin, Ingo Molnar, Linux-MM, LKML

On Mon, Jun 08, 2015 at 03:38:13PM -0700, Andrew Morton wrote:
> On Mon,  8 Jun 2015 13:50:53 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > An IPI is sent to flush remote TLBs when a page is unmapped that was
> > recently accessed by other CPUs. There are many circumstances where this
> > happens but the obvious one is kswapd reclaiming pages belonging to a
> > running process as kswapd and the task are likely running on separate CPUs.
> > 
> > On small machines, this is not a significant problem but as machine
> > gets larger with more cores and more memory, the cost of these IPIs can
> > be high. This patch uses a structure similar in principle to a pagevec
> > to collect a list of PFNs and CPUs that require flushing. It then sends
> > one IPI per CPU that was mapping any of those pages to flush the list of
> > PFNs. A new TLB flush helper is required for this and one is added for
> > x86. Other architectures will need to decide if batching like this is both
> > safe and worth the memory overhead. Specifically the requirement is;
> > 
> > 	If a clean page is unmapped and not immediately flushed, the
> > 	architecture must guarantee that a write to that page from a CPU
> > 	with a cached TLB entry will trap a page fault.
> > 
> > This is essentially what the kernel already depends on but the window is
> > much larger with this patch applied and is worth highlighting.
> > 
> > The impact of this patch depends on the workload as measuring any benefit
> > requires both mapped pages co-located on the LRU and memory pressure. The
> > case with the biggest impact is multiple processes reading mapped pages
> > taken from the vm-scalability test suite. The test case uses NR_CPU readers
> > of mapped files that consume 10*RAM.
> > 
> > vmscale on a 4-node machine with 64G RAM and 48 CPUs
> >            4.1.0-rc6     4.1.0-rc6
> >              vanilla batchunmap-v5
> > User          577.35        618.60
> > System       5927.06       4195.03
> > Elapsed       162.21        121.31
> > 
> > The workload completed 25% faster with 29% less CPU time.
> > 
> > This is showing that the readers completed 25% with 30% less CPU time. From
> > vmstats, it is known that the vanilla kernel was interrupted roughly 900K
> > times per second during the steady phase of the test and the patched kernel
> > was interrupts 180K times per second.
> > 
> > The impact is much lower on a small machine
> > 
> > vmscale on a 1-node machine with 8G RAM and 1 CPU
> >            4.1.0-rc6     4.1.0-rc6
> >              vanilla batchunmap-v5
> > User           59.14         58.86
> > System        109.15         83.78
> > Elapsed        27.32         23.14
> > 
> > It's still a noticeable improvement with vmstat showing interrupts went
> > from roughly 500K per second to 45K per second.
> 
> Looks nice.
> 
> > The patch will have no impact on workloads with no memory pressure or
> > have relatively few mapped pages.
> 
> What benefit can we expect to see to any real-world userspace?
> 

Only a small subset of workloads will see a benefit -- ones that mmap a
lot of data with working sets larger than the size of a node. Some
streaming media servers allegedly do this.

Some numerical processing applications may hit this. Those that use glibc
for large buffers use mmap and if the application is larger than a NUMA
node, it'll need to be unmapped and flushed. Python/NumPY uses large
maps for large buffers (based on the paper "Doubling the Performance of
Python/Numpy with less than 100 SLOC"). Whether users of NumPY hit this
issue or not depends on whether kswapd is active.

Anecdotally, I'm aware from IRC of one user that is experimenting with
a large HTTP cache and a load generator that spent a lot of time sending
IPIs that was obvious from the profile. I asked weeks ago that they post
the results here which they promised they would but didn't. Unfortunately,
I don't know the persons real name to cc them. Rik might.

Anecdotally, I also believe that Intel hit this internally with some
internal workload but I'm basing this on idle comments at LSF/MM. However
they were unwilling or unable to describe exactly what the test does and
against what software.

I know this is more vague than you'd like. By and large, I'm relying on
the assumption that if we are reclaiming mapped pages from kswapd context
then sending one IPI per page is stupid.

> > --- a/include/linux/init_task.h
> > +++ b/include/linux/init_task.h
> > @@ -175,6 +175,13 @@ extern struct task_group root_task_group;
> >  # define INIT_NUMA_BALANCING(tsk)
> >  #endif
> >  
> > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
> > +	.tlb_ubc = NULL,
> > +#else
> > +# define INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)
> > +#endif
> > +
> >  #ifdef CONFIG_KASAN
> >  # define INIT_KASAN(tsk)						\
> >  	.kasan_depth = 1,
> > @@ -257,6 +264,7 @@ extern struct task_group root_task_group;
> >  	INIT_RT_MUTEXES(tsk)						\
> >  	INIT_VTIME(tsk)							\
> >  	INIT_NUMA_BALANCING(tsk)					\
> > +	INIT_TLBFLUSH_UNMAP_BATCH_CONTROL(tsk)				\
> >  	INIT_KASAN(tsk)							\
> >  }
> 
> We don't really need any of this - init_task starts up all-zero anyway.
> Maybe it's useful for documentation reasons (dubious), but I bet we've
> already missed fields.
> 

True. I'll look into removing it and make sure there is no adverse
impact.

> >
> > ...
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1289,6 +1289,16 @@ enum perf_event_task_context {
> >  	perf_nr_task_contexts,
> >  };
> >  
> > +/* Matches SWAP_CLUSTER_MAX but refined to limit header dependencies */
> > +#define BATCH_TLBFLUSH_SIZE 32UL
> > +
> > +/* Track pages that require TLB flushes */
> > +struct tlbflush_unmap_batch {
> > +	struct cpumask cpumask;
> > +	unsigned long nr_pages;
> > +	unsigned long pfns[BATCH_TLBFLUSH_SIZE];
> 
> Why are we storing pfn's rather than page*'s?
> 

Because a page would require a page->pfn lookup within the IPI handler.
That will work but it's unnecessarily wasteful.

> I'm trying to get my head around what's actually in this structure.
> 
> Each thread has one of these, lazily allocated <under circumstances>. 
> The cpumask field contains a mask of all the CPUs which have done
> <something>.  The careful reader will find mm_struct.cpu_vm_mask_var
> and will wonder why it didn't need documenting, sigh.
> 
> Wanna fill in the blanks?  As usual, understanding the data structure
> is the key to understanding the design, so it's worth a couple of
> paragraphs.  With this knowledge, the reader may understand why
> try_to_unmap_flush() has preempt_disable() protection but
> set_tlb_ubc_flush_pending() doesn't need it!
> 

Is this any help?

struct tlbflush_unmap_batch {
        /*
         * Each bit set is a CPU that potentially has a TLB entry for one of
         * the PFNs being flushed. See set_tlb_ubc_flush_pending().
         */
        struct cpumask cpumask;

        /*
         * The number and list of pfns to be flushed. PFNs are tracked instead
         * of struct pages to avoid multiple page->pfn lookups by each CPU that
         * receives an IPI in percpu_flush_tlb_batch_pages
         */
        unsigned long nr_pages;
        unsigned long pfns[BATCH_TLBFLUSH_SIZE];

        /*
         * If true then the PTE was dirty when unmapped. The entry must be
         * flushed before IO is initiated or a stale TLB entry potentially
         * allows an update without redirtying the page.
         */
        bool writable;
};

> > +};
> > +
> >  struct task_struct {
> >  	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
> >  	void *stack;
> >
> > ...
> >
> > @@ -581,6 +583,90 @@ vma_address(struct page *page, struct vm_area_struct *vma)
> >  	return address;
> >  }
> >  
> > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> > +static void percpu_flush_tlb_batch_pages(void *data)
> > +{
> > +	struct tlbflush_unmap_batch *tlb_ubc = data;
> > +	int i;
> 
> Anally speaking, this should be unsigned long (in which case it
> shouldn't be called `i'!).  Or make tlbflush_unmap_batch.nr_pages int. 
> But int is signed, which is silly.  Whatever ;)
> 

I can make it unsigned int which is a micro-optimisation for loop iterators
anyway.

> 
> > +	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> > +	for (i = 0; i < tlb_ubc->nr_pages; i++)
> > +		flush_local_tlb_addr(tlb_ubc->pfns[i] << PAGE_SHIFT);
> > +}
> > +
> > +/*
> > + * Flush TLB entries for recently unmapped pages from remote CPUs. It is
> > + * important that if a PTE was dirty when it was unmapped that it's flushed
> 
> s/important that/important /
> 

Fixed.

> > + * before any IO is initiated on the page to prevent lost writes. Similarly,
> > + * it must be flushed before freeing to prevent data leakage.
> > + */
> > +void try_to_unmap_flush(void)
> > +{
> > +	struct tlbflush_unmap_batch *tlb_ubc = current->tlb_ubc;
> > +	int cpu;
> > +
> > +	if (!tlb_ubc || !tlb_ubc->nr_pages)
> > +		return;
> > +
> > +	trace_tlb_flush(TLB_REMOTE_SHOOTDOWN, tlb_ubc->nr_pages);
> > +
> > +	preempt_disable();
> > +	cpu = smp_processor_id();
> 
> get_cpu()
> 
> > +	if (cpumask_test_cpu(cpu, &tlb_ubc->cpumask))
> > +		percpu_flush_tlb_batch_pages(&tlb_ubc->cpumask);
> > +
> > +	if (cpumask_any_but(&tlb_ubc->cpumask, cpu) < nr_cpu_ids) {
> > +		smp_call_function_many(&tlb_ubc->cpumask,
> > +			percpu_flush_tlb_batch_pages, (void *)tlb_ubc, true);
> > +	}
> > +	cpumask_clear(&tlb_ubc->cpumask);
> > +	tlb_ubc->nr_pages = 0;
> > +	preempt_enable();
> 
> put_cpu()
> 

Gack. Fixed.

> > +}
> >
> > ...
> >
> > +/*
> > + * Returns true if the TLB flush should be deferred to the end of a batch of
> > + * unmap operations to reduce IPIs.
> > + */
> > +static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
> > +{
> > +	bool should_defer = false;
> > +
> > +	if (!current->tlb_ubc || !(flags & TTU_BATCH_FLUSH))
> > +		return false;
> > +
> > +	/* If remote CPUs need to be flushed then defer batch the flush */
> > +	if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
> > +		should_defer = true;
> > +	put_cpu();
> 
> What did the get_cpu() protect?
> 

To safely lookup the current running CPU number. smp_processor_id() is
potentially safe except in specific circumstances and I did not think
raw_smp_processor_id() was justified.

> > +	return should_defer;
> > +}
> > +#else
> > +static void set_tlb_ubc_flush_pending(struct mm_struct *mm,
> > +		struct page *page)
> > +{
> > +}
> > +
> >
> > ...
> >
> > +#ifdef CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH
> > +/*
> > + * Allocate the control structure for batch TLB flushing. An allocation
> > + * failure is harmless as the reclaimer will send IPIs where necessary.
> > + */
> > +static void alloc_tlb_ubc(void)
> > +{
> > +	if (!current->tlb_ubc)
> > +		current->tlb_ubc = kzalloc(sizeof(struct tlbflush_unmap_batch),
> > +						GFP_KERNEL | __GFP_NOWARN);
> 
> A GFP_KERNEL allocation from deep within page reclaim?  Seems imprudent
> if only from a stack-usage POV.
> 

When we call it, we are in PF_MEMALLOC context either set in the page
allocator from direct reclaim or because kswapd always sets it. This limits
the stack depth. It would be comparable to the depth we reach during normal
reclaim calling into shrink_page_list and everything below it so we should
be safe. Granted, the dependency on PF_MEMALLOC is not obvious at all.

> > +}
> > +#else
> > +static inline void alloc_tlb_ubc(void)
> > +{
> > +}
> > +#endif /* CONFIG_ARCH_SUPPORTS_LOCAL_TLB_PFN_FLUSH */
> >
> > ...
> >
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 10:32     ` Ingo Molnar
@ 2015-06-09 11:20       ` Mel Gorman
  2015-06-09 12:43         ` Ingo Molnar
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2015-06-09 11:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner

On Tue, Jun 09, 2015 at 12:32:31PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > So have you explored the possibility to significantly simplify your patch-set 
> > > by only deferring the flushing, and doing a simple TLB flush on the remote 
> > > CPU?
> > 
> > Yes. At one point I looked at range flushing but it is not a good idea.
> 
> My suggestion wasn't range-flushing, but a simple all-or-nothing batched flush of 
> user-space TLBs.
> 

I'm aware of that. I had considered both range flushing and full flushing
as alternatives to PFN tracking and settled on PFN tracking as the least
risky change.

> > The ranges that reach the end of the LRU are too large to be useful except in 
> > the ideal case of a workload that sequentially accesses memory. Flushing the 
> > full TLB has an unpredictable cost. [...]
> 
> Why would it have unpredictable cost?

Because we have no means of knowing how many active TLB entries are flushed,
no way of knowing if it matters and we potentially do this every 32
(BATCH_TLBFLUSH_SIZE) pages that are reclaimed.

> We flush the TLB on every process context 
> switch. Yes, it's somewhat workload dependent, but the performance profile is so 
> different anyway with batching that it has to be re-measured anyway.
> 

With the per-page flush, there is a direct cost associated with the
operation -- the IPI and the TLB flushes. This is easy to measure. With
a full flush there is an indirect cost -- the TLB entries that have to be
refilled after the full flush. It also works against any notion of using
ASID or similar mechanisms that avoid full flushes on context switches.

It will be very easy to show the benefit in the direct case. The indirect
case is both unpredictable and impossible to measure the full impact in
all cases.

> > With a full flush we clear entries we know were recently accessed and may have 
> > to be looked up again and we do this every 32 mapped pages that are reclaimed. 
> > In the ideal case of a sequential mapped reader it would not matter as the 
> > entries are not needed so we would not see the cost at all. Other workloads will 
> > have to do a refill that was not necessary before this series. The cost of the 
> > refill will depend on the CPU and whether the lookup information is still in the 
> > CPU cache or not. That means measuring the full impact of your proposal is 
> > impossible as it depends heavily on the workload, the timing of its interaction 
> > with kswapd in particular, the state of the CPU cache and the cost of refills 
> > for the CPU.
> >
> > I agree with you in that it would be a simplier series and the actual flush 
> > would probably be faster but the downsides are too unpredictable for a series 
> > that primarily is about reducing the number of IPIs.
> 
> Sorry, I don't buy this, at all.
> 
> Please measure this, the code would become a lot simpler, as I'm not convinced 
> that we need pfn (or struct page) or even range based flushing.
> 

The code will be simplier and the cost of reclaim will be lower and that
is the direct case but shows nothing about the indirect cost. The mapped
reader will benefit as it is not reusing the TLB entries and will look
artifically very good. It'll be very difficult for even experienced users
to determine that a slowdown during kswapd activity is due to increased
TLB misses incurred by the full flush.

> I.e. please first implement the simplest remote batching variant, then complicate 
> it if the numbers warrant it. Not the other way around. It's not like the VM code 
> needs the extra complexity!
> 

The simplest remote batching variant is a much more drastic change from what
we do today and an unpredictable one. If we were to take that direction,
it goes against the notion of making incremental changes. Even if we
ultimately ended up with your proposal, it would make sense to separte
it from this series by at least one release for bisection purposes. That
way we get;

Current:     Send one IPI per page to unmap, active TLB entries preserved
This series: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap, active TLB entries preserved
Your proposal: Send one IPI, flush everything, active TLB entries must refill

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 11:20       ` Mel Gorman
@ 2015-06-09 12:43         ` Ingo Molnar
  2015-06-09 13:05           ` Mel Gorman
  2015-06-09 15:34           ` Dave Hansen
  0 siblings, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2015-06-09 12:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner


* Mel Gorman <mgorman@suse.de> wrote:

> > Sorry, I don't buy this, at all.
> > 
> > Please measure this, the code would become a lot simpler, as I'm not convinced 
> > that we need pfn (or struct page) or even range based flushing.
> 
> The code will be simplier and the cost of reclaim will be lower and that is the 
> direct case but shows nothing about the indirect cost. The mapped reader will 
> benefit as it is not reusing the TLB entries and will look artifically very 
> good. It'll be very difficult for even experienced users to determine that a 
> slowdown during kswapd activity is due to increased TLB misses incurred by the 
> full flush.

If so then the converse is true just as much: if you were to introduce finegrained 
flushing today, you couldn't justify it because you claim it's very hard to 
measure!

Really, in such cases we should IMHO fall back to the simplest approach, and 
iterate from there.

I cited very real numbers about the direct costs of TLB flushes, and plausible 
speculation about why the indirect costs are low on the achitecture you are trying 
to modify here.

I think since it is you who wants to introduce additional complexity into the x86 
MM code the burden is on you to provide proof that the complexity of pfn (or 
struct page) tracking is worth it.

> > I.e. please first implement the simplest remote batching variant, then 
> > complicate it if the numbers warrant it. Not the other way around. It's not 
> > like the VM code needs the extra complexity!
> 
> The simplest remote batching variant is a much more drastic change from what we 
> do today and an unpredictable one. If we were to take that direction, it goes 
> against the notion of making incremental changes. Even if we ultimately ended up 
> with your proposal, it would make sense to separte it from this series by at 
> least one release for bisection purposes. That way we get;
> 
> Current:     Send one IPI per page to unmap, active TLB entries preserved
> This series: Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap, active TLB entries preserved
> Your proposal: Send one IPI, flush everything, active TLB entries must refill

Not quite, my take of it is:

  Current:      Simplest method: send one IPI per page to unmap, active TLB 
                entries preserved. Remote TLB flushing cost is so high that it 
                probably moots any secondary effects of TLB preservation.

  This series:  Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap, add complex 
                tracking of pfn's with expensive flushing, active TLB entries 
                preserved. Cost of the more complex flushing are probably
                higher than the refill cost, based on the numbers I gave.

  My proposal:  Send one IPI per BATCH_TLBFLUSH_SIZE pages to unmap that flushes 
                everything. TLB entries not preserved but this is expected to be 
                more than offset by the reduction in remote flushing costs and the 
                simplicity of the flushing scheme. It can still be complicated to 
                your proposed pfn tracking scheme, based on numbers.

Btw., have you measured the full TLB flush variant as well? If so, mind sharing 
the numbers?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 12:43         ` Ingo Molnar
@ 2015-06-09 13:05           ` Mel Gorman
  2015-06-10  8:51             ` Ingo Molnar
  2015-06-09 15:34           ` Dave Hansen
  1 sibling, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2015-06-09 13:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner

On Tue, Jun 09, 2015 at 02:43:28PM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > Sorry, I don't buy this, at all.
> > > 
> > > Please measure this, the code would become a lot simpler, as I'm not convinced 
> > > that we need pfn (or struct page) or even range based flushing.
> > 
> > The code will be simplier and the cost of reclaim will be lower and that is the 
> > direct case but shows nothing about the indirect cost. The mapped reader will 
> > benefit as it is not reusing the TLB entries and will look artifically very 
> > good. It'll be very difficult for even experienced users to determine that a 
> > slowdown during kswapd activity is due to increased TLB misses incurred by the 
> > full flush.
> 
> If so then the converse is true just as much: if you were to introduce finegrained 
> flushing today, you couldn't justify it because you claim it's very hard to 
> measure!
> 

I'm claiming the *INDIRECT COST* is impossible to measure as part of this
series because it depends on the workload and exact CPU used. The direct
cost is measurable and can be quantified.

> Really, in such cases we should IMHO fall back to the simplest approach, and 
> iterate from there.
> 
> I cited very real numbers about the direct costs of TLB flushes, and plausible 
> speculation about why the indirect costs are low on the achitecture you are trying 
> to modify here.
> 
> I think since it is you who wants to introduce additional complexity into the x86 
> MM code the burden is on you to provide proof that the complexity of pfn (or 
> struct page) tracking is worth it.
> 

I'm taking a situation whereby IPIs are sent like crazy with interrupt
storms and replacing it with something that is a lot more efficient that
minimises the number of potential surprises. I'm stating that the benefit
of PFN tracking is unknowable in the general case because it depends on the
workload, timing and the exact CPU used so any example provided can be naked
with a counter-example such as a trivial sequential reader that shows no
benefit. The series as posted is approximately in line with current behaviour
minimising the chances of surprise regressions from excessive TLB flush.

You are actively blocking a measurable improvement and forcing it to be
replaced with something whose full impact is unquantifiable. Any regressions
in this area due to increased TLB misses could take several kernel releases
as the issue will be so difficult to detect.

I'm going to implement the approach you are forcing because there is an
x86 part of the patch and you are the maintainer that could indefinitely
NAK it. However, I'm extremely pissed about being forced to introduce
these indirect unpredictable costs because I know the alternative is you
dragging this out for weeks with no satisfactory conclusion in an argument
that I cannot prove in the general case.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 12:43         ` Ingo Molnar
  2015-06-09 13:05           ` Mel Gorman
@ 2015-06-09 15:34           ` Dave Hansen
  2015-06-09 16:49             ` Dave Hansen
  2015-06-21 20:22             ` Kirill A. Shutemov
  1 sibling, 2 replies; 41+ messages in thread
From: Dave Hansen @ 2015-06-09 15:34 UTC (permalink / raw)
  To: Ingo Molnar, Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Andi Kleen, H Peter Anvin, Linux-MM, LKML, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner

On 06/09/2015 05:43 AM, Ingo Molnar wrote:
> I cited very real numbers about the direct costs of TLB flushes, and plausible 
> speculation about why the indirect costs are low on the achitecture you are trying 
> to modify here.

We should be careful to extrapolate what the real-world cost of a TLB
flush is from the cost of running a kernel function in a loop.

Let's look at what got measured:

> +static char tlb_flush_target[PAGE_SIZE] __aligned(4096);
> +static void fn_flush_tlb_one(void)
> +{
> +	unsigned long addr = (unsigned long)&tlb_flush_target;
> +
> +	tlb_flush_target[0]++;
> +	__flush_tlb_one(addr);
> +}

So we've got an increment of a variable in kernel memory (which is
almost surely in the L1), then we flush that memory location, and repeat
the increment.

I assume the increment is so that the __flush_tlb_one() has some "real"
work to do and is not just flushing an address which is not in the TLB.
 This is almost certainly a departure from workloads like Mel is
addressing where we (try to) flush pages used long ago that will
hopefully *not* be in the TLB.

But, that unfortunately means that we're measuring a TLB _miss_ here in
addition to the flush.  A TLB miss shouldn't be *that* expensive, right?
 The SDM says: "INVLPG also invalidates all entries in all
paging-structure caches ... regardless of the linear addresses to which
they correspond."  Ugh, so the TLB refill has to also refill the paging
structure caches.  At least the page tables will be in the L1.

Since "tlb_flush_target" is in kernel mapping, you might also be
shooting down the TLB entry for kernel text, or who knows what else.
The TLB entry might be 1G or 2M which might never be in the STLB
(second-level TLB), which could have *VERY* different behavior than a 4k
flush or a flush of an entry in the first-level TLB.

I'm not sure that these loop-style tests are particularly valuable, but
if we're going to do it, I think we should consider:
1. We need to separate the TLB fill portion from the flush and not
   measure any part of a fill along with the flush
2. We should measure flushing of ascending, adjacent virtual addresses
   mapped with 4k pages since that is the normal case.  Perhaps
   vmalloc(16MB) or something.
3. We should flush a mix of virtual addresses that are in and out of the
   TLB.
4. To measure instruction (as opposed to instruction+software)
   overhead, use __flush_tlb_single(), not __flush_tlb_one()

P.S.  I think I'm responsible for it, but we should probably also move
the count_vm_tlb_event() to outside the loop in flush_tlb_mm_range().
invlpg is not a "normal" instruction and could potentially increase the
overhead of incrementing the counter.  But, I guess the kernel mappings
_should_ stay in the TLB over an invlpg and shouldn't pay any cost to be
refilled in to the TLB despite the paging-structure caches going away.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 15:34           ` Dave Hansen
@ 2015-06-09 16:49             ` Dave Hansen
  2015-06-09 21:14               ` Dave Hansen
  2015-06-21 20:22             ` Kirill A. Shutemov
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2015-06-09 16:49 UTC (permalink / raw)
  To: Ingo Molnar, Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Andi Kleen, H Peter Anvin, Linux-MM, LKML, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner

On 06/09/2015 08:34 AM, Dave Hansen wrote:
> 2. We should measure flushing of ascending, adjacent virtual addresses
>    mapped with 4k pages since that is the normal case.  Perhaps
>    vmalloc(16MB) or something.

Now that I think about this a bit more, we really have two different
patterns here:
1. flush_tlb_mm_range() style, which is contiguous virtual address
   flushing from an unmap-style operation.
2. Mel's new try_to_unmap_one() style, which are going to be relatively
   random virtual addresses.

So, the "ascending adjacent" case is still interesting, but it wouldn't
cover Mel's new case.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 16:49             ` Dave Hansen
@ 2015-06-09 21:14               ` Dave Hansen
  2015-06-09 21:54                 ` Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2015-06-09 21:14 UTC (permalink / raw)
  To: Ingo Molnar, Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Andi Kleen, H Peter Anvin, Linux-MM, LKML, Linus Torvalds,
	Peter Zijlstra, Thomas Gleixner

I did some of what I talked about earlier in the thread.

I think the sfence (via mb()) is potentially unfair since it removes
some of the CPU's ability to optimize things.  For this kind of test,
any ability that the CPU has to smear the overhead around is a bonus in
practice and should be taken in to account for these tests.

Here's the horribly hacked-together patch so you can see precisely
what's going on:

	https://www.sr71.net/~dave/intel/measure-tlb-stuff.patch

Here's a Haswell Xeon:

> [    0.222090] x86/fpu:########  MM instructions:            ############################
> [    0.222168] x86/fpu: Cost of: __flush_tlb()               fn            :   124 cycles avg:   125
> [    0.222623] x86/fpu: Cost of: __flush_tlb_global()        fn            :   960 cycles avg:   968
> [    0.222744] x86/fpu: Cost of: __flush_tlb_single()        fn            :   216 cycles avg:   216
> [    0.222864] x86/fpu: Cost of: __flush_tlb_single() vmal   fn            :   216 cycles avg:   219
> [    0.222987] x86/fpu: Cost of: __flush_tlb_one() OLD       fn            :   216 cycles avg:   216
> [    0.223139] x86/fpu: Cost of: __flush_tlb_range()         fn            :   284 cycles avg:   287
> [    0.223272] x86/fpu: Cost of: tlb miss                    fn            :     0 cycles avg:     0

And a Westmere Xeon:

> [    1.057770] x86/fpu:########  MM instructions:            ############################
> [    1.065876] x86/fpu: Cost of: __flush_tlb()               fn            :   108 cycles avg:   109
> [    1.075188] x86/fpu: Cost of: __flush_tlb_global()        fn            :   828 cycles avg:   829
> [    1.084162] x86/fpu: Cost of: __flush_tlb_single()        fn            :   232 cycles avg:   237
> [    1.093175] x86/fpu: Cost of: __flush_tlb_single() vmal   fn            :   240 cycles avg:   240
> [    1.102214] x86/fpu: Cost of: __flush_tlb_one() OLD       fn            :   284 cycles avg:   286
> [    1.111299] x86/fpu: Cost of: __flush_tlb_range()         fn            :   472 cycles avg:   478
> [    1.120281] x86/fpu: Cost of: tlb miss                    fn            :     0 cycles avg:     0

I was rather surprised how close the three __flush_tlb_single/one()
variants were on Haswell.  I've looked at a few other CPUs and this was
the only one that acted like this.

The 0 cycle TLB miss was also interesting.  It goes back up to something
reasonable if I put the mb()/mfence's back.

I don't think this kind of thing is a realistic test unless we put
mfence's around all of our TLB flushes in practice. :)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 21:14               ` Dave Hansen
@ 2015-06-09 21:54                 ` Linus Torvalds
  2015-06-09 22:32                   ` Mel Gorman
  2015-06-10 13:13                   ` Andi Kleen
  0 siblings, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2015-06-09 21:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Mel Gorman, Andrew Morton, Rik van Riel,
	Hugh Dickins, Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM,
	LKML, Peter Zijlstra, Thomas Gleixner

On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen <dave.hansen@intel.com> wrote:
>
> The 0 cycle TLB miss was also interesting.  It goes back up to something
> reasonable if I put the mb()/mfence's back.

So I've said it before, and I'll say it again: Intel does really well
on TLB fills.

The reason is partly historical, with Win95 doing a ton of TLB
invalidation (I think every single GDI call ended up invalidating the
TLB, so under some important Windows benchmarks of the time, you
literally had a TLB flush every 10k instructions!).

But partly it is because people are wrong in thinking that TLB fills
have to be slow. There's a lot of complete garbage RISC machines where
the TLB fill took forever, because in the name of simplicity it would
stop the pipeline and often be done in SW.

The zero-cycle TLB fill is obviously a bit optimistic, but at the same
time it's not completely insane. TLB fills can be prefetched, and the
table walker can hit the cache, if you do them right. And Intel mostly
does.

So the normal full (non-global) TLB fill really is fairly cheap. It's
been optimized for, and the TLB gets re-filled fairly efficiently. I
suspect that it's really the case that doing more than just a couple
of single-tlb flushes is a complete waste of time: the flushing takes
longer than re-filling the TLB well.

                         Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 21:54                 ` Linus Torvalds
@ 2015-06-09 22:32                   ` Mel Gorman
  2015-06-09 22:35                     ` Mel Gorman
  2015-06-10 13:13                   ` Andi Kleen
  1 sibling, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2015-06-09 22:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Ingo Molnar, Andrew Morton, Rik van Riel,
	Hugh Dickins, Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM,
	LKML, Peter Zijlstra, Thomas Gleixner

On Tue, Jun 09, 2015 at 02:54:01PM -0700, Linus Torvalds wrote:
> On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > The 0 cycle TLB miss was also interesting.  It goes back up to something
> > reasonable if I put the mb()/mfence's back.
> 
> So I've said it before, and I'll say it again: Intel does really well
> on TLB fills.
> 
> The reason is partly historical, with Win95 doing a ton of TLB
> invalidation (I think every single GDI call ended up invalidating the
> TLB, so under some important Windows benchmarks of the time, you
> literally had a TLB flush every 10k instructions!).
> 
> But partly it is because people are wrong in thinking that TLB fills
> have to be slow. There's a lot of complete garbage RISC machines where
> the TLB fill took forever, because in the name of simplicity it would
> stop the pipeline and often be done in SW.
> 
> The zero-cycle TLB fill is obviously a bit optimistic, but at the same
> time it's not completely insane. TLB fills can be prefetched, and the
> table walker can hit the cache, if you do them right. And Intel mostly
> does.
> 
> So the normal full (non-global) TLB fill really is fairly cheap. It's
> been optimized for, and the TLB gets re-filled fairly efficiently. I
> suspect that it's really the case that doing more than just a couple
> of single-tlb flushes is a complete waste of time: the flushing takes
> longer than re-filling the TLB well.
> 

I expect I'll do another revision of the series after 4.2-rc1 as it's way
too close to 4.1's release. When that happens, I'll drop patch 4 and leave
just the full non-global flush patch. In the event there is an architecture
that really cares about the refill cost or we find that there is a corner
case where the TLB refill hurts then patch 4 will be in the mail archives
to consider for rebase and testing.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 22:32                   ` Mel Gorman
@ 2015-06-09 22:35                     ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2015-06-09 22:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Ingo Molnar, Andrew Morton, Rik van Riel,
	Hugh Dickins, Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM,
	LKML, Peter Zijlstra, Thomas Gleixner

On Tue, Jun 09, 2015 at 11:32:03PM +0100, Mel Gorman wrote:
> On Tue, Jun 09, 2015 at 02:54:01PM -0700, Linus Torvalds wrote:
> > On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > The 0 cycle TLB miss was also interesting.  It goes back up to something
> > > reasonable if I put the mb()/mfence's back.
> > 
> > So I've said it before, and I'll say it again: Intel does really well
> > on TLB fills.
> > 
> > The reason is partly historical, with Win95 doing a ton of TLB
> > invalidation (I think every single GDI call ended up invalidating the
> > TLB, so under some important Windows benchmarks of the time, you
> > literally had a TLB flush every 10k instructions!).
> > 
> > But partly it is because people are wrong in thinking that TLB fills
> > have to be slow. There's a lot of complete garbage RISC machines where
> > the TLB fill took forever, because in the name of simplicity it would
> > stop the pipeline and often be done in SW.
> > 
> > The zero-cycle TLB fill is obviously a bit optimistic, but at the same
> > time it's not completely insane. TLB fills can be prefetched, and the
> > table walker can hit the cache, if you do them right. And Intel mostly
> > does.
> > 
> > So the normal full (non-global) TLB fill really is fairly cheap. It's
> > been optimized for, and the TLB gets re-filled fairly efficiently. I
> > suspect that it's really the case that doing more than just a couple
> > of single-tlb flushes is a complete waste of time: the flushing takes
> > longer than re-filling the TLB well.
> > 
> 
> I expect I'll do another revision of the series after 4.2-rc1 as it's way
> too close to 4.1's release. When that happens, I'll drop patch 4 and leave
> just the full non-global flush patch. In the event there is an architecture
> that really cares about the refill cost or we find that there is a corner
> case where the TLB refill hurts then patch 4 will be in the mail archives
> to consider for rebase and testing.
> 

To be clear I meant patch 4 from v6 of the series will be dropped. In that
series, patch 2 does a full non-global flush and patch 4 does the per-pfn
tracking with multiple single page frame flushes.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 13:05           ` Mel Gorman
@ 2015-06-10  8:51             ` Ingo Molnar
  2015-06-10  9:08               ` Ingo Molnar
  2015-06-10  9:19               ` Mel Gorman
  0 siblings, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2015-06-10  8:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner


* Mel Gorman <mgorman@suse.de> wrote:

> > I think since it is you who wants to introduce additional complexity into the 
> > x86 MM code the burden is on you to provide proof that the complexity of pfn 
> > (or struct page) tracking is worth it.
> 
> I'm taking a situation whereby IPIs are sent like crazy with interrupt storms 
> and replacing it with something that is a lot more efficient that minimises the 
> number of potential surprises. I'm stating that the benefit of PFN tracking is 
> unknowable in the general case because it depends on the workload, timing and 
> the exact CPU used so any example provided can be naked with a counter-example 
> such as a trivial sequential reader that shows no benefit. The series as posted 
> is approximately in line with current behaviour minimising the chances of 
> surprise regressions from excessive TLB flush.
> 
> You are actively blocking a measurable improvement and forcing it to be replaced 
> with something whose full impact is unquantifiable. Any regressions in this area 
> due to increased TLB misses could take several kernel releases as the issue will 
> be so difficult to detect.
> 
> I'm going to implement the approach you are forcing because there is an x86 part 
> of the patch and you are the maintainer that could indefinitely NAK it. However, 
> I'm extremely pissed about being forced to introduce these indirect 
> unpredictable costs because I know the alternative is you dragging this out for 
> weeks with no satisfactory conclusion in an argument that I cannot prove in the 
> general case.

Stop this crap.

I made a really clear and unambiguous chain of arguments:

 - I'm unconvinced about the benefits of INVLPG in general, and your patches adds
   a whole new bunch of them. I cited measurements and went out on a limb to 
   explain my position, backed with numbers and logic. It's admittedly still a 
   speculative position and I might be wrong, but I think it's well grounded 
   position that you cannot just brush aside.

 - I suggested that you split this approach into steps that first does the simpler
   approach that will give us at least 95% of the benefits, then the more complex
   one on top of it. Your false claim that I'm blocking a clear improvement is
   pure demagogy!

 - I very clearly claimed that I am more than willing to be convinced by numbers.
   It's not _that_ hard to construct a memory trashing workload with a
   TLB-efficient iteration that uses say 80% of the TLB cache, to measure the
   worst-case overhead of full flushes.

I'm really sick of this partly deceptive, partly passive-aggressive discussion 
style that seems to frequently permeate VM discussions and which made sched/numa 
such a huge PITA in the past...

And I think the numbers in the v6 series you submitted today support my position, 
so you owe me an apology I think ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10  8:51             ` Ingo Molnar
@ 2015-06-10  9:08               ` Ingo Molnar
  2015-06-10 10:15                 ` Mel Gorman
  2015-06-10  9:19               ` Mel Gorman
  1 sibling, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2015-06-10  9:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner


* Ingo Molnar <mingo@kernel.org> wrote:

> Stop this crap.
> 
> I made a really clear and unambiguous chain of arguments:
> 
>  - I'm unconvinced about the benefits of INVLPG in general, and your patches adds
>    a whole new bunch of them. [...]

... and note that your claim that 'we were doing them before, this is just an 
equivalent transformation' is utter bullsh*t technically: what we were doing 
previously was a hideously expensive IPI combined with an INVLPG.

The behavior was dominated by the huge overhead of the remote flushing IPI, which 
does not prove or disprove either your or my opinion!

Preserving that old INVLPG logic without measuring its benefits _again_ would be 
cargo cult programming.

So I think this should be measured, and I don't mind worst-case TLB trashing 
measurements, which would be relatively straightforward to construct and the 
results should be unambiguous.

The batching limit (which you set to 32) should then be tuned by comparing it to a 
working full-flushing batching logic, not by comparing it to the previous single 
IPI per single flush approach!

... and if the benefits of a complex algorithm are not measurable and if there are 
doubts about the cost/benefit tradeoff then frankly it should not exist in the 
kernel in the first place. It's not like the Linux TLB flushing code is too boring 
due to overwhelming simplicity.

and yes, it's my job as a maintainer to request measurements justifying complexity 
and your ad hominem attacks against me are disgusting - you should know better.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10  8:51             ` Ingo Molnar
  2015-06-10  9:08               ` Ingo Molnar
@ 2015-06-10  9:19               ` Mel Gorman
  1 sibling, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2015-06-10  9:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner

On Wed, Jun 10, 2015 at 10:51:41AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > > I think since it is you who wants to introduce additional complexity into the 
> > > x86 MM code the burden is on you to provide proof that the complexity of pfn 
> > > (or struct page) tracking is worth it.
> > 
> > I'm taking a situation whereby IPIs are sent like crazy with interrupt storms 
> > and replacing it with something that is a lot more efficient that minimises the 
> > number of potential surprises. I'm stating that the benefit of PFN tracking is 
> > unknowable in the general case because it depends on the workload, timing and 
> > the exact CPU used so any example provided can be naked with a counter-example 
> > such as a trivial sequential reader that shows no benefit. The series as posted 
> > is approximately in line with current behaviour minimising the chances of 
> > surprise regressions from excessive TLB flush.
> > 
> > You are actively blocking a measurable improvement and forcing it to be replaced 
> > with something whose full impact is unquantifiable. Any regressions in this area 
> > due to increased TLB misses could take several kernel releases as the issue will 
> > be so difficult to detect.
> > 
> > I'm going to implement the approach you are forcing because there is an x86 part 
> > of the patch and you are the maintainer that could indefinitely NAK it. However, 
> > I'm extremely pissed about being forced to introduce these indirect 
> > unpredictable costs because I know the alternative is you dragging this out for 
> > weeks with no satisfactory conclusion in an argument that I cannot prove in the 
> > general case.
> 
> Stop this crap.
> 
> I made a really clear and unambiguous chain of arguments:
> 
>  - I'm unconvinced about the benefits of INVLPG in general, and your patches adds
>    a whole new bunch of them. I cited measurements and went out on a limb to 
>    explain my position, backed with numbers and logic. It's admittedly still a 
>    speculative position and I might be wrong, but I think it's well grounded 
>    position that you cannot just brush aside.
> 

And I explained my concerns with the use of a full flush and the difficulty
of measuring its impact in the general case. I also explained why I thought
starting with PFN tracking was an incremental approach. The argument looped
so I bailed.

>  - I suggested that you split this approach into steps that first does the simpler
>    approach that will give us at least 95% of the benefits, then the more complex
>    one on top of it. Your false claim that I'm blocking a clear improvement is
>    pure demagogy!
> 

The splitting was already done, released and I followed up saying that
I'll be dropping patch 4 in the final merge request. In the event we
find in a few kernel releases time that it was required then there will
be a realistic bug report to use as a reference.

>  - I very clearly claimed that I am more than willing to be convinced by numbers.
>    It's not _that_ hard to construct a memory trashing workload with a
>    TLB-efficient iteration that uses say 80% of the TLB cache, to measure the
>    worst-case overhead of full flushes.
> 

And what I had said was that in the general case that it will not show us any
proof. Any other workload will always behave differently and two critical
components are the exact timing versus kswapd running and the CPU used.
Even if it's demonstrated for a single workload on one CPU then we would
still be faced with the same problem and dropping patch 4.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10  9:08               ` Ingo Molnar
@ 2015-06-10 10:15                 ` Mel Gorman
  2015-06-11 15:26                   ` Ingo Molnar
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2015-06-10 10:15 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner

On Wed, Jun 10, 2015 at 11:08:13AM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > Stop this crap.
> > 
> > I made a really clear and unambiguous chain of arguments:
> > 
> >  - I'm unconvinced about the benefits of INVLPG in general, and your patches adds
> >    a whole new bunch of them. [...]
> 
> ... and note that your claim that 'we were doing them before, this is just an 
> equivalent transformation' is utter bullsh*t technically: what we were doing 
> previously was a hideously expensive IPI combined with an INVLPG.
> 

And replacing it with an INVLPG without excessive IPI transmission is
changing one major variable. Going straight to a full TLB flush is changing
two major variables. I thought the refill cost was high, parially based
on the estimate of 22,000 cycles in https://lkml.org/lkml/2014/7/31/825.
I've been told in these discussions that I'm wrong and the cost is not
high. As it'll always be variable, we can never be sure which is why
I do not see a value to building a complex test around it that will be
invalidated the instant we use a different CPU. When/if a workload shows
up that really cares about those refill costs then there will be a stable
test case to work from.

> The behavior was dominated by the huge overhead of the remote flushing IPI, which 
> does not prove or disprove either your or my opinion!
> 
> Preserving that old INVLPG logic without measuring its benefits _again_ would be 
> cargo cult programming.
> 
> So I think this should be measured, and I don't mind worst-case TLB trashing 
> measurements, which would be relatively straightforward to construct and the 
> results should be unambiguous.
> 
> The batching limit (which you set to 32) should then be tuned by comparing it to a 
> working full-flushing batching logic, not by comparing it to the previous single 
> IPI per single flush approach!
> 

We can decrease it easily but increasing it means we also have to change
SWAP_CLUSTER_MAX because otherwise enough pages are not unmapped for
flushes and it is a requirement that we flush before freeing the pages. That
changes another complex variable because at the very least, it alters LRU
lock hold times.

> ... and if the benefits of a complex algorithm are not measurable and if there are 
> doubts about the cost/benefit tradeoff then frankly it should not exist in the 
> kernel in the first place. It's not like the Linux TLB flushing code is too boring 
> due to overwhelming simplicity.
> 
> and yes, it's my job as a maintainer to request measurements justifying complexity 
> and your ad hominem attacks against me are disgusting - you should know better.
> 

It was not intended as an ad hominem attack and my apologies for that.
I wanted to express my frustration that a series that adjusted one variable
with known benefit will be rejected for a series that adjusts two major
variables instead with the second variable being very sensitive to
workload and CPU.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 21:54                 ` Linus Torvalds
  2015-06-09 22:32                   ` Mel Gorman
@ 2015-06-10 13:13                   ` Andi Kleen
  2015-06-10 16:17                     ` Linus Torvalds
  1 sibling, 1 reply; 41+ messages in thread
From: Andi Kleen @ 2015-06-10 13:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
	Rik van Riel, Hugh Dickins, Minchan Kim, Andi Kleen,
	H Peter Anvin, Linux-MM, LKML, Peter Zijlstra, Thomas Gleixner

On Tue, Jun 09, 2015 at 02:54:01PM -0700, Linus Torvalds wrote:
> On Tue, Jun 9, 2015 at 2:14 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > The 0 cycle TLB miss was also interesting.  It goes back up to something
> > reasonable if I put the mb()/mfence's back.
> 
> So I've said it before, and I'll say it again: Intel does really well
> on TLB fills.

Assuming the page tables are cache-hot... And hot here does not mean
L3 cache, but higher. But a memory intensive workload can easily
violate that.

That's why I'm dubious of all these micro benchmarks. They won't be
clearing caches. They generate unrealistic conditions in the CPU
pipeline and overestimate the cost of the flushes.

The only good way to measure TLB costs is macro benchmarks.

-Andi


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10 13:13                   ` Andi Kleen
@ 2015-06-10 16:17                     ` Linus Torvalds
  2015-06-10 16:42                       ` Linus Torvalds
  2015-06-10 17:07                       ` Mel Gorman
  0 siblings, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2015-06-10 16:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
	Rik van Riel, Hugh Dickins, Minchan Kim, H Peter Anvin, Linux-MM,
	LKML, Peter Zijlstra, Thomas Gleixner

On Wed, Jun 10, 2015 at 6:13 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Assuming the page tables are cache-hot... And hot here does not mean
> L3 cache, but higher. But a memory intensive workload can easily
> violate that.

In practice, no.

You'll spend all your time on the actual real data cache misses, the
TLB misses won't be any more noticeable.

And if your access patters are even *remoptely* cache-friendly (ie
_not_ spending all your time just waiting for regular data cache
misses), then a radix-tree-like page table like Intel will have much
better locality in the page tables than in the actual data. So again,
the TLB misses won't be your big problem.

There may be pathological cases where you just look at one word per
page, but let's face it, we don't optimize for pathological or
unrealistic cases.

And the thing is, you need to look at the costs. Single-page
invalidation taking hundreds of cycles? Yeah, we definitely need to
take the downside of trying to be clever into account.

If the invalidation was really cheap, the rules might change. As it
is, I really don't think there is any question about this.

That's particularly true when the single-page invalidation approach
has lots of *software* overhead too - not just the complexity, but
even "obvious" costs feeding the list of pages to be invalidated
across CPU's. Think about it - there are cache misses there too, and
because we do those across CPU's those cache misses are *mandatory*.

So trying to avoid a few TLB misses by forcing mandatory cache misses
and extra complexity, and by doing lots of 200+ cycle operations?
Really? In what universe does that sound like a good idea?

Quite frankly, I can pretty much *guarantee* that you didn't actually
think about any real numbers, you've just been taught that fairy-tale
of "TLB misses are expensive". As if TLB entries were somehow sacred.

If somebody can show real numbers on a real workload, that's one
thing. But in the absence of those real numbers, we should go for the
simple and straightforward code, and not try to make up "but but but
it could be expensive" issues.

Trying to avoid IPI's by batching them up sounds like a very obvious
win. I absolutely believe that is worth it. I just don't believe it is
worth it trying to pass a lot of page detail data around. The number
of individual pages flushed should be *very* controlled.

Here's a suggestion: the absolute maximum number of TLB entries we
flush should be something we can describe in one single cacheline.
Strive to minimize the number of cachelines we have to transfer for
the IPI, instead of trying to minimize the number of TLB entries. So
maybe soemthing along the lines of "one word of flags (all or nothing,
or a range), or up to seven indivual addresses to be flushed". I
personally think even seven invlpg's is likely too much, but with the
"single cacheline" at least there's another basis for the limit too.

So anyway, I like the patch series. I just think that the final patch
- the one that actually saves the addreses, and limits things to
BATCH_TLBFLUSH_SIZE, should be limited.

                   Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10 16:17                     ` Linus Torvalds
@ 2015-06-10 16:42                       ` Linus Torvalds
  2015-06-10 17:24                         ` Mel Gorman
  2015-06-10 18:08                         ` Josh Boyer
  2015-06-10 17:07                       ` Mel Gorman
  1 sibling, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2015-06-10 16:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
	Rik van Riel, Hugh Dickins, Minchan Kim, H Peter Anvin, Linux-MM,
	LKML, Peter Zijlstra, Thomas Gleixner

On Wed, Jun 10, 2015 at 9:17 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So anyway, I like the patch series. I just think that the final patch
> - the one that actually saves the addreses, and limits things to
> BATCH_TLBFLUSH_SIZE, should be limited.

Oh, and another thing:

Mel, can you please make that "struct tlbflush_unmap_batch" be just
part of "struct task_struct" rather than a pointer?

If you are worried about the cpumask size, you could use

      cpumask_var_t cpumask;

and

        alloc_cpumask_var(..)
...
        free_cpumask_var(..)

for that.

That way, sane configurations never have the allocation cost.

(Of course, sad to say, sane configurations are probably few and far
between. At least Fedora seems to ship with a kernel where NR_CPU's is
1024 and thus CONFIG_CPUMASK_OFFSTACK=y. Oh well. What a waste of CPU
cycles that is..)

                  Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10 16:17                     ` Linus Torvalds
  2015-06-10 16:42                       ` Linus Torvalds
@ 2015-06-10 17:07                       ` Mel Gorman
  1 sibling, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2015-06-10 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Dave Hansen, Ingo Molnar, Andrew Morton,
	Rik van Riel, Hugh Dickins, Minchan Kim, H Peter Anvin, Linux-MM,
	LKML, Peter Zijlstra, Thomas Gleixner

On Wed, Jun 10, 2015 at 09:17:15AM -0700, Linus Torvalds wrote:
> On Wed, Jun 10, 2015 at 6:13 AM, Andi Kleen <andi@firstfloor.org> wrote:
> >
> > Assuming the page tables are cache-hot... And hot here does not mean
> > L3 cache, but higher. But a memory intensive workload can easily
> > violate that.
> 
> In practice, no.
> 
> You'll spend all your time on the actual real data cache misses, the
> TLB misses won't be any more noticeable.
> 
> And if your access patters are even *remoptely* cache-friendly (ie
> _not_ spending all your time just waiting for regular data cache
> misses), then a radix-tree-like page table like Intel will have much
> better locality in the page tables than in the actual data. So again,
> the TLB misses won't be your big problem.
> 
> There may be pathological cases where you just look at one word per
> page, but let's face it, we don't optimize for pathological or
> unrealistic cases.
> 

It's concerns like this that have me avoiding any micro-benchmarking approach
that tried to measure the indirect costs of refills. No matter what the
microbenchmark does, there will be other cases that render it irrelevant.

> And the thing is, you need to look at the costs. Single-page
> invalidation taking hundreds of cycles? Yeah, we definitely need to
> take the downside of trying to be clever into account.
> 
> If the invalidation was really cheap, the rules might change. As it
> is, I really don't think there is any question about this.
> 
> That's particularly true when the single-page invalidation approach
> has lots of *software* overhead too - not just the complexity, but
> even "obvious" costs feeding the list of pages to be invalidated
> across CPU's. Think about it - there are cache misses there too, and
> because we do those across CPU's those cache misses are *mandatory*.
> 
> So trying to avoid a few TLB misses by forcing mandatory cache misses
> and extra complexity, and by doing lots of 200+ cycle operations?
> Really? In what universe does that sound like a good idea?
> 
> Quite frankly, I can pretty much *guarantee* that you didn't actually
> think about any real numbers, you've just been taught that fairy-tale
> of "TLB misses are expensive". As if TLB entries were somehow sacred.
> 

Everyone has been taught that one. Papers I've read from the last two
years on TLB implementations or page reclaim management bring this up as
a supporting point for whatever they are proposing. It was partially why
I kept PFN tracking and also to put much of the cost on the reclaimer and
minimise interference on the recipient of the IPI. I still think it was
a rational concern but will assume that refills are cheaper than smart
invalidations until it can be proven otherwise.

> If somebody can show real numbers on a real workload, that's one
> thing.

The last adjustments made today  to the series are at
http://git.kernel.org/cgit/linux/kernel/git/mel/linux-balancenuma.git/log/?h=mm-vmscan-lessipi-v7r5
I'll redo it on top of 4.2-rc1 whenever that happens so gets a full round
in linux-next.  Patch 4 can be revisited if a real workload is found that
is not deliberately pathological running on a CPU that matters. The forward
port of patch 4 for testing will be trivial.

It also separated out the dynamic allocation of the structure so that it
can be excluded if deemed to be an unnecessary complication.

> So anyway, I like the patch series. I just think that the final patch
> - the one that actually saves the addreses, and limits things to
> BATCH_TLBFLUSH_SIZE, should be limited.
> 

I see your logic but if it's limited then we send more IPIs and it's all
crappy tradeoffs. If a real workload complains, it'll be far easier to
work with.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10 16:42                       ` Linus Torvalds
@ 2015-06-10 17:24                         ` Mel Gorman
  2015-06-10 17:31                           ` Linus Torvalds
  2015-06-10 18:08                         ` Josh Boyer
  1 sibling, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2015-06-10 17:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Dave Hansen, Ingo Molnar, Andrew Morton,
	Rik van Riel, Hugh Dickins, Minchan Kim, H Peter Anvin, Linux-MM,
	LKML, Peter Zijlstra, Thomas Gleixner

On Wed, Jun 10, 2015 at 09:42:34AM -0700, Linus Torvalds wrote:
> On Wed, Jun 10, 2015 at 9:17 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So anyway, I like the patch series. I just think that the final patch
> > - the one that actually saves the addreses, and limits things to
> > BATCH_TLBFLUSH_SIZE, should be limited.
> 
> Oh, and another thing:
> 
> Mel, can you please make that "struct tlbflush_unmap_batch" be just
> part of "struct task_struct" rather than a pointer?
> 

Yes, that was done earlier today based on Ingo's review so that the
allocation could be dealt with as a separate path at the end of the series.

> If you are worried about the cpumask size, you could use
> 
>       cpumask_var_t cpumask;
> 
> and
> 
>         alloc_cpumask_var(..)
> ...
>         free_cpumask_var(..)
> 
> for that.
> 
> That way, sane configurations never have the allocation cost.
> 

Ok, good point.  Patch 3 in my git tree ("mm: Dynamically allocate TLB
batch unmap control structure") does not do this but I'll look into doing
it before the release based on 4.2-rc1.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10 17:24                         ` Mel Gorman
@ 2015-06-10 17:31                           ` Linus Torvalds
  0 siblings, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2015-06-10 17:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andi Kleen, Dave Hansen, Ingo Molnar, Andrew Morton,
	Rik van Riel, Hugh Dickins, Minchan Kim, H Peter Anvin, Linux-MM,
	LKML, Peter Zijlstra, Thomas Gleixner

On Wed, Jun 10, 2015 at 10:24 AM, Mel Gorman <mgorman@suse.de> wrote:
>
> Yes, that was done earlier today based on Ingo's review so that the
> allocation could be dealt with as a separate path at the end of the series.

Ahh, ok, never mind then.

> Ok, good point.  Patch 3 in my git tree ("mm: Dynamically allocate TLB
> batch unmap control structure") does not do this but I'll look into doing
> it before the release based on 4.2-rc1.

I'm not sure how size-sensitive this is. The 'struct task_struct' is
pretty big already, and if somebody builds a MAXSMP kernel, I really
don't think they worry too much about wasting a few bytes for each
process. Clearly they either are insane, or they actually *have* a big
machine, in which point the allocation is not going to be wasteful and
they'll likely trigger this code all the time anyway, so trying to be
any more dynamic about it is probably not worth it.

So my "maybe you could use 'cpumask_var_t' here" suggestion isn't
necessarily even worth it. Just statically allocating it is probably
perfectly fine.

Of course, again the counter-example for that may well be how distros
seem to make thousand-cpu configurations the default. That does seem
insane to me. But I guess the pain of multiple kernel configs is just
too much.

                Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10 16:42                       ` Linus Torvalds
  2015-06-10 17:24                         ` Mel Gorman
@ 2015-06-10 18:08                         ` Josh Boyer
  1 sibling, 0 replies; 41+ messages in thread
From: Josh Boyer @ 2015-06-10 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Dave Hansen, Ingo Molnar, Mel Gorman, Andrew Morton,
	Rik van Riel, Hugh Dickins, Minchan Kim, H Peter Anvin, Linux-MM,
	LKML, Peter Zijlstra, Thomas Gleixner

On Wed, Jun 10, 2015 at 12:42 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Jun 10, 2015 at 9:17 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> So anyway, I like the patch series. I just think that the final patch
>> - the one that actually saves the addreses, and limits things to
>> BATCH_TLBFLUSH_SIZE, should be limited.
>
> Oh, and another thing:
>
> Mel, can you please make that "struct tlbflush_unmap_batch" be just
> part of "struct task_struct" rather than a pointer?
>
> If you are worried about the cpumask size, you could use
>
>       cpumask_var_t cpumask;
>
> and
>
>         alloc_cpumask_var(..)
> ...
>         free_cpumask_var(..)
>
> for that.
>
> That way, sane configurations never have the allocation cost.
>
> (Of course, sad to say, sane configurations are probably few and far
> between. At least Fedora seems to ship with a kernel where NR_CPU's is
> 1024 and thus CONFIG_CPUMASK_OFFSTACK=y. Oh well. What a waste of CPU
> cycles that is..)

The insane part being NR_CPUS = 1024?  Or that to have said number
requires cpumask being dynamically allocated to avoid stack overflow?
(Or both I guess).

josh

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-10 10:15                 ` Mel Gorman
@ 2015-06-11 15:26                   ` Ingo Molnar
  0 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2015-06-11 15:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Hugh Dickins, Minchan Kim,
	Dave Hansen, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner


* Mel Gorman <mgorman@suse.de> wrote:

> > > I made a really clear and unambiguous chain of arguments:
> > > 
> > >  - I'm unconvinced about the benefits of INVLPG in general, and your patches adds
> > >    a whole new bunch of them. [...]
> > 
> > ... and note that your claim that 'we were doing them before, this is just an 
> > equivalent transformation' is utter bullsh*t technically: what we were doing 
> > previously was a hideously expensive IPI combined with an INVLPG.
> 
> And replacing it with an INVLPG without excessive IPI transmission is changing 
> one major variable. Going straight to a full TLB flush is changing two major 
> variables. [...]

But this argument employs the fallacy that the transition to 'batching with PFN 
tracking' is not a major variable in itself. In reality it is a major variable: it 
adds extra complexity, such as the cross CPU data flow (the pfn tracking), and it 
also changes the distribution of the flushes and related caching patterns.

> > [...]
> > 
> > The batching limit (which you set to 32) should then be tuned by comparing it 
> > to a working full-flushing batching logic, not by comparing it to the previous 
> > single IPI per single flush approach!
> 
> We can decrease it easily but increasing it means we also have to change 
> SWAP_CLUSTER_MAX because otherwise enough pages are not unmapped for flushes and 
> it is a requirement that we flush before freeing the pages. That changes another 
> complex variable because at the very least, it alters LRU lock hold times.

('should then be' implied 'as a separate patch/series', obviously.)

I.e. all I wanted to observe is that I think the series did not explore the 
performance impact of the batching limit, because it was too focused on the INVLPG 
approach which has an internal API limit of 33.

Now that the TLB flushing side is essentially limit-less, a future enhancement 
would be to further increase batching.

My suspicion is that say doubling SWAP_CLUSTER_MAX would possibly further reduce 
the IPI rate, at a negligible cost.

But this observation does not impact the current series in any case.

> > ... and if the benefits of a complex algorithm are not measurable and if there 
> > are doubts about the cost/benefit tradeoff then frankly it should not exist in 
> > the kernel in the first place. It's not like the Linux TLB flushing code is 
> > too boring due to overwhelming simplicity.
> > 
> > and yes, it's my job as a maintainer to request measurements justifying 
> > complexity and your ad hominem attacks against me are disgusting - you should 
> > know better.
> 
> It was not intended as an ad hominem attack and my apologies for that. I wanted 
> to express my frustration that a series that adjusted one variable with known 
> benefit will be rejected for a series that adjusts two major variables instead 
> with the second variable being very sensitive to workload and CPU.

... but that's not what it did: it adjusted multiple complex variables already, 
with a questionable rationale for more complexity.

And my argument was and continues to be: start with the simplest variant and 
iterate from there. Which you seem to have adapted in your latest series, so my 
concerns are addressed:

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-09 15:34           ` Dave Hansen
  2015-06-09 16:49             ` Dave Hansen
@ 2015-06-21 20:22             ` Kirill A. Shutemov
  2015-06-25 11:48               ` Ingo Molnar
  1 sibling, 1 reply; 41+ messages in thread
From: Kirill A. Shutemov @ 2015-06-21 20:22 UTC (permalink / raw)
  To: Dave Hansen, Ingo Molnar
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Hugh Dickins,
	Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner

On Tue, Jun 09, 2015 at 08:34:35AM -0700, Dave Hansen wrote:
> On 06/09/2015 05:43 AM, Ingo Molnar wrote:
> > +static char tlb_flush_target[PAGE_SIZE] __aligned(4096);
> > +static void fn_flush_tlb_one(void)
> > +{
> > +	unsigned long addr = (unsigned long)&tlb_flush_target;
> > +
> > +	tlb_flush_target[0]++;
> > +	__flush_tlb_one(addr);
> > +}
> 
> So we've got an increment of a variable in kernel memory (which is
> almost surely in the L1), then we flush that memory location, and repeat
> the increment.

BTW, Ingo, have you disabled direct mapping of kernel memory with 2M/1G
pages for the test? 

I'm just thinking if there is chance that the test shooting out 1G tlb
entry. In this case we're measure wrong thing.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-21 20:22             ` Kirill A. Shutemov
@ 2015-06-25 11:48               ` Ingo Molnar
  2015-06-25 18:46                 ` Dave Hansen
       [not found]                 ` <CA+55aFykFDZBEP+fBeqF85jSVuhWVjL5SW_22FTCMrCeoihauw@mail.gmail.com>
  0 siblings, 2 replies; 41+ messages in thread
From: Ingo Molnar @ 2015-06-25 11:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Dave Hansen, Mel Gorman, Andrew Morton, Rik van Riel,
	Hugh Dickins, Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM,
	LKML, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov


* Kirill A. Shutemov <kirill@shutemov.name> wrote:

> On Tue, Jun 09, 2015 at 08:34:35AM -0700, Dave Hansen wrote:
> > On 06/09/2015 05:43 AM, Ingo Molnar wrote:
> > > +static char tlb_flush_target[PAGE_SIZE] __aligned(4096);
> > > +static void fn_flush_tlb_one(void)
> > > +{
> > > +	unsigned long addr = (unsigned long)&tlb_flush_target;
> > > +
> > > +	tlb_flush_target[0]++;
> > > +	__flush_tlb_one(addr);
> > > +}
> > 
> > So we've got an increment of a variable in kernel memory (which is
> > almost surely in the L1), then we flush that memory location, and repeat
> > the increment.
> 
> BTW, Ingo, have you disabled direct mapping of kernel memory with 2M/1G
> pages for the test? 

No, the kernel was using gbpages.

> I'm just thinking if there is chance that the test shooting out 1G tlb entry. In 
> this case we're measure wrong thing.

Yeah, you are right, it's slightly wrong, but still gave us the right numbers to 
work with.

I've updated the benchmarks with 4K flushes as well. Changes to the previous 
measurement:

 - I've refined the testing methodology. (It now looks at the true distribution of
   delays, not just the minimum or an average: this is especially important for
   cache-cold numbers were perturbations (such as irq or SMI events) can generate 
   outliers in both directions.)

 - Added tests to attempt to measure the TLB miss costs, I've split the INVLPG 
   measurements into over a dozen separate variants, which should help measure the 
   overall impact of TLB flushes and should help quantify the 'hidden' cost cost 
   associated with TLB flushes as well.

Here are the numbers:

Old 64-bit CPU (AMD) using 2MB/4K pages:

[    8.122002] x86/bench: Running x86 benchmarks:              cache-    hot  /   cold cycles
[   12.269003] x86/bench:########  MM methods:        #######################################
[   12.369003] x86/bench: __flush_tlb()               fn            :    127  /    273
[   12.469003] x86/bench: __flush_tlb_global()        fn            :    589  /   1501
[   12.568003] x86/bench: access           2M page    fn            :      0  /    440
[   12.668003] x86/bench: invlpg           2M page    fn            :     93  /    512
[   12.768003] x86/bench: access+invlpg    2M page    fn            :    107  /    512
[   12.868003] x86/bench: invlpg+access    2M page    fn            :     94  /    535
[   12.972003] x86/bench: flush+access     2M page    fn            :    113  /   1060
[   13.070003] x86/bench: access        1x 4K page    fn            :      0  /      5
[   13.166003] x86/bench: access        2x 4K page    fn            :      1  /    150
[   13.264003] x86/bench: access        3x 4K page    fn            :      1  /    331
[   13.360003] x86/bench: access        4x 4K page    fn            :      1  /    360
[   13.463003] x86/bench: invlpg+access 1x 4K page    fn            :     95  /    247
[   13.566003] x86/bench: invlpg+access 2x 4K page    fn            :    188  /    543
[   13.668003] x86/bench: invlpg+access 3x 4K page    fn            :    281  /    836
[   13.774003] x86/bench: invlpg+access 4x 4K page    fn            :    374  /   1165
[   13.878003] x86/bench: flush+access  1x 4K page    fn            :    114  /    275
[   13.981003] x86/bench: flush+access  2x 4K page    fn            :    115  /    480
[   14.083003] x86/bench: flush+access  3x 4K page    fn            :    115  /    671
[   14.183003] x86/bench: flush+access  4x 4K page    fn            :    116  /    670
[   14.286003] x86/bench: flush            4K page    fn            :    108  /    247
[   14.385003] x86/bench: access+invlpg    4K page    fn            :     94  /    287
[   14.502003] x86/bench: __flush_tlb_range()         fn            :    280  /   5173

Newer 64-bit CPU (Intel) using gbpages/4K pages:

[            ] x86/bench: Running x86 benchmarks:              cache-    hot  /   cold cycles
[            ] x86/bench:########  MM methods:        #######################################
[  124.362251] x86/bench: __flush_tlb()               fn            :    152  /    168
[  125.287283] x86/bench: __flush_tlb_global()        fn            :    936  /   1312
[  126.227641] x86/bench: access           GB page    fn            :     12  /    304
[  127.152369] x86/bench: invlpg           GB page    fn            :    280  /    584
[  128.084376] x86/bench: access+invlpg    GB page    fn            :    280  /    592
[  129.018430] x86/bench: invlpg+access    GB page    fn            :    280  /    596
[  129.939640] x86/bench: flush+access     GB page    fn            :    152  /    396
[  130.849827] x86/bench: access        1x 4K page    fn            :      0  /      0
[  131.796836] x86/bench: access        2x 4K page    fn            :     12  /    432
[  132.738849] x86/bench: access        3x 4K page    fn            :     16  /    736
[  133.661315] x86/bench: access        4x 4K page    fn            :     12  /    736
[  134.575238] x86/bench: invlpg+access 1x 4K page    fn            :    280  /    356
[  135.506638] x86/bench: invlpg+access 2x 4K page    fn            :    500  /   1152
[  136.432645] x86/bench: invlpg+access 3x 4K page    fn            :    728  /   1920
[  137.355536] x86/bench: invlpg+access 4x 4K page    fn            :    956  /   3084
[  138.296174] x86/bench: flush+access  1x 4K page    fn            :    148  /    280
[  139.229264] x86/bench: flush+access  2x 4K page    fn            :    152  /    728
[  140.165103] x86/bench: flush+access  3x 4K page    fn            :    136  /    736
[  141.088691] x86/bench: flush+access  4x 4K page    fn            :    152  /    820
[  141.994946] x86/bench: flush            4K page    fn            :    280  /    328
[  142.902499] x86/bench: access+invlpg    4K page    fn            :    280  /    352
[  143.814408] x86/bench: __flush_tlb_range()         fn            :    424  /   2100

Notes:

 - 'access' means an ACCESS_ONCE() load.

 - 'flush'  means a CR3 based full flush

 - 'invlpg' means an __flush_tlb_one() INVLPG flush

 - the 4K pages are vmalloc()-ed pages. 

 - 1x, 2x, 3x, 4x means up to 4 adjacent 4K vmalloc()-ed pages are accessed, the 
   first byte in each

 - PGE is turned off in the CR4 for these measuremnts, to make sure the 
  vmalloc()-ed page(s) get flushed by the CR3 method as well.

 - there's a bit of a jitter in the numbers, for example the 'flush+access 3x 4K' 
   number is lower - but the jitter is in the 10 cycles range max

As you can see (assuming the numbers are correct), the 'full' CR3 based flushing 
is superior to any __flush_tlb_one() method:

[  134.575238] x86/bench: invlpg+access 1x 4K page    fn            :    280  /    356
[  135.506638] x86/bench: invlpg+access 2x 4K page    fn            :    500  /   1152
[  136.432645] x86/bench: invlpg+access 3x 4K page    fn            :    728  /   1920
[  137.355536] x86/bench: invlpg+access 4x 4K page    fn            :    956  /   3084

[  138.296174] x86/bench: flush+access  1x 4K page    fn            :    148  /    280
[  139.229264] x86/bench: flush+access  2x 4K page    fn            :    152  /    728
[  140.165103] x86/bench: flush+access  3x 4K page    fn            :    136  /    736
[  141.088691] x86/bench: flush+access  4x 4K page    fn            :    152  /    820

In the cache-hot case: with the CR3 based method the _full_ cost of flushing (i.e. 
the cost of the flush plus the cost of the access) is roughly constant. With 
INVLPG it's both higher, and increases linearly.

In the cache-cold case: the costs are increasing linearly, but it's much higher in 
the INVLPG case. 

So in fact I'd argue that, somewhat counter-intuitively, we should consider doing 
a full flush even for single-page flushes, for example for page faults...

I'll do more measurements (and will publish the latest patches) to increase 
confidence in the numbers.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-25 11:48               ` Ingo Molnar
@ 2015-06-25 18:46                 ` Dave Hansen
  2015-06-26  9:08                   ` Ingo Molnar
       [not found]                 ` <CA+55aFykFDZBEP+fBeqF85jSVuhWVjL5SW_22FTCMrCeoihauw@mail.gmail.com>
  1 sibling, 1 reply; 41+ messages in thread
From: Dave Hansen @ 2015-06-25 18:46 UTC (permalink / raw)
  To: Ingo Molnar, Kirill A. Shutemov
  Cc: Mel Gorman, Andrew Morton, Rik van Riel, Hugh Dickins,
	Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM, LKML,
	Linus Torvalds, Peter Zijlstra, Thomas Gleixner, Borislav Petkov

On 06/25/2015 04:48 AM, Ingo Molnar wrote:
> I've updated the benchmarks with 4K flushes as well. Changes to the previous 
> measurement:

Did you push these out somewhere?  The tip tmp.fpu branch hasn't seen
any updates.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
       [not found]                 ` <CA+55aFykFDZBEP+fBeqF85jSVuhWVjL5SW_22FTCMrCeoihauw@mail.gmail.com>
@ 2015-06-25 19:15                   ` Vlastimil Babka
  2015-06-25 22:04                     ` Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2015-06-25 19:15 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar
  Cc: Thomas Gleixner, linux-mm, Borislav Petkov, Hugh Dickins,
	Andrew Morton, Minchan Kim, Rik van Riel, Kirill A. Shutemov,
	H Peter Anvin, Andi Kleen, Peter Zijlstra, Dave Hansen,
	Mel Gorman, LKML

On 25.6.2015 20:36, Linus Torvalds wrote:
> 
> On Jun 25, 2015 04:48, "Ingo Molnar" <mingo@kernel.org
> <mailto:mingo@kernel.org>> wrote:
>>
>>  - 1x, 2x, 3x, 4x means up to 4 adjacent 4K vmalloc()-ed pages are accessed, the
>>    first byte in each
> 
> So that test is a bit unfair. From previous timing of Intel TLB fills, I can
> tell you that Intel is particularly good at doing adjacent entries.
> 
> That's independent of the fact that page tables have very good locality (if they
> are the radix tree type - the hashed page tables that ppc uses are shit). So
> when filling adjacent entries, you take the cache misses for the page tables
> only once, but even aside from that, Intel send to do particularly well at the
> "next page" TLB fill case

AFAIK that's because they also cache partial translations, so if the first 3
levels are the same (as they mostly are for the "next page" scenario) it will
only have to look at the last level of pages tables. AMD does that too.

> Now, I think that's a reasonably common case, and I'm not saying that it's
> unfair to compare for that reason, but it does highlight the good case for TLB
> walking.
> 
> So I would suggest you highlight the bad case too: use invlpg to invalidate
> *one* TLB entry, and then walk four non-adjacent entries. And compare *that* to
> the full TLB flush.
> 
> Now, I happen to still believe in the full flush, but let's not pick benchmarks
> that might not show the advantages of the finer granularity.
> 
>         Linus
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-25 19:15                   ` Vlastimil Babka
@ 2015-06-25 22:04                     ` Linus Torvalds
  0 siblings, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2015-06-25 22:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Ingo Molnar, Thomas Gleixner, linux-mm, Borislav Petkov,
	Hugh Dickins, Andrew Morton, Minchan Kim, Rik van Riel,
	Kirill A. Shutemov, H Peter Anvin, Andi Kleen, Peter Zijlstra,
	Dave Hansen, Mel Gorman, LKML

On Thu, Jun 25, 2015 at 12:15 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>   [..]  even aside from that, Intel seems to do particularly well at the
>> "next page" TLB fill case
>
> AFAIK that's because they also cache partial translations, so if the first 3
> levels are the same (as they mostly are for the "next page" scenario) it will
> only have to look at the last level of pages tables. AMD does that too.

In the tests I did long ago, Intel seemed to do even more than that.

Now, it's admittedly somewhat hard to do a good job at TLB fill cost
measurement with any aggressively out-of-order core (lots of
asynchronous and potentially speculative stuff going on), so without
any real knowledge about what is going on it's kind of hard to say.
But I had an old TLB benchmark (long lost now, so I can't re-run it to
check on this) that implied that when Intel brings in a new TLB entry,
they brought it in 32 bytes at a time, and would then look up the
adjacent entries in that aligned half-cache-line in just a single
cycle.

(This was back in the P6/P4/Core Duo kinds of days - we're talking
10-15 years ago.  The tests I did mapped the constant zero-page in a
big area, so that the effects of the D$ would be minimized, and I'd
basically *only* see the TLB miss costs. I also did things like
measure the costs of the micro-fault that happens when marking a page
table entry entry dirty for the first time etc, which was absolutely
horrendously slow on a P4).

Now, it could have been other effects too - things like multiple
outstanding TLB misses in parallel due to prefetching TLB entries
could possibly cause similar patterns, although I distinctly remember
that effect of aligned 8-entry entries. So the simplest explanation
*seemed* to be a very special single-entry 32-byte "L0 cache" for TLB
fills (and 32 bytes would probably match the L1 fetch width into the
core).

That's in *addition* to the fact that the upper level entries are
cached in the TLB.

As mentioned earlier, from my Transmeta days I have some insight into
what old pre-NT Windows used to do. The GDI protection traversal
seemed to flush the whole TLB for every GDI kernel entry, and some of
the graphics benchmarks (remember: this was back in the days when
unaccelerated VGA graphics was still a big deal) would literally blow
the TLB away every 5-10k instructions. So it actually makes sense that
Intel and AMD spent a *lot* of effort on making TLB fills go fast,
because those GDI benchmarks were important back then. It was all the
basic old 2D window handling and font drawing etc benchmarking.

None of the RISC vendors ever cared, they had absolutely crap
hardware, and instead encouraged their SW partners (particularly
databases) to change the software to use large-pages and any trick
possible to avoid TLB misses. Their compilers would schedule loads
early and stores late, because the whole memory subsystem was a toy.
TLB misses would break the whole pipeline etc. Truly crap hardware,
and people still look up to it. Sad.

In the Windows world, that wasn't an option: when Microsoft said
"jump", Intel and AMD both said "how high?". As a result, x86 is a
*lot* more well-rounded than any RISC I have ever seen, because
Intel/AMD were simply forced to run any crap software, rather than
having software writers optimize for what they were good at.

                       Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/3] TLB flush multiple pages per IPI v5
  2015-06-25 18:46                 ` Dave Hansen
@ 2015-06-26  9:08                   ` Ingo Molnar
  0 siblings, 0 replies; 41+ messages in thread
From: Ingo Molnar @ 2015-06-26  9:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kirill A. Shutemov, Mel Gorman, Andrew Morton, Rik van Riel,
	Hugh Dickins, Minchan Kim, Andi Kleen, H Peter Anvin, Linux-MM,
	LKML, Linus Torvalds, Peter Zijlstra, Thomas Gleixner,
	Borislav Petkov


* Dave Hansen <dave.hansen@intel.com> wrote:

> On 06/25/2015 04:48 AM, Ingo Molnar wrote:
>
> > I've updated the benchmarks with 4K flushes as well. Changes to the previous 
> > measurement:
> 
> Did you push these out somewhere?  The tip tmp.fpu branch hasn't seen any 
> updates.

Not yet, I still don't trust the numbers.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2015-06-26  9:09 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-08 12:50 [PATCH 0/3] TLB flush multiple pages per IPI v5 Mel Gorman
2015-06-08 12:50 ` [PATCH 1/3] x86, mm: Trace when an IPI is about to be sent Mel Gorman
2015-06-08 12:50 ` [PATCH 2/3] mm: Send one IPI per CPU to TLB flush multiple pages that were recently unmapped Mel Gorman
2015-06-08 22:38   ` Andrew Morton
2015-06-09 11:07     ` Mel Gorman
2015-06-08 12:50 ` [PATCH 3/3] mm: Defer flush of writable TLB entries Mel Gorman
2015-06-08 17:45 ` [PATCH 0/3] TLB flush multiple pages per IPI v5 Ingo Molnar
2015-06-08 18:21   ` Dave Hansen
2015-06-08 19:52     ` Ingo Molnar
2015-06-08 20:03       ` Ingo Molnar
2015-06-08 21:07       ` Dave Hansen
2015-06-08 21:50         ` Ingo Molnar
2015-06-09  8:47   ` Mel Gorman
2015-06-09 10:32     ` Ingo Molnar
2015-06-09 11:20       ` Mel Gorman
2015-06-09 12:43         ` Ingo Molnar
2015-06-09 13:05           ` Mel Gorman
2015-06-10  8:51             ` Ingo Molnar
2015-06-10  9:08               ` Ingo Molnar
2015-06-10 10:15                 ` Mel Gorman
2015-06-11 15:26                   ` Ingo Molnar
2015-06-10  9:19               ` Mel Gorman
2015-06-09 15:34           ` Dave Hansen
2015-06-09 16:49             ` Dave Hansen
2015-06-09 21:14               ` Dave Hansen
2015-06-09 21:54                 ` Linus Torvalds
2015-06-09 22:32                   ` Mel Gorman
2015-06-09 22:35                     ` Mel Gorman
2015-06-10 13:13                   ` Andi Kleen
2015-06-10 16:17                     ` Linus Torvalds
2015-06-10 16:42                       ` Linus Torvalds
2015-06-10 17:24                         ` Mel Gorman
2015-06-10 17:31                           ` Linus Torvalds
2015-06-10 18:08                         ` Josh Boyer
2015-06-10 17:07                       ` Mel Gorman
2015-06-21 20:22             ` Kirill A. Shutemov
2015-06-25 11:48               ` Ingo Molnar
2015-06-25 18:46                 ` Dave Hansen
2015-06-26  9:08                   ` Ingo Molnar
     [not found]                 ` <CA+55aFykFDZBEP+fBeqF85jSVuhWVjL5SW_22FTCMrCeoihauw@mail.gmail.com>
2015-06-25 19:15                   ` Vlastimil Babka
2015-06-25 22:04                     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).